Serving Git

31 Mar 2010

The standard git client has support for tunneling through SSH to get to a repository. Naturally, you'd need a system user account on the host machine for each person requiring access. This is fine for small teams but what about the various hosting sites that allow cloning from urls like:

git clone git@example.com:path/to/repo

Gitosis allows access in the style above along with some access-control features by wrapping git-shell1 functionality. Under the hood Gitosis maintains a file-based username-key map and updates the ${HOME}/.ssh/authorized_keys file every time permissions or the user-key map are updated. As the number of users scales, Gitosis' approach to key management becomes unwieldy. GitHub also provides SSH access but, at one point, took a drastically different approach based on their operating requirements. By Engine Yard's account, Github's solution to the problem was to patch their SSH server2.

Both implementations use SSH public keys to control access but neither approach offers a whole lot of flexibility with respect to user management and authentication (unless you consider maintaining a patched OpenSSH deployment trivial). Fortunately, there is yet another alternative. Twisted3 has a client and server implementation of the SSH protocol with which it is possible to host git over SSH using a custom authentication system.

Aside: Twisty Shells

Note

As of the publication of this document, you will need the SVN checkout of Twisted for these examples. Fortunately, virtualenv makes test-driving the code very easy. Appendix A provides details and a simple script that will bootstrap a development environment.

The founder of the Twisted project wrote last year that the goal of Twisted is to provide a high-quality, high-level, secure implementation of every protocol spoken on the Internet4. Twisted is great if you need to make two or more computers talk but it really shines if you need to talk in obscure or multiple protocols. Twisted's SSH package, Conch5(see what I did there?), can be used to emulate OpenSSH:

#!/usr/bin/env python
import sys
from twisted.cred.portal import Portal
from twisted.conch import checkers
from twisted.conch.ssh.factory import SSHFactory
from twisted.conch.ssh.keys import Key
from twisted.conch.unix import UnixSSHRealm
from twisted.internet import reactor
from twisted.python import log
log.startLogging(sys.stderr)

class SSHServer(SSHFactory):
    'Simulate an OpenSSH server.'
    portal = Portal(UnixSSHRealm())
    portal.registerChecker(checkers.UNIXPasswordDatabase())
    portal.registerChecker(checkers.SSHPublicKeyDatabase())

    def __init__(self, privkey):
        pubkey = '.'.join((privkey, 'pub'))
        self.privateKeys = {'ssh-rsa': Key.fromFile(privkey)}
        self.publicKeys = {'ssh-rsa': Key.fromFile(pubkey)}

if __name__ == '__main__':
    reactor.listenTCP(2222, SSHServer(sys.argv[1]))
    reactor.run()

To run the example, you'll need to generate the host keys and run the service as root. On a unix system, you can run the following to generate a test key (just press enter when prompted for a passphrase):

$ ssh-keygen -t rsa -f /tmp/test_host_key

And assuming you've placed the sample code in sshserver.py:

$ ./sshserver.py /tmp/test_host_key

Our checker classes come with some caveats. OS X does not use /etc/passwd or shadow passwords for authentication so UNIXPasswordDatabase will not work properly on those operating systems.

SSHServer is built on top of the authentication framework Cred (Figure 1) allowing us to customize our server with very little code. In particular we can take advantage of the authentication and session handling framework.

Authentication with twisted.

Figure 1: This image is from the Twisted authentication tutorial6. Twisted already has much of the authentication logic we need. Our task is only to connect all the components together into something cohesive.

Minimal Git Over SSH

Note

The full source code for this section is in simplessh.py.

In this section, we create a very basic server wrapping git-shell to allow remote clone and push operations. As we have seen how straighforward7 it is to plug in authentication modules, we won't worry about authentication until the next section. SSHServer hides a bit of magic that is key to customizing the service and registering a connection handler. At the bottom of the twisted.conch.unix module, the following statement is executed on module import:

components.registerAdapter(
        SSHSessionForUnixConchUser,
        UnixConchUser,
        session.ISession)

The Portal configures authentication and the session handling is hooked in via components.registerAdapter(). As a first step, an implementation of IRealm needs to return a user8,6 instance. Users are application-specific so we need to implement our own. Our git server isn't doing anything special so we'll reuse what we can from Conch.

def find_git_shell():
    'Find git-shell path.'
    # Adapted from http://bugs.python.org/file15381/shutil_which.patch
    path = os.environ.get("PATH", os.defpath)
    for dir in path.split(os.pathsep):
        full_path = os.path.join(dir, 'git-shell')
        if (os.path.exists(full_path) and
                os.access(full_path, (os.F_OK | os.X_OK))):
            return full_path
    raise Exception('Could not find git executable!')


class GitConchUser(ConchUser):
    shell = find_git_shell()

    def __init__(self, username):
        ConchUser.__init__(self)
        self.username = username
        self.channelLookup.update({"session": SSHSession})

    def logout(self): pass

The interface between a user and an ssh connection is defined in IConchUser. Twisted provides the reference implementation ConchUser. Since SSH connections multiplex different channels9 (X11, shell, etc), classes inheriting from ConchUser must specify the class10 responsible for handling low level protocol details by updating the channelLookup dictionary.

The next step is to define the session. For our purposes, it is sufficient for each connection to wrap a single git command. The only ISession interface method requiring implementation is execCommand(). The rest can remain stubs.

class SimpleGitSession(object):
    interface.implements(ISession)

    def __init__(self, user):
        self.user = user

    def execCommand(self, proto, cmd):
        command = (self.user.shell, '-c', cmd)
        reactor.spawnProcess(proto, self.user.shell, command)

    def eofReceived(self): pass

    def closed(self): pass

execCommand() dispatches the requested command to git-shell to validate and execute. Finally, we put all our custom components together:

class GitRealm(object):
    interface.implements(IRealm)

    def requestAvatar(self, username, mind, *interfaces):
        user = GitConchUser(username)
        return interfaces[0], user, user.logout


class SimpleGitServer(SSHFactory):
    portal = Portal(GitRealm())

    mockpasswd = InMemoryUsernamePasswordDatabaseDontUse()
    mockpasswd.addUser('bshi', 'bshi')
    portal.registerChecker(mockpasswd)

    def __init__(self, privkey):
        pubkey = '.'.join((privkey, 'pub'))
        self.privateKeys = {'ssh-rsa': Key.fromFile(privkey)}
        self.publicKeys = {'ssh-rsa': Key.fromFile(pubkey)}


if __name__ == '__main__':
    components.registerAdapter(SimpleGitSession, GitConchUser, ISession)
    reactor.listenTCP(2222, SimpleGitServer(sys.argv[1]))
    reactor.run()

Running this is identical to running SSHServer, just specify the path to a private RSA key for the server as the first argument. You do not need superuser privileges to run this test. Try cloning an existing git repository (make sure the repository is readable by the system user the service is running as:

$ git clone ssh://testuser@localhost:2222/path/to/existing/repo.git

When prompted for the password, use the one set above; "testpassword". The URI in the above command is fairly ugly. Deploying on port 22 would allow the port part to be omitted, though you'd have to move your real SSH server to another port.

Authentication and Access Control

Note

See Appendix B for a mock implementation of IGitMetadata. The full source code for this section is in gitssh.py.

The simple example server in the previous section lacks any meaningful authentication or access control. Anyone that logs in gets the full read and write access of the system user the service is running under.

To implement these things, our server needs user or account metadata mapping usernames to SSH public keys and repository names to file-system paths. There are multitudes of approaches we can take to persist this data somewhere, we've already covered two used by Gitosis and Github, but that is somewhat outside the scope of this document. For generality, lets define a simple interface IGitMetadata to an imaginary persistent datastore somewhere:

class IGitMetadata(Interface):
    'API for authentication and access control.'

    def repopath(self, username, reponame):
        '''
        Given a username and repo name, return the full path of the repo on
        the file system.
        '''

    def pubkeys(self, username):
        '''
        Given a username return a list of OpenSSH compatible public key
        strings.
        '''

When connecting to the basic git server, the repository path supplied after the server hostname maps directly to a file-system path on the server host. Ideally, we'd like more flexibility with respect to mapping the url path in addition to managing access control. git-shell allows three commands: upload-archive, receive-pack, and upload-pack. In order to implement access control and a custom path to storage map, we need to intercept the command, in particular, the directory portion, and manipulate it before passing it on to git-shell. The man pages for these three subcommands reveal the following signatures:

git upload-archive <directory>
git receive-pack <directory>
git upload-pack [--strict] [--timeout=<n>] <directory>

We could manually parse the requested command, but the standard library shlex module provides a nice parser that handles escaping for us; there is no point in reinventing the wheel here. If the command line arguments were more complex we could use the optparse module to intercept the command string. GitSession is identical to SimpleGitSession with the following exceptions:

import shlex

...

class GitSession(object):
    interface.implements(ISession)

    def __init__(self, user):
        self.user = user

    def execCommand(self, proto, cmd):
        argv = shlex.split(cmd)
        reponame = argv[-1]
        sh = self.user.shell

        # Check permissions by mapping requested path to file system path
        repopath = self.user.meta.repopath(self.user.username, reponame)
        if repopath is None:
            raise ConchError('Invalid repository.')
        command = ' '.join(argv[:-1] + ["'%s'" % (repopath,)])
        reactor.spawnProcess(proto, sh,(sh, '-c', command))

    def eofReceived(self): pass

    def closed(self): pass

Note that the user object has a new attribute meta representing an implementation of IGitMetadata that we need to provide. One option is to provide a reference to the class instance on user-creation in the Realm logic. This involves a slight modification of the GitConchUser and GitRealm classes:

class GitConchUser(ConchUser):
    shell = find_git_shell()

    def __init__(self, username, meta):
        ConchUser.__init__(self)
        self.username = username
        self.channelLookup.update({"session": SSHSession})
        self.meta = meta

    def logout(self): pass


class GitRealm(object):
    interface.implements(IRealm)

    def __init__(self, meta):
        self.meta = meta

    def requestAvatar(self, username, mind, *interfaces):
        user = GitConchUser(username, self.meta)
        return interfaces[0], user, user.logout

SSHPublicKeyDatabase implements the ICredentialsChecker interface and, as the name implies, authenticates a remote user public key against those listed in $HOME/.ssh/authorized_keys. We override the checkKeys() method to use an instance of IGitMetaData to verify credentials:

class GitPubKeyChecker(checkers.SSHPublicKeyDatabase):
    def __init__(self, meta, *args, **kwargs):
        super(GitPubKeyChecker, self).__init__(*args, **kwargs)
        self.meta = meta

    def checkKey(self, credentials):
        for k in self.meta.pubkeys(credentials.username):
            if Key.fromString(k) == credentials.blob:
                return True
        return False

And to put everything together:

class GitServer(SSHFactory):
    authmeta = BallinMockMeta()
    portal = Portal(GitRealm(authmeta))
    portal.registerChecker(GitPubKeyChecker(authmeta))

    def __init__(self, privkey):
        pubkey = '.'.join((privkey, 'pub'))
        self.privateKeys = {'ssh-rsa': Key.fromFile(privkey)}
        self.publicKeys = {'ssh-rsa': Key.fromFile(pubkey)}


if __name__ == '__main__':
    components.registerAdapter(GitSession, GitConchUser, ISession)
    reactor.listenTCP(2222, GitServer(sys.argv[1]))
    reactor.run()

Parting Thoughts

When hosting a service like git for multiple users it is rare that the users require full shell access. The reference git implementation doesn't (rightly so) provide any functionality for this use-case, so one needs to implement a wrapper that takes care of the kind of business logic that a shared hosting service requires. Hopefully this article has sketched out an interesting way of approaching the problem. I'll part with some exercises left to the reader.

The Java world has had an implementation of a git server that uses much of the same techniques discussed here. Gerrit is a code review tool descending from Google's Mondrian and Rietveld that is tightly integrated with git. The application provides git hosting by using MINA, a Java asynchronous networking library analagous to Twisted, to provide an SSH transport around a pure Java git implementation called JGit. Python has a similar project called Dulwich which is fast becoming a pure python implementation of the git file and protocol specification. A Twisted server backed by Dulwich would have some interesting properties.

Tangentially, the git-shell style interface isn't unique to git, most version control systems have some protocol that operates over multiple transports (SSH and HTTP being the more popular ones). It would probably be fairly trivial to support Mercurial alongside git.

And as for storing application state, remember that Twisted is an asynchronous IO library so if you're using a database to store business logic, there are some extra topics to be aware of11 as most of the popular DBAPI implementations aren't asynchronous.

Appendix A: Test Drive

All the source code in this article can be found here. Use the bootstrap.sh script to set up a sandbox to try the code samples. The bootstrap process sets up a virtual environment and retrieves all dependencies from PyPI. The code samples were tested using Python 2.6 but likely work for 2.4 and later.

If you are working on OS X, it's possible you'll have no choice but to use this virtual environment method as Apple includes a distribution of Twisted with OS X. virtualenv creates a completely isolated python sandbox (using --no-site-packages) to work around this.

[bshi@mac ~]$ bash bootstrap.sh /path/to/python2.aw3some
...
[bshi@mac ~]$ source GITEX/bin/activate
(GITEX)[bshi@mac ~]$ python
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import twisted
>>> twisted.__version__
'10.0.0'

The examples here require two bug fixes in Twisted, one of which is as yet unreleased in a stable tarball. Ticket 3984, fixed in 10.x is required for the first example to run but is not required for subsequent examples. Ticket 4350, unreleased, fixes an IO issue with respect to stability using larger repositories. If you are stuck with an older version of tickets, you can try backporting the patches attached to the tickets.

Appendix B: Mock Metadata

The mock metadata instance simply stores repository information in memory. With the mock implementation below, John would clone with a command like:

$ git clone ssh://john@localhost/helloworld.git

assuming, of course, that the server was operating on port 22, and that the strings listed in pubkeys were valid SSH public keys.

class BallinMockMeta(object):
    'Mock persistence layer.'
    def __init__(self):
        self.db = {
            'jane': {
                'pubkeys': ('a', 'b'),
                'repos': {
                    'foobar.git': '/path/to/foobar',
                    'project.git': '/path/to/project',
                },
            },
            'john': {
                'pubkeys': ('c', 'd'),
                'repos': {
                    'helloworld.git': '/path/to/helloworld/',
                },
            },
            'bshi': {
                'pubkeys': ('c', 'd'),
                'repos': {
                    'poop.git': '/Users/bshi/sandbox/poop/',
                },
            }
        }

    def repopath(self, username, reponame):
        if username not in self.db
            return None
        return self.db[username]['repos'].get(reponame, None)

    def pubkeys(self, username):
        if username not in self.db
            return None
        return self.db[username]['pubkeys']
[1]Restricted login shell for GIT-only SSH access. The git-shell source source code is short and informative.
[2]Engine Yard and GitHub Transition, September 11, 2009. Tom Mornini of Engine Yard reveals that GitHub used a patched SSH server that performed key lookups against MySQL; presumably their user database.
[3]Twisted is an event-driven networking engine written in Python and licensed under the MIT license. All code examples in this document require version 10.0.0 of Twisted along with pycrypto and pyasn1. If you are stuck with a version prior to 10, there is a bug in Conch that will affect the first two code samples.
[4]A Chicken in Every Pot and a Python on Every Port, June 23, 2009.
[5]Conch contains a multitude of extras including implementation of unix remote login functionality in the twisted.conch.unix module. This module is worth studying in detail as the existing official conch documentation is a bit sparse. That said, it is straightforward to mimic an SSH Server:
[6](1, 2)"Cred: Pluggable Authentication". A good overview of Portal, IRealm and other classes/interfaces used in this article; required reading for a good understanding of how all the classes come together.
[7]"Components: Interfaces and Adapters". Twisted's implementation is maximally flexible and extensible but this may make using some parts of the library daunting to first time developers when they attempt to put components together; just take a look at the sheer number of imports required for the SSHServer code sample.
[8]Parts of the Twisted documentation (and code) use "avatar" and "user" interchangeably. This document uses "user".
[9]RFC 4254 Section 5: Channel Mechanism: All terminal sessions, forwarded connections, etc., are channels. Either side may open a channel. Multiple channels are multiplexed into a single connection.
[10]twisted.conch.ssh.channel.SSHChannel, twisted.conch.ssh.session.SSHSession
[11]Databases and Twisted: When Threads Are OK (For Some Purposes), December 20, 2008.