Distributed version control systems (DVCSes) are powerful and liberating tools for software developers, but projects aiming to choose one of the major contenders can find themselves mired in contention and politics. The most common such controversy is whether to use git or hg (aka Mercurial). With a little planning it's possible to have it both ways, allowing developers to use either git or hg to work with the same repository.
This page is a guide to up-converting your repository, finding the tools you need, and adopting practices that will reduce process and political friction to a minimum.
The technical fact central to the strategy I'm going to describe is that as of late 2011 an hg plugin already exists to allow seamless access to a git repository - but the reverse is not true. So we'll walk through how to up-convert a git repository and then line up the tools to access it in several different ways.
This assumption may become false in the future. The git-remote-hg project seems to be attempting seamless live gatewaying in the other direction (that is, using hg repos as git remotes). But as I write it seems to be in early development and as yet only poorly documented.
But tools aren't the end of the story. Your developers will need some education in good practice to get the most out of the tools. I'll cover that aspect as well.
In 90% of cases you'll be converting from CVS or Subversion, and those are the cases we'll discuss in detail. If you're using something older or weirder, see the short section on other VCSes for some hints, but you're mostly on your own.
Create a scratch directory for your conversion work.
Copy this generic makefile designed to sequence conversions to be the Makefile in your conversion directory. Then set the variables near the top appropriately for your project.
This Makefile will help you avoid typing a lot of fiddly commands by hand, and ensure that later products of the conversion pipeline are always updated when earlier ones have been modified or removed.
Create an empty file named $(PROJECT).lift. The Makefile will use it. Later, you will put your custom commands in here.
Doing a high-quality repository conversion is not a simple job and the odds that you will get it perfectly right the first time are close to zero. By packaging your lift commands in a repeatable script and using the Makefile to sequence repetitive operations, you will reduce the overhead of experimenting.
In the rest of the steps we describe below, when we write "make foo" that means the step can be sequenced by the "foo" production in the Makefile. Replace $(PROJECT) with your project name.
You may find it instructive to type "make -n" to see what the entire conversion sequence will look like.
Subversion and CVS identify users by a Unix login name local to the repository host; DVCSes use pairs of fullnames and email addresses. Before you can finish your conversion, you'll need to put together an author map that maps the former to the latter; the Makefile assumes this is named $(PROJECT).map. The author map should specify a full name and email address for each local user ID in the repo you're converting. Each line should be in the following form:
foonly=Fred Foonly <foonly@foobar.com>
You can optionally specify an third field that is a timezone description, either an ISO8601 offset (like "-0500") or a named entry in the Unix timezon file (like "America/Chicago"). If you do, this timezone will be attached to the timestamps on commits made by this person.
Using the generic Makefile for Subversion, "make $(PROJECT).map" will display a start on an author-map file. Edit in real names and addresses to the right of the equals signs.
How best to get this information will vary depending on your
situation. If you can get shell access to the repository host,
looking at /etc/passwd will give you the real name
corresponding to each username: usually you can simply append @ and
the repository hostname to each username to get a valid email
address. If the prepository is owned by a project on a forge site, you
can usually get the real name information through the Web interface;
try looking for the project membership or developer's list information .
If you are converting the repository for an open-source project, it is good courtesy and good practice after the above first step to email the contributors and invite them to supply a preferred form of their name and a preferred email address to be used in the mapping. The reason for this is that some sites, like Ohloh, aggregate participation statistics (and thus, reputation) across many projects, using developer name and email address as a primary key.
Your authors file does not have to be final until you ship your converted repo, so you can chase down authors' preferred identifications in parallel with the rest of the work.
Your first step will be converting your repository to git.
There are at least half a dozen utilities out there for lifting CVS and Subversion repositories to a git repository or import stream. My opinion of them can be gauged by the fact that I wrote my own. (You can read a description of the things it does that other conversion tools don't.)
So, install reposurgeon and whatever tool it needs to
read your repository. That will be cvs-fast-export
for CVS, or the Subversion tools themselves for Subversion.
The generic-workflow Makefile will call reposurgeon
for you, interpreting your $(PROJECT).lift file, when you type "make".
You may have to watch the baton spin for a few minutes.
Normally reposurgeon will do branch analysis for you.
On most Subversion repositories, and in particular anything with a
standard trunk/tags/branches layout, it will do the right thing. (It
will also cope with adventitious branches in the root directory of the
repo, such as many projects use for website content.) In unusual
cases you may want to use the "svn_nobranch" option; find out more
about this and other options via the "set" command.
When you're done, run make repodiffer. This is a
sanity check that redoes your repo conversion with
git-svn, then compares the file contents of each pair of
revisions it can match by committer and timestamp. They should have
no file-content differences; if they do, there is a bug in either
reposurgeon or git-svn (I have seen both
cases).
To my knowledge, reposurgeon is the only conversion
tool that handles multibranch Subversion repositories in full
generality. It can even translate Subversion commits that alter
multiple branches.
Performance tip: reposurgeon should analyze Subversion
repositories at the rate of over 250 commits a second, but that rate
falls off somewhat on very large repositories (apparently due to I/O
costs). You can speed it up significantly by running it under pypy.
SCCS: Use sccs2rcs to get to RCS, then follow the directions for RCS. There is a script called sccs2git on CPAN which I don't recommend, as it is poorly documented and makes no attempt to group commits into changesets.
RCS: reposurgeon will read an RCS collection. It uses cvs-fast-export, which despite its name does not actually require CVS metadata other than the RCS master files that store the content.
For other systems, see the Git wiki page on conversion tools.
You should now have a git repository, but it is likely to have a lot of cruft and conversion artifacts in it. Here are some common forms of cruft:
git log
and hg log. For a really high-quality conversion,
multiline comments should be edited into this form.cvs2svn to carry tag information.cvs2svn seem to have generated all kinds of meaningless
crud into these.git-svn inserts lines at the end of each commit
comment that refer back to the Subversion commit it is derived from. This
is necessary for live-gatewaying, and useful during one-shot
conversions, but you will probably not want it in the final repo.git-svn to carry tag information.The two kinds of git-svn cruft are only an issue if you're
starting from repository that has been preconverted with
git-svn, which is the procedure older versions of this
guide recommended. Since its 2.0 version you can use
reposurgeon to read Subversion repos directly, which is a
better idea and avoids these problems.
You can use reposurgeon to clean
up all these sorts of problems; it's specifically designed for this
job. The remainder of this section explains reposurgeon
commands for common problems; the tool has a lot of additional power
for dealing with unusual situations
(The descriptions below apply to reposurgeon 1.6 and later.
The command set was significantly different in earlier versions.
Also note that the cvspurge and gitsvnparse
commands in versions up to 1.9 were obsoleted by the ability to read
Subversion repositories directly and have been removed. The equivalent
fixups are now done, and done better, at repository read time.)
Here's a checklist of manual cleanup steps. Tips on how to do them with reposurgeon follow.
git gc --aggressive.reposurgeon has a "script" command that allows you to
bundle up a set of commands with comments. I recommend writing your
repository lift as a reposurgeon script; this helps you
not lose older steps as you experiment with newer ones, and it
documents what you did.
Most of the work will be in the comment-fixup and reference-lifting stages. I find, however, that they normally take only a couple of hours even on very large repos with thousands of commits. An entire conversion is usually less than a day of work.
You can use the authors read command to perform the
author-ID mapping operation with reposurgeon.
The command list /cvs2svn/ will show you all remaining
cvs2svn artifacts. Some can be deleted; a clue to look
for is junk commits generated to carry a tag at branch tips that have
one or two M fileops referring to a blob much earlier than the commit.
Very occasionally the generated commits will have real fileops on
them; all you can do in this case is note conversion damage in the
comment and move on.
Another good way to spot junk commits is to eyeball the picture of
the commit DAG created by the reposurgeon 'graph' command
- they tend to stand out visually as leaf nodes in odd places. Be
aware that the graph command outputs DOT, the language interpreted by
the graphviz suite; you will
need a DOT rendering program and an image viewer.
See the documentation of the references command; for
details on how fix up Subversion and CVS changeset references in
comments so they're still meaningful.
The edit multiline command is good for fixing up
multiline comments.
The reposurgeon command inspect =H will
show you tip commits which may contain only deletes and
deletealls.
Tags can be inspected with inspect =T. Junk tags can
be removed with the delete commmand. Tag comments can be
modified with edit.
Version 2.x and later of reposurgeon have a new
merge command specifically for performing branch merges.
The edit command will also allow you to add a parent mark
to a commit.
One minor feature you lose in moving from CVS or Subversion to a
DVCS is keyword expansion. You should go through the last revision of
the code and remove $Id$, $Date$, $Revision$, and other keyword
cookies lest they become unhelpful fossils. A command like grep -R
'$[A-Z]' . may be helpful.
After conversion of a branchy repository, look to see if there is a
'root' branch. If there are any commits with a sufficiently
pathological structure that reposurgeon can't figure out
what branch they belong to, they'll wind up there. Certain odd
combinations of Subversion branch creation and deletion operations may
do this, producing spurious deleteall commits; the results have to
be garbage-collected by hand.
It's good practice to leave a commit in the stream noting the date and time of the repo lift. See the next section on conversion comments for discussion.
Experiments with reposurgeon suggest that git
import doesn't try to pack or otherwise optimize for space when
it populates a repo from a dump file; this produces large
repositories. Running gt repack and git gc
--aggressive can slim them down quite a lot.
Sometimes, in converting a repository, you may need to insert an explanatory comment - for example, if metadata has been garbled or missing and you need to point to that fact.
It's helpful for repository-browsing tools if there is a uniform syntax for this that is highly unlikely to show up in repository comments. I recommend enclosing translation notes in [[ ]]. This has the advantage of being visually similar to the [ ] traditionally used for editorial comments in text.
It is good practice to include, in the root commit of the repository, a note dating and attributing the conversion work and explaining these conventions. Example:
[[This repository was converted from Subversion to git on 2012-10-24 by Eric S. Raymond <esr@thyrsus.com>. Here and elsewhere, conversion notes are enclosed in double square brackets. Junk commits generated by cvs2svn have been removed, commit references have been mapped into a uniform VCS-independent syntax, and some comments edited into summary-plus-continuation form.]]
You should also, as previously noted, leave a comment in the normal commit sequence noting the switchover.
You'll want to run through the repository removing CVS and Subversion keyword-expansion headers. "grep -R '\\$[A-Z]' ." will turn these up. Note that if you've been relying on these to supply version strings that are visible at runtime, you will need to supply that information in some different way.
A step that too often gets missed and then inelegantly patched in
later is converting the declarations that tell the version-control
system to ignore derived files. reposurgeon does this for
you if you're using it for CVS- or Subversion-to-git conversion, both
expressing Subversion svn:ignore properties as .gitignore files
and lifying .cvsignore files to .gitignore files; see the
LIMITATIONS AND GUARANTEES section on its manpage if other DVCSes are
involved.
Occasionally you'll discover problems with a conversion after you've pushed it to a project's hosting site, typically to a bare repo that the hosting software created for you. Here's how to cope:
Do your surgery on a copy of the repo with its .git/config pointing to the public location.
Warn the public repo's users that it is briefly going out of service and they will need to re-clone it afterwards!
From your modified local repo, try
git push -f origin HEAD:master
Re-initialize the public repo. You'll need ssh addess to the bare repo directory on the host - let's suppose it's 'myproject'. Pop up to the enclosing directory and do this:
mv myproject myproject-hidden
rm -fr myproject-hidden/*
git init --bare myproject-hidden
mv myproject-hidden myproject
After re-initializing, you should be able to run git
push to push the new history up to the public repo.
Inform the public repo's users that it is available and remind them that they will need to re-clone it.
Developers who are already git fans and know how to use a git client will, of course, have no partiticular trouble using a git repository.
Windows users accustomed to working through TortoiseSVN can move to TortoiseGIT.
Developers who like hg can use the hg-git mercurial plugin. There is an Ubuntu package "mercurial-git" for this, and other distributions are likely to carry it as well.
There are some hg-git limitations to be aware of. In order to simulate git behavior, hg-git keeps some local state in the .hg directories; a map from git branch names to Mercurial commits, a list of Mercurial bookmarks describing git branches (which have bookmark-like behavior different from a Mercurial named branch) and a file mapping git SHA1 hashes to hg SHA1 hashes (both systems use them as commit IDs). The problem is that hg doesn't copy any of this local state when it clones a repo, so clones of hg-git repos lose their git branches and tags.
Since the object of this exercise is to support both git and hg fans, both groups need to use the repo in a way that doesn't assume the other group will understand artifacts (like commit hashes) that are specific to either VCS.
Being careful about this has an additional benefit. Someday your project may need to change VCSes yet again; on that day, it will be extremely helpful if nobody has to try to convert years' or decades' worth of VCS-specific magic cookies in the history.
Educate your developers in the following good practices:
The combination of a committer email address with a ISO8601
timestamp is a good way to refer to a commit without being
VCS-specific. Thus, instead of "commit 304a53c2",
"2011-10-25T15:11:09Z!fred@foonly.com". I recommend that you not
vary from this format, even in trivial ways like omitting the 'Z'
or changing the 'T' or '!'. Making these cookies uniform and
machine-parseable will have good consequences for future
repository-browsing tools. The reference-lifting code in
Sometimes it's enough to quote the summary line of a commit. So, instead of "Back out most of commit 304a53c2", you might write "Back out 'Attempted divide-by-zero fix'.".
When appropriate. "my last commit" is simple and effective.
As previously noted, git and hg both want comments to begin with a summary line that can stand alone as a short description of the change; this may optionally be followed by a aeparating blank line and details in whatever form the commenter likes.
Try to end summary lines with a period. Ending punctuation other than a period should be used to indicate that the summary line is incomplete and continues after the separator; "..." is conventional.
For best results, stay within 72 characters per line. Don't go over 80.
Good comment practice produces more readable output from git
log and hg log, and makes it easy to take in
whole sequences of changes at a glance.
1.0 (2011-10-25) Original version.
2.0 (2011-11-04) Much more about CVS-to-git conversion, including
recommending git cvsimport. I started numbering versions
at this point.
2.1 (2011-11-07) Updated for reposurgeon 1.7.
2.2 (2011-11-10) Updated for reposurgeon 1.8.
2.3 (2011-11-10) Fix incorrect assertion about newer versions of git handling properties, this was a failure in my testing.
2.4 (2011-11-16) Add section on post-surgical cleanup: moving ignores, removing keyword expansions.
2.5 (2011-11-25) Fix typos and note the existence of git-remote-hg
2.6 (2012-11-02) reposurgeon can read Subversion repos now, making earlier conversion tools obsolete.
2.7 (2012-11-03) Add a link to the generic conversion makefile.
2.8 (2012-11-04) Title change, cleanup, and a Step Zero section.
3.0 (2012-11-05) Get serious about capturing the workflow in the Makefile.
3.1 (2012-11-18) It's a good idea to run 'make compare'.
3.2 (2012-12-05) Add hints on other systems.
3.3 (2012-12-19) Update for reposurgeon 2.10.
3.4 (2012-12-20) Update for reposurgeon 2.11.
3.5 (2013-01-09) Update for reposurgeon 2.13 and the 'graph' command.
3.6 (2013-01-22) Update for reposurgeon 2.15 and cvs-fast-export.
3.7 (2013-04-01) Note that reposurgeon is significantly faster under pypy.