DVCS migration HOWTO

Overview

Distributed version control systems (DVCSes) are powerful and liberating tools for software developers, but projects aiming to choose one of the major contenders can find themselves mired in contention and politics. The most common such controversy is whether to use git or hg (aka Mercurial). With a little planning it's possible to have it both ways, allowing developers to use either git or hg to work with the same repository. It is even possible to support a CVS emulation from the same repository.

This page is a guide to up-converting your repository, finding the tools you need, and adopting practices that will reduce process and political friction to a minimum.

The technical fact central to the strategy I'm going to describe is that as of late 2011 an hg plugin already exists to allow seamless access to a git repository, and as of 2013 recent versions of git can clone from and push to Mercurial repos. We'll walk through how to up-convert a git repository and then line up the tools to access it in several different ways.

But tools aren't the end of the story. Your developers will need some education in good practice to get the most out of the tools. I'll cover that aspect as well.

In 90% of cases you'll be converting from CVS or Subversion, and those are the cases we'll discuss in detail. If you're using something older or weirder, see the short section on other VCSes for some hints, but you're mostly on your own.

Commercial Note

If you are an organization that pays programmers and has a requirement to do a repository conversion, the author can be engaged to perform or assist with the transition. You are likely to find this is more efficient than paying someone in-house for the time required to learn the tools and procedures. I have been very open about my methods here, but nothing substitues for experience when you need it done quickly and right.

Step Zero: Preparation

Create a scratch directory for your conversion work.

Copy this generic makefile designed to sequence conversions to be the Makefile in your conversion directory. Then set the variables near the top appropriately for your project.

This Makefile will help you avoid typing a lot of fiddly commands by hand, and ensure that later products of the conversion pipeline are always updated when earlier ones have been modified or removed.

Create an empty file named $(PROJECT).lift. The Makefile will use it. Later, you will put your custom commands in here. Doing this helps you not lose older steps as you experiment with newer ones, and it documents what you did.

Doing a high-quality repository conversion is not a simple job and the odds that you will get it perfectly right the first time are close to zero. By packaging your lift commands in a repeatable script and using the Makefile to sequence repetitive operations, you will reduce the overhead of experimenting.

In the rest of the steps we describe below, when we write "make foo" that means the step can be sequenced by the "foo" production in the Makefile. Replace $(PROJECT) with your project name.

You may find it instructive to type "make -n" to see what the entire conversion sequence will look like.

Step One: The Author Map

Subversion and CVS identify users by a Unix login name local to the repository host; DVCSes use pairs of fullnames and email addresses. Before you can finish your conversion, you'll need to put together an author map that maps the former to the latter; the Makefile assumes this is named $(PROJECT).map. The author map should specify a full name and email address for each local user ID in the repo you're converting. Each line should be in the following form:

foonly=Fred Foonly <foonly@foobar.com>

You can optionally specify an third field that is a timezone description, either an ISO8601 offset (like "-0500") or a named entry in the Unix timezon file (like "America/Chicago"). If you do, this timezone will be attached to the timestamps on commits made by this person.

Using the generic Makefile for Subversion, "make stubmap" will display a start on an author-map file. Edit in real names and addresses to the right of the equals signs.

How best to get this information will vary depending on your situation.

If you are converting the repository for an open-source project, it is good courtesy and good practice after the above first step to email the contributors and invite them to supply a preferred form of their name and a preferred email address to be used in the mapping. The reason for this is that some sites, like Ohloh, aggregate participation statistics (and thus, reputation) across many projects, using developer name and email address as a primary key.

Your authors file does not have to be final until you ship your converted repo, so you can chase down authors' preferred identifications in parallel with the rest of the work.

Step Two: Conversion

Your first step will be converting your repository to git.

There are at least half a dozen utilities out there for lifting CVS and Subversion repositories to a git repository or import stream. My opinion of them can be gauged by the fact that I wrote my own. (You can read a description of the things it does that other conversion tools don't.)

So, install reposurgeon and whatever tool it needs to read your repository. That will be cvs-fast-export for CVS, or the Subversion tools themselves for Subversion.

The generic-workflow Makefile will call reposurgeon for you, interpreting your $(PROJECT).lift file, when you type "make". You may have to watch the baton spin for a few minutes.

If you are exporting from CVS, it may be a good idea to run some trial conversions with cvsconvert, a wrapper script shipped with cvs-fast-export. This script runs a conversion direct to git; the advantage is that it can do a comparison of the repository histories and identify problems for you to fix in your lift script.

Normally reposurgeon will do branch analysis for you. On most Subversion repositories, and in particular anything with a standard trunk/tags/branches layout, it will do the right thing. (It will also cope with adventitious branches in the root directory of the repo, such as many projects use for website content.) In unusual cases you may want to use the "--nobranch" option; find out more about this from the manual page.

To my knowledge, reposurgeon is the only conversion tool that handles multibranch Subversion repositories in full generality. It can even translate Subversion commits that alter multiple branches.

Performance tip: reposurgeon should analyze Subversion repositories at the rate of over 10K commits per minute, but that rate falls off somewhat on very large repositories (apparently due to I/O costs). You can speed it up significantly by running it under pypy.

Other VCSes

SCCS: Use sccs2rcs to get to RCS, then follow the directions for RCS. There is a script called sccs2git on CPAN which I don't recommend, as it is poorly documented and makes no attempt to group commits into changesets.

RCS: reposurgeon will read an RCS collection. It uses cvs-fast-export, which despite its name does not actually require CVS metadata other than the RCS master files that store the content.

Fossil: reposurgeon will read a Fossil repository file. It uses the native Fossil exporter, which is pretty good but doesn't export ignore patterns, wiki events, or tickets.

For other systems, see the Git wiki page on conversion tools.

Step Three: Sanity Checking

Before putting in further effort on polishing your your conversion and putting it into production, you should check it for basic correctness.

Use diff(1) with the -r option to compare a head checkout of the unconverted repository with a checkout of the converted repository. The only differences you should see are those due to keyword expansion and ignore-file lifting. If this is not true, you have found a serious bug in either reposurgeon or the front end it used. Consult http://www.catb.org/~esr/reposurgeon/reporting-bugs.html for information on how to usefully report bugs.

If you are convering from Subversion, make repodiffer. This is a sanity check that redoes your repo conversion with git-svn, then compares the file contents of each pair of revisions it can match by committer and timestamp. They should have no file-content differences; if they do, there is a bug in either reposurgeon or git-svn (I have seen both cases).

If you are convering from CVS, use reposurgeon's graph command to examine the conversion, looking (in particular) for misplaced tags or branch joins. Often these can be manually repaired with little effort. These flaws do 'not' necessarily imply bugs in cvs-fast-export or reposugeon; they may simply indicate previously undetected malformations in the history. However, reporting them may help improve cvs-fast-export.

Step Four: Cleanup

You should now have a git repository, but it is likely to have a lot of cruft and conversion artifacts in it. Here are some common forms of cruft:

Subversion and CVS commit references
Often Subversion references will be in the form 'r' followed by a string of digits referring to a Subversion commit number. But not always; humans come up with lots of ambiguous ways to write these. CVS commit references are even harder to spot mechanically, as they're just groups of digits separated by dots with no identifing prefix. A clean conversion should turn all these into VCS-independent commit references, which I'll describe later in this document.
Multiline contents with no summary
git and hg both encourage comments to begin with a summary line that can stand alone as a short description of the change; this practice produces more readable output from git log and hg log. For a really high-quality conversion, multiline comments should be edited into this form.
Branch tip deletes, deletealls, and unexpressed merges
In Subversion it is common practice to delete a branch directory when that line of development is finished or merged to trunk; this makes sense because it reduces the checkout size of the repo in later revisions. In a DVCS deletes at a branch tip don't save you any storage, so it makes more sense to leave the branch with all of its tip content live if you're not going to delete it entirely. Sometimes editing a later commit to have the branch tip as a parent (creating a merge that Subversion could not express) make sense; look for svn:mergeinfo properties as clues.
Commits generated by cvs2svn to carry tag information.
These lurk in the history of a lot of Subversion projects. Sometimes these junk commits are empty (no file operations associated with them at all); sometimes they're translated as long lists of spurious delete fileops, and sometimes they have actual file content (duplicating parent file versions, or referring randomly to file versions far older than the junk commit). Older versions of cvs2svn seem to have generated all kinds of meaningless crud into these.
Metadata inserted by git-svn.
git-svn inserts lines at the end of each commit comment that refer back to the Subversion commit it is derived from. This is necessary for live-gatewaying, and useful during one-shot conversions, but you will probably not want it in the final repo.
Commits generated by git-svn to carry tag information.
Yes, these are a different phenomenon from cvs2svn-generated tag commits. These are tip commits carrying a tag which have no file-operation content.

The two kinds of git-svn cruft are only an issue if you're starting from repository that has been preconverted with git-svn, which is the procedure older versions of this guide recommended. Since its 2.0 version you can use reposurgeon to read Subversion repos directly, which is a better idea and avoids these problems.

Surgical cleanup using reposurgeon

You can use reposurgeon to clean up all these sorts of problems; it's specifically designed for this job. The remainder of this section explains reposurgeon commands for common problems; the tool has a lot of additional power for dealing with unusual situations

Here's a checklist of manual cleanup steps. Tips on how to do them with reposurgeon follow.

  1. Map author IDs from local to DVCS form.
  2. Check for leftover cvs2svn junk commits and remove them if possible.
  3. Lift references in commit comments.
  4. Massage comments into summary-line-plus-continuation form.
  5. Remove delete-only tip commits where appropriate.
  6. Review generated tags, pruning as appropriate.
  7. Look for branch merge points and patch parent marks to make them.
  8. Fix up or remove $-keyword cookies in the latest revision.
  9. If there's a root branch, check for and remove junk commits on it.
  10. For the record, make a commit noting time and date of the repo lift.
  11. If your target was git, run git gc --aggressive.

Most of the work will be in the comment-fixup and reference-lifting stages. I find, however, that they normally take only a couple of hours even on very large repos with thousands of commits. An entire conversion is usually less than a day of work.

You can use the authors read command to perform the author-ID mapping operation with reposurgeon.

The command list /cvs2svn/ will show you all remaining cvs2svn artifacts. Some can be deleted; a clue to look for is junk commits generated to carry a tag at branch tips that have one or two M fileops referring to a blob much earlier than the commit. Very occasionally the generated commits will have real fileops on them; all you can do in this case is note conversion damage in the comment and move on.

Another good way to spot junk commits is to eyeball the picture of the commit DAG created by the reposurgeon 'graph' command - they tend to stand out visually as leaf nodes in odd places. Be aware that the graph command outputs DOT, the language interpreted by the graphviz suite; you will need a DOT rendering program and an image viewer.

See the documentation of the references command; for details on how fix up Subversion and CVS changeset references in comments so they're still meaningful.

The command =L edit is good for fixing up multiline comments.

The reposurgeon command inspect =H will show you tip commits which may contain only deletes and deletealls.

Tags can be inspected with =T inspect. Junk tags can be removed with the delete commmand. Tag comments can be modified with edit.

Version 2.x and later of reposurgeon have a new merge command specifically for performing branch merges. The edit command will also allow you to add a parent mark to a commit.

One minor feature you lose in moving from CVS or Subversion to a DVCS is keyword expansion. You should go through the last revision of the code and remove $Id$, $Date$, $Revision$, and other keyword cookies lest they become unhelpful fossils. A command like grep -R '$[A-Z]' . may be helpful.

After conversion of a branchy repository, look to see if there is a 'root' branch. If there are any commits with a sufficiently pathological structure that reposurgeon can't figure out what branch they belong to, they'll wind up there. Certain odd combinations of Subversion branch creation and deletion operations may do this, producing spurious deleteall commits; the results have to be garbage-collected by hand.

It's good practice to leave a commit in the stream noting the date and time of the repo lift. See the next section on conversion comments for discussion.

Experiments with reposurgeon suggest that git import doesn't try to pack or otherwise optimize for space when it populates a repo from a dump file; this produces large repositories. Running git repack and git gc --aggressive can slim them down quite a lot.

Conversion comments

Sometimes, in converting a repository, you may need to insert an explanatory comment - for example, if metadata has been garbled or missing and you need to point to that fact.

It's helpful for repository-browsing tools if there is a uniform syntax for this that is highly unlikely to show up in repository comments. I recommend enclosing translation notes in [[ ]]. This has the advantage of being visually similar to the [ ] traditionally used for editorial comments in text.

It is good practice to include, in the root commit of the repository, a note dating and attributing the conversion work and explaining these conventions. Example:

[[This repository was converted from Subversion to git on 2012-10-24 by Eric S. Raymond <esr@thyrsus.com>. Here and elsewhere, conversion notes are enclosed in double square brackets. Junk commits generated by cvs2svn have been removed, commit references have been mapped into a uniform VCS-independent syntax, and some comments edited into summary-plus-continuation form.]]

You should also, as previously noted, leave a comment in the normal commit sequence noting the switchover.

Nonsurgical cleanup steps

You'll want to run through the repository removing CVS and Subversion keyword-expansion headers. "grep -R '\\$[A-Z]' ." will turn these up. Note that if you've been relying on these to supply version strings that are visible at runtime, you will need to supply that information in some different way.

A step that too often gets missed and then inelegantly patched in later is converting the declarations that tell the version-control system to ignore derived files. reposurgeon does this for you if you're using it for CVS- or Subversion-to-git conversion, both expressing Subversion svn:ignore properties as .gitignore files and lifting .cvsignore files to .gitignore files; see the LIMITATIONS AND GUARANTEES section on its manpage if other DVCSes are involved.

Under versions of reposurgeon before 3.11 (August 2014) explicit .gitignore files in Subversion repositories were preserved and could interfere with .gitignore files generated from svn:ignore properties during the conversion. Under 3.11 and later the assumption is that these were created by git-svn users ad hoc and should be discarded; it is up to the human doing the conversion to look through them and rescue any ignore patterns that should be merged into the converted repository.

Recovering from errors

Occasionally you'll discover problems with a conversion after you've pushed it to a project's hosting site, typically to a bare repo that the hosting software created for you. Here's how to cope:

  1. Do your surgery on a copy of the repo with its .git/config pointing to the public location.

  2. Warn the public repo's users that it is briefly going out of service and they will need to re-clone it afterwards!

  3. From your modified local repo, try

     
    git push -f origin HEAD:master
    
     
    to push the new history up to the public repo. This will work if the public repo is not locked against non fast-forward updates; otherwise, try the next step.

  4. Re-initialize the public repo. You'll need ssh addess to the bare repo directory on the host - let's suppose it's 'myproject'. Pop up to the enclosing directory and do this:

    
        mv myproject myproject-hidden
        rm -fr myproject-hidden/*
        git init --bare myproject-hidden
        mv myproject-hidden myproject
    
     
    The point of doing it this way is (a) so you never actually remove myproject (on many hosts you will not have create permissions in the enclosing directory), and (b) so no user can update the repo while you're clearing it (mv is atomic).

    After re-initializing, you should be able to run git push to push the new history up to the public repo.

  5. Inform the public repo's users that it is available and remind them that they will need to re-clone it.

Step Five: Client Tools

Developers who are already git fans and know how to use a git client will, of course, have no partiticular trouble using a git repository.

Windows users accustomed to working through TortoiseSVN can move to TortoiseGIT.

Developers who like hg can use the hg-git mercurial plugin. There is an Ubuntu package "mercurial-git" for this, and other distributions are likely to carry it as well.

There are some hg-git limitations to be aware of. In order to simulate git behavior, hg-git keeps some local state in the .hg directories; a map from git branch names to Mercurial commits, a list of Mercurial bookmarks describing git branches (which have bookmark-like behavior different from a Mercurial named branch) and a file mapping git SHA1 hashes to hg SHA1 hashes (both systems use them as commit IDs). The problem is that hg doesn't copy any of this local state when it clones a repo, so clones of hg-git repos lose their git branches and tags.

If you have developers attached to the CVS interface, it is possible (and in fact relatively easy) to set up a gateway interface that lets them continue using their CVS client tools. Consult the documentation for git-cvsserver.

Step Six: Good Practice

Since the object of this exercise is to support both git and hg fans, both groups need to use the repo in a way that doesn't assume the other group will understand artifacts (like commit hashes) that are specific to either VCS.

Being careful about this has an additional benefit. Someday your project may need to change VCSes yet again; on that day, it will be extremely helpful if nobody has to try to convert years' or decades' worth of VCS-specific magic cookies in the history.

Educate your developers in the following good practices:

Commit references

The combination of a committer email address with a ISO8601 timestamp is a good way to refer to a commit without being VCS-specific. Thus, instead of "commit 304a53c2", "2011-10-25T15:11:09Z!fred@foonly.com". I recommend that you not vary from this format, even in trivial ways like omitting the 'Z' or changing the 'T' or '!'. Making these cookies uniform and machine-parseable will have good consequences for future repository-browsing tools. The reference-lifting code in reposurgeon generates them.

Sometimes it's enough to quote the summary line of a commit. So, instead of "Back out most of commit 304a53c2", you might write "Back out 'Attempted divide-by-zero fix'.".

When appropriate. "my last commit" is simple and effective.

Comment summary lines

As previously noted, git and hg both want comments to begin with a summary line that can stand alone as a short description of the change; this may optionally be followed by a aeparating blank line and details in whatever form the commenter likes.

Try to end summary lines with a period. Ending punctuation other than a period should be used to indicate that the summary line is incomplete and continues after the separator; "..." is conventional.

For best results, stay within 72 characters per line. Don't go over 80.

Good comment practice produces more readable output from git log and hg log, and makes it easy to take in whole sequences of changes at a glance.

Revision history

1.0 (2011-10-25) Original version.

2.0 (2011-11-04) Much more about CVS-to-git conversion, including recommending git cvsimport. I started numbering versions at this point.

2.1 (2011-11-07) Updated for reposurgeon 1.7.

2.2 (2011-11-10) Updated for reposurgeon 1.8.

2.3 (2011-11-10) Fix incorrect assertion about newer versions of git handling properties, this was a failure in my testing.

2.4 (2011-11-16) Add section on post-surgical cleanup: moving ignores, removing keyword expansions.

2.5 (2011-11-25) Fix typos and note the existence of git-remote-hg

2.6 (2012-11-02) reposurgeon can read Subversion repos now, making earlier conversion tools obsolete.

2.7 (2012-11-03) Add a link to the generic conversion makefile.

2.8 (2012-11-04) Title change, cleanup, and a Step Zero section.

3.0 (2012-11-05) Get serious about capturing the workflow in the Makefile.

3.1 (2012-11-18) It's a good idea to run 'make compare'.

3.2 (2012-12-05) Add hints on other systems.

3.3 (2012-12-19) Update for reposurgeon 2.10.

3.4 (2012-12-20) Update for reposurgeon 2.11.

3.5 (2013-01-09) Update for reposurgeon 2.13 and the 'graph' command.

3.6 (2013-01-22) Update for reposurgeon 2.15 and cvs-fast-export.

3.7 (2013-04-01) Note that reposurgeon is significantly faster under pypy.

3.8 (2013-11-15) Remove an obsolete paragraph.

3.9 (2013-12-11) Incorporate the report that git now does hg remotes.

3.10 (2014-02-16) Minor changes for 3.0 syntax.

3.11 (2014-02-18) More about post-conversion sanity checking.

3.12 (2014-08-12) Merging SVN .gitignore files.

2.13 (2014-10-26) Note that git-cvsserver exists.

2.14 (2014-11-05) Mention cvsconvert.