Overview

Distributed version control systems (DVCSes) are powerful and liberating tools for software developers, but projects aiming to choose one of the major contenders can find themselves mired in contention and politics. The most common such controversy is whether to use git or hg (aka Mercurial). With a little planning it’s possible to have it both ways, allowing developers to use either git or hg to work with the same repository. It is even possible to support a CVS emulation from the same repository.

This page is a guide to up-converting your repository, finding the tools you need, and adopting practices that will reduce process and political friction to a minimum.

The technical fact central to the strategy I’m going to describe is that as of late 2011 an hg plugin already exists to allow seamless access to a git repository, and as of 2013 recent versions of git can clone from and push to Mercurial repos. We’ll walk through how to up-convert a git repository and then line up the tools to access it in several different ways.

But tools aren’t the end of the story. Your developers will need some education in good practice to get the most out of the tools. I’ll cover that aspect as well.

In 90% of cases you’ll be converting from CVS or Subversion, and those are the cases we’ll discuss in detail. If you’re using something older or weirder, see the short section on other VCSes for some hints, but you’re mostly on your own.

Commercial Note

If you are an organization that pays programmers and has a requirement to do a repository conversion, the author can be engaged to perform or assist with the transition. You are likely to find this is more efficient than paying someone in-house for the time required to learn the tools and procedures. I have been very open about my methods here, but nothing substitutes for experience when you need it done quickly and right.

Step Zero: Preparation

Make sure the tools in the reposurgeon suite (especially reposurgeon and repotool) are on your $PATH

Create a scratch directory for your conversion work.

Run "repotool initialize" in the scratch directory. This will create a Makefile designed to sequence your conversion, and an empty lift script. Then set the variables near the top appropriately for your project.

This Makefile will help you avoid typing a lot of fiddly commands by hand, and ensure that later products of the conversion pipeline are always updated when earlier ones have been modified or removed.

Later, you will put your custom commands in the lift script file. Doing this helps you not lose older steps as you experiment with newer ones, and it documents what you did.

Doing a high-quality repository conversion is not a simple job and the odds that you will get it perfectly right the first time are close to zero. By packaging your lift commands in a repeatable script and using the Makefile to sequence repetitive operations, you will reduce the overhead of experimenting.

In the rest of the steps we describe below, when we write "make foo" that means the step can be sequenced by the "foo" production in the Makefile. Replace $(PROJECT) in these instructions with your project name.

You may find it instructive to type "make -n" to see what the entire conversion sequence will look like.

Step One: The Author Map

Subversion and CVS identify users by a Unix login name local to the repository host; DVCSes use pairs of fullnames and email addresses. Before you can finish your conversion, you’ll need to put together an author map that maps the former to the latter; the Makefile assumes this is named $(PROJECT).map. The author map should specify a full name and email address for each local user ID in the repo you’re converting. Each line should be in the following form:

foonly = Fred Foonly <foonly@foobar.com>

You can optionally specify an third field that is a timezone description, either an ISO8601 offset (like "-0500") or a named entry in the Unix timezone file (like "America/Chicago"). If you do, this timezone will be attached to the timestamps on commits made by this person.

Using the generic Makefile for Subversion, "make stubmap" will generate a start on an author-map file as $(PROJECT).map. Edit in real names and addresses (and optionally offsets) to the right of the equals signs.

How best to get this information will vary depending on your situation.

  • If you can get shell access to the repository host, looking at /etc/passwd will give you the real name corresponding to each username of a developer still active: usually you can simply append @ and the repository hostname to each username to get a valid email address. You can do this automatically, and merge in real names from the password file, using the repomapper tool from the reposurgeon distribution.

  • If the repository is owned by a project on a forge site, you can usually get the real name information through the Web interface; try looking for the project membership or developer’s list information.

  • If the project has a development mailing list, posting your incomplete map with a request for completions often gives good results.

  • If you can download the archives of the project’s development mailing list, grepping out all the From addresses may suggest some obvious matches with otherwise unknown usernames. You may also be able to get timezone offsets from the date stamps on the mail.

If you are converting the repository for an open-source project, it is good courtesy and good practice after the above first step to email the contributors and invite them to supply a preferred form of their name, a preferred email address to be used in the mapping, and a timezone offset. The reason for this is that some sites, like OpenHub, aggregate participation statistics (and thus, reputation) across many projects, using developer name and email address as a primary key.

Your authors file does not have to be final until you ship your converted repo, so you can chase down authors' preferred identifications in parallel with the rest of the work.

Step Two: Conversion

Your first step will be converting your repository to git.

There are at least half a dozen utilities out there for lifting CVS and Subversion repositories to a git repository or import stream. My opinion of them can be gauged by the fact that I wrote my own: reposurgeon. You can read a description of the things it does that other conversion tools don’t.

So, install reposurgeon and whatever tool it needs to read your repository. That will be cvs-fast-export for CVS, or the Subversion tools themselves for Subversion.

The generic-workflow Makefile will call reposurgeon for you, interpreting your $(PROJECT).lift file, when you type "make". You may have to watch the baton spin for a few minutes. For very large repositories it could be more than a few minutes.

CVS

If you are exporting from CVS, it may be a good idea to run some trial conversions with cvsconvert, a wrapper script shipped with cvs-fast-export. This script runs a conversion direct to git; the advantage is that it can do a comparison of the repository histories and identify problems for you to fix in your lift script.

Problems in CVS conversions generally arise from the fact that CVS’s data model doesn’t have real multi-file changesets, which are the fundamental unit of a commit in DVCSes. It can be difficult to fully recover changesets from what are actually large numbers of single-file changes flying in loose formation - in fact, old CVS operator errors can sometimes make it impossible. Bad tools silently propagate such damage forward into your translation. Good tools, like cvs-fast-export and reposurgeon, warn you of problems and help you recover.

Subversion

Normally reposurgeon will do branch analysis for you. On most Subversion repositories, and in particular anything with a standard trunk/tags/branches layout, it will do the right thing. (It will also cope with adventitious branches in the root directory of the repo, such as many projects use for website content.)

In unusual cases you may need to use the "--nobranch" option; find out more about this from the manual page. However, this has the disadvantage that you’ll have to do the branch surgery by hand at a later stage. Instead, you may be able to use the repocutter filter to transform the dump file into a version shaped right for a regular branch-sensitive lift.

To my knowledge, reposurgeon is the only conversion tool that handles multibranch Subversion repositories in full generality. It can even translate Subversion commits that alter multiple branches.

Special Google Code note: If you are converting a Subversion project from Google Code, you may want to use the command "debranch wiki" to turn the wiki branch into a subdirectory on your master branch.

Performance tip: reposurgeon should analyze Subversion repositories at the rate of over 10K commits per minute, but that rate falls off somewhat on very large repositories (apparently due to I/O costs). You can speed it up significantly by building a binary with cython; there’s a production to do this in the reposurgeon Makefile.

Unlike CVS, Subversion repositories have real changesets and the work in them can effectively always be mapped unto equivalent DVCS commits. The parent-child relationships among commits will also translate cleanly. There is, however, a minor problem around tags, and a significant problem around merges.

The tag problem arises because Subversion tags are really branches that you’ve conventionally agreed not to commit to after the initial branch copy (that’s what the tags/ directory name conveys). But Subversion doesn’t enforce any prohibition against committing to the tag branch, and various odd things can happen if you do

Another case with surprising results is if you create a tag directory in Subversion and then move it. The gitspace tag is likely to end up attached to the old location. Yes, this is a bug, but not a practically fixable one; a detailed explanation of why is possible but would probably make your head explode.

The reposurgeon analyzer tries to warn you about these cases, and reposurgeon gives you tools for coping with them. Unfortunately, the warnings are (unavoidably) cryptic unless you understand Subversion internals in detail.

In a DVCS, a merge normally coalesces two entire branches. Subversion has something close to this in newer versions; it’s called a "sync merge" working on directories (and is expressed as an svn:mergeinfo property of the target directory that names the source). A sync merge of a branch directory into another branch directory behaves like a DVCS merge; reposurgeon picks these up and translates them for you.

The older, more basic Subversion merge is per file and is expressed by per-file svn:mergeinfo properties. These correspond to what in DVCS-land are called "cherry-picks", which just replay a commit from a source branch onto a target branch but do not create cross-branch links.

Sometimes Subversion developers use collections of per-file mergeinfo properties to express partial branch merges. This does not map to the DVCS model at all well, and trying to promote these to full-branch merges by hand is actually dangerous. An excellent essay, Partial git merges — just say no. explores the problem in depth.

The bottom line is that reposurgeon warns about per-file svn:mergeinfo properties and then discards them for good reasons. If you feel an urge to hand-edit in a branch merge based on these, do so with care and check your work.

Other VCSes

SCCS: Use sccs2rcs to get to RCS, then follow the directions for RCS. There is a script called sccs2git on CPAN which I don’t recommend, as it is poorly documented and makes no attempt to group commits into changesets.

RCS: reposurgeon will read an RCS collection. It uses cvs-fast-export, which despite its name does not actually require CVS metadata other than the RCS master files that store the content.

Fossil: reposurgeon will read a Fossil repository file. It uses the native Fossil exporter, which is pretty good but doesn’t export ignore patterns, wiki events, or tickets.

BitKeeper: As of version 7.3 (and probably earlier versions under open-source licensing) BitKeeper has fast-import and fast-export subcommands, and reposurgeon now knows how to use these.

For other systems, see the Git wiki page on conversion tools.

Step Three: Sanity Checking

Before putting in further effort on polishing your conversion and putting it into production, you should check it for basic correctness.

Pay attention to error messages emitted during the lift. Most of these, and remedial actions to take, are described in the reposurgeon manual.

For Subversion lifts, use the "compare" and "compare-tags" productions to compare the head revision and tagged revisions between the unconverted repostory. If you didn’t use the cvsconvert wrapper for your CVS lift, these productions have a similar effect. The only differences you should see are those due to keyword expansion and ignore-file lifting. If this is not true, you may have found a serious bug in either reposurgeon or the front end it used, or you might just have a misassigned tag that can be manually fixed. Consult How to report bugs for information on how to usefully report bugs.

If you are converting from CVS, use reposurgeon’s graph command to examine the conversion, looking (in particular) for misplaced tags or branch joins. Often these can be manually repaired with little effort. These flaws do not necessarily imply bugs in cvs-fast-export or reposurgeon; they may simply indicate previously undetected malformations in the history. However, reporting them may help improve cvs-fast-export.

Step Four: Cleanup

You should now have a git repository, but it is likely to have a lot of cruft and conversion artifacts in it. Here are some common forms of cruft:

Subversion and CVS commit references

Often Subversion references will be in the form r followed by a string of digits referring to a Subversion commit number. But not always; humans come up with lots of ambiguous ways to write these. CVS commit references are even harder to spot mechanically, as they’re just groups of digits separated by dots with no identifying prefix. A clean conversion should turn all these into VCS-independent commit references, which I’ll describe later in this document.

Multiline contents with no summary

git and hg both encourage comments to begin with a summary line that can stand alone as a short description of the change; this practice produces more readable output from git log and hg log. For a really high-quality conversion, multiline comments should be edited into this form.

No-fileop commits

Commits with no fileops are automatically transformed into tags when reading a Subversion repository. Other importers may generate them for various reasons; you can detect them as the =Z visibility set. In order to preserve the behavior that read followed by immediate write does not modify a stream file, this simplification is not done by default in non-Subversion imports.

Branch tip deletes, deletealls, and unexpressed merges

In Subversion it is common practice to delete a branch directory when that line of development is finished or merged to trunk; this makes sense because it reduces the checkout size of the repo in later revisions. In a DVCS, deletes at a branch tip don’t save you any storage, so it makes more sense to leave the branch with all of its tip content live if you’re not going to delete it entirely. Sometimes editing a later commit to have the branch tip as a parent (creating a merge that Subversion could not express) make sense; look for svn:mergeinfo properties as clues.

Commits generated by cvs2svn to carry tag information

These lurk in the history of a lot of Subversion projects. Sometimes these junk commits are empty (no file operations associated with them at all); sometimes they’re translated as long lists of spurious delete fileops, and sometimes they have actual file content (duplicating parent file versions, or referring randomly to file versions far older than the junk commit). Older versions of cvs2svn seem to have generated all kinds of meaningless crud into these.

Metadata inserted by git-svn

git-svn inserts lines at the end of each commit comment that refer back to the Subversion commit it is derived from. This is necessary for live-gatewaying, and useful during one-shot conversions, but you will probably not want it in the final repo.

Surgical cleanup using reposurgeon

You can use reposurgeon to clean up all these sorts of problems; it’s specifically designed for this job. The remainder of this section explains reposurgeon commands for common problems; the tool has a lot of additional power for dealing with unusual situations

Here’s a checklist of manual cleanup steps. Tips on how to do them with reposurgeon follow.

  1. Map author IDs from local to DVCS form.

  2. Check for leftover cvs2svn junk commits and remove them if possible.

  3. Lift references in commit comments.

  4. Massage comments into summary-line-plus-continuation form.

  5. Remove empty and delete-only tip commits where appropriate.

  6. Review generated tags, pruning and fixing locations as appropriate.

  7. Look for branch merge points and patch parent marks to make them.

  8. Fix up or remove $-keyword cookies in the latest revision.

  9. If there’s a root branch, check for and remove junk commits on it.

  10. For the record, make a tag noting time and date of the repo lift.

  11. If your target was git, run git gc --aggressive.

Most of the work will be in the comment-fixup and reference-lifting stages. I find, however, that they normally take only a couple of hours even on very large repos with thousands of commits. An entire conversion is usually less than a day of work.

You can use the authors read command to perform the author-ID mapping operation with reposurgeon.

You can find empty commits as the =Z visibility set set and clean them up with the command tagify. Consult the reposurgeon manual page for usage details.

The cvs2svn list command will show you all remaining cvs2svn artifacts. Some can be deleted; a clue to look for is junk commits generated to carry a tag at branch tips that have one or two M fileops referring to a blob much earlier than the commit. Very occasionally the generated commits will have real fileops on them; all you can do in this case is note conversion damage in the comment and move on.

Another good way to spot junk commits is to eyeball the picture of the commit DAG created by the reposurgeon graph command - they tend to stand out visually as leaf nodes in odd places. Be aware that the graph command outputs DOT, the language interpreted by the graphviz suite; you will need a DOT rendering program and an image viewer.

See the documentation of the references command; for details on how fix up Subversion and CVS changeset references in comments so they’re still meaningful.

The command =L edit is good for fixing up multiline comments.

The reposurgeon command inspect =H will show you tip commits which may contain only deletes and deletealls.

Tags can be inspected with =T inspect. Junk tags can be removed with the delete commmand. Tag comments can be modified with edit. Check that the creation date of tags matches what you see in the source reopository; this is the easiest way to spot when one has been attached to the wrong commit, something that can be manually fixed by editing its from field.

Version 2.x and later of reposurgeon have a new merge command specifically for performing branch merges. The edit command will also allow you to add a parent mark to a commit.

One minor feature you lose in moving from CVS or Subversion to a DVCS is keyword expansion. You should go through the last revision of the code and remove $Id$, $Date$, $Revision$, and other keyword cookies lest they become unhelpful fossils. A command like grep -R '$[A-Z]' . may be helpful.

After conversion of a branchy repository, look to see if there is a root branch. If there are any commits with a sufficiently pathological structure that reposurgeon can’t figure out what branch they belong to, they’ll wind up there. Certain odd combinations of Subversion branch creation and deletion operations may do this, producing spurious deleteall commits; the results have to be garbage-collected by hand.

It’s good practice to leave a commit in the stream noting the date and time of the repo lift. See the next section on conversion comments for discussion.

Experiments with reposurgeon suggest that git import doesn’t try to pack or otherwise optimize for space when it populates a repo from a dump file; this produces large repositories. Running git repack and git gc --aggressive can slim them down quite a lot.

Conversion comments

Sometimes, in converting a repository, you may need to insert an explanatory comment - for example, if metadata has been garbled or missing and you need to point to that fact.

It’s helpful for repository-browsing tools if there is a uniform syntax for this that is highly unlikely to show up in repository comments. I recommend enclosing translation notes in [[ ]]. This has the advantage of being visually similar to the [ ] traditionally used for editorial comments in text.

It is good practice to include, in the root commit of the repository, a note dating and attributing the conversion work and explaining these conventions. Example:

[[This repository was converted from Subversion to git on 2012-10-24
by Eric S. Raymond &lt;esr@thyrsus.com&gt;.  Here and elsewhere, conversion
notes are enclosed in double square brackets. Junk commits generated
by cvs2svn have been removed, commit references have been mapped into
a uniform VCS-independent syntax, and some comments edited into
summary-plus-continuation form.]]

You should also, as previously noted, leave a tag in the normal commit sequence noting the switchover. You can do this with the mailbox --create command; see the reposurgeon manual page for details and an example.

Nonsurgical cleanup steps

You’ll want to run through the repository removing CVS and Subversion keyword-expansion headers. "grep -R \\$[A-Z] ." will turn these up. Note that if you’ve been relying on these to supply version strings that are visible at runtime, you will need to supply that information in some different way.

A step that too often gets missed and then inelegantly patched in later is converting the declarations that tell the version-control system to ignore derived files. reposurgeon does this for you if you’re using it for CVS- or Subversion-to-git conversion, both expressing Subversion svn:ignore properties as .gitignore files and lifting .cvsignore files to .gitignore files; see the LIMITATIONS AND GUARANTEES section on its manpage if other DVCSes are involved.

Under versions of reposurgeon before 3.11 (August 2014) explicit .gitignore files in Subversion repositories were preserved and could interfere with .gitignore files generated from svn:ignore properties during the conversion. Under 3.11 and later the assumption is that these were created by git-svn users ad hoc and should be discarded; it is up to the human doing the conversion to look through them and rescue any ignore patterns that should be merged into the converted repository. This behavior can be reversed with the --user-ignores option, which simply passes through .gitignore files.

Recovering from errors

Occasionally you’ll discover problems with a conversion after you’ve pushed it to a project’s hosting site, typically to a bare repo that the hosting software created for you. Here’s how to cope:

  1. Do your surgery on a copy of the repo with its .git/config pointing to the public location.

  2. Warn the public repo’s users that it is briefly going out of service and they will need to re-clone it afterwards!

  3. Ensure that it is possible to force-push to the repository. How you do this will vary depending on your hosting site.

  4. On gitlab.com, under Settings, there is a "Protected Branches" item you can use. If you unprotect a branch, you can force-push to it.

    Elsewhere, you may be able to re-initialize the public repo (this works, for example, on SourceForge). You’ll need ssh access to the bare repo directory on the host - let’s suppose it’s myproject. Pop up to the enclosing directory and do this:

        mv myproject myproject-hidden
        rm -fr myproject-hidden/*
        git init --bare myproject-hidden
        mv myproject-hidden myproject

    The point of doing it this way is (a) so you never actually remove myproject (on many hosts you will not have create permissions in the enclosing directory), and (b) so no user can update the repo while you’re clearing it (mv is atomic).

    After re-initializing, you should be able to run git push to push the new history up to the public repo.

  5. From your modified local repo, try

         git push --mirror --force

    to push the new history up to the public repo.

  6. Inform the public repo’s users that it is available and remind them that they will need to re-clone it.

Step Five: Client Tools

Developers who are already git fans and know how to use a git client will, of course, have no partiticular trouble using a git repository.

Windows users accustomed to working through TortoiseSVN can move to TortoiseGIT.

Developers who like hg can use the hg-git mercurial plugin. There is an Ubuntu package "mercurial-git" for this, and other distributions are likely to carry it as well.

There are some hg-git limitations to be aware of. In order to simulate git behavior, hg-git keeps some local state in the .hg directories; a map from git branch names to Mercurial commits, a list of Mercurial bookmarks describing git branches (which have bookmark-like behavior different from a Mercurial named branch) and a file mapping git SHA1 hashes to hg SHA1 hashes (both systems use them as commit IDs). The problem is that hg doesn’t copy any of this local state when it clones a repo, so clones of hg-git repos lose their git branches and tags.

If you have developers attached to the CVS interface, it is possible (and in fact relatively easy) to set up a gateway interface that lets them continue using their CVS client tools. Consult the documentation for git-cvsserver.

Step Six: Good Practice

Since the object of this exercise is to support both git and hg fans, both groups need to use the repo in a way that doesn’t assume the other group will understand artifacts (like commit hashes) that are specific to either VCS.

Being careful about this has an additional benefit. Someday your project may need to change VCSes yet again; on that day, it will be extremely helpful if nobody has to try to convert years' or decades' worth of VCS-specific magic cookies in the history.

Educate your developers in the following good practices:

Commit references

The combination of a committer email address with a ISO8601 timestamp is a good way to refer to a commit without being VCS-specific. Thus, instead of "commit 304a53c2", use "2011-10-25T15:11:09Z!fred@foonly.com". I recommend that you not vary from this format, even in trivial ways like omitting the Z or changing the T or !. Making these cookies uniform and machine-parseable will have good consequences for future repository-browsing tools. The reference-lifting code in reposurgeon generates them.

Sometimes it’s enough to quote the summary line of a commit. So, instead of "Back out most of commit 304a53c2", you might write "Back out Attempted divide-by-zero fix.".

When appropriate, "my last commit" is simple and effective.

Comment summary lines

As previously noted, git and hg both want comments to begin with a summary line that can stand alone as a short description of the change; this may optionally be followed by a separating blank line and details in whatever form the commenter likes.

Try to end summary lines with a period. Ending punctuation other than a period should be used to indicate that the summary line is incomplete and continues after the separator; "…" is conventional.

For best results, stay within 72 characters per line. Don’t go over 80.

Good comment practice produces more readable output from git log and hg log, and makes it easy to take in whole sequences of changes at a glance.

Revision history

1.0 (2011-10-25)

Original version.

2.0 (2011-11-04)

Much more about CVS-to-git conversion, including recommending git cvsimport. I started numbering versions at this point.

2.1 (2011-11-07)

Updated for reposurgeon 1.7.

2.2 (2011-11-10)

Updated for reposurgeon 1.8.

2.3 (2011-11-10)

Fix incorrect assertion about newer versions of git handling properties, this was a failure in my testing.

2.4 (2011-11-16)

Add section on post-surgical cleanup: moving ignores, removing keyword expansions.

2.5 (2011-11-25)

Fix typos and note the existence of git-remote-hg

2.6 (2012-11-02)

reposurgeon can read Subversion repos now, making earlier conversion tools obsolete.

2.7 (2012-11-03)

Add a link to the generic conversion makefile.

2.8 (2012-11-04)

Title change, cleanup, and a Step Zero section.

3.0 (2012-11-05)

Get serious about capturing the workflow in the Makefile.

3.1 (2012-11-18)

It’s a good idea to run make compare.

3.2 (2012-12-05)

Add hints on other systems.

3.3 (2012-12-19)

Update for reposurgeon 2.10.

3.4 (2012-12-20)

Update for reposurgeon 2.11.

3.5 (2013-01-09)

Update for reposurgeon 2.13 and the graph command.

3.6 (2013-01-22)

Update for reposurgeon 2.15 and cvs-fast-export.

3.7 (2013-04-01)

Note that reposurgeon is significantly faster under pypy.

3.8 (2013-11-15)

Remove an obsolete paragraph.

3.9 (2013-12-11)

Incorporate the report that git now does hg remotes.

3.10 (2014-02-16)

Minor changes for 3.0 syntax.

3.11 (2014-02-18)

More about post-conversion sanity checking.

3.12 (2014-08-12)

Merging SVN .gitignore files.

3.13 (2014-10-26)

Note that git-cvsserver exists.

3.14 (2014-11-05)

Mention cvsconvert.

3.15 (2015-05-27)

A note on Google Code.

4.0 (2015-05-31)

Converted to asciidoc and merged into the reposurgeon distribution. Improved advice about force-pushing. Simplified conversion procedure. No longer recommending comparison of Subversion with a git-svn translation; it’s too flaky and limited for that to be a good idea. Add recommendation to create a synthetic conversion tag. Describe differences between SVN and DVCS merging models in detail.

4.1 (2015-08-26)

Point at the repomapper tool where appropriate.

4.2 (2015-09-03)

More about detecting and fixing misplaced tags in Subversion conversions.

4.3 (2016-01-23)

Mention use of repocutter to avoid --nobranch lifts.

4.4 (2017-03-06)

BitKeeper conversions are now supported.