There are many tools for converting repositories between version-control systems out there. This file explains why reposurgeon is the best of breed by comparing it to the competition.
The problems other repository-translation tools have come from ontological mismatches between their source and target systems - models of changesets, branching and tagging can differ in complicated ways. While these gaps can often be bridged by careful analysis, the techniques for doing so are algorithmically complex, difficult to test, and have ugly edge cases.
Furthermore, doing a really high-quality translation often requires human judgment about how to move artifacts - and what to discard. But most lifting tools are, unlike reposurgeon, designed as run-it-once batch processors that can only implement simple and mechanical rules.
Consequently, most repository-translation tools evade the harder problems. They produce a sort of pidgin rendering that crudely and partially copies the history from the source system to the target without fully translating it into native idioms, leaving behind metadata that would take more effort to move over or leaving it in the native format for the source system.
But pidgin repository translations are a kind of friction drag on future development, and are just plain unpleasant to use. So instead of evading the hard problems, reposurgeon provides a power assist for a human to tackle them head-on.
Here are some specific symptoms of evasion that are common enough to deserve tags for later reference.
LINEAR: One very common form of evasion is only handling linear histories.
NO_IGNORES: There are many different mechanisms for ignoring files - .cvsignore, Subversion svn:ignore properties, .gitignore and their analogues. Many older Subversion repositories still have .cvsignore files in them as relics of CVS prehistory that weren’t translated when the repository was lifted. Reposurgeon, on the other hand, knows these can be changed to .gitignore files and does it.
NO_TAGS: Many repository translators cannot generate annotated tags (or their non-git equivalents) even when that would be the right abstraction in the target system.
CONFIGURATION: Another common failing is for repository-translation tools to require a lot of configuration and ceremony before they can operate. Often, for example, tools that translate from Subversion repositories require you to declare the repository’s branch structure every time even though sensible defaults and a bit of autodetection could have avoided this.
MIXEDBRANCH: Yet another case usually handled poorly (in translators that handle branching) is mixed-branch commits. In Subversion it is possible (though a bad idea) to commit a changeset that modifies multiple branches at once. All sufficiently old Subversion repositories have these, often by accident. The proper thing to do is split these up; the usual thing is to assign them to one branch and leave them omitted from the others.
Version references in commit comments. It is not uncommon to see a lot of references that are no longer usable embedded in translated repositories like fossils in geological strata - file-version numbers like '1.2' in Subversion repos that had a former life in CVS, Subversion references like 'r1234' in git repositories, and so forth. There’s no tag for this because tools other than reposurgeon generally have no support at all for lifting these.
Problems with specific tools
To avoid repetitive text in these descriptions, I use the following additional bug tags:
CONFIGURATION: Requires elaborate configuration even for cases that ought to be simple.
ABANDONED: Effectively abandoned by its maintainer. Some tools with this tag are still nominally maintained but have not been updated or released in years.
NO_DOCUMENTATION: Poorly (if at all) documented.
!FOO means the tool is known not to have problem FOO.
?FOO means I have not tried the tool but have strong reason to suspect the problem is present based on other things I know about it.
You should assume that none of these tools do reference-lifting.
Just after the turn of the 21st century, when Subversion was the new thing in version control, most projects that were using version control were using CVS, and cvs2svn was about the only migration path.
Early cvs2svn had problems on every level, only some of which have been fixed by more recent releases. It tended to spew junk commits into the translated history, and produced strange combinations of Subversion internal operations that most later translation tools would cope with only poorly. Sometimes the resulting translations are actually malformed; more often they contains noisy commits or commit duplications that made little sense under Subversion and make even less under the new target system.
!LINEAR, ?MIXEDBRANCH, DOCUMENTATION
Formerly named parsecvs. Originally written by Keith Packard to port the X.org repositories, which it did a good job on. Now maintained by me; reposurgeon uses it to read CVS repositories. It is extremely fast and can thus be productively used even on huge repositories.
!ABANDONED, !LINEAR, !NO_IGNORES, !DOCUMENTATION, !CONFIGURATION
Don’t use this. Just plain don’t. I maintained version 3.x until I end-of-lifed it in favor of cvs-fast-export due to fundamental, unfixable problems. It gets branch topology wrong in ways that are difficult to detect.
git-svn, the Subversion converter in the git distribution, is really designed to be a two-way live gateway enabling git users to push and pull commits from a Subversion server. It operates by creating a git repository that is effectively a local mirror of the Subversion history, then performing Subversion client commands to synchronize the two in a git-like way.
This choice of mission means that git-svn’s translation of history into git uses a compromise between Subversion idioms and git ones that is more designed to make transactions back to the Subversion server easy and safe to generate than it is to make full use of the git capabilities that Subversion doesn’t have. This is pidgin translation for a reason better than laziness or failure of nerve, but it’s still pidgin.
Worse, git-svn has bugs that severely compromise it for full translations. It tends to stumble over common repository malformations in Subversion, producing history damage that is significant but evades superficial scrutiny. I have written about this problem in detail at Don’t do svn-to-git repository conversions with git-svn!
For a straight linear history with no tags or branches, the difference between git-svn’s Subversion-emulating behavior and the way a git repository would most naturally be structured is minimal. But for conformability with Subversion, git-svn cannot (practically speaking) use git’s annotated-tag facility in the local mirror; instead, Subversion tags have to be represented in the local mirror as git branches even if they have no changes after the initial branch copy.
Another thing the live-gatewaying use case prevents is reference-lifting. Subversion references like "r1234" in commit comments have to be left as-is to avoid creating pain for users of the same Subversion remote not going through git-svn.
git-svn is used by both the Google Code exporter and GitHub importer web services. Depending on these services is not recommended.
!ABANDONED, MIXEDBRANCH, NO_TAGS, NO_IGNORES.
Formerly part of the git suite; what they had before git-svn, and inferior to it. Among other problems, it can only handle Subversion repos with a "standard" trunk/tags/branches layout. Now deprecated.
MIXEDBRANCH, NO_TAGS, NO_IGNORES, ABANDONED.
A trivial wrapper around git-svn. All the reasons not to use git-svn apply to it as well.
MIXEDBRANCH, NO_TAGS, NO_IGNORES, !ABANDONED.
svn-fe was a valiant effort to produce a tool that would dump a Subversion repository history as a git fast-import stream. It made it into the git contrib directory and lingers there still.
LINEAR, NO_TAGS, NO_IGNORES, ABANDONED.
Tailor aimed to be an any-to-any repository translator.
LINEAR, ?NO_IGNORES, ABANDONED.
This is a Subversion-to-git tool that was written to handle some cases that git-svn barfs on (but reposurgeon doesn’t - the reposurgeon test suite contains a case sent by agito’s author to check this). It even handles mixed-branch commits correctly.
!LINEAR, !NO_TAGS, !MIXEDBRANCH, CONFIGURATION.
If you cannot use reposurgeon for some reason, this is one of the best alternatives.
svn2git (jcoglan/nirvdrum version)
A batch-conversion wrapper around git-svn that creates real tag objects. This is the one written in Ruby.
!ABANDONED, !NO_TAGS, NO_IGNORES.
If you cannot use reposurgeon for some reason, this is another alternative that is not too horrible. But beware of possible history damage if your Subversion repo has malformations that confuse git-svn.
svn2git (Schemenauer version)
Native Python. More a proof of concept than a production tool.
LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION, ABANDONED.
svn2git (Nyblom version)
Written in C++. Says it’s based on svn-fast-export by Chris Lee. Not easy to figure out what it actually does, as there is no documentation at all and no test cases. May be genetically related to svn-all-fast-export, but if so they diverged in 2008.
Written in C. More a proof of concept than a production tool.
LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION, ABANDONED.
Written in C. Documentation is so lacking that there isn’t even a README. However, it’s possible to deduce what isn’t there by reading the code.
LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION.
May be genetically related to the Nyblom svn2git, but if so they diverged in 2008.
LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION, ABANDONED.
Nearly unique for this category of software in being closed-source. Beyond an evaluation period, users have to register, possibly for a cost (it’s supposed to be free-of-charge for certain uses: open source projects, education, and ``startups'' — history with BitKeeper shows that these arrangements should not be trusted).
The intended outcome of this program is to provide a server with support for both Subversion and Git users to interact at once. This may be of little value overall, as new developers are frequently unfamiliar with Subversion (and old ones forget the usage patterns!), fundamental differences in design of the two VCSes interfering with the quality of both views, and increased confusion with preferred modes of contribution arise.
The quality of SubGit’s conversion is rather poor. It fails to properly translate at least half of the reposurgeon *.svn regression tests, even some of the simpler ones - although trickier cases such as agito.svn it does translates correctly. Large real-world Subversion repos will exhibit multiple issues that SubGit may, silently or otherwise, trip over.
This program will forever contain compromises for the same reasons git-svn does. The non-open source nature leaves little hope of having such issues repaired by skilled community members.
Atlassian’s BitBucket service relies on this for Subversion-to-Git migration. Depending on this service is not recommended.
!MIXEDBRANCH, !LINEAR, CONFIGURATION, DOCUMENTATION
reposurgeon success stories
reposurgeon has been used for successful conversion on projects including but not limited to the following. These are in rough chronological order.
- Hercules (IBM mainframe emulator)
I did this one, Subversion to hg. About ten years of history at the time, not too horribly messy.
- NUT (Network UPS Tools)
I did this one, Subversion to git. The trial by fire - it was when the Subversion dump analyzer got built. Very large old repository with lots of pathologies (there was a CVS stratum).
- Battle For Wesnoth
I did this one, Subversion to git. Very large repo, moderately complex.
- Roundup (issue tracker)
I did this one, Subversion to git (they later switched to hg). Moderate-sized Subversion repo with some very strange malformations.
I did this one, CVS to git. Simple history, pretty easy.
Two guys at Blender did this one with help from me, Subversion to git. Huge repository with a lot of nasty pathologies. The tool needed some serious optimization and feature upgrades to handle it.
I did this one, CVS to git. Rather easy as the project history was almost linear and, though very old, not huge.
CVS to git. This conversion has not yet been publicly released at time of writing (late October 2014) for complicated political reasons.
A record three layers, Bazaar over CVS over RCS. Malformations not too bad except for some unique challenges created by the RCS-to-CVS conversion, but the sheer size of the history and number of layers makes it the most complex conversion yet.
I did BitKeeper to git using a derivative of Tridge’s SourcePuller as a front end, done in early 2015. Nothing especially taxing about the reposurgeon side of things, the magic was all in the front end.
- pdfrw, playtag, pyeda, rson
Four small Subversion projects by Patrick Maupin, converted in two hours' work in May 2015. No significant difficulties. These mainly served to demonstrate that the standard conversion workflow in conversion.mk is fast and effective for a wide range of projects.
The Emacs interface for MH. Converted by Bill Wohler in late 2015. He reports that the standard conversion workflow worked fine.
CVS to git, 30 years of history with some early releases recovered from tarballs. Converted by me in late 2017. Somewhat messy due to vendor-branch issues.
SVN to git, with ancient strata of CVS and RCS. 280K commits of history back to 1987, dwarfing Emacs. Converted by myself and two core GCC developers. The 4.0 release came out of this.