How to use this manual

Everybody should read the Introduction (immediately after this section) to be sure reposurgeon is the tool you actually want.

Then read the Quick Start section (immediately following the Introduction) to get a feeling for the syntax and style of reposurgeon commands.

Assuming you’re trying to do a repository lift, you should probably continue by reading A Guide to Repository Conversion. It is not necessary to read the entire main body of the manual after that (the command reference) straight through, but keep it handy for when you need to learn more about a particular command or group of commands.

If you’re doing something more unusual than a conversion, you’ll probably have to read through the command reference until you discover the commands you need.

Help is available within reposurgeon using the "help" command. Type "help" alone for a list of help topics.

Introduction

The purpose of reposurgeon is to enable risky operations that VCSes (version-control systems) don’t want to let you do, such as (a) editing past comments and metadata, (b) excising commits, (c) coalescing and splitting commits, (d) removing files and subtrees from repo history, (e) merging or grafting two or more repos, and (f) cutting a repo in two by cutting a parent-child link, preserving the branch structure of both child repos.

A major use of reposurgeon is to assist a human operator to perform higher-quality conversions among version-control systems than can be achieved with fully automated converters. Another application is when code needs to be removed from a repository for legal or policy reasons.

Fully supported systems (those for which reposurgeon can both read and write repositories) include git, hg, bzr, darcs, bk, RCS, and SRC. For a complete list, with dependencies and technical notes, type "prefer" to the reposurgeon prompt.

Writing to the file-oriented systems RCS and SRC has some serious limitations because those systems cannot represent all the metadata in a git-fast-export stream. They require rcs-fast-import as a back end; consult that tool’s documentation for details and partial workarounds.

Fossil repository files can be read in using the --format=fossil option of the ‘read’ command and written out with the --format=fossil option of the ‘write’ comment. Ignore patterns are not translated in either direction.

SVN and CVS are supported for read only, not write. For CVS, reposurgeon must be run from within a repository directory (one with a CVSROOT subdirectory), not a checkout. Each module becomes a subdirectory in the reposurgeon representation of the change history.

Note that reposurgeon is a sharp enough tool to cut you. It takes care not to ever write a repository in an actually inconsistent state, and will terminate with an error message rather than proceed when its internal data structures are confused. However, there are lots of things you can do with it - like altering stored commit timestamps so they no longer match the commit sequence - that are likely to cause havoc after you’re done. Proceed with caution and check your work.

Also note that, if your DVCS does the usual thing of making commit IDs a cryptographic hash of content and parent links, editing a publicly-accessible repository with this tool would be a bad idea. All of the surgical operations in reposurgeon will modify the hash chains.

Please also see the notes on system-specific issues under Limitations and guarantees.

Quick start

As a motivating example, here are commands to do a Subversion-to-Git lift on a repository named project just under the current directory:

# Load the project into main memory
# Warning: this command is slow because Subversion is slow.
read project

# Map Subversion usernames to Git-style user IDs.  In a real
# conversion you'd probably have a lot more of these and you'd
# probably read them in from a separate file, not a heredoc.
authors read <<EOF
fred = Fred Foonly <fred@foonly.net> America/Chicago
jrh = J. Random Hacker <jrh@random.org> America/Los_Angeles
esr = Eric S. Raymond <esr@thyrsus.com> America/New_York
julien = Julien '_FrnchFrgg_' RIVAUD <julien@frnchfrgg.pw> Europe/Paris
db48x = <Daniel Brooks <db48x@db48x.net> America/Los_Angeles
ecree = Edward Cree <ecree@solarflare.com> Europe/London

EOF

# Massage comments into Git-like form (with a topic sentence and a
# spacer line after it if there is following running text). Only
# done when the first line is syntactically recognizable as a whole
# sentence.
gitify

# Tags with the name prefix emptycommit were branch-creation commits
# in Subversion. Usually there's nothing interesting in the comment
# text, but you'll want to browse them and check.  These commands
# save one such tag and delete the rest
tag emptycommit-23 noteworthy
tag /emptycommit/ delete

# Often, your Subversion repository was a CVS repository in a past
# life. CVS creates tags named with the suffix "-root" to mark branch
# points, and cv2svn blindly copies them even though the Subversion
# tools don't need that marker. This clutter can be tossed.
tag /-root$/ delete

# This command illustrates how to use msgin to modify the comment
# text of a commit. In this test we're patching a Subversion revision
# reference because we're going to want to reference-lift it later.
# But this capability could also be used, for example, to add an
# update note to a commit comment that turned out to be incorrect.
msgin <<EOF
Legacy-ID: 23
Check-Text: Referring back to r2.

Referring back to [[SVN:2]].
EOF

# Change cookies like [[SVN:2]] into action stamps that are independent
# of the VCS in use. A typical action stamp looks like this:
# <jrh@random.org!2006-08-14T02:34:56Z>
references lift

# Sometimes it's useful to drop files from the repo that should
# never have been checked in.
1,$ expunge :documents/.*.pdf:

# Process GNU Changelogs to get better attributions for changesets.
# When a commit was derived from a patch and checked in by someone
# other than its author this can often correct the commit attribution.
changelogs

# It's good practice to add a tag marking the point of conversion.
tag cutover-git create @max(=C)
msgin <<EOF
Tag-Name: cutover-git
Tagger: J. Random Hacker <jrh@random.org> America/Los_Angeles

This tag marks the last Subversion commit before the move to Git.
EOF

# We want to write a Git repository
prefer git

# Do it
rebuild project-git

Typically you’d have these commands in a script that you evolved by experimenting until you got a conversion that suited your tastes and needs.

A Guide to Repository Conversion

One of the main uses for reposurgeon is converting repositories between different version-control systems. In the year 2020, this usually means converting from something else to Git.

This section is a guide to up-converting your repository, and adopting practices that will reduce process friction to a minimum. It is meant to provide context for the description of reposurgeon’s features in later sections.

If you are aiming at something other than a repository conversion, you can safely skip this section.

In 90% of cases you’ll be converting from CVS or Subversion, and those are the cases we’ll discuss in detail. If you’re using something older or weirder, see the short section on other VCSes for some hints, but you’re mostly on your own.

Why convert with reposurgeon?

Reposurgeon is more difficult to use than any of dozens of fully-automated conversion tools out there; you have to make choices and compose a recipe. This section explains why it’s worth the bother.

In brief, it’s because fully-automated converters don’t work very well. They are very poor at dealing with the ontological mismatches between the data models of different version-control systems. For detailed discussion of the technical flaws in many common converters, see Appendix A.

But even automated converters that are relatively good at bridging data-model differences tend to produce crude, jackleg, unidiomatic conversions that make the seam between the pre-conversion and post-conversion parts of the repository very obvious.

A central example of this is commit references in change comments. These references convey important information to anyone reading the comments, and it is correspondingly important to change them from using the reference format of the old system to one that is intelligible in whatever your new one is.

As another example, git has a convention about the form of change comments; they’re supposed to consist of a standalone summary line, followed optionally by a spacer blank line and running text. Git relies on this convention to produce log summaries that are easy to take in at a glance.

Older version-control systems don’t have this convention. An ideal conversion changes as many comments as possible to be in Git-like form so that the Git summary tools see the data regularity they want. But this kind of editing can’t be fully automated. The best you can hope for, if you want to do it right, is that your tool automates as much of this fixup as it can and it assists a human operator in applying fixups.

Neither reference-lifting nor patching comments for Git-friendliness is a process that can be fully automated. Both require human judgment; accordingly, fully-automated converters don’t even try to do the right thing. The result is often a history that is full of unpleasant little speedbumps and distractions. These induce wasted developer effort and, correspondingly, higher defect rates.

On the other hand, a skilled operator of reposurgeon can produce a conversion that is fully idiomatic in the target system, significantly lowering future friction costs for developers browsing the history.

One fully automated reposurgeon feature of some significance that no other importer supports is that it can parse ChangeLogs in histories which use that Free Software Foundation convention, and use the attributions in them to fill in Git author fields. This recovers better information about the provenance of changesets corresponding to patches committed by a project developer (who continues to be recorded as the committer of that changeset).

Commercial Note

If you are an organization that pays programmers and has a requirement to do a repository conversion, the author can be engaged to perform or assist with the transition. You are likely to find this is more efficient than paying someone in-house for the time required to learn the tools and procedures. I (the author) have been very open about my methods here, but nothing substitutes for experience when you need it done quickly and right.

If you are wondering why it’s worth spending any money at all for a real history conversion, as opposed to just starting a new repository with a snapshot of the old head revision, the answer comes down to two words: risk management.

Suppose you do a snapshot conversion, head revision only. Then you get a regression report with a way to reproduce the problem. What you want to do is bisect in the new history to identify the revision where the bug was introduced, because knowing what the breaking change was makes a fix far easier. Bzzzt! You can’t. That history is missing in the new system.

Yes, in theory you could run a manual bisection using bracketing builds in new and old repositories. Until you have tried this, you will have no comprehension of how easy it is to get that process slightly but fatally wrong, and (actually more importantly) how difficult it is to be sure you haven’t gotten it wrong. This is the kind of friction cost that sounds minor until the first time it blows up on you and eats man-weeks of NRE.

So congratulations, tracing the bug just got an order of magnitude more expensive in engineer time, and your expected time to fix changed proportionally. It typically only takes one of these incidents to justify the up-front cost of having had the conversion done right.

If you go the snapshot-conversion route, maybe you’ll get lucky and never need visibility further back. Or maybe you’ll have a disaster because you increased the friction costs of debugging just enough that you, say, miss a critical ship date. The more experienced with in-the-trenches software development you are, the more plausible that second scenario will sound.

A subtler issue is that by losing the old change comments you have thrown away a great deal of hard-won knowledge about why your code is written the way it is. Again, this may never matter – but if it does, it’s going to bite you on the butt, hard, probably when you least expect it.

And if you’re thinking "No problem, the old repository will still be around"…​heh. Repositories that have become seldom-accessed are like other kinds of dead storage in that they have a way of quietly disappearing because after a few job turnovers the knowledge of why they’re important is lost. Typically you don’t find out this has happened until you have an unanticipated urgent need, at which point whatever trouble you were in gets deeper.

Spending the relatively small amount it takes to have a proper full conversion done right is a way of bounding your downside risk. If you aren’t a software engineer and had trouble following the preceding argument, propose a snapshot conversion to the engineer you trust the most and watch that person reaching for a diplomatic way to tell you it’s a stupid, shortsighted idea.

Step Zero: Preparation

Make sure the tools in the reposurgeon suite (especially reposurgeon and repotool) are on your $PATH.

Create a scratch directory for your conversion work.

Run "repotool initialize" in the scratch directory. This will create a Makefile designed to sequence your conversion, and an empty lift script. Then set the variables near the top appropriately for your project.

This Makefile will help you avoid typing a lot of fiddly commands by hand, and ensure that later products of the conversion pipeline are always updated when earlier ones have been modified or removed.

Later, you will put your custom commands in the lift script file. Doing this helps you not lose older steps as you experiment with newer ones, and it documents what you did.

Doing a high-quality repository conversion is not a simple job, and the odds that you will get it perfectly right the first time are close to zero. By packaging your lift commands in a repeatable script and using the Makefile to sequence repetitive operations, you will reduce the overhead of experimenting.

In the rest of the steps we describe below, when we write "make foo" that means the step can be sequenced by the "foo" production in the Makefile. Replace $(PROJECT) in these instructions with your project name.

You may find it instructive to type "make -n" to see what the entire conversion sequence will look like.

Step One: The Author Map

Subversion and CVS identify users by a Unix login name local to the repository host; DVCSes use pairs of fullnames and email addresses. Before you can finish your conversion, you’ll need to put together an author map that maps the former to the latter; the Makefile assumes this is named $(PROJECT).map. The author map should specify a full name and email address for each local user ID in the repo you’re converting. Each line should be in the following form:

foonly = Fred Foonly <foonly@foobar.com>

You can optionally specify a third field that is a timezone description, either an ISO8601 offset (like "-0500") or a named entry in the Unix timezone file (like "America/Chicago"). If you do, this timezone will be attached to the timestamps on commits made by this person.

Using the generic Makefile for Subversion, "make stubmap" will generate a start on an author-map file as $(PROJECT).map. Edit in real names and addresses (and optionally offsets) to the right of the equals signs.

How best to get this information will vary depending on your situation.

  • If you can get shell access to the repository host, looking at /etc/passwd will give you the real name corresponding to each username of a developer still active: usually you can simply append @ and the repository hostname to each username to get a valid email address. You can do this automatically, and merge in real names from the password file, using the 'repomapper' tool from the reposurgeon distribution.

  • If the repository is owned by a project on a forge site, you can usually get the real name information through the Web interface; try looking for the project membership or developer’s list information.

  • If the project has a development mailing list, posting your incomplete map with a request for completions often gives good results.

  • If you can download the archives of the project’s development mailing list, grepping out all the From addresses may suggest some obvious matches with otherwise unknown usernames. You may also be able to get timezone offsets from the date stamps on the mail. The repomapper tool can mine matching addresses from mailbox files automatically, though it does not extract timezones.

If you are converting the repository for an open-source project, it is good courtesy and good practice after the above first step to email the contributors and invite them to supply a preferred form of their name, a preferred email address to be used in the mapping, and a timezone offset. The reason for this is that some sites, like OpenHub, aggregate participation statistics (and thus, reputation) across many projects, using developer name and email address as a primary key.

Your authors file does not have to be final until you ship your converted repo, so you can chase down authors' preferred identifications in parallel with the rest of the work.

Step Two: Conversion

Install whatever front end reposurgeon needs to read your repository. That will be cvs-fast-export for CVS, or the Subversion tools themselves for Subversion.

The generic-workflow Makefile will call reposurgeon for you, interpreting your $(PROJECT).lift file, when you type "make". You may have to watch the baton spin for a few minutes. For very large repositories it could be more than a few minutes.

This will convert your repository to git. If you need to export to something else, reposurgeon has write support for a couple of other modern VCSes.

CVS

If you are exporting from CVS, it may be a good idea to run some trial conversions with cvsconvert, a wrapper script shipped with cvs-fast-export. This script runs a conversion direct to git; the advantage is that it can do a comparison of the repository histories and identify problems for you to fix in your lift script.

A CVS repository normally consists of a set of module subdirectories and a CVSROOT directory containing metadata. If yours has more than one module, see this important caveat.

Problems in CVS conversions generally arise from the fact that CVS’s data model doesn’t have real multi-file changesets, which are the fundamental unit of a commit in DVCSes. It can be difficult to fully recover changesets from what are actually large numbers of single-file changes flying in loose formation - in fact, old CVS operator errors can sometimes make it impossible. Bad tools silently propagate such damage forward into your translation. Good tools, like cvs-fast-export and reposurgeon, warn you of problems and help you recover.

There are two kinds of non-serious CVS conversion glitches: file content mismatches due to keyword fields in masters, and 'zombie' files deleted in CVS that get resurrected in Git revisions associated with tags.

You can spot content mismatches due to keyword expansion easily. They will produce single-line diffs of lines containing dollar signs surrounding keyword text. Because binary files can be corrupted by keyword expansion, cvs-fast-export behaves like cvs -kb mode and does no keyword expansion of its own.

Manifest mismatches on tags are most likely to occur on files which were deleted in CVS but persist under later tags in the Git conversion. You can bet this is what’s going on if, when you search for the pathname in the CVS repository, you find it in an Attic directory.

These spurious reports happen because CVS does not always retain enough information to track deletions reliably and is somewhat flaky in its handling of "dead"-state revisions. To make your CVS and git repos match perfectly, you may need to add delete fileops to the conversion - or, more likely, move existing ones back along their branches to commits that predate the gitspace tag - using reposurgeon(1).

Manifest mismatches in the other direction (present in CVS, absent in gitspace) should never occur. If one does, submit a bug report.

Any other kind of content or manifest match - but especially any on the master branch - is bad news and indicates either a severe repository malformation or a bug in cvs-fast-export (or possibly both). Any such situation should be reported as a bug.

Conversion bugs are disproportionately likely to occur on older branches or tags from before CVS had reliable commitids. Often the most efficient remedy is simply to delete junk branches and tags; reposurgeon(1) makes this easy to do.

Subversion

Normally reposurgeon will do branch analysis for you. On most Subversion repositories, and in particular anything with a standard trunk/tags/branches layout, it will do the right thing. (It will also cope with adventitious branches in the root directory of the repo, such as many projects use for website content.)

In very unusual cases you may need to use the "--nobranch" option. However, this has the disadvantage that you’ll have to do the branch surgery by hand at a later stage. Instead, you may be able to use the repocutter filter to transform the dump file into a version shaped right for a regular branch-sensitive lift.

To the author’s knowledge, reposurgeon is the only conversion tool that handles multibranch Subversion repositories in full generality. It can even correctly translate Subversion commits that alter multiple branches.

Performance tip: reposurgeon should analyze Subversion repositories at the rate of over 100K commits per minute, but that rate can fall off greatly on very large repositories.

Unlike CVS, Subversion repositories have real changesets, and the work in them can effectively always be mapped to equivalent DVCS commits. The parent-child relationships among commits will also translate cleanly. There is, however, a minor problem around tags, and a significant problem around merges.

The tag problem arises because Subversion tags are really branches that you’ve conventionally agreed not to commit to after the initial branch copy (that’s what the tags/ directory name conveys). But Subversion doesn’t enforce any prohibition against committing to the tag branch, and various odd things can happen if you do

The reposurgeon analyzer tries to warn you about pathological cases, and reposurgeon gives you tools for coping with them. Unfortunately, the warnings are (unavoidably) cryptic unless you understand Subversion internals in detail.

In a DVCS, a merge normally coalesces two entire branches. Subversion has something close to this in newer versions; it’s called a "sync merge" working on directories (and is expressed as an svn:mergeinfo property of the target directory that names the source). A sync merge of a branch directory into another branch directory behaves like a DVCS merge; reposurgeon picks these up and translates them for you.

The older, more basic Subversion merge is per file and is expressed by per-file svn:mergeinfo properties. These correspond to what in DVCS-land are called "cherry-picks", which just replay a commit from a source branch onto a target branch but do not create cross-branch links.

Sometimes Subversion developers use collections of per-file mergeinfo properties to express partial branch merges. This does not map to the DVCS model at all well, and trying to promote these to full-branch merges by hand is actually dangerous. An excellent essay, Partial git merges — just say no. explores the problem in depth.

The bottom line is that reposurgeon warns about per-file svn:mergeinfo properties and then discards them for good reasons. If you feel an urge to hand-edit in a branch merge based on these, do so with care and check your work.

Other VCSes

SCCS: Use sccs2rcs to get to RCS, then follow the directions for RCS. There is a script called sccs2git on CPAN which is not recommended, as it is poorly documented and makes no attempt to group commits into changesets.

RCS: reposurgeon will read an RCS collection. It uses cvs-fast-export, which despite its name does not actually require CVS metadata other than the RCS master files that store the content.

Mercurial: reposurgeon will read a Mercurial repository. It uses hg-git-fast-import as an importer. Note that this conversion is not very well tested yet; you may want to run conversions with both the importer and the hg extractor harness and compare them.

Fossil: reposurgeon will read a Fossil repository file. It uses the native Fossil exporter, which is pretty good but doesn’t export ignore patterns, wiki events, or tickets.

BitKeeper: As of version 7.3 (and probably earlier versions under open-source licensing) BitKeeper has fast-import and fast-export subcommands, and reposurgeon now knows how to use these.

Perforce: According to this Stack Overflow answer, the magic incantation is something like git p4 clone --import-labels --detect-branches //depot/path/project@all. This will create a live Git repository capturing your Perforce history. This recipe is included for completeness; it is unknown to the author what (if any) reposurgeon cleanup operations might be required, but a skim of Perforce documentation suggests that mapping Perforce user IDs to a Git-style name/address pair will be desirable.

AccuRev: There are a couple of tools for translating AccuRev repositories to live Git repositories. Of these the most developed appears to be called "ac2git.py"; you should be able to find it with a search engine. We recommend you use this tool to get a first-pass conversion to Git, then use reposurgeon to clean up the result. The ac2git.py converter’s goal is to produce an accurate representation of a collection of ac2git streams, as multiple disconnected git branches; while there is functionality to identify branch and merge points, actually weaving the streams into a single DAG is something best done in reposurgeon.

For other systems, see the Git wiki page on conversion tools.

Step Three: Sanity Checking

Before putting in further effort on polishing your conversion and putting it into production, you should check it for basic correctness.

Pay attention to error messages emitted during the lift. Most of these, and remedial actions to take, are described in the reposurgeon manual.

For Subversion lifts, use the "headcompare", "tagscompare" and "branchescompare" productions to compare the converted with the unconverted repository. If you didn’t use the cvsconvert wrapper for your CVS lift, these productions have a similar effect. Be aware that these operations may be extremely slow on large Subversion repositories.

The only differences you should see are those due to keyword expansion and ignore-file lifting. If this is not true, you may have found a serious bug in either reposurgeon or the front end it used, or you might just have a misassigned tag that can be manually fixed. Consult How to report bugs for information on how to usefully report bugs.

Use reposurgeon’s ‘lint’ command to find anomalies like detached branches that may need manual correction.

If you are converting from CVS, use reposurgeon’s graph command to examine the conversion, looking (in particular) for misplaced tags or branch joins. Often these can be manually repaired with little effort. These flaws do 'not' necessarily imply bugs in cvs-fast-export or reposurgeon; they may simply indicate previously undetected malformations in the history. However, reporting them may help improve cvs-fast-export.

Warning: As of September 2016, stock CVS is known buggy in ways which may affect checking the correctness of conversions. For best results, use a CVS version with the MirOS patches. These are carried by Debian Linux and derivatives; you can check by Looking for "MirDebian" in the output of cvs --version.

Step Four: Cleanup

You should now have a git repository, but it is likely to have a lot of cruft and conversion artifacts in it. Here are some common forms of cruft:

Commit comments and attributions containing non-UTF8 data

You could have metadata in your repository in an encoding incompatible with UTF-8 (Latin-1 is the most common offender). You will probably want to transcode the repo to UTF-8.

Subversion and CVS commit references

Often Subversion references will be in the form 'r' followed by a string of digits referring to a Subversion commit number. But not always; humans come up with lots of ambiguous ways to write these. CVS commit references are even harder to spot mechanically, as they’re just groups of digits separated by dots with no identifying prefix. A clean conversion should turn all these into VCS-independent commit references, which will be described later in this document.

Multi-line contents with no summary

git and hg both encourage comments to begin with a summary line that can stand alone as a short description of the change; this practice produces more readable output from git log and hg log. For a really high-quality conversion, multi-line comments should be edited into this form.

Tags from Subversion no-fileop commits

Commits with no fileops are automatically transformed into tags when reading a Subversion repository. Other importers may generate them for various reasons; you can detect them as the =Z visibility set. You will probably want to delete these; they’re preserved just in case something about the metadata is interesting.

Branch tip deletes, deletealls, and unexpressed merges

In Subversion it is common practice to delete a branch directory when that line of development is finished or merged to trunk; this makes sense because it reduces the checkout size of the repo in later revisions. In a DVCS, deletes at a branch tip don’t save you any storage, so it makes more sense to leave the branch with all of its tip content live if you’re not going to delete it entirely. Sometimes editing a later commit to have the branch tip as a parent (creating a merge that Subversion could not express) makes sense; look for svn:mergeinfo properties as clues.

Commits generated by cvs2svn to carry tag information

These lurk in the history of a lot of Subversion projects. Sometimes these junk commits are empty (no file operations associated with them at all); sometimes they’re translated as long lists of spurious delete fileops, and sometimes they have actual file content (duplicating parent file versions, or referring randomly to file versions far older than the junk commit). Older versions of cvs2svn seem to have generated all kinds of meaningless crud into these.

Metadata inserted by git-svn

git-svn inserts lines at the end of each commit comment that refer back to the Subversion commit it is derived from. This is necessary for live-gatewaying, and useful during one-shot conversions, but you may not want it in the final repo.

Here’s a checklist of cleanup steps. If you’re using the makefile generated by repotool, most of these will be done by commands in your lift script.

  1. Map author IDs from local to DVCS form.

  2. Check for leftover cvs2svn junk commits and remove them if possible.

  3. Lift references in commit comments.

  4. Massage comments into summary-line-plus-continuation form.

  5. If the project used the GNU ChangeLog convention, run "changelogs".

  6. Remove empty and delete-only tip commits where appropriate.

  7. Review generated tags, pruning and fixing locations as appropriate.

  8. Look for branch merge points and patch parent marks to make them.

  9. Fix up or remove $-keyword cookies in the latest revision.

  10. Resolve any [zombie] files in a CVS conversion by patching in D ops.

  11. If there’s a root branch, check for and remove junk commits on it.

  12. Use the transcode command to fix up metadata in non-UTF8 encoding.

  13. Run lint to detect remaining anomalies that might need to be patched.

  14. For the record, make a tag noting time and date of the repo lift.

  15. If your target was git, run git gc --aggressive.

Most of the work will be in the comment-fixup and reference-lifting stages. These normally take only a couple of hours even on very large repos with thousands of commits. An entire conversion is usually less than two days of work.

You can use the authors read command to perform the author-ID mapping operation with reposurgeon.

You can find empty commits as the =Z visibility set set and clean them up with the command tagify. Consult the reposurgeon manual page for usage details.

A good way to spot junk commits is to eyeball the picture of the commit DAG created by the reposurgeon 'graph' command - they tend to stand out visually as leaf nodes in odd places. Be aware that the graph command outputs DOT, the language interpreted by the graphviz suite; you will need a DOT rendering program and an image viewer.

See the documentation of the references command, for details on how fix up Subversion and CVS changeset references in comments so they’re still meaningful.

The command =L edit is good for fixing up multi-line comments.

The reposurgeon command inspect =H will show you tip commits which may contain only deletes and deletealls.

Tags can be inspected with =T inspect. Junk tags can be removed with the delete command. Tag comments can be modified with edit. Check that the creation date of tags matches what you see in the source repository; this is the easiest way to spot when one has been attached to the wrong commit, something that can be manually fixed by editing its "from" field.

The command =I will select all commits that don’t decode to UTF-8 in both the commit comment and attribution parts. You eyeball those to figure out what the encoding is and apply the transcode command to fix things.

Reposurgeon has a merge command specifically for performing branch merges. The edit command will also allow you to add a parent mark to a commit.

One minor feature you lose in moving from SCCS, CVS, Subversion, or BitKeeper to a DVCS is keyword expansion. You should go through the last revision of the code and remove $Id$, $Date$, $Revision$, and other keyword cookies lest they become unhelpful fossils. The full Subversion set is $Date$, $Revision$, $Author$, $HeadURL$ and $Id$. CVS uses $Author$, $Date$, $Header$, $Id$, $Log$, $Revision$, also (rarely) $Locker$, $Name$, $RCSfile$, $Source$, and $State$. A command like grep -R '$[A-Z]' . may be helpful.

After conversion of a branchy repository, look to see if there is a 'root' branch. If there are any commits with a sufficiently pathological structure that reposurgeon can’t figure out what branch they belong to, they’ll wind up there.

It’s good practice to leave an annotated tag at the conversion point noting the date and time of the repo lift. See the next section on conversion comments for discussion. Here’s an example of how to make a tag:

msgin --create <<EOF
Tag-Name: git-conversion

Marks the spot at which this repository was converted from Subversion to git.

Conversion notes are enclosed in double square brackets. Junk commits
generated by cvs2svn have been removed, commit references have been
mapped into a uniform VCS-independent syntax, and some comments edited
into summary-plus-continuation form.
EOF

Experiments with reposurgeon suggest that git import doesn’t try to pack or otherwise optimize for space when it populates a repo from a dump file; this produces large repositories. Running git repack and git gc --aggressive can slim them down quite a lot.

Conversion comments

Sometimes, in converting a repository, you may need to insert an explanatory comment - for example, if metadata has been garbled or missing and you need to point to that fact.

It’s helpful for repository-browsing tools if there is a uniform syntax for this that is highly unlikely to show up in repository comments. Enclosing translation notes in [[ ]] is recommended; this has the advantage of being visually similar to the [ ] traditionally used for editorial comments in text.

It is good practice to include, in either the root commit of the repository or the conversion tag, a note dating and attributing the conversion work and explaining these conventions. Example:

[[This repository was converted from Subversion to git on 2012-10-24
by Eric S. Raymond &lt;esr@thyrsus.com&gt;.  Here and elsewhere, conversion
notes are enclosed in double square brackets. Junk commits generated
by cvs2svn have been removed, commit references have been mapped into
a uniform VCS-independent syntax, and some comments edited into
summary-plus-continuation form.]]

You should also, as previously noted, leave a tag in the normal commit sequence noting the switchover. You can do this with the msgin --create command; see the reposurgeon manual page for details and an example.

Nonsurgical cleanup steps

A step that too often gets missed and then inelegantly patched in later is converting the declarations that tell the version-control system to ignore derived files. reposurgeon does this for you if you’re using it for CVS- or Subversion-to-git conversion, both expressing Subversion svn:ignore and svn:global-ignores properties as .gitignore files and lifting .cvsignore files to .gitignore files; see Limitations and guarantees if other DVCSes are involved.

Any .gitignore files found in a repository were almost certainly created by git-svn users ad hoc and should be discarded; it is up to the human doing the conversion to look through them and rescue any ignore patterns that should be merged into the converted repository. This behavior can be reversed with the --user-ignores option, which simply passes through .gitignore files.

Recovering from errors

Occasionally you’ll discover problems with a conversion after you’ve pushed it to a project’s hosting site, typically to a bare repo that the hosting software created for you. Here’s how to cope:

  1. Do your surgery on a copy of the repo with its .git/config pointing to the public location.

  2. Warn the public repo’s users that it is briefly going out of service, and they will need to re-clone it afterwards!

  3. Ensure that it is possible to force-push to the repository. How you do this will vary depending on your hosting site.

  4. On gitlab.com, under Settings, there is a "Protected Branches" item you can use. If you unprotect a branch, you can force-push to it.

    Elsewhere, you may be able to re-initialize the public repo (this works, for example, on SourceForge). You’ll need ssh access to the bare repo directory on the host - let’s suppose it’s 'myproject'. Pop up to the enclosing directory and do this:

        mv myproject myproject-hidden
        rm -fr myproject-hidden/*
        git init --bare myproject-hidden
        mv myproject-hidden myproject

    The point of doing it this way is (a) so you never actually remove myproject (on many hosts you will not have create permissions in the enclosing directory), and (b) so no user can update the repo while you’re clearing it (mv is atomic).

    Here’s a script that will do the job on SourceForge:

    #!/usr/bin/expect -f
    #
    # nuke - nuke a SourceForge repo
    #
    # usage: nuke project [userid]
    #
    
    if {$argc < 1} {
        puts "nuke: project name argument is required"
        exit 1
    } else {
        set project [lindex $argv 0]
        set user $env(USER)
        if {$argc >= 2} {
    	set user [lindex $argv 1]
        }
    }
    
    set remoteprompt "bash-4.1"
    
    set timeout -1
    spawn $env(SHELL)
    match_max 100000
    send -- "ssh -t $user@shell.sourceforge.net create"
    expect -exact "ssh -t $user@shell.sourceforge.net \r create"
    send -- "\r"
    expect -exact "$remoteprompt\$ "
    send -- "cd /home/git/p/$project\n"
    expect -exact "$remoteprompt\$ "
    send -- "cd git-main.git\n"
    expect -exact "$remoteprompt\$ "
    send -- "rm -fr *\n"
    expect -exact "$remoteprompt\$ "
    send -- "git init --bare .\n"
    expect -exact "$remoteprompt\$ "

    After re-initializing, you should be able to run git push to push the new history up to the public repo.

  5. From your modified local repo, try

         git push --mirror --force

    to push the new history up to the public repo.

  6. Inform the public repo’s users that it is available and remind them that they will need to re-clone it.

On GitLab, you can get a similar effect by unprotecting all branches and doing a git push --force to unconditionally overwrite the public history. It is good practice to re-protect the branches afterwards.

Step Five: Client Tools

Developers who are already git fans and know how to use a git client will, of course, have no particular trouble using a git repository.

Windows users accustomed to working through TortoiseSVN can move to TortoiseGIT.

Developers who like hg can use the hg-git mercurial plugin. There is an Ubuntu package "mercurial-git" for this, and other distributions are likely to carry it as well. It installs a Mercurial plugin called hg-git.

There are some hg-git limitations to be aware of. In order to simulate git behavior, hg-git keeps some local state in the .hg directories; a map from git branch names to Mercurial commits, a list of Mercurial bookmarks describing git branches (which have bookmark-like behavior different from a Mercurial named branch) and a file mapping git SHA1 hashes to hg SHA1 hashes (both systems use them as commit IDs). The problem is that hg doesn’t copy any of this local state when it clones a repo, so clones of hg-git repos lose their git branches and tags.

If you have developers attached to the CVS interface, it is possible (and in fact relatively easy) to set up a gateway interface that lets them continue using their CVS client tools. Consult the documentation for git-cvsserver.

Step Six: Good Practice

Educate your developers in the following good practices:

Commit references

The combination of a committer email address with a ISO8601 timestamp is a good way to refer to a commit without being VCS-specific. Thus, instead of "commit 304a53c2", use "<2011-10-25T15:11:09Z!fred@foonly.com>". It is recommended that you not vary from this format, even in trivial ways like omitting the 'Z' or changing the 'T' or '!'. Making these cookies uniform and machine-parseable will have good consequences for future repository-browsing tools. The reference-lifting code in reposurgeon generates them.

Being careful about this has an additional benefit. Someday your project may need to change VCSes yet again; on that day, it will be extremely helpful if nobody has to try to convert years' or decades' worth of VCS-specific magic cookies in the history.

Sometimes it’s enough to quote the summary line of a commit. So, instead of "Back out most of commit 304a53c2", you might write "Back out Attempted divide-by-zero fix."

When appropriate, "my last commit" is simple and effective.

Comment summary lines

As previously noted, git and hg both want comments to begin with a summary line that can stand alone as a short description of the change; this may optionally be followed by a separating blank line and details in whatever form the commenter likes.

Try to end summary lines with a period. Ending punctuation other than a period should be used to indicate that the summary line is incomplete and continues after the separator; "…​" is conventional.

For best results, stay within 72 characters per line. Don’t go over 80.

Good comment practice produces more readable output from git log and hg log, and makes it easy to take in whole sequences of changes at a glance.

Theory of Operation

The outside view

As the quick-start example shows, you’re typically going to do three steps when you use reposurgeon: (1) read in one (or more) repositories, (2) do surgical things on them, and (3) write out one (or more) repositories.

To keep reposurgeon simple and flexible, it normally does not do its own repository reading and writing. Instead, it relies on being able to parse and emit the command streams created by git-fast-export and read by git-fast-import. This means that it can be used on any version-control system that has both fast-export and fast-import utilities. The git-import stream format also implicitly defines a common language of primitive operations for reposurgeon to speak.

In order to deal with version-control systems that do not have fast-export equivalents, reposurgeon can also host extractor code that reads repositories directly. For each version-control system supported through an extractor, reposurgeon uses a small amount of knowledge about the system’s command-line tools to (in effect) replay repository history into an input stream internally. Repositories under systems supported through extractors can be read by reposurgeon, but not modified by it. In particular, reposurgeon can be used to move a repository history from any VCS supported by an extractor to any VCS supported by a normal importer/exporter pair.

Mercurial repository reading is implemented with an extractor class; writing is handled with the "hg-git-fast-import" command. A test extractor exists for git, but is normally disabled in favor of the regular exporter.

Subversion is an important exception. Its exporter is ‘svnadmin dump’, which doesn’t ship a git-fast-import stream, but rather the unique dump format supported by Subversion. Reposurgeon contains an interpreter for this stream format.

As a matter of historical interest, some old versions of reposurgeon had the ability to build a Subversion repository on output by synthesizing a Subversion dump stream and feeding it to ‘svnadmin load’. This feature was a cute stunt, but was abandoned during translation to Go for a couple of reasons. Most importantly, there is zero demand for moving histories to Subversion - and supposing there were, moving content and metadata from git’s DAG representation to a Subversion stream is very lossy. Subversion to Git to Subversion wouldn’t even have round-tripped well.

The inside view

Between reads and writes, reposurgeon can usefully be thought of as a structure editor for directed acyclic graphs with a pre-defined set of attributes on their nodes.

To get a feel for what that graph is like, it’s helpful to have seen a git-fast-import stream file. Here is a trivial example from the reposurgeon test suite, describing a history with two commits to a single file:

blob
mark :1
data 20
1234567890123456789

commit refs/heads/master
mark :2
committer Ralf Schlatterbeck <rsc@runtux.com> 0 +0000
data 14
First commit.
M 100644 :1 README

blob
mark :3
data 20
0123456789012345678

commit refs/heads/master
mark :4
committer Ralf Schlatterbeck <rsc@runtux.com> 10 +0000
data 15
Second commit.
from :2
M 100644 :3 README

A git-fast-import stream consists of a sequence of commands which must be executed in the specified sequence to build the repo; to avoid confusion with reposurgeon commands we will refer to the stream commands as events in this documentation. These events are implicitly numbered from 1 upwards. Most commands require specifying a selection of event sequence numbers so reposurgeon will know which events to modify or delete.

For all the details of event types and semantics, see the git-fast-import(1) manual page; the rest of this paragraph is a quick start for the impatient. The most prominent events in a stream are commits describing revision states of the repository; these group together under a single change comment one or more fileops (file operations), which usually point to blobs that are revision states of individual files. A fileop may also be a delete operation indicating that a specified previously-existing file was deleted as part of the commit; there are a couple of other special fileop types of lesser importance.

Reposurgeon’s internal representation of a repository history is basically a deserialized git fast-import stream. A few extra attributes are supported; most notably, commits and resets have a legacy-id attribute that carries over the object’s ID from whatever version-control system exported the stream, in particular a Subversion or CVS revision number.

The interpreter view

The program can be run in one of two modes, either as an interactive command interpreter or in batch mode to execute commands given as arguments on the reposurgeon invocation line.

The only differences between these modes are (1) the interactive one begins by turning on the ‘interactive’ option, (2) in batch mode all errors (including normally-recoverable errors in selection-set syntax) are fatal, and (3) each command-line argument beginning with ‘--’ has that stripped off (which in particular means that --help and --version will work as expected).

Also, in interactive mode, Ctrl-P and Ctrl-N will be available to scroll through your command history, and tab completion of both command keywords and name arguments (wherever that makes semantic sense) is available.

It is expected that interactive mode will be used mainly for exploring repository metadata, while conversion experiments will be captured in a script that is gradually improved until the day final cutover can be performed and the old repository decommissioned.

Note that this means the old repository can be left in service while the conversion recipe is under development. Recipe development should be treated as a serious project with its own change tracking.

Finding your way around

In the remainder of this document, individual commands are described by hanging paragraphs led by the command sequence.

Help is always available.

help

Get help on the interpreter commands. Optionally follow with whitespace and a command name; with no argument, lists all commands. '?' also invokes this.

History is always available.

history

List the commands you have entered this session.

You can do Ctrl-P or up-arrow to scroll back through the command history list, and Ctrl-N or down-arrow to scroll forward in it.

Tab-completion on command keywords is available.

You don’t need to exit the interpreter to run quick shell commands.

shell

Execute the shell command given in the remainder of the line. '!' also invokes this.

General command syntax

Commands to reposurgeon consist of a command keyword, usually preceded by a selection set, sometimes followed by whitespace-separated arguments. It is often possible to omit the selection-set argument and have it default to something reasonable. For commands that are considered safe (no side effects) the default is all events; for risky commands the default is no events.

When a command changes repository state, it will usually so indicate in a response.

Here are some motivating examples. The commands will be explained in more detail after the description of selection syntax.

:15 edit               ;; edit the object associated with mark :15.

edit                   ;; edit all editable objects.

29..71 list            ;; list summary index of events 29..71.

236..$ list            ;; List events from 236 to the last.

<#523> inspect         ;; Look for commit #523; they are numbered
                       ;; 1-origin from the beginning of the
                       ;; repository.

<2317> inspect         ;; Look for a tag with the name 2317, a tip
                       ;; commit of a branch named 2317, or a commit
                       ;; with legacy ID 2317. Inspect what is found.
                       ;; A plain number is probably a legacy ID
                       ;; inherited from a Subversion revision
                       ;; number.

/regression/ list      ;; list all commits and tags with comments or
                       ;; committer headers or author headers
                       ;; containing the string "regression".

1..:97 & =T delete     ;; delete tags from event 1 to mark 97.

[Makefile] inspect     ;; Inspect all commits with a file op touching
                       ;; Makefile and all blobs referred to in a
                       ;; fileop touching Makefile.

:46 tip                ;; Display the branch tip that owns
                       ;; commit :46.

@dsc(:55) list         ;; Display all commits with ancestry tracing
                       ;; to :55.

@min([.gitignore]) remove .gitignore delete
                       ;; Remove the first .gitignore fileop in the
                       ;; repo.

The regular expressions should be in Golang’s format, with one exception. Due to a conflict with the use of $ for arguments in the script command, we retain Python’s use of backslashes as a leader for references to group matches.

Regular expressions are not anchored. Use ^ and $ to anchor them to the beginning or end of the search space, when appropriate.

Selection syntax

A selection set is ordered; that is, any given element may occur only once, and the set is ordered by when its members were first added.

The selection-set specification syntax is an expression-oriented minilanguage. The most basic term in this language is a location. The following sorts of primitive locations are supported:

event numbers

A plain numeric literal is interpreted as a 1-origin event-sequence number. It is not expected that you will have to use this feature often.

marks

A numeric literal preceded by a colon is interpreted as a mark; see the import stream format documentation for explanation of the semantics of marks.

tag and branch names

The basename of a branch (including branches in the refs/tags namespace) refers to its tip commit. The name of a tag is equivalent to its mark (that of the tag itself, not the commit it refers to). Tag and branch locations are bracketed with < > (angle brackets) to distinguish them from command keywords.

legacy IDs

If the content of name brackets (< >) does not match a tag or branch name, the interpreter next searches legacy IDs of commits. This is especially useful when you have imported a Subversion dump; it means that commits made from it can be referred to by their corresponding Subversion revision numbers.

commit numbers

A numeric literal within name brackets (< >) preceded by # is interpreted as a 1-origin commit-sequence number.

reset targets

If the previous ways of interpreting a name within brackets don’t resolve, the name is checked to see if it matches a reset. If so, the expression resolves to the commit the reset is attached to.

reset@ names

A name with the prefix ‘reset@’ refers to the latest reset with a basename matching the part after the @. Usually there is only one such reset.

$

Refers to the last event.

These may be grouped into sets in the following ways:

ranges

A range is two locations separated by ‘..’, and is the set of events beginning at the left-hand location and ending at the right-hand location (inclusive).

lists

Comma-separated lists of locations and ranges are accepted, with the obvious meaning.

There are some other ways to construct event sets:

visibility sets

A visibility set is an expression specifying a set of event types. It will consist of a leading equal sign, followed by type letters. These are the type letters:

B

blobs

Most default selection sets exclude blobs; they have to be manipulated through the commits they are attached to.

C

commits

D

all-delete commits

These are artifacts produced by some older repository-conversion tools.

H

head (branch tip) commits

O

orphaned (parentless) commits

U

commits with callout parents

Z

commits with no fileops

M

merge (multi-parent) commits

F

fork (multi-child) commits

L

commits with unclean multi-line comments

E.g. without a separating empty line after the first

I

commits for which metadata cannot be decoded to UTF-8

T

tags

R

resets

P

passthroughs

All event types simply passed through, including comments, progress`commands, and `checkpoint commands

N

Legacy IDs

Any comment matching a cookie (legacy-ID) format.

references

A reference name (bracketed by angle brackets) resolves to a single object, either a commit or tag.

type interpretation

tag name

annotated tag with that name

branch name

the branch tip commit

legacy ID

commit with that legacy ID

assigned name

name equated to a selection by assign

Note that if an annotated tag and a branch have the same name foo, <foo> will resolve to the tag rather than the branch tip commit.

dates and action stamps

A date or action stamp in angle brackets resolves to a selection set of all matching commits.

type interpretation

RFC3339 timestamp

commit or tag with that time/date

action stamp (timestamp!email)

commits or tags with that timestamp and author (or committer if no author). Aliases of the author are also accepted.

yyyy-mm-dd part of RFC3339 timestamp

all commits and tags with that date

To refine the match to a single commit, use a 1-origin index suffix separated by #. Thus <2000-02-06T09:35:10Z> can match multiple commits, but <2000-02-06T09:35:10Z#2> matches only the second in the set.

text search

A text search expression is a regular expression surrounded by forward slashes (to embed a forward slash in it, use a C-like string escape such as \x2f).

A text search normally matches against the comment fields of commits and annotated tags, or against their author/committer names, or against the names of tags; also the text of passthrough objects.

The scope of a text search can be changed with qualifier letters after the trailing slash. These are as follows:

letter interpretation

a

author name in commit

b

branch name in commit; also matches blobs referenced by commits on matching branches, and tags which point to commits on matching branches.

c

comment text of commit or tag

r

committish reference in tag or reset

p

text in passthrough

t

tagger in tag

n

name of tag

B

blob content

Multiple qualifier letters can add more search scopes.

(The "b" qualifier replaces the branch-set syntax in earlier versions of reposurgeon.)

paths

A "path expression" enclosed in square brackets resolves to the set of all commits and blobs related to a path matching the given expression. The path expression itself is either a path literal or a regular expression surrounded by slashes. Immediately after the trailing / of a path regexp you can put any number of the following characters which act as flags: ‘a’, ‘c’, ‘D’, ‘M’, ‘R’, ‘C’, ‘N’.

By default, a path is related to a commit if the latter has a fileop that touches that file path - modifies that change it, deletes that remove it, renames and copies that have it as a source or target. When the ‘c’ flag is in use the meaning changes: the paths related to a commit become all paths that would be present in a checkout for that commit.

A path literal matches a commit if and only if the path literal is exactly one of the paths related to the commit (no prefix or suffix operation is done). In particular a path literal won’t match if it corresponds to a directory in the chosen repository.

A regular expression matches a commit if it matches any path related to the commit anywhere in the path. You can use ^ or $ if you want the expression to only match at the beginning or end of paths. When the ‘a’ flag is in use, the path expression selects commits whose every path matches the regular expression. This is necessarily a subset of commits selected without the ‘a’ flag because it also selects commits with no related paths (e.g. empty commits, deletealls and commits with empty trees). If you want to avoid those, you can use e.g. ‘[/regex/] & [/regex/a]’.

The flags ‘D’, ‘M’, ‘R’, ‘C’, ‘N’ restrict match checking to the corresponding fileop types. Note that this means an ‘a’ match is easier (not harder) to achieve. These are no-ops when used with ‘c’.

A path or literal matches a blob if it matches any path that appeared in a modification fileop that referred to that blob. To select purely matching blobs or matching commits, compose a path expression with =B or =C.

If you need to embed ‘[^/]’ into your regular expression (e.g. to express "all characters but a slash") you can use a C-like string escape such as \x2f.

function calls

The expression language has named special functions. The sequence for a named function is “@” followed by a function name, followed by an argument in parentheses. Presently the following functions are defined:

name interpretation

min

minimum member of a selection set

max

maximum member of a selection set

amp

nonempty selection set becomes all objects, empty set is returned empty

par

all parents of commits in the argument set

chn

all children of commits in the argument set

dsc

all commits descended from the argument set (argument set included)

anc

all commits whom the argument set is descended from (argument set included)

pre

events before the argument set; empty if the argument set includes the first event.

suc

events after the argument set; empty if the argument set includes the last event.

srt

sort the argument set by event number.

Set expressions may be combined with the operators ‘|’ and ‘&’ which are, respectively, set union and intersection. The | has lower precedence than intersection, but you may use parentheses ‘(’ and ‘)’ to group expressions in case there is ambiguity.

Any set operation may be followed by ‘?’ to add the set members' neighbors and referents. This extends the set to include the parents and children of all commits in the set, and the referents of any tags and resets in the set. Each blob reference in the set is replaced by all commit events that refer to it. The ? can be repeated to extend the neighborhood depth. The result of a ? extension is sorted so the result is in ascending order.

Do set negation with prefix ‘~’; it has higher precedence than & and | but lower than ?.

Command syntax

Following the (optional) selection set will be a whitespace-separated command name, possibly another whitespace-separated subcommand name, and possibly following arguments.

The syntax of following arguments is variable according to the requirements of individual commands, but there are a couple of general rules.

  • You can have comments in a script, led by the character "#". Both whole-line and "winged" comments following command arguments are supported. Note that reposurgeon’s command parser is fairly primitive and will be confused by a literal # in a command argument.

  • Many commands interpret C/Go style backslash escapes like \n in arguments. You can usually, for example, get around having to include a literal # in an argument by writing \x23.

  • Some commands support option flags. These are led with a --, so if there is an option flag named "foo" you would write it as "--foo". Option flags are parsed out of the command line before any other interpretation is performed, and can be anywhere on the line. The order of option flags is never significant.

  • When an option flag "foo" sets a value, the syntax is --foo=xxx with no spaces around the equal sign.

  • All commands that expect data to be presented on standard input support input redirection. You may write "<myfile" to take input from the file named "myfile". Redirections are parsed out early, before the command arguments proper are interpreted, and can be anywhere on the line

  • All commands that expect data to be presented on standard input also accept a here-document, just the shell syntax for here-documents with a leading "<<". There are two here-documents in the quick-start example.

  • Most commands that normally ship data to standard output accept output redirection. As in the shell, you can write ">outfile" to send the command output to "outfile", and ">>outfile2" to append to outfile2.

  • Some commands take following arguments that are regular expressions. In this context, they still require start and end delimiters as they do when used in a selection prefix, but if you need to have a / in the expression the delimiters can be any printable character. As a reminder, these are described in the embedded help as "delimited" regular expressions.

  • Also note that following-argument regular expressions may not contain whitespace; if you need to specify whitespace or a non-printable character use a standard C-style escape such as \s for space.

Beware that while the reposurgeon CLI mimics simple shell features like redirection, many things you can do in a real shell won’t work. String-quoting arguments will fail unless the specific, documented syntax of a command supports that. You can’t redirect standard error (but see the ‘log’ command for a rough equivalent). And you can’t pipe input from a command or output to a command.

In general you should avoid getting cute with the command parser. It’s stupider than it looks.

Import and Export

reposurgeon can hold multiple repository states in core. Each has a name. At any given time, one may be selected for editing. Commands in this group import repositories, export them, and manipulate the in-core list and the selection.

If you are planning a conversion from Subversion, you should probably read Working with Subversion after this section.

If you are planning a conversion from Mercurial, out should probably read Working with Mercurial after this section.

Reading and writing repositories

read [ --format=fossil ] [ --no-implicit ] [ directory | - | <infile ]

With a directory-name argument, this command attempts to read in the contents of a repository in any supported version-control system under that directory; read with no arguments does this in the current directory. If input is redirected from a plain file, it will be read in as a fast-import stream or Subversion dumpfile. With an argument of ‘-’, this command reads a fast-import stream or Subversion dumpfile from standard input (this will be useful in filters constructed with command-line arguments).

If the content is a fast-import stream, any “cvs-revision” property on a commit is taken to be a newline-separated list of CVS revision cookies pointing to the commit, and used for reference lifting.

If the content is a fast-import stream, any “legacy-id” property on a commit is taken to be a legacy ID token pointing to the commit, and used for reference-lifting.

If the read location is a git repository and contains a .git/cvsauthors file (such as is left in place by ‘git cvsimport -A’) that file will be read in as if it had been given to the ‘authors read’ command.

If the read location is a directory, and its repository subdirectory has a file named legacy-map, that file will be read as though passed to a ‘legacy read’ command.

If the read location is a file and the --format=fossil option is used, the file is interpreted as a Fossil repository.

The just-read-in repo is added to the list of loaded repositories and becomes the current one, selected for surgery. If it was read from a plain file and the file name ends with one of the extensions ‘.fi’ or ‘.svn’, that extension is removed from the load list name.

Normally, missing ‘from’ links in input streams are defaulted to the previous commit. The --no-implicit option disables this and may enable round-tripping of some streams on which it would fail (note however that git fast-export generates explicit ‘from’ links). This option will mainly be useful for testing and debugging.

Additional options to this command and not listed here are given in a later section, and apply only to Subversion repositories.

Note: this command does not take a selection set.

[ selection ] write [ --legacy ] [ --format=fossil ] [ --noincremental ] [ --callout ] [ >outfile | - ]

Dump selected events as a fast-import stream representing the edited repository; the default selection set is all events. Where to dump to is standard output if there is no argument or the argument is ‘-’, or the target of an output redirect.

Alternatively, if there is no redirect and the argument names a directory, the repository is rebuilt into that directory, with any selection set being ignored; if that target directory is nonempty its contents are backed up to a save directory.

If the write location is a file and the --format=fossil option is used, the file is written in Fossil repository format.

With the --legacy option, the Legacy-ID of each commit is appended to its commit comment at write time. This option is mainly useful for debugging conversion edge cases.

If you specify a partial selection set such that some commits are included but their parents are not, the output will include incremental dump cookies for each branch with an origin outside the selection set, just before the first reference to that branch in a commit. An incremental dump cookie looks like “refs/heads/foo^0” and is a clue to export-stream loaders that the branch should be glued to the tip of a pre-existing branch of the same name. The --noincremental option suppresses this behavior.

Specifying a partial selection set, including a commit object, forces the inclusion of every blob to which it refers and every tag that refers to it.

Specifying a partial selection may cause a situation in which some parent marks in merges don’t correspond to commits present in the dump. When this happens and the --callout option was specified, the write code replaces the merge mark with a callout, the action stamp of the parent commit; otherwise the parent mark is omitted. Importers will fail when reading a stream dump with callouts; it is intended to be used by the ‘graft’ command.

Specifying a write selection set with gaps in it is allowed but unlikely to lead to good results if it is loaded by an importer.

Property extensions will be be omitted from the output if the importer for the preferred repository type cannot digest them.

Note: to examine small groups of commits without the progress meter, use ‘inspect’.

Repository type preference

prefer [ repotype ]

With no arguments, describe capabilities of all supported systems. With an argument (which must be the name of a supported system) this has two effects:

First, if there are multiple repositories in a directory you do a read on, reposurgeon will read the preferred one (otherwise it will complain that it can’t choose among them).

Secondly, this will change reposurgeon’s preferred type for output. This means that if you do a write to a directory, it will build a repo of the preferred type rather than its original type (if it had one).

If no preferred type has been explicitly selected, reading in a repository (but not a fast-import stream) will implicitly set the preferred type to the type of that repository.

sourcetype [ repotype ]

Report (with no arguments) or select (with one argument) the current repository’s source type. This type is normally set at repository-read time, but may remain unset if the source was a stream file.

The source type affects the interpretation of legacy IDs (for purposes of the =N visibility set and the ‘references’ command) by controlling the regular expressions used to recognize them. If no preferred output type has been set, it may also change the output format of stream files made from the repository.

The source type is reliably set whenever a live repository is read, or when a Subversion stream or Fossil dump is interpreted - but not necessarily by other stream files. Streams generated by cvs-fast-export(1) using the --reposurgeon option are detected as CVS. In some other cases, the source system is detected from the presence of magic $-headers in contents blobs.

Rebuilds in place

reposurgeon can rebuild an altered repository in place. Untracked files are normally saved and restored when the contents of the new repository are checked out (but see the documentation of the ‘preserve’ command for a caveat).

rebuild [ directory ]

Rebuild a repository from the state held by reposurgeon. This command does not take a selection set.

The single argument, if present, specifies the target directory in which to do the rebuild; if the repository read was from a repo directory (and not a git-import stream), it defaults to that directory. If the target directory is nonempty its contents are backed up to a save directory. Files and directories on the repository’s preserve list are copied back from the backup directory after repo rebuild. The default preserve list depends on the repository type, and can be displayed with the ‘stats’ command.

If reposurgeon has a nonempty legacy map, it will be written to a file named legacy-map in the repository subdirectory as though by a ‘legacy write’ command. (This will normally be the case for Subversion and CVS conversions.)

Crash recovery

This section will become relevant only if reposurgeon or something underneath it in the software and hardware stack crashes while in the middle of writing out a repository, in particular if the target directory of the rebuild is your current directory.

The tool has two conflicting objectives. On the one hand, we never want to risk clobbering a pre-existing repo. On the other hand, we want to be able to run this tool in a directory with a repo and modify it in place.

We resolve this dilemma by playing a game of three-directory monte.

  1. First, we build the repo in a freshly-created staging directory. If your target directory is named /path/to/foo, the staging directory will be a peer named /path/to/foo-stageNNNN, where NNNN is a cookie derived from reposurgeon’s process ID.

  2. We then make an empty backup directory. This directory will be named /path/to/foo.~N~, where N is incremented so as not to conflict with any existing backup directories. reposurgeon never, under any circumstances, ever deletes a backup directory.

    So far, all operations are safe; the worst that can happen up to this point if the process gets interrupted is that the staging and backup directories get left behind.

  3. The critical region begins. We first move everything in the target directory to the backup directory.

  4. Then we move everything in the staging directory to the target.

  5. We finish off by restoring untracked files in the target directory from the backup directory. That ends the critical region.

During the critical region, all signals that can be ignored are ignored.

File preservation

When the repository type you are working with has a "lister" method, it can tell which files in a repository directory are not checked in and will copy them into the edited repository made by a rebuild.

The following commands are required only if there is no lister method and you have to set preservations by hand.

preserve [ file…​ ]

Add (presumably untracked) files or directories to the repo’s list of paths to be restored from the backup directory after a ‘rebuild’. Each argument, if any, is interpreted as a pathname. The current preserve list is displayed afterwards.

It is only necessary to use this feature if your version-control system lacks a command to list files under version control. Under systems with such a command (which include git and hg), all files that are neither beneath the repository dot directory nor under reposurgeon temporary directories are preserved automatically.

unpreserve [ file…​ ]

Remove (presumably untracked) files or directories from the repo’s list of paths to be restored from the backup directory after a ‘rebuild’. Each argument, if any, is interpreted as a pathname. The current preserve list is displayed afterwards.

Incorporating release tarballs

When converting a legacy repository, it sometimes happens that there are archived releases of the project surviving from before the date of the repository’s initial commit. It may be desirable to insert those releases at the front of the repository history.

To do this, use the ‘incorporate’ command. This inserts the contents of specified tarballs as commits. The tarball names are given as arguments; if no arguments, a list is read from stdin. Tarballs may be gzipped or bzipped. The initial segment of each path is assumed to be a version directory and stripped off. The number of segments stripped off can be set with the option --strip=<n>, n defaulting to 1.

Takes a singleton selection set. Normally inserts before that commit; with the option --after, insert after it. The default selection set is the very first commit of the repository.

The option --date can be used to set the commit date. It takes an argument, which is expected to be an RFC3339 timestamp.

The generated commits have a committer field (the invoking user) and each gets as date the modification time of the newest file in the tarball (not the mod time of the tarball itself). No author field is generated. A comment recording the tarball name is generated.

Note that the import stream generated by this command is - while correct - not optimal, and may in particular contain duplicate blobs.

With the --firewall option, generate an additional commit after the sequence consisting only of deletes crafted to prevent the incorporated content from leaking forward.

The repository list

Reposurgeon can have several repositories loaded at once. The following commands operate on the repository list.

choose [ reponame ]

Choose a named repo on which to operate. The name of a repo is normally the basename of the directory or file it was loaded from, but repos loaded from standard input are "unnamed". reposurgeon will add a disambiguating suffix if there have been multiple reads from the same source.

With no argument, lists the names of the currently stored repositories and their load times. The second column is ‘*’ for the currently selected repository, ‘-’ for others.

drop [ reponame ]

Drop a repo named by the argument from reposurgeon’s list, freeing the memory used for its metadata and deleting on-disk blobs. With no argument, drops the currently chosen repo.

rename reponame

Rename the currently chosen repo; requires an argument. Won’t do it if there is already one by the new name.

Information and reports

Commands in this group report information about the selected repository.

The output of these commands can individually be redirected to a named output file. Where indicated in the syntax, you can prefix the output filename with ‘>’ and give it as a following argument. If you use ‘>>’ the file is opened for append rather than write.

Reports on the DAG

[ selection ] list [ >outfile ]

This is the main command for identifying the events you want to modify. It lists commits in the selection set by event sequence number with summary information. The first column is raw event numbers, the second a timestamp in local time. If the repository has legacy IDs, they will be displayed in the third column. The leading portion of the comment follows.

[ selection ] index [ >outfile ]

Display four columns of info on objects in the selection set: their number, their type, the associate mark (or ‘-’ if no mark) and a summary field varying by type. For a branch or tag it’s the reference; for a commit it’s the commit branch; for a blob it’s the repository path of the file in the blob.

[ selection ] stamp [ >outfile ]

Alternative form of listing that displays full action stamps, usable as references in selections.

[ selection ] tip [ >outfile ]

Display the branch tip names associated with commits in the selection set. These will not necessarily be the same as their branch fields (which will often be tag names if the repo contains either annotated or lightweight tags).

If a commit is at a branch tip, its tip is its branch name. If it has only one child, its tip is the child’s tip. If it has multiple children, then if there is a child with a matching branch name its tip is the child’s tip. Otherwise this function throws a recoverable error.

[ selection ] tags [>outfile ]

Display tags and resets: three fields, an event number and a type and a name. Branch tip commits associated with tags are also displayed with the type field ‘commit’.

[ selection ] inspect [>outfile ]

Dump a fast-import stream representing selected events to standard output. Just like a write, except (1) the progress meter is disabled, and (2) there is an identifying header before each event dump.

[ selection ] graph [>outfile ]

Emit a visualization of the commit graph in the DOT markup language used by the graphviz tool suite. This can be fed as input to the main graphviz rendering program dot(1), which will yield a viewable image.

You may find a script like this useful:

graph $1 >/tmp/foo$$
shell dot </tmp/foo$$ -Tpng | display -; rm /tmp/foo$$

You can substitute in your own preferred image viewer, of course.

[ selection ] lint [ options ] [>outfile ]

Look for DAG and metadata configurations that may indicate a problem. Presently checks for: (1) Mid-branch deletes, (2) disconnected commits, (3) parentless commits, (4) the existence of multiple roots, (5) committer and author IDs that don’t look well-formed as DVCS IDs, (6) multiple child links with identical branch labels descending from the same commit, (7) time and action-stamp collisions.

Options to issue only partial reports are supported; ‘lint --options’ or ‘lint -?’ lists them.

The options and output format of this command are unstable; they may change without notice as more sanity checks are added.

Statistics

stats [ repo-name…​] [>outfile ]

Report size statistics and import/export method information about named repositories, or with no argument the currently chosen repository.

[ selection ] count [>outfile ]

Report a count of items in the selection set. Default set is everything in the currently-selected repo.

[ selection ] sizes [>outfile ]

Print a report on data volume per branch; takes a selection set, defaulting to all events. The numbers tally the size of uncompressed blobs, commit and tag comments, and other metadata strings (a blob is counted each time a commit points at it).

The numbers are not an exact measure of storage size: they are intended mainly as a way to get information on how to efficiently partition a repository that has become large enough to be unwieldy.

Examining tree states

[ selection ] manifest [ /regular expression/ ] [ >outfile ]

Takes an optional selection set argument defaulting to all commits, and an optional regular expression. For each commit in the selection set, print the mapping of all paths in that commit tree to the corresponding blob marks, mirroring what files would be created in a checkout of the commit. If a delimited regular expression is given, only print "path -> mark" lines for paths matching it. This command supports > redirection.

[ selection ] checkout directory

Takes a selection set which must resolve to a single commit, and a second argument. The second argument is interpreted as a directory name. The state of the code tree at that commit is materialized beneath the directory.

[ selection ] diff [ >outfile ]

Display the difference between commits. Takes a selection-set argument which must resolve to exactly two commits. Supports output redirection.

Surgical Operations

These are the operations the rest of reposurgeon is designed to support.

Commit deletion

[ selection ] squash [ *policy*…​ ]

Combine or delete commits in a selection set of events. The default selection set for this command is empty. Has no effect on events other than commits unless the --delete policy is selected; see the ‘delete’ command for discussion.

Normally, when a commit is squashed, its file operation list (and any associated blob references) gets either prepended to the beginning of the operation list of each of the commit’s children or appended to the operation list of each of the commit’s parents. Then children of a deleted commit get it removed from their parent set and its parents added to their parent set.

The analogous operation is performed on commit comments, so no comment text is ever outright discarded. Exception: comments consisting of “*** empty log message ***”, as generated by CVS, are ignored.

The default is to squash forward, modifying children; but see the list of policy modifiers below for how to change this.

Warning
It is easy to get the bounds of a squash command wrong, with confusing and destructive results. Beware thinking you can squash on a selection set to merge all commits except the last one into the last one; what you will actually do is to merge all of them to the first commit after the selected set.

Normally, any tag pointing to a combined commit will also be pushed forward. But see the list of policy modifiers below for how to change this.

Following all operation moves, every one of the altered file operation lists is reduced to a shortest normalized form. The normalized form detects various combinations of modification, deletion, and renaming and simplifies the operation sequence as much as it can without losing any information.

The following modifiers change these policies:

--delete

Simply discards all file ops and tags associated with deleted commit(s).

--no-coalesce

Do not normalize the modified commit operations.

--pushback

Append fileops to parents, rather than prepending to children.

--pushforward

Prepend fileops to children. This is the default; it can be specified in a lift script for explicitness about intentions.

--tagforward

Any tag on the deleted commit is pushed forward to the first child rather than being deleted. This is the default; it can be specified for explicitness.

--tagback

Any tag on the deleted commit is pushed backward to the first parent rather than being deleted.

--quiet

Suppresses warning messages about deletion of commits with non-delete fileops.

--complain

The opposite of --quiet. Can be specified for explicitness.

--empty-only

Complain if a squash operation modifies a nonempty comment.

--blobs

Allow deletion of selected blobs.

Under any of these policies except --delete, deleting a commit that has children does not back out the changes made by that commit, as they will still be present in the blobs attached to versions past the end of the deletion set. All a delete does when the commit has children is lose the metadata information about when and by who those changes were actually made; after the delete any such changes will be attributed to the first undeleted children of the deleted commits. It is expected that this command will be useful mainly for removing commits mechanically generated by repository converters such as cvs2svn.

[ selection ] delete [ policy…​ ]

Delete a selection set of events. The default selection set for this command is empty. On a set of commits, this is equivalent to a squash with the --delete flag. It unconditionally deletes tags, resets, and passthroughs; blobs can be removed only as a side effect of deleting every commit that points at them.

Commit mutation

[ selection ] merge

Create a merge link. Takes a selection set argument, ignoring all but the lowest (source) and highest (target) members. Creates a merge link from the highest member (child) to the lowest (parent).

[ selection ] unmerge

Linearize a commit. Takes a selection set argument, which must resolve to a single commit, and removes all its parents except for the first. It is equivalent to 'first_parent, commit reparent --rebase', where commit is the same selection set as used with unmerge and first_parent is a set resolving commit's first parent (see the reparent command below). The main interest of the unmerge is that you don’t have to find and specify the first parent yourself, saving time and avoiding errors when nearby surgery would make a manual first parent argument stale.

[ selection ] reparent [ options…​ ] [ policy ]

Changes the parent list of a commit. Takes a selection set, zero or more option arguments, and an optional policy argument.

Selection set: The selection set must resolve to one or more commits. The selected commit with the highest event number (not necessarily the last one selected) is the commit to modify. The remainder of the selected commits, if any, become its parents: the selected commit with the lowest event number (which is not necessarily the first one selected) becomes the first parent, the selected commit with second lowest event number becomes the second parent, and so on. All original parent links are removed. Examples:

# this makes 17 the parent of 33
17,33 reparent

# this also makes 17 the parent of 33
33,17 reparent

# this makes 33 a root (parentless) commit
33 reparent

# this makes 33 an octopus merge commit.  its first parent
# is commit 15, second parent is 17, and third parent is 22
22,33,15,17 reparent

The option --use-order says to use the selection order to determine which selected commit is the commit to modify and which are the parents (and if there are multiple parents, their order). The last selected commit (not necessarily the one with the highest event number) is the commit to modify, the first selected commit (not necessarily the one with the lowest event number) becomes the first parent, the second selected commit becomes the second parent, and so on. Examples:

# this makes 33 the parent of 17
33,17 reparent --use-order

# this makes 17 an octopus merge commit.  its first parent
# is commit 22, second parent is 33, and third parent is 15
22,33,15,17 reparent --use-order

Because ancestor commit events must appear before their descendants, giving a commit with a low event number a parent with a high event number triggers a re-sort of the events. A re-sort assigns different event numbers to some or all of the events. Re-sorting only works if the reparenting does not introduce any cycles. To swap the order of two commits that have an ancestor–descendant relationship without introducing a cycle during the process, you must reparent the descendant commit first.

By default, the manifest of the reparented commit is computed before modifying it; a deleteall and some fileops are prepended so that the manifest stays unchanged even when the first parent has been changed. This behavior can be changed by specifying a policy flag. --rebase inhibits the default behavior—no deleteall is issued, and the tree contents of all descendants can be modified as a result.

{ selection } split {at|by} item

The first argument is required to be a commit location; the second is a preposition which indicates which splitting method to use. If the preposition is ‘at’, then the third argument must be an integer 1-origin index of a file operation within the commit. If it is ‘by’, then the third argument must be a pathname to be prefix-matched, with the pathname match done first.

The commit is copied and inserted into a new position in the event sequence, immediately following itself; the duplicate becomes the child of the original, and replaces it as parent of the original’s children. Commit metadata is duplicated; the new commit then gets a new mark. If the new commit has a legacy ID, the suffix ‘.split’ is appended to it.

Finally, some file operations — starting at the one matched or indexed by the split argument — are moved forward from the original commit into the new one. Legal indices are 2-n, where n is the number of file operations in the original commit.

{ selection } add { D path | M perm mark path | R source target | C source target}

To a selected commit, add a specified fileop.

For a D operation to be valid there must be an M operation for the path in the commit’s ancestry. For an M operation to be valid, the ‘perm’ part must be a token ending with 755 or 644, and the ‘mark’ must refer to a blob that precedes the commit location. For an R or C operation to be valid, there must be an M operation for the source in the commit’s ancestry.

{ selection } remove [ index | path | deletes ] [ to commit ]

From a selected commit, remove a specified fileop. The op must be one of (a) the keyword ‘deletes’, (b) a file path, (c) a file path preceded by an op type set (some subset of the letters DMRCN), or (d) a 1-origin numeric index. The ‘deletes’ keyword selects all D fileops in the commit; the others select one each.

If the ‘to’ clause is present, the removed op is appended to the commit specified by the following singleton selection set. This option cannot be combined with ‘deletes’.

Note that this command does not attempt to scavenge blobs even if the deleted fileop might be the only reference to them. This behavior may change in a future release.

[ selection ] tagify [ --canonicalize ] [ --tipdeletes ] [ --tagify-merges ]

Search for empty commits and turn them into tags. Takes an optional selection set argument defaulting to all commits. For each commit in the selection set, turn it into a tag with the same message and author information if it has no fileops. By default merge commits are not considered, even if they have no fileops (thus no tree differences with their first parent). To change that, use the --tagify-merges option.

The name of the generated tag will be ‘emptycommit-ident’, where ident is generated from the legacy ID of the deleted commit, or from its mark, or from its index in the repository, with a disambiguation suffix if needed.

With the --canonicalize option, tagify tries harder to detect trivial commits by first ensuring that all fileops of selected commits will have an actual effect when processed by fast-import.

With the --tipdeletes option, tagify also considers branch tips with only deleteall fileops to be candidates for tagification. The corresponding tags get names of the form ‘tipdelete-branchname’ rather than the default ‘emptycommit-ident’.

With the --tagify-merges option, tagify also tagifies merge commits that have no fileops. When this is done the merge link is move to the tagified commit’s parent.

[ selection ] reorder [ --quiet ]

Re-order a contiguous range of commits.

Older revision-control systems tracked change history on a per-file basis, rather than as a series of atomic changesets, which often made it difficult to determine the relationships between changes. Some tools which convert a history from one revision-control system to another attempt to infer changesets by comparing file commit comment and time-stamp against those of other nearby commits, but such inference is a heuristic and can easily fail. In the best case, when inference fails, a range of commits in the resulting conversion which should have been coalesced into a single changeset instead end up as a contiguous range of separate commits. This situation typically can be repaired easily enough with the coalesce or squash commands.

However, in the worst case, numerous commits from several different topics, each of which should have been one or more distinct changesets, may end up interleaved in an apparently chaotic fashion. To deal with such cases, the commits need to be re-ordered, so that those pertaining to each particular topic are clumped together, and then possibly squashed into one or more changesets pertaining to each topic. This command, reorder, can help with the first task; the squash command with the second.

Selected commits are re-arranged in the order specified; for instance: ‘:7,:5,:9,:3 reorder’. The specified commit range must be contiguous; each commit must be accounted for after re-ordering. Thus, for example, ‘:5’ can not be omitted from ‘:7,:5,:9,:3 reorder’. (To drop a commit, use the delete or squash command.)

The selected commits must represent a linear history, however, the lowest numbered commit being re-ordered may have multiple parents, and the highest numbered may have multiple children. Re-ordered commits and their immediate descendants are inspected for rudimentary fileops inconsistencies. Warns if re-ordering results in a commit trying to delete, rename, or copy a file before it was ever created. Likewise, warns if all of a commit’s fileops become no-ops after re-ordering. Other fileops inconsistencies may arise from re-ordering, both within the range of affected commits and beyond; for instance, moving a commit which renames a file ahead of a commit which references the original name. Such anomalies can be discovered via manual inspection and repaired with the add and remove (and possibly path) commands. Warnings can be suppressed with --quiet.

In addition to adjusting their parent/child relationships, re-ordering commits also re-orders the underlying events since ancestors must appear before descendants, and blobs must appear before commits which reference them. This means that events within the specified range will have different event numbers after the operation.

Branches, tag, and resets

branch branchname { rename | delete } [ arg ]

Rename or delete a branch (and any associated resets). First argument must be an existing branch name; second argument must be one of the verbs ‘rename’ or ‘delete’.

For a ‘rename’, the third argument may be any token that is a syntactically valid branch name (but not the name of an existing branch). If it does not contain a ‘/’ the prefix ‘heads/’ is prepended. If it does not begin with ‘refs/’, then ‘refs/’ is prepended.

For a ‘delete’, the name may optionally be a regular expression wrapped in //; if so, all objects of the specified type with names matching the regexp are deleted. This is useful for mass deletion of branches. Such deletions can be restricted by a selection set in the normal way. No third argument is required.

[ selection ] tag tagname { create | move | rename | delete } [ arg ]

Create, move, rename, or delete a tag.

Creation is a special case. First argument is a name, which must not be an existing tag. Takes a singleton event second argument which must point to a commit. A tag object pointing to the commit is created and inserted just after the last tag in the repo (or just after the last commit if there are no tags). The tagger, committish, and comment fields are copied from the commit’s committer, mark, and comment fields.

Otherwise, first argument must be an existing tag name; second argument must be one of the verbs ‘move’, ‘rename’, or ‘delete’.

For a ‘move’, a third argument must be a singleton selection set. For a ‘rename’, the third argument may be any token that is a syntactically valid tag name (but not the name of an existing tag).

For a ‘delete’, no third argument is required. The name portion of a delete may be a regexp wrapped in //; if so, all objects of the specified type with names matching the regexp are deleted. This is useful for mass deletion of junk tags such as CVS branch-root tags.

The tagname may use C-style backslash escapes, such as \s.

The behavior of this command is complex because features which present as tags may be any of three things: (1) True tag objects, (2) lightweight tags, actually sequences of commits with a common branchname beginning with ‘refs/tags’ - in this case the tag is considered to point to the last commit in the sequence, (3) Reset objects. These may occur in combination; in fact, stream exporters from systems with annotation tags commonly express each of these as a true tag object (1) pointing at the tip commit of a sequence (2) in which the basename of the common branch field is identical to the tag name. An exporter that generates lightweight-tagged commit sequences (2) may or may not generate resets pointing at their tip commits.

This command tries to handle all combinations in a natural way by doing up to three operations on any true tag, commit sequence, and reset matching the source name. In a rename, all are renamed together. In a delete, any matching tag or reset is deleted; then matching branch fields are changed to match the branch of the unique descendant of the tagged commit, if there is one. When a tag is moved, no branch fields are changed, and a warning is issued.

Attempts to delete a lightweight tag may fail with the message “couldn’t determine a unique successor”. When this happens, the tag is on a commit with multiple children that have different branch labels. There is a hole in the specification of git fast-import streams that leaves it uncertain how branch labels can be safely reassigned in this case; rather than do something risky, reposurgeon throws a recoverable error.

[ selection ] reset resetname { create | move | rename | delete } [ arg ]

Create, move, rename, or delete a reset. Create is a special case; it requires a singleton selection which is the associated commit for the reset, takes as a first argument the name of the reset (which must not exist), and ends with the keyword create.

In the other modes, the first argument must match an existing reset name; second argument must be one of the verbs ‘move’, ‘rename’, or ‘delete’.

The reset name may use C-style backslash escapes, such as \s.

For a ‘move’, a third argument must be a singleton selection set. For a ‘rename’, the third argument may be any token that matches a syntactically valid reset name (but not the name of an existing reset). For a ‘delete’, no third argument is required.

For either name, if it does not contain a ‘/’ the prefix ‘heads/’ is prepended. If it does not begin with ‘refs/’, ‘refs/’ is prepended.

An argument matches a reset’s name if it is either the entire reference (refs/heads/FOO or refs/tags/FOO for some value of FOO) or the basename (e.g. FOO), or a suffix of the form heads/FOO or tags/FOO. An unqualified basename is assumed to refer to a head.

When a reset is renamed, commit branch fields matching the tag are renamed with it to match. When a reset is deleted, matching branch fields are changed to match the branch of the unique descendant of the tip commit of the associated branch, if there is one. When a reset is moved, no branch fields are changed.

Repository splitting and merging

{ selection } divide

Attempt to partition a repo by cutting the parent-child link between two specified commits (they must be adjacent). Does not take a general selection set. It is only necessary to specify the parent commit, unless it has multiple children in which case the child commit must follow (separate it with a comma).

If the repo was named ‘_foo_’, you will normally end up with two repos named ‘_foo_-early’ and ‘_foo_-late’ (option and feature events at the beginning of the early segment will be duplicated onto the beginning of the late one.). But if the commit graph would remain connected through another path after the cut, the behavior changes. In this case, if the parent and child were on the same branch ‘_qux_’, the branch segments are renamed ‘_qux_-early’ and ‘_qux_-late’ but the repo is not divided.

[ selection ] expunge [ --notagify ] [~] [ path | /regexp/ ]…​

Expunge files from the selected portion of the repo history; the default is the entire history. The arguments to this command may be paths or regular expressions matching paths (regexps must be marked by being surrounded with //). String quotes and backslash escapes are interpreted when parsing the command line.

Exceptionally, the first argument may be the token "~" which chooses all file paths other than those selected by the remaining arguments. You may use this to sift out all file operations matching a pattern set rather than expunging them.

All filemodify (M) operations and delete (D) operations involving a matched file in the selected set of events are disconnected from the repo and put in a removal set. Renames are followed as the tool walks forward in the selection set; each triggers a warning message. If a selected file is a copy (C) target, the copy will be deleted and a warning message issued. If a selected file is a copy source, the copy target will be added to the list of paths to be deleted and a warning issued.

After file expunges have been performed, any commits with no remaining file operations will be removed, and any tags pointing to them. By default each deleted commit is replaced with a tag of the form ‘emptycommit-ident’ on the preceding commit unless --notagify is specified as an argument. Commits with deleted fileops pointing both in and outside the path set are not deleted, but are cloned into the removal set.

unite [ --prune ] reponame…​

Unite repositories. Name any number of loaded repositories; they will be united into one union repo and removed from the load list. The union repo will be chosen.

The root of each repo (other than the oldest repo) will be grafted as a child to the last commit in the dump with a preceding commit date. This will produce a union repository with one branch for each part. Running last to first, duplicate tag and branch names will be disambiguated using the source repository name (thus, recent duplicates will get priority over older ones). After all grafts, marks will be renumbered.

The name of the new repo will be the names of all parts concatenated, separated by ‘+’. It will have no source directory or preferred system type.

With the option --prune, at each join D operations for every ancestral file existing will be prepended to the root commit, then it will be canonicalized using the rules for squashing. The effect will be that only files with properly matching M, R, and C operations in the root survive.

[ selection ] graft [ --prune ] reponame

For when unite doesn’t give you enough control. This command may have either of two forms, selected by the size of the selection set. The first argument is always required to be the name of a loaded repo.

If the selection set is of size 1, it must identify a single commit in the currently chosen repo; in this case the named repo’s root will become a child of the specified commit. If the selection set is empty, the named repo must contain one or more callouts matching a commit in the currently chosen repo.

Labels and branches in the named repo are prefixed with its name; then it is grafted to the selected one. Any other callouts in the named repo are also resolved in the context of the currently chosen one. Finally, the named repo is removed from the load list.

With the option --prune, prepend a deleteall operation into the root of the grafted repository.

Metadata editing

[ selection ] msgout [ >outfile ] [ --filter=/regex/ ] [ --blobs ]

Emit a file of messages in RFC2822 format representing the contents of repository metadata. Takes a selection set; members of the set other than commits, annotated tags, and passthroughs are ignored (that is, presently, blobs and resets), except that if the --blobs option is passed, blobs will also be included.

May have an option --filter, followed by = and a /-enclosed regular expression. If this is given, only headers with names matching it are emitted. In this context the name of the header includes its trailing colon.

msgin [ --create ] [ --empty-only ] [ <infile ] [ --changed >outfile ]

Accept a file of messages in RFC2822 format representing the contents of the metadata in selected commits and annotated tags. Takes no selection set. If there is an argument it will be taken as the name of a message file to read from; if no argument, or one of ‘-’, reads from standard input.

Users should be aware that modifying an Event-Number or Event-Mark field will change which event the update from that message is applied to. This is unlikely to have good results.

The header CheckText, if present, is examined to see if the comment text of the associated event begins with it. If not, the item modification is aborted. This helps ensure that you are landing updates of the events you intend.

If the --create modifier is present, new tags and commits will be appended to the repository. In this case it is an error for a tag name to match any existing tag name. Commit objects are created with no fileops. If Committer-Date or Tagger-Date fields are not present they are filled in with the time at which this command is executed. If Committer or Tagger fields are not present, reposurgeon will attempt to deduce the user’s git-style identity and fill it in. If a singleton commit set was specified for commit creations, the new commits are made children of that commit.

Otherwise, if the Event-Number and Event-Mark fields are absent, the msgin logic will attempt to match the commit or tag first by Legacy-ID, then by a unique committer ID and timestamp pair.

If output is redirected and the modifier --changed appears, a minimal set of modifications actually made is written to the output file in a form that can be fed back in.

If the option --empty-only is given, this command will throw a recoverable error if it tries to alter a message body that is neither empty nor consists of the CVS empty-comment marker.

[ selection ] setfield attribute value

In the selected objects (defaulting to none) set every instance of a named field to a string value. The string may be quoted to include whitespace, and use C-style backslash escapes, such as \n and \t.

Attempts to set nonexistent attributes are ignored. Valid values for the attribute are internal field names; in particular, for commits, ‘comment’ and ‘branch’ are legal. Consult the source code for other interesting values.

The special field names ‘author’, ‘commitdate’ and ‘authdate’ apply only to commits in the range. The latter two set attribution dates. The former sets the author’s name and email address (assuming the value can be parsed for both), copying the committer timestamp. The author’s timezone may be deduced from the email address.

[ selection ] edit [ --blobs | --not-last ] [<`infile`] [>`outfile`]

Report the selection set of events to a tempfile as msgout does, call an editor on it, and update from the result as msgin does. If you do not specify an editor name as second argument, it will be taken from the $EDITOR variable in your environment. If $EDITOR is not set, /usr/bin/editor will be used as a fallback if it exists as a symlink to your default editor, as is the case on Debian, Ubuntu and their derivatives.

Normally this command ignores blobs because msgout does. However, if you specify a selection set consisting of a single blob, your editor will be called directly on the blob file; alternatively, as with msgout, the --blobs option will include blobs in the file.

In the singleton blob case (without --blobs), will warn if the blob to be edited appears in any commits whose descendants modify the same blob (since changes will not propagate to the descendant versions). This warning may be suppressed (e.g. in scripts) with the --not-last option.

Supports < and > redirection.

[ selection ] attribution [ attr-selection ] { show | set | delete | prepend | append } [ args ]

Inspect, modify, add, and remove commit and tag attributions.

Attributions upon which to operate are selected in much the same way as events are selected, as described in Selection syntax. attr-selection is an expression composed of 1-origin attribution-sequence numbers, ‘$’ for last attribution, ‘..’ ranges, comma-separated items, ‘(…​)’ grouping, set operations ‘|’ union, ‘&’ intersection, and ‘~’ negation, and function calls @min(), @max(), @amp(), @pre(), @suc(), @srt(). Attributions can also be selected by visibility set ‘=C’ for committers, ‘=A’ for authors, and ‘=T’ for taggers. Finally, ‘/regex/’ will attempt to match the regular expression regex against an attribution name and email address; ‘/n’ limits the match to only the name, and ‘/e’ to only the email address.

With the exception of ‘show’, all actions require an explicit event selection upon which to operate. Available actions are:

[ attr-selection ] [ show ] [ >file ]

Inspect the selected attributions of the specified events (commits and tags). The ‘show’ keyword is optional. If no attribution selection expression is given, defaults to all attributions. If no event selection is specified, defaults to all events. Supports > redirection.

{ attr-selection } set [ name ] [ email ] [ date ]

Assign name, email, date to the selected attributions. As a convenience, if only some fields need to be changed, the others can be omitted. Arguments name, email, and date can be given in any order.

[ attr-selection ] delete

Delete the selected attributions. As a convenience, deletes all authors if attr-selection is not given. It is an error to delete the mandatory committer and tagger attributions of commit and tag events, respectively.

[ attr-selection ] | prepend [ name ] [ email ] [ date ]

Insert a new attribution before the first attribution named by attr-selection. The new attribution has the same type (committer, author, or tagger) as the one before which it is being inserted. Arguments name, email, and date can be given in any order.

If name is omitted, an attempt is made to infer it from email by trying to match email against an existing attribution of the event, with preference given to the attribution before which the new attribution is being inserted. Similarly, email is inferred from an existing matching name. Likewise, for date.

As a convenience, if attr-selection is empty or not specified a new author is prepended to the author list.

It is presently an error to insert a new committer or tagger attribution. To change a committer or tagger, use ‘set’ instead.

[ attr-selection ] append [ name ] [ email ] [ date ]

Insert a new attribution after the last attribution named by attr-selection. The new attribution has the same type (committer, author, or tagger) as the one after which it is being inserted. Arguments name, email, and date can be given in any order.

If name is omitted, an attempt is made to infer it from email by trying to match email against an existing attribution of the event, with preference given to the attribution after which the new attribution is being inserted. Similarly, email is inferred from an existing matching name. Likewise, for date.

As a convenience, if attr-selection is empty or not specified a new author is appended to the author list.

It is presently an error to insert a new committer or tagger attribution. To change a committer or tagger, use ‘set’ instead.

{ attr-selection } append [ --rstrip ] [text]

Append text to the comments of commits and tags in the specified selection set. The text is the first token of the command and may be a quoted string. C-style escape sequences in the string are interpreted as one would expect.

If the option --rstrip is given, the comment is right-stripped before the new text is appended. If the option --legacy is given, the string ‘%LEGACY%’ in the append payload is replaced with the commit’s legacy-ID before it is appended.

[ selection ] gitify

Attempt to massage comments into a git-friendly form with a blank separator line after a summary line. This code assumes it can insert a blank line if the first line of the comment ends with ‘.’, ‘,’, ‘:’, ‘;’, ‘?’, or ‘!’. If the separator line is already present, the comment won’t be touched.

Takes a selection set, defaulting to all commits and tags.

[ selection ] filter [ --shell | --regex | --replace | --dedos ]

Run blobs, commit comments, or tag comments in the selection set through the filter specified on the command line.

In any mode other than --dedos, attempting to specify a selection set including both blobs and non-blobs (that is, commits or tags) throws an error. Inline content in commits is filtered when the selection set contains (only) blobs and the commit is within the range bounded by the earliest and latest blob in the specification.

With --shell, the remainder of the line specifies a filter as a shell command. Each blob or comment is presented to the filter on standard input; the content is replaced with whatever the filter emits to standard output.

When filtering blobs, if the command line contains the magic cookie '%PATHS%' it is replaced with a space-separated list of all paths that reference the blob.

With --regex, the remainder of the line is expected to be a regular expression substitution written as /from/to/ with from and to being passed as arguments to the standard re.sub() function to modify the content. Actually, any non-space character will work as a delimiter in place of the /; this makes it easier to use / in patterns. Ordinarily only the first such substitution is performed; putting ‘g’ after the slash replaces globally, and a numeric literal gives the maximum number of substitutions to perform. Other flags available restrict substitution scope - ‘c’ for comment text only, ‘C’ for committer name only, ‘a’ for author names only. Note that parsing of a --regex argument will be confused by any substring consisting of whitespace followed by #; use ‘\s’ rather than whitespace to avoid this.

With --replace, the behavior is like --regex, but the expressions are not interpreted as regular expressions. (This is slightly faster).

With --dedos, DOS/Windows-style \r\n line terminators are replaced with \n.

Path modifications

[ selection ] path source rename [ --force ] target

Rename a path in every fileop of every selected commit. The default selection set is all commits. The first argument is interpreted as a regular expression to match against paths; the second may contain back-reference syntax.

Ordinarily, if the target path already exists in the fileops, or is visible in the ancestry of the commit, this command throws an error. With the --force option, these checks are skipped.

[ selection ] paths [ sub | sup ] [ dirname ] [ >outfile ]

Takes a selection set. Without a modifier, list all paths touched by fileops in the selection set (which defaults to the entire repo). This reporting variant does >-redirection.

With the ‘sub’ modifier, take a second argument that is a directory name and prepend it to every path. With the ‘sup’ modifier, strip any directory argument from the start of the path if it appears there; with no argument, strip the first directory component from every path.

[ selection ] setperm {100644|100755|120000} path…​

For the selected objects (defaulting to none) take the first argument as an octal literal describing permissions. All subsequent arguments are paths. For each M fileop in the selection set and exactly matching one of the paths, patch the permission field to the first argument value.

Timequakes and time offsets

Modifying a repository so every commit in it has a unique timestamp is often a useful thing to do, in order for every commit to have a unique action stamp that can be referred to in surgical commands.

The ‘lint’ command will tell you if you have timestamp collisions.

[ selection ] timequake

Attempt to hack committer and author time stamps in the selection set (defaulting to all commits in the repository) to be unique. Works by identifying collisions between parent and child, than incrementing child timestamps so they no longer coincide. Won’t touch commits with multiple parents.

Because commits are checked in ascending order, this logic will normally do the right thing on chains of three or more commits with identical timestamps.

Any timestamp collisions left after this operation are probably cross-branch and have to be individually dealt with using ‘timebump’ commands.

[ selection ] timeoffset [ offset [ timezone ] ]

Apply a time offset to all time/date stamps in the selected set. An offset argument is required; it may be in the form [+-]ss, [+-]mm:ss or [+-]hh:mm:ss. The leading sign is optional. With no argument, the default is 1 second.

Optionally you may also specify another argument in the form [+-]hhmm, a timezone literal to apply. To apply a timezone without an offset, use an offset literal of 0, +0 or -0.

Those of you twitchy about "rewriting history" should bear in mind that the commit stamps in many older repositories were never very reliable to begin with.

+ CVS in particular is notorious for shipping client-side timestamps with timezone and DST issues (as opposed to UTC) that don’t necessary compare well with stamps from different clients of the same CVS server. Thus, inducing a timequake in a CVS repo seldom produces effects anywhere near as large as the measurement noise of the repository’s own timestamps.

+ Subversion was somewhat better about this, as commits were stamped at the server, but older Subversion repositories often have sections that predate the era of ubiquitous NTP time.

Miscellanea

blob

Create a blob at mark :1 after renumbering other marks starting from :2. Data is taken from stdin, which may be a here-doc. This can be used with the add command to patch synthetic data into a repository.

renumber

Renumber the marks in a repository, from :1 up to :<n> where <n> is the count of the last mark. Just in case an importer ever cares about mark ordering or gaps in the sequence.

A side effect of this command is to clean up stray "done" passthroughs that may have entered the repository via graft operations. After a renumber, the repository will have at most one "done", and it will be at the end of the events.

[ selection ] dedup

Deduplicate blobs in the selection set. If multiple blobs in the selection set have the same SHA1, throw away all but the first, and change fileops referencing them to instead reference the (kept) first blob.

[ selection ] transcode codec

Transcode blobs, commit comments and committer/author names, or tag comments and tag committer names in the selection set to UTF-8 from the character encoding specified on the command line.

Attempting to specify a selection set including both blobs and non-blobs (that is, commits or tags) throws an error. Inline content in commits is filtered when the selection set contains (only) blobs and the commit is within the range bounded by the earliest and latest blob in the specification.

The encoding argument must name one of the codecs known to the Golang standard codecs library. In particular, ‘latin1’ is a valid codec name.

Errors in this command are fatal, because an error may leave repository objects in a damaged state.

The theory behind the design of this command is that the repository might contain a mixture of encodings used to enter commit metadata by different people at different times. After using =I to identify metadata containing non-Unicode high bytes in text, a human must use context to identify which particular encodings were used in particular event spans and compose appropriate transcode commands to fix them up.

debranch source-branch [ target-branch ]

Takes one or two arguments which must be the names of source and target branches; if the second (target) argument is omitted it defaults to refs/heads/master. Any trailing segment of a branch name is accepted as a synonym for it; thus master is the same as refs/heads/master. Does not take a selection set.

The history of the source branch is merged into the history of the target branch, becoming the history of a subdirectory with the name of the source branch. Any resets of the source branch are removed.

[ selection ] strip [ --blobs | --reduce ]

Reduce the selected repository to make it a more tractable test case. Use this when reporting bugs.

With the option ‘--blobs’, replace each blob in the repository with a small, self-identifying stub, leaving all metadata and DAG topology intact. This is useful when you are reporting a bug, for reducing large repositories to test cases of manageable size. You can do a repocutter "obscure" pass after the strip to mask even the metadata.

A selection set is effective only with the ‘--blobs’ option, defaulting to all blobs. The ‘--reduce’ mode always acts on the entire repository.

With the modifier ‘--reduce’, perform a topological reduction that throws out uninteresting commits. If a commit has all file modifications (no deletions or copies or renames) and has exactly one ancestor and one descendant, then it may be boring. To be fully boring, it must also not be referred to by any tag or reset. Interesting commits are not boring, or have a non-boring parent or non-boring child.

With no modifiers, this command strips blobs.

Artifact handling

Some commands automate fixing various kinds of artifacts associated with repository conversions from older systems.

Attributions

[ selection ] authors [ read | write ] [ <filename ] [ >filename ]

Apply or dump author-map information for the specified selection set, defaulting to all events.

Lifts from CVS and Subversion may have only usernames local to the repository host in committer and author IDs. DVCSes want email addresses (net-wide identifiers) and complete names. To supply the map from one to the other, an authors file is expected to consist of lines each beginning with a local user ID, followed by a ‘=’ (possibly surrounded by whitespace) followed by a full name and email address, optionally followed by a timezone offset field. Thus:

fred = Fred J. Foonly <foonly@foo.com> America/New_York

An authors file may also contain lines of this form

+ Fred J. Foonly <foonly@foobar.com> America/Los_Angeles

These are interpreted as aliases for the last preceding ‘=’ entry that may appear in ChangeLog files. When such an alias is matched on a ChangeLog attribution line, the author attribution for the commit is mapped to the basename, but the timezone is used as is. This accommodates people with past addresses (possibly at different locations) unifying such aliases in metadata so searches and statistical aggregation will work better.

An authors file may have comment lines beginning with ‘#’; these are ignored.

When an authors file is applied, email addresses in committer and author metadata for which the local ID matches between < and @ are replaced according to the mapping (this handles git-svn lifts). Alternatively, if the local ID is the entire address, this is also considered a match (this handles what git-cvsimport and cvs2git do). If a timezone was specified in the map entry, that person’s author and committer dates are mapped to it.

With the ‘read’ modifier, or no modifier, apply author mapping data (from standard input or a <-redirected file). May be useful if you are editing a repo or dump created by cvs2git or by git-svn invoked without -A.

With the ‘write’ modifier, write a mapping file that could be interpreted by ‘authors read’, with entries for each unique committer, author, and tagger (to standard output or a >-redirected mapping file). This may be helpful as a start on building an authors file, though each part to the right of an equals sign will need editing.

Ignore patterns

reposurgeon recognizes how supported VCSes represent file ignores (CVS .cvsignore files lurking untranslated in older Subversion repositories, Subversion ignore properties, .gitignore/.hgignore/.bzrignore file in other systems) and moves ignore declarations among these containers on repo input and output. This will be sufficient if the ignore patterns are exact filenames.

Translation may not, however, be perfect when the ignore patterns are Unix glob patterns or regular expressions. This compatibility table describes which patterns will translate; "plain" indicates a plain filename with no glob or regexp syntax or negation, "no !" means no negated regexps, and "no RE:" means the RE prefix for a regular expression does not work.

RCS has no ignore files or patterns and is therefore not included in the table.

from CVS from svn from git from hg from bzr from darcs from SRC from bk

to CVS

all

all

no ! & nonempty

all

no RE:, no !

plain

all

all

to svn

no !

all

no !

all

no RE:. no !

plain

all

all

to git

all

all

all

no !

no RE:

plain

all

all

to hg

no !

all

no !

all

no RE:, no !

plain

all

all

to bzr

all

all

all

all

all

plain

all

all

to darcs

plain

plain

plain

plain

plain

all

all

all

to SRC

no !

all

no !

all

no RE:, no !

plain

all

all

The hg rows and columns of the table describe compatibility to hg’s glob syntax rather than its default regular-expression syntax. When writing to an hg repository from any other kind, reposurgeon prepends to the output .hgignore a ‘syntax: glob’ line.

For dealing with unusual cases, there’s this:

ignores [ --rename ] [ --translate ] [ --defaults ]

Intelligent handling of ignore-pattern files. This command fails if no repository has been selected or no preferred write type has been set for the repository. It does not take a selection set.

If ‘--rename’ is present, this command attempts to rename all ignore-pattern files to whatever is appropriate for the preferred type - e.g. .gitignore for git, .hgignore for hg, etc. This option does not cause any translation of the ignore files it renames.

If ‘--translate’ is present, syntax translation of each ignore file is attempted. At present, the only transformation the code knows is to prepend a ‘syntax: glob’ header if the preferred type is hg.

If ‘--defaults’ is present, the command attempts to prepend these default patterns to all ignore files. If no ignore file is created by the first commit, it will be modified to create one containing the defaults. This command will error out on prefer types that have no default ignore patterns (git and hg, in particular). It will also error out when it knows the import tool has already set default patterns.

Reference lifting

This group of commands is meant for fixing up references in commits that are in the format of older version-control systems. The general workflow is this: first, go over the comment history and change all old-fashioned commit references into machine-parseable cookies. Then, automatically turn the machine-parseable cookie into action stamps. The point of dividing the process this way is that the first part is hard for a machine to get right, while the second part is prone to errors when a human does it.

A Subversion cookie is a comment substring of the form ‘[[SVN:ddddd]]’ (example: ‘[[SVN:2355]]’) with the revision read directly via the Subversion exporter, deduced from git-svn metadata, or matching a $Revision$ header embedded in blob data for the filename.

A CVS cookie is a comment substring of the form ‘[[CVS:filename:revision]]’ (example: ‘[[CVS:src/README:1.23]]’) with the revision matching a CVS $Id$ or $Revision$ header embedded in blob data for the filename.

A mark cookie is of the form ‘[[:dddd]]’ and is simply a reference to the specified mark. You may want to hand-patch this in when one of the previous forms is inconvenient.

An action stamp is an RFC3339 timestamp, followed by a ‘!’, followed by an author email address (author is preferred rather than committer because that timestamp is not changed when a patch is replayed on to a branch, but the code to make a stamp for a commit will fall back to the committer if no author field is present). It attempts to refer to a commit without being VCS-specific. Thus, instead of “commit 304a53c2” or “r2355”, “2011-10-25T15:11:09Z!fred@foonly.com”.

The following git aliases allow git to work directly with action stamps. Append it to your ~/.gitconfig; if you already have an [alias] section, leave off the first line.

[alias]
	# git stamp <commit-ish> - print a reposurgeon-style action stamp
	stamp = show -s --format='%cI!%ce'

	# git scommit <stamp> <rev-list-args> - list most recent commit that matches <stamp>.
	# Must also specify a branch to search or --all, after these arguments.
	scommit = "!f(){ d=${1%%!*}; a=${1##*!}; arg=\"--until=$d -1\"; if [ $a != $1 ]; then arg=\"$arg --committer=$a\"; fi; shift; git rev-list $arg ${1:+\"$@\"}; }; f"

	# git scommits <stamp> <rev-list-args> - as above, but list all matching commits.
	scommits = "!f(){ d=${1%%!*}; a=${1##*!}; arg=\"--until=$d --after $d\"; if [ $a != $1 ]; then arg=\"$arg --committer=$a\"; fi; shift; git rev-list $arg ${1:+\"$@\"}; }; f"

	# git smaster <stamp> - list most recent commit on master that matches <stamp>.
	smaster = "!f(){ git scommit \"$1\" master --first-parent; }; f"
	smasters = "!f(){ git scommits \"$1\" master --first-parent; }; f"

	# git shs <stamp> - show the commits on master that match <stamp>.
	shs = "!f(){ stamp=$(git smasters $1); shift; git show ${stamp:?not found} $*; }; f"

	# git slog <stamp> <log-args> - start git log at <stamp> on master
	slog = "!f(){ stamp=$(git smaster $1); shift; git log ${stamp:?not found} $*; }; f"

	# git sco <stamp> - check out most recent commit on master that matches <stamp>.
	sco = "!f(){ stamp=$(git smaster $1); shift; git checkout ${stamp:?not found} $*; }; f"

There is a rare case in which an action stamp will not refer uniquely to one commit. It is theoretically possible that the same author might check in revisions on different branches within the one-second resolution of the timestamps in a fast-import stream. There is nothing to be done about this; tools using action stamps need to be aware of the possibility and throw a warning when it occurs.

In order to support reference lifting, reposurgeon internally builds a legacy-reference map that associates revision identifiers in older version-control systems with commits. The contents of this map come from three places: (1) cvs2svn:rev properties if the repository was read from a Subversion dump stream, (2) $Id$ and $Revision$ headers in repository files, and (3) the .git/cvs-revisions created by ‘git cvsimport’.

The detailed sequence for lifting possible references is this: first, find possible CVS and Subversion references with the references or =N visibility set; then replace them with equivalent cookies; then run references lift to turn the cookies into action stamps (using the information in the legacy-reference map) without having to do the lookup by hand.

references [ list | edit | lift ] [ >outfile ]

With the modifier ‘list’, list commit and tag comments for strings that might be CVS- or Subversion-style revision identifiers. This will be useful when you want to replace them with equivalent cookies that can automatically be translated into VCS-independent action stamps. This reporting command supports >-redirection. It is equivalent to ‘=N list’.

With the modifier ‘edit’, edit the set where revision IDs are found. This version of the command supports < and > redirection. This is equivalent to ‘=N edit’.

With the modifier ‘lift’, attempt to resolve Subversion and CVS cookies in comments into action stamps using the legacy map. An action stamp is a timestamp/email/sequence-number combination uniquely identifying the commit associated with that blob, as described in [style].

It is not guaranteed that every such reference will be resolved, or even that any at all will be. Normally all references in history from a Subversion repository will resolve, but CVS references are less likely to be resolvable.

legacy [ read | write ] [ <filename ] [ >filename ]

Apply or list legacy-reference information. Does not take a selection set. The ‘read’ variant reads from standard input or a <-redirected filename; the ‘write’ variant writes to standard output or a >-redirected filename. If neither is specified, defaults to ‘read’.

A legacy-reference file maps reference cookies to (committer, commit-date, sequence-number) triplets; these in turn (should) uniquely identify a commit. The format is two whitespace-separated fields: the cookie followed by an action stamp identifying the commit.

It should not normally be necessary to use this command. The legacy map is automatically preserved through repository reads and rebuilds, being stored in the file legacy-map under the repository subdirectory.

Changelogs

CVS, Subversion and Mercurial do not have separated notions of committer and author for changesets; when lifted to a VCS that does, like git, their one author field is used for both.

However, if the project used the FSF ChangeLog convention, many changesets will include a ChangeLog modification listing an author for the commit. In the common case that the changeset was derived from a patch and committed by a project maintainer, but the ChangeLog entry names the actual author, this information can be recovered.

Use the ‘changelogs’ command. This takes no arguments, but may take a selection set; the default is all commits. It mines the selected ChangeLog files for authorship data.

An optional following argument is a delimited regular expression to match the basename of files that should be treated as changelogs. The default expression is ‘/^ChangeLog$/’.

It assumes such files are in the format used by FSF projects: entry header lines begin with YYYY-MM-DD and are followed by a fullname/address. When a ChangeLog file modification is found in a clique, the entry header at or before the section changed since its last revision is parsed, and the address is inserted as the commit author.

If the entry header contains an email address but no name, a name will be filled in if possible by looking for the address in author map entries.

In accordance with FSF policy for ChangeLogs, any date in an attribution header is discarded and the committer date is used. However, if the name is an author-map alias with an associated timezone, that zone is used.

The Co-Author convention described in the Linux kernel’s co-author message conventions is observed: If an attribution header is followed by a whitespace-led line containing only a valid email address, that name becomes the payload of a "Co-Author" header that is appended to the change comment for the containing commit.

The command reports statistics on how many commits were altered.

Clique coalescence

When lifting a history from a version-control system that lacks changesets, it is useful to have a way to recognize cliques of per-file changes that ought to be grouped into changesets.

You won’t need this for CVS because cvs-fast-export does clique coalescence itself.

[ selection ] coalesce [ --debug | --changelog ] [ timefuzz ]

Scan the selection set for runs of commits with identical comments close to each other in time (this is a common form of scar tissues in repository up-conversions from older file-oriented version-control systems). Merge these cliques by deleting all but the last commit, in order; fileops from the deleted commits are pushed forward to that last one

The optional second argument, if present, is a maximum time separation in seconds; the default is 90 seconds.

The default selection set for this command is =C, all commits. Occasionally you may want to restrict it, for example to avoid coalescing unrelated cliques of “*** empty log message ***” commits from CVS lifts.

With the --debug option, show messages about mismatches.

With the --changelog option, any commit with a comment containing the string ‘empty log message’ (such as is generated by CVS) and containing exactly one file operation modifying a path ending in ChangeLog is treated specially. Such ChangeLog commits are considered to match any commit before them by content, and will coalesce with it if the committer matches and the commit separation is small enough. This option handles a convention used by Free Software Foundation projects.

Control Options

The following options change reposurgeon’s behavior:

canonicalize

If set, import stream reads and msgin and edit will canonicalize comments by replacing CR-LF with LF, stripping leading and trailing whitespace, and then appending a LF. This behavior inverts of the crlf option is on - LF is replaced with Cr-LF and CR-LF is appended.

crlf

If set, expect CR-LF line endings on text input and emit them on output. Comment canonicalization will map LF to CR-LF.

compressblobs

Use compression for on-disk copies of blobs. Accepts an increase in repository read and write time in order to reduce the amount of disk space required while editing; this may be useful for large repositories. No effect if the edit input was a dump stream; in that case, reposurgeon doesn’t make on-disk blob copies at all (it points into sections of the input stream instead).

echo

Echo commands before executing them. Setting this in test scripts may make the output easier to read.

experimental

This flag is reserved for developer use. If you set it, it could do anything up to and including making demons fly out of your nose.

interactive

Enable interactive responses even when not on a tty.

progress

Enable fancy progress messages even when not on a tty.

relax

Continue script execution on error, do not bail out.

serial

Disable parallelism in code. Use for generating test loads.

testmode

Disable some features that cause output to be vary depending on wall time, screen width, and the ID of the invoking user. Use in regression-test loads.

quiet

Suppress time-varying parts of reports.

Most options are described in conjunction with the specific operations that they modify.

Here are the commands to manipulate them. None of these take a selection set:

set [ option ]

Turn on an option flag. With no arguments, list all options.

clear [ option ]

Turn off an option flag. With no arguments, list all options.

Scripting and debugging support

Variables, macros, and scripts

Occasionally you will need to issue a large number of complex surgical commands of very similar form, and it’s convenient to be able to package that form so you don’t need to do a lot of error-prone typing. For those occasions, reposurgeon supports simple forms of named variables and macro expansion.

{ selection } assign [ name ]

Compute a leading selection set and assign it to a symbolic name. It is an error to assign to a name that is already assigned, or to any existing branch name. Assignments may be cleared by sequence mutations (though not ordinary deletions); you will see a warning when this occurs.

With no selection set and no name, list all assignments. This version accepts output redirection.

If the option --singleton is given, the assignment will throw an error if the selection set is not a singleton.

Use this to optimize out location and selection computations that would otherwise be performed repeatedly, e.g. in macro calls.

unassign name

Unassign a symbolic name. Throws an error if the name is not assigned.

names [ >outfile ]

List the names of all known branches and tags. Tells you what things are legal within angle brackets and parentheses.

define name body

Define a macro. The first whitespace-separated token is the name; the remainder of the line is the body, unless it is “{”, which begins a multi-line macro terminated by a line beginning with “}”.

A later ‘do’ call can invoke this macro.

The command ‘define’ by itself without a name or body produces a macro list.

do name arguments…​

Expand and perform a macro. The first whitespace-separated token is the name of the macro to be called; remaining tokens replace {0}, {1}…​ in the macro definition. Tokens may contain whitespace if they are string-quoted; string quotes are stripped. Macros can call macros.

If the macro expansion does not itself begin with a selection set, whatever set was specified before the ‘do’ keyword is available to the command generated by the expansion.

undefine name

Undefine the named macro.

Here’s an example to illustrate how you might use this. In CVS repositories of projects that use the GNU ChangeLog convention, a very common pre-conversion artifact is a commit with the comment “*** empty log message ***” that modifies only a ChangeLog entry explaining the commit immediately previous to it. The following

define changelog <{0}> & /empty log message/ squash --pushback
do changelog 2012-08-14T21:51:35Z
do changelog 2012-08-08T22:52:14Z
do changelog 2012-08-07T04:48:26Z
do changelog 2012-08-08T07:19:09Z
do changelog 2012-07-28T18:40:10Z

is equivalent to the more verbose

<2012-08-14T21:51:35Z> & /empty log message/ squash --pushback
<2012-08-08T22:52:14Z> & /empty log message/ squash --pushback
<2012-08-07T04:48:26Z> & /empty log message/ squash --pushback
<2012-08-08T07:19:09Z> & /empty log message/ squash --pushback
<2012-07-28T18:40:10Z> & /empty log message/ squash --pushback

but you are less likely to make difficult-to-notice errors typing the first version.

(Also note how the text regexp acts as a failsafe against the possibility of typing a wrong date that doesn’t refer to a commit with an empty comment. This was a real-world example from the CVS-to-git conversion of groff.)

script filename [ arg…​ ]

Takes a filename and optional following arguments. Reads each line from the file and executes it as a command.

During execution of the script, the script name replaces the string $0, and the optional following arguments (if any) replace the strings $1, $2 …​ $n in the script text. This is done before tokenization, so the $1 in a string like ‘foo$1bar’ will be expanded. Additionally, $$ is expanded to the current process ID (which may be useful for scripts that use tempfiles).

Within scripts (and only within scripts) reposurgeon accepts a slightly extended syntax: First, a backslash ending a line signals that the command continues on the next line. Any number of consecutive lines thus escaped are concatenated, without the ending backslashes, prior to evaluation. Second, a command that takes an input filename argument can instead take literal data using the syntax of a shell here-document. That is: if the ‘<filename’ is replaced by ‘<<EOF’, all following lines in the script up to a terminating line consisting only of ‘EOF’ will be read, placed in a temporary file, and that file fed to the command and afterwards deleted. EOF may be replaced by any string. Backslashes have no special meaning while reading a here-document.

Scripts may have comments. Any line beginning with a ‘#’ is ignored. If a line has a trailing portion that begins with one or more whitespace characters followed by ‘#’, that trailing portion is ignored.

Scripts may call other scripts to arbitrary depth.

print output-text…​ [>outfile]

Does nothing but ship its argument line to standard output. Useful in scripts for regression tests. Output redirection is supported.

Here are some more advanced examples of scripting:

define lastchange {
@max(=B & [/ChangeLog/] & /{0}/B)? list
}

List the last commit that refers to a ChangeLog file containing a specified string. (The trick here is that ? extends the singleton set consisting of the last eligible ChangeLog blob to its set of referring commits, and list only notices the commits.)

index >index.txt
shell <index.txt awk '/refs\/tags/ {print $4}' | sort | uniq | while read t; do echo "tag $(basename "$t") rename $(basename "$t" | sed -e 's/sample/example/')"; done >renames.script
script renames.script

Mass-rename tags, replacing "sample" on the basename with "example". Illustrates a general technique of generating reposurgeon commands via shell that you then execute with the ‘script’ command. Enabling this technique is the reason as many commands as possible support redirects.

Housekeeping

gc [ percent ]

Trigger a garbage collection. Scavenges and removes all blob objects that no longer have references, e.g. as a result of delete operations on repositories. This is followed by a Go-runtime garbage collection.

The optional argument, if present, is passed as a SetGCPercent call to the Go runtime. The initial value is 100; setting it lower causes more frequent garbage collection and may reduce maximum working set, while setting it higher causes less frequent garbage collection and will raise maximum working set.

when timespec

Interconvert between git timestamps (integer Unix time plus TZ) and RFC3339 format. Takes one argument, autodetects the format. Useful when eyeballing export streams. Also accepts any other supported date format and converts to RFC3339.

Debugging and diagnostics

A few commands have been implemented primarily for debugging and regression-testing purposes, but may be useful in unusual circumstances.

The output of most of these commands can individually be redirected to a named output file. Where indicated in the syntax, you can prefix the output filename with ‘>’ and give it as a following argument.

[ selection ] resolve [ label-text…​ ]

Does nothing but resolve a selection-set expression and echo the resulting event-number set to standard output. The remainder of the line after the command is used as a label for the output.

Implemented mainly for regression testing, but may be useful for exploring the selection-set language.

log [ logclasses…​ ]

Without an argument, list all log message classes, prepending a + if that class is enabled and a - if not.

Otherwise, it expects a space-separated list of ‘<+ or -><log message class>’ entries, and enables (with +) or disables (with -) the corresponding log message class. The special keyword ‘all’ can be used to affect all the classes at the same time.

For instance, ‘log -all +shout +warn’ will disable all classes except "shout" and "warn", which is the default setting. ‘log +all -svnparse’ would enable logging everything but messages from the svn parser.

You can get a list of other log message classes from ‘help log’.

logfile [ path ]

Error, warning, and diagnostic messages are normally emitted to standard error. This command, with a nonempty path argument, directs them to the specified file instead. Without an argument, reports what logfile is set.

version [ version…​ ]

With no argument, display the program version and the list of VCSes directly supported. With argument, declare the major version (single digit) or full version (major.minor) under which the enclosing script was developed. The program will error out if the major version has changed (which means the surgical language is not backwards compatible).

It is good practice to start your lift script with a version requirement, especially if you are going to archive it for later reference.

[ selection ] hash [--tree|--bare]

Takes a selection set, defaulting to all. For each eligible object in the set, returns its index and the same hash that Git would generate for its representation of the object. Eligible objects are blobs and commits.

With the option --bare, omit the event number; list only the hash.

With the option --tree, generate a tree hash for the specified commit rather than the commit hash. This option is not expected to be useful for anything but verifying the hash code itself.

This command supports output redirection.

Profiling

elapsed [ >outfile ]

Display elapsed time since start.

timing [ >outfile ]

Display statistics on phase timing in repository analysis. Mainly of interest to developers trying to speed up the program.

If the command has following text, this creates a new, named time mark that will be visible in a later report; this may be useful during long-running conversion recipes.

readlimit [number]

Set a maximum number of commits to read from a stream. If the limit is reached before EOF, it will be logged. Mainly useful for benchmarking. Without arguments, report the read limit; 0 means there is none.

memory [ >outfile ]

Report memory usage. Runs a garbage-collect before reporting so the figure will better reflect storage currently held in loaded repositories; this will not affect the reported high-water mark.

profile [ live | start | save ] [ args…​ ]

Profiling is enabled by default, but viewing the profile data requires either starting the HTTP server with ‘profile live’, or saving it to a file with ‘profile save’. When no arguments are given it prints out the available types of profiles. There is more detailed documentation on this command in the embedded help.

exit [ >outfile ]

Exit, reporting the time. Included here because, while EOT will also cleanly exit the interpreter, this command reports elapsed time since start.

Working with Mercurial

There is a built-in extractor class to perform extractions from Mercurial repositories.

Mercurial branches are exported as branches in the exported repository, and tags are exported as tags. By default, bookmarks are ignored. You can specify explicit handling for bookmarks by setting ‘reposurgeon.bookmarks’ in your .hg/hgrc. Set the value to the prefix that reposurgeon should use for bookmarks.

For example, if your bookmarks represent branches, put this at the bottom of your .hg/hgrc:

[reposurgeon]
bookmarks=heads/

If you do that, it’s your responsibility to ensure that branch names do not conflict with bookmark names. You can add a prefix like ‘bookmarks=heads/feature-’ to disambiguate as necessary.

Alternatively, you can import directly using hg-git-fast-import. This importer is not yet well-tested, but may be substantially faster than using the extractor harness. You may wish to run test conversions using both methods and compare them.

Mercurial subrepositories

The hg extractor does not attempt to recursively handle subrepos. Rather, it will extract the history of the top-level repo, in which .hgsub and .hgsubstate will be treated as regular files. If you wish to translate these into the semantics of your target VCS, you will need to do so with surgical primitives after reading the history into reposurgeon.

Working with Subversion

reposurgeon can read Subversion dumpfiles or repositories. If you want it to read a repository, you must run it within the top-level directory of the repository itself, not in a checkout directory made from the repository.

The transaction model of Subversion is nothing like that of the DVCSes (distributed version-control systems) that followed it. Two of the more obvious differences are around tags and branches.

A Subversion tag isn’t an annotation attached to a commit. The Subversion data model is that a history is a sequence of surgical operations on a tree; there are no annotation tags as such, a tag is just another branch of the tree. Accordingly a Subversion tag is a copy of the state of an entire branch at a particular revision. This can be losslessly translated to an annotation only if no additional commits are added to the tag branch after the copy. But nothing prevents this! reposurgeon tries to do the right thing, creating a DVCS-style annotated tag when it can and otherwise preserving the changes as commits, using a lightweight tag to point at the tip.

There is a subtler problem around branches themselves. In a DVCS, deleting a branch removes it from the repository history entirely, a fact of some significance since repositories are copied around often enough that keeping every discarded experiment forever would eventually drown the live content in superannuated cruft. Subversion repositories, on the other hand, are designed on the assumption that they sit on one server and never move. A Subversion branch is just a directory in the branch namespace; if you delete it, you won’t see it in following revisions, but if you update to an older one that content will still be there. By default, reposurgeon will delete the corresponding branches as if the deletion was done in a DVCS, keeping only the commits that are also part of other branches' histories, but you can tell it to preserve the branches instead and give them unambiguous names in the refs/deleted namespace.

Bad things can happen when a tag directory is created, copied from, deleted, then recreated from a different source directory. This is a place where the Subversion model of tags clashes badly with the changeset-DAG model used by git and other DVCSes, especially if the same tag is recreated later! The obvious thing to do when converting this sequence would be to just nuke the tag history from the deletion back to its branch point, but that will cause problems if a copy operation was ever sourced in the deleted branch (and this does happen!).

What reposurgeon does instead is preserve the most recent branch with any given name, so the view back from any branch tip in the repository has the correct content. This does however mean that reposurgeon discards the content of any branch having that same name. However, see the --preserve option of the read command.

Reading Subversion repositories

Note that the Subversion dump reader only supports versions 1 and 2 of the dump file format, not version 3 with diff-based file changes. This shouldn’t be a problem with normal use of reposurgeon, which calls svnadmin dump in its default mode generating version 2.

Certain optional modifiers on the read command change its behavior when reading Subversion repositories:

--nobranch

Suppress branch analysis. The generated git repository will mirror the whole subversion tree, with trunk and branches as subdirectories. No directory deletions are translated to branch deletions, since no directories are seen as branches in the first place.

--ignore-properties

Suppress read-time warnings about discarded property settings.

--user-ignores

By default reposurgeon filters in-tree .gitignore files found in the history because they would clash with those generated from the svn:ignore and svn:global-ignores properties. Using this option makes .gitignore files be passed through. They will still be overridden by generated .gitignore files so this option is often used along with --no-automatic-ignores.

--use-uuid

If the --use-uuid read option is set, the repository’s UUID will be used as the hostname when faking up email addresses, a la git-svn. Otherwise, addresses will be generated the way git cvs-import does it, simply copying the username into the address field.

--no-automatic-ignores

Do not generate .gitignore files from svn:ignore and svn:global-ignores properties.

--preserve

When a branch or tag was deleted in SVN, preserve the history up to deletion in a git ref under refs/deleted/, instead of deleting the branch and only keeping the commits that are also part of the history of other branches.

--cvsignores

Suppress the normal deletion of .cvsignore files.

These modifiers can go anywhere in any order on the command line after the read verb. They must be whitespace-separated.

It is also possible to embed a magic comment in a Subversion stream file to set these options. Prefix a space-separated list of them with the magic comment ‘ # reposurgeon-read-options:’; the leading space is required. This may be useful when synthesizing test loads; in particular, a stream file that does not set up a standard trunk/branches/tags directory layout can use this to perform a mapping of all commits onto the master branch that the git importer will accept.

Here are the rules used for mapping subdirectories in a Subversion repository to branches:

  • At any given time there is a set of eligible paths and path wildcards which declare potential branches. See the documentation of the branchify command for how to alter this set, which initially consists of {trunk, tags/*, branches/*, and *}.

  • A repository is considered “flat” if it has no directory that matches a path or path wildcard in the branchify set. All commits in a flat repository are assigned to branch master, and what would have been branch structure becomes directory structure. In this case, we’re done; all the other rules apply to non-flat repos.

  • If you give the option --nobranch when reading a Subversion repository, branch analysis is skipped, and the repository is treated as though flat (left as a linear sequence of commits on refs/heads/master). This may be useful if your repository configuration is highly unusual and you need to do your own branch surgery. Note that this option will disable partitioning of mixed commits.

  • If ‘trunk’ is eligible, it always becomes the master branch.

  • If an element of the branchify set ends with /*, it is considered a branch namespace: each immediate subdirectory of it is considered a potential branch, unless it itself appears in branchify as a namespace. If * is in the branchify set (which is true by default) all top-level directories are also considered potential branches (other than /trunk which is mapped to master, and /tags and /branches which are namespaces by default).

  • Files in the top-level directory are assigned to a synthetic branch named ‘root’. If there is no "trunk" (or rather no master branch), then this synthetic ‘root’ branch becomes the master branch. You can map another directory to master using branchify and branchmap.

  • Each potential branch is checked to see if it has commits on it after the initial creation or copy. If there are such commits, or if the branch creation or copy introduces changes other than the copy, it becomes a branch. If not, it may become a tag in order to preserve the commit metadata. In all cases, the name of any created tag or branch is the basename of the directory, unless another mapping is in place.

Branch-creation operations with no following commits are tagified.

Otherwise, each commit that only creates or deletes directories (in particular, copy commits for tags and branches, and commits that only change properties) will be transformed into a tag named after the tag or branch, containing the date/author/comment metadata from the commit.

Subversion branch deletions are turned into deletealls, clearing the fileset of the import-stream branch. When a branch finishes with a deleteall at its tip, the deleteall is transformed into a tag. This rule cleans up after aborted branch renames.

Occasionally (and usually by mistake) a branchy Subversion repository will contain revisions that touch multiple branches. These are handled by partitioning them into multiple import-stream commits, one on each affected branch. The Legacy-ID of such a split commit will have a pseudo-decimal part - for example, if Subversion revision 2317 touches three branches, the three generated commits will have IDs 2317.1, 2317.2, and 2317.3.

The svn:executable and svn:special properties are translated into permission settings in the input stream; svn:executable becomes 100755 and svn:special becomes 120000 (indicating a symlink; the blob contents will be the path to which the symlink should resolve).

Any cvs2svn:rev properties generated by cvs2svn are incorporated into the internal map used for reference-lifting, then discarded.

Normally, per-directory svn:ignore properties (and svn:global-ignores properties, e.g. in a site configuration file) become .gitignore files. Actual .gitignore files in a Subversion directory are presumed to have been created by git-svn users separately from native Subversion ignore properties and discarded with a warning. It is up to the user to merge the content of such files into the target repository by hand. But this behavior is changed by the --user-ignores option which disables filtering of in-tree .gitignore files, and the --no-automatic-ignores which discards Subversion svn:ignore and svn:global-ignores properties without translation.

Normally, .cvsignore files left over from a Subversion repository’s ancient history as a CVS repository are deleted. The assumption is that the repository users want the (presumably more up-to-date) Subversion ignore properties to be translated. However, this deletion can be prevented with the --cvsignores read option.

svn:mergeinfo properties are interpreted. Any svn:mergeinfo property on a revision A with a merge source containing all revisions on a branch from the forking point (or the branch start if the histories are independent) up to revision B produces a merge link such that the branch tip at revision B becomes a parent of A. The "svnmerge-integrated" properties produced by Subversion’s svnmerge.py script are handled the same way.

All other Subversion properties are discarded. (This may change in a future release.) The property for which this is most likely to cause semantic problems is svn:eol-style. However, since property-change-only commits get turned into annotated tags, the translated tags will retain information about setting changes.

The sub-second resolution on Subversion commit dates is discarded; Git wants integer timestamps only.

Because fast-import format cannot represent an empty directory, empty directories in Subversion repositories will be lost in translation.

Normally, Subversion local usernames are mapped in the style of git cvs-import; thus user ‘foo’ becomes ‘foo <foo>’, which is sufficient to pacify git and other systems that require email addresses. With the option --use-uuid, usernames are mapped in the git-svn style, with the repository’s UUID used as a fake domain in the email address. Both forms can be remapped to real addresses using the authors read command.

Reading a Subversion stream enables writing of the legacy map as 'legacy-id' passthroughs when the repo is written to a stream file.

reposurgeon tries hard to silently do the right thing, but there are Subversion edge cases in which it emits warnings because a human may need to intervene and perform fixups by hand. Here are the less obvious messages it may emit:

user-created .gitignore ignored

This message means reposurgeon has found a .gitignore file in the Subversion repository it is analyzing. This probably happened because somebody was using git-svn as a live gateway, and created ignores which may or may not be congruent with those in the generated .gitignore files that the Subversion ignore properties will be translated into. You’ll need to make a policy decision about which set of ignores to use in the conversion, and possibly set the --user-ignores option on read to pass through user-created .gitignore files; in that case this warning will not be emitted.

properties set

reposurgeon has detected a setting of a user-defined property, or the Subversion properties svn:externals. These properties cannot be expressed in an import stream; the user is notified in case this is a showstopper for the conversion or some corrective action is required, but normally this error can be ignored. This warning is suppressed by the --ignore-properties option.

Detected link from <revision> to <revision> might be dubious

When trying to detect parent links from multiple file copies like what cvs2svn can produce, source revisions of the different copies were not all the same. The link should probably be monitored because it has a non-negligible probability of being slightly wrong. This does not impact the tree contents, only the quality of the history.

Branchification

These commands control the branchify set, defined earlier, and how branch and tag names are to be rewritten as they are read from the Subversion repository.

branchify [ glob …​ ]

Specify the list of directories to be treated as potential branches (to become tags if there are no modifications after the creation copies) when analyzing a Subversion repo. This list is ignored when the ‘--nobranchread option is used. It defaults to the 'standard layout' set of directories, plus any unrecognized directories in the repository root.

String quotes and backslash escapes are interpreted when parsing the command line.

With no arguments, displays the current branchification set.

List elements are old-fashioned path glob patterns; each * matches any path segment. In particular, one asterisk at the end of a path in the set means ‘all immediate subdirectories of this path, unless they are part of another (longer) path in the branchify set’.

Note that the branchify set is a property of the reposurgeon interpreter, not of any individual repository, and will persist across Subversion dumpfile reads. This may lead to unexpected results if you forget to re-set it.

branchmap [ /regex/branch/ …​ | reset ]

Specify the list of regular expressions used for mapping the SVN branches that are detected by branchify to their rewritten names. If none of the expressions match a particular branch, the default behavior applies. This maps a branch to the name of the last directory, except for trunk and * which are mapped to master and root.

With no arguments the current regex replacement pairs are shown. Passing ‘reset’ will clear the mapping.

String quotes and backslash escapes are not interpreted when parsing the command line, as this would clash with the use of backslashes as substitution-part references. If you need to include a non-printing character in a regexp, use its C-style escape, e.g. \s for space.

For each branch name read from the Subversion repository, this command will attempt to match the name against each regex in the map. If it finds a match, it rewrites the branch name to the associated branch. It stops after it has either found a match, or there are no more regexes left in the map. The branch name can use Go backreferences.

Note that the regular expressions are appended to ‘refs/’ without either the needed ‘heads/’ or ‘tags/’. This allows for choosing the right kind of branch type.

While the syntax template above uses slashes, any first character will be used as a delimiter (and you will need to use a different one in the common case that the paths contain slashes).

You must give this command before the Subversion repository read it is supposed to affect! This will not affect any other repository type.

Note that the branchmap set is a property of the reposurgeon interpreter, not of any individual repository, and will persist across Subversion dumpfile or repository reads. This may lead to unexpected results if you forget to re-set it.

Mid-branch deletions

When a branch A is deleted and a branch B is copied to the name A, the Subversion intent is to replace the contents of branch A with the contents of branch B, keeping the A name. This is a poor man’s merge from before "svn merge" existed. Many Subversion users who formed their habits before svn merge existed still operate this way.

In git terms, this almost corresponds to a merge of A into B followed by a rename of B to A. Branch B continues to exist, however, so we can’t do that in translation. The reposurgeon logic does not try to be clever about this, because "clever" would have rebarbative edge cases; the sequence is translated into a deleteall followed by a commit operation that recreates the B files under corresponding A names. No merge link is created. The commit filling A with a branch copy from B will have B as its first parent, though, so all that would be needed is to create a merge link from the old A before the delete to the commit recreating A.

This case is mentioned here because it is likely to confuse the merge-tracking algorithms used, e.g., by git diff, or if you ever try to merge a branch that forked off the old A to a branch spun off the new (and expect git to know that you do not want to incorporate old A’s changes).

Multiproject Subversion repositories

Subversion repositories are sometimes organized to hold multiple projects, with the root directory containing one subdirectory per project and each subdirectory having its own trunk/branches/tags layout.

Suppose you have a stream dump from a repository with two project subdirectories, project1 and project2. The pattern for dissecting out project1 looks like this:

branchify project1/trunk project1/branches/* project1/tags/* *
branchmap :project1/trunk:heads/master: :project1/tags:tags: :project1/branches:branches:
set testmode
read <multiproject.svn
branch project2 delete

The first command branchifies every directory underneath project1 for which that’s required, with project2 left as its own branch from top level. The second command sets up a transform of these branches into a standard layout.

These transformations are performed when the actual read of the repository happens. Following that, the unneeded project2 branch can be dropped.

Of course we could have done the same thing with project2 and dropped project1. Repeat this as many times as required to turn each partial into an autonomous git repository.

While something like this could be done with repocutter sift commands, that would not correctly resolve Subversion copies across projects. This reposurgeon procedure handles those correctly.

Working with CVS

When you are converting a CVS repository using reposurgeon, most of the heavy lifting will have been done by the importer - cvs-fast-export. In particular, it coalesces CVS per-file changes into changesets when it detects that they have identical comments and attributions and are close in time, and it converts .cvsignore files to .gitignores.

A CVS repository normally consists of a set of module subdirectories and a CVSROOT directory containing metadata. cvs-fast-export ignores CVSROOT; thus you can run reposurgeon at any level of a directory tree containing CVS master files, and it will try to lift what it can see at and below the current directory it is run from.

If you do this at the top level of the repository directory, your converted repository will have a subdirectory corresponding to each module. This is normally not the way you want to do things, as CVS tags are not likely to be consistent across all modules and thus won’t lift correctly. You probably want to do individual module conversions.

One issue to keep an eye on is whether the window defining "close in time" is wide enough. If it’s not, you may detect commit groups with the same committer and comment text that should have been merged into one changeset but were not. You can either clean these up with the ‘coalesce’ command in reposurgeon or run cvs-fast-export by hand with a larger -w option and read in the generated stream.

One step in cleaning up a CVS conversion that is unique to that system is deleting root tags - tags which have "-root" as a name suffix and mark the beginning of a branch, CVS uses these for bookkeeping, but later systems don’t need them. They’re just clutter and can be removed.

It’s also worth paying careful attention to reference-lifting so that you can scrub useless CVS revision numbers out of comments. This is a more pressing issue than it is with Subversion, where changesets map to changesets, and conversions have the option of marking each target changeset with its revision number.

Troubleshooting and bug reports

Dealing with memory exhaustion

To do its job, reposurgeon needs to hold all of your history’s metadata in memory. That doesn’t mean the content part, but does mean all of the changeset attributions, comments, and tags. Given a large enough repository, this will overrun the RAM of a small machine. If this happens to you, your reposurgeon instance will die abruptly with an OOM (Out Of Memory) error while attempting to read in your repository.

It is extremely unlikely that this is due to a bug in reposurgeon. Before filing an issue about it, there is a procedure you should try. It consists of bisecting on the parameter the Go language runtime uses to control the frequency of garbage collection. You can set this using the environment variable GOGC or reposurgeon "gc" command.

GOGC defaults to 100, which instructs the runtime to garbage-collect when the heap size is 100% bigger than (i.e., twice) the size of the reachable objects after the last garbage collection. To increase the frequency of GC, usually resulting in a lower memory high-water mark, decrease that percentage threshold. To decrease gc frequency, increase the threshold so the runtime tolerates a larger heap.

To troubleshoot your OOM problem, bisect on this threshold to find the highest value that will avoid OOM. Start by cutting it to 50, then to 25, then to 12, then to 6. If you find a value that allows you to read to completion, you may want to try increasing it by a half interval (e,g. 50 to 75, 25 to 37, etc.) to get back some throughput.

If your repository won’t read in at GOGC=6 you have a real problem. Unfortunately, it’s not one the reposurgeon devteam can help you with; the correct solution to it is to do your conversion on a machine with more RAM and/or more swap configured. 64GB should be sufficient. The largest repository the reposurgeon devteam has ever seen (the history of GCC, 280K Subversion commits) fit on a 128GB machine with GOGC=30.

If you can’t read your history onto a 128GB machine with GOGC=30, then maybe the reposurgeon devteam ought to hear about it. That said, if you can find ways to make reposurgeon more efficient, we are eager to accept those patches, or even just a bug report with the details. It’s probable that there are some efficiency gains yet to be made.

Dealing with stalled conversions

Occasionally it will happen that a conversion on a particularly large or malformed repository seems to stall out, grinding endlessly without completing a conversion phase.

Reposurgeon’s execution time is dominated by cycles spent in the memory allocator and garbage collector. Thus, you can pay RAM to decrease running time - push GOGC up from its default of 100. If your conversion completes in reasonable time before your memory usage increases enough that reposurgeon gets killed by OOM, you win. Otherwise see the previous section about adding RAM and swap space.

A stallout is more difficult to troubleshoot than an OOM, and more likely to indicate an actual bug or algorithmic problem in reposurgeon. There are a couple of things you can do to make a good resolution more likely:

  1. Identify and report the phase in which the stallout occurs. Be aware that there is a known O(n**2) problem in phase C2 of the Subversion dump reader that really thrashes the allocator; that’s not reducible and we’re just going to tell you you have to throw more RAM at the problem.

  2. Use repobench to see if you can identify a revision that triggers the stall. The procedure for this is to use it to step your readlimit up from zero until you see the runtime spike.

  3. As always, provide a stripped (and possibly obscured) dump of the repository for testing.

How to report bugs

It is generally not possible to reproduce reposurgeon/repocutter bugs without a copy of the history on which they occurred. When you find a bug, please send the maintainers:

(a) An exact description of how observed behavior differed from expected behavior. If reposurgeon/repocutter died, include the backtrace.

(b) A git fast-import or Subversion dump file of the repository you were operating on, or a pointer to where it can be pulled from.

(c) A script containing the sequence of reposurgeon or repocutter commands that revealed the bug. If you were exploring interactively, remember that the "history" command exists and can dump your command history to a file.

(d) If you were using the standard-workflow Makefile generated by "repotool initialize", mention that in your bug report. If you modified the Makefile, include a copy with the bug report.

Please use the reposurgeon project’s issue tracker and attach these files. It’s helpful.

Are you seeing git die with a complaint about an unknown --show-original-IDs option? Upgrade your git; reposurgeon needs 1.19.2 or later.

Test case reduction

If you know how to reproduce the error, the best possible test case is a hand-crafted dump stream of minimal size with content that explains how it breaks the tool. Those are turned into regression tests instantly.

When you don’t know the cause of the error, ship the project a dump file derived from the real repository that triggered it. To speed up the debugging process so you can get an answer more quickly, there are some tactics you can use to reduce the bulk of the test case you send. Also, a well-reduced dump can become a regression test to ensure the bug does not recur.

How to make dumps in Git: cd to your git repository and capture the output of "repotool export".

How to make dumps in Subversion: cd to the toplevel directory of the repository master - not a checkout of it. You can tell you’re in the right place if you see this:

$ ls
conf  db  format  hooks  locks  README.txt

Then run "repotool export", capturing the output.

The commands you will use for test-case reduction are reposurgeon and, on Subversion dumps, repocutter.

Replace the content blobs in the dump with stubs

The subcommand in both tools is 'strip'; it will usually cut the size of the dump by more than a factor of 10. Check that the bug still reproduces on the stripped dump; if it doesn’t, that would be unprecedented and interesting in itself.

If you are trying to maintain confidentiality about your code, sending me a stripped repo has the additional advantage that the code won’t be disclosed! The command preserves structure and metadata but replaces each content blob with a unique magic cookie.

If you don’t want to disclose even the metadata, you can do a repocutter "obscure" pass after the strip. This will mask file paths and developer names.

Truncate the dump as much as possible

Try to truncate the dump to the shortest leading section that reproduces the bug.

A reposurgeon error message will normally include a mark, event number, or (when reading a Subversion dump) a Subversion revision number. Use a selection-set argument to reposurgeon’s 'write' command, or the 'select' subcommand of repocutter, to pare down the dump so that it ends just after the error is triggered. Again, check to ensure that the bug reproduces on the truncated dump.

If the error message doesn’t tell you where the problem occurred, try a bisection process. Use the --readlimit option of the read to ignore the last half of the events in the dump; check to see if the bug reproduces. If it does, repeat; if it does not, throw out the last quarter, then the last eighth, and so forth. Keep this up until you can no longer drop events without making the bug go away.

Bisection is more effective than you might expect, because the kinds of repository malformations that trigger misbehavior from reposurgeon tend to rise in frequency as you go back in time. The largest single category of them has been ancient cruft produced by cvs2svn conversions.

Topologically reduce the dump

Next, topologically reduce the dump, throwing out boring commits that are unlikely to be related to your problem.

If a commit has all file modifications (no additions or deletions or copies or renames) and has exactly one ancestor and one descendant, then it may be boring. In a Subversion dump it also has to not have any property changes; in a git dump it must not be referred to by any tag or reset. Interesting commits are not boring, or have a not-boring parent or not-boring child.

Try using the 'reduce' subcommand of repocutter to strip boring commits out of a Subversion dump. For a git dump, look at reposugeon’s "strip --reduce" command.

Prune irrelevant branches

Try to throw away branches that are not relevant to your problem. The 'expunge' operation on repocutter or the 'branch delete' command in reposurgeon may be helpful.

This is the attempted simplification least unlikely to make your bug vanish, so check that carefully after each branch deletion.

Know how to spot possible importer bugs

If your target VCS’s importer dies during a rebuild, try writing the repository content to a stream instead and importing the stream by hand. If the latter does not fail, the target VCS’s importer may be slightly buggy - but you have a workaround.

(This has been observed under git 2.5.0 with the result of a unite operation on two repositories. The cause is unknown, as git dies suddenly enough to not leave a crash report.)

Benchmarking

A fair amount of effort has been expended to keep the run-time performance of reposurgeon as linear as possible. This is not an easy state to stay in; it is unfortunately quite simple to accidentally regress this without noticing.

To that end, there are some fairly simple scripts in the bench directory of the source distribution that can be used to check for this type of problem. repobench runs reposurgeon multiple times with a different readlimit each time, recording the run time and memory allocated at each iteration. Supply arguments specifying the svn dump file to read and the readlimit values to use like this:

    ./repobench your-dump-file.svn 1000 2000 20000

This reads your-dump-file.svn 10 times, with the readlimit set first to 1,000, then 3,000, etc, stepping up until it reaches 20,000.

This produces a .dat file which you can use with repobench -p, or repobench -o to produce graphs.

For an example, see oops.svg. This shows a graph made using a good revision that had linear performance, several made with revisions that introduced a regression that made performance quite non-linear, and the fix. You can easily tell the difference visually.

Incompatible language changes

Reposurgeon scripts are effectively never reused. Thus, incompatible changes to the command language don’t have a high cost in pain to users, and the maintainers feel free to make them whenever improvement seems possible. But just in case, such changes are recorded here.

In versions before 4.10, the "reduce" and "blob" options of the "strip" command were bare keywords. Also the options of the "ignores" command were bare keywords. There was a command to set the prompt string that has been retired.

In versions before 4.8, the expunge command run on a repository named "foo" tried to keep deleted fileops in a new synthetic repository named "foo-expunges". This feature has been replaced by the "~" negation operator on expunge selections.

In versions before 4.1, the index command did not see blobs by default.

In versions before 4.0, msgin and msgout were named "mailbox_in" and "mailbox_out:"; branchify was "branchify_map". Previous versions used the Python variant of regular expressions; some of the more idiosyncratic features of these are not replicated in the Go implementation.

In versions before 3.23, ‘prefer’ changed the repository type as well as the preferred output format. Since then, do this with ‘sourcetype’.

In versions before 3.0, the general command syntax put the command verb first, then the selection set (if any) then modifiers (VSO). It has changed to optional selection set first, then command verb, then modifiers (SVO). The change made parsing simpler, allowed abolishing some noise keywords, and recapitulates a successful design pattern in some other Unix tools - notably sed(1).

In versions before 3.0, path expressions only matched commits, not commits and the associated blobs as well. The names of the "a" and "c" flags were different.

In reposurgeon versions before 3.0, the delete command had the semantics of squash; also, the policy flags did not require a ‘--’ prefix. The ‘--delete’ flag was named "obliterate".

In reposurgeon versions before 3.0, read and write optionally took file arguments rather than requiring redirects (and the write command never wrote into directories). This was changed in order to allow these commands to have modifiers. These modifiers replaced several global options that no longer exist.

In reposurgeon versions before 3.0, the earliest factor in a unite command always kept its tag and branch names unaltered. The new rule for resolving name conflicts, giving priority to the latest factor, produces more natural behavior when uniting two repositories end to end; the master branch of the second (later) one keeps its name.

In reposurgeon versions before 3.0, the tagify command expected policies as trailing arguments to alter its behavior. The new syntax uses similarly named options with leading dashes, that can appear anywhere after the tagify command.

In versions before 2.9. the syntax of authors, legacy, list, and what are now msg{in|out} was different (and legacy was fossils). They took plain filename arguments rather than using redirect < and >.

In versions so old that the changeover point is now lost in the mists of time, curly brackets (not parens) performed subexpression grouping.

Emergency help

If you need emergency help, go to the #reposurgeon IRC on freenode. Be aware, however, that the maintainer is too busy to babysit difficult repository conversions unless he has explicitly volunteered for one or someone is paying him to care about it. For explanation, see Your money or your spec.

Stream syntax extensions

The event-stream parser in reposurgeon supports some extended syntax. Exporters designed to work with reposurgeon may have a --reposurgeon option that enables emission of extended syntax; notably, this is true of cvs-fast-export(1). The remainder of this section describes these syntax extensions. The properties they set are (usually) preserved and re-output when the stream file is written.

The token ‘#reposurgeon’ at the start of a comment line in a fast-import stream signals reposurgeon that the remainder is an extension command to be interpreted by reposurgeon.

One such extension command is implemented: ‘sourcetype’, which behaves identically to the reposurgeon sourcetype command. An exporter for a version-control system named "frobozz" could, for example, say

#reposurgeon sourcetype frobozz

Within a commit, a magic comment of the form ‘#legacy-id’ declares a legacy ID from the stream file’s source version-control system.

Also accepted is the bzr syntax for setting per-commit properties. While parsing commit syntax, a line beginning with the token ‘property’ must continue with a whitespace-separated property-name token. If it is then followed by a newline it is taken to set that boolean-valued property to true. Otherwise it must be followed by a numeric token specifying a data length, a space, following data (which may contain newlines) and a terminating newline. For example:

commit refs/heads/master
mark :1
committer Eric S. Raymond <esr@thyrsus.com> 1289147634 -0500
data 16
Example commit.

property legacy-id 2 r1
M 644 inline README

Unlike other extensions, bzr properties are only preserved on stream output if the preferred type is bzr, because any importer other than bzr’s will choke on them.

Limitations and guarantees

Guarantee: In DVCSes that use commit hashes, editing with reposurgeon never changes the hash of a commit object unless (a) you edit the commit, or (b) it is a descendant of an edited commit in a VCS that includes parent hashes in the input of a child object’s hash (git and hg both do this).

Guarantee: reposurgeon only requires main memory proportional to the size of a repository’s metadata history, not its entire content history. (Exception: the data from inline content is held in memory.)

Guarantee: In the worst case, reposurgeon makes its own copy of every content blob in the repository’s history and thus uses intermediate disk space approximately equal to the size of a repository’s content history. However, when the repository to be edited is presented as a stream file, reposurgeon requires no or only very little extra disk space to represent it; the internal representation of content blobs is a (seek-offset, length) pair pointing into the stream file.

Guarantee: reposurgeon never modifies the contents of a repository it reads, nor deletes any repository. The results of surgery are always expressed in a new repository.

Guarantee: Any line in a fast-import stream that is not a part of a command reposurgeon parses and understands will be passed through unaltered. At present the set of potential passthroughs is known to include the progress, options, and checkpoint commands as well as comments led by #.

Guarantee: All reposurgeon operations either preserve all repository state they are not explicitly told to modify or warn you when they cannot do so.

Guarantee: reposurgeon handles the bzr commit-properties extension, correctly passing through property items including those with embedded newlines. (Such properties are also editable in the message-box format.)

Limitation: Because reposurgeon relies on other programs to generate and interpret the fast-import command stream, it is subject to bugs in those programs.

Limitation: bzr suffers from deep confusion over whether its unit of work is a repository or a floating branch that might have been cloned from a repo or created from scratch, and might or might not be destined to be merged to a repo one day. Its exporter only works on branches, but its importer creates repos. Thus, a rebuild operation will produce a subdirectory structure that differs from what you expect. Look for your content under the subdirectory ‘trunk’.

Limitation: under git, signed tags are imported verbatim. However, any operation that modifies any commit upstream of the target of the tag will invalidate it.

Limitation: Stock git (at least as of version 1.7.3.2) will choke on property extension commands. Accordingly, reposurgeon omits them when rebuilding a repo with git type.

Limitation: Converting an hg repo that uses bookmarks (not branches) to git can lose information; the branch ref that git assigns to each commit may not be the same as the hg bookmark that was active when the commit was originally made under hg. Unfortunately, this is a real ontological mismatch, not a problem that can be fixed by cleverness in reposurgeon.

Limitation: Converting an hg repo that uses branches to git can lose information because git does not store an explicit branch as part of commit metadata, but colors commits with branch or tag names on the fly using a specific coloring algorithm, which might not match the explicit branch assignments to commits in the original hg repo. Reposurgeon preserves the hg branch information when reading an hg repo, so it is available from within reposurgeon itself, but there is no way to preserve it if the repo is written to git.

Limitation: Not all BitKeeper versions have the fast-import and fast-export commands that reposurgeon requires. They are present back to the 7.3 opensource version.

Limitation: reposurgeon may misbehave under a filesystem which smashes case in filenames, or which nominally preserves case but maps names differing only by case to the same filesystem node (Mac OS X behaves like this by default). Problems will arise if any two paths in a repo differ by case only. To avoid the problem on a Mac, do all your surgery on an HFS+ file system formatted with case sensitivity specifically enabled.

Limitation: If whitespace followed by # appears in a string or regexp command argument, it will be misinterpreted as the beginning of a line-ending comment and screw up parsing.

Guarantee: As version-control systems add support for the fast-import format, their repositories will become editable by reposurgeon.

Credits

These are in roughly descending magnitude.

Eric S. Raymond <esr@thyrsus.com>

Designer and original author.

Julien "FrnchFrgg" RIVAUD <frnchfrgg@free.fr>

Lots of high-quality code cleanups and speed tuning. Responsible for at least half of the massively revamped Subversion dump reader on the 4.0 releases. Ported the CoW filemaps from Python to Go.

Daniel Brooks <db48x@db48x.net>

Date unit testing, improvements for split and expunge commands. Assistance on Python to Go port. Go profiling support. Several significant reductions in total run time, total allocations, and max heap usage.

Greg Hudson <ghudson@MIT.EDU>

Contributed copy-on-write filemaps, which both tremendously sped up Subversion dumpfile parsing and squashed a nasty bug in the older code. While his CoW implementation was eventually replaced with one by Julien Rivaud, it busted the project out of a nearly two-year slump.

Eric Sunshine <sunshine@sunshineco.com>

Review of seldom-used features, test improvements, bug-fixing. Generalized selection expression parser for use-cases other than events. Converted selection parser, which evaluated an expression while parsing it, to a compile/evaluate paradigm in which a selection expression can be compiled once and evaluated many times. Added 'attribution' command. Added 'reorder' command. Assist Python to Go port.

Edward Cree <ec429@cantab.net>

Wrote the Hg extractor class and its test.

Ian Bruene <ianbruene@gmail.com>

Wrote the kommandant package and the Go port of Python difflib in order to support this package.

Chris Lemmons <alficles@gmail.com>

Solved some problems with inline blobs, improved interoperability with Mercurial, wrote the --prune option for graft.

Richard Hansen <rhansen@rhansen.org>

Selections as ordered rather than compulsorily sorted sets. The generalized reparent command. Improvements in regression-test infrastructure.

Peter Donis <peterdonis@alum.mit.edu>

Python 3 port and Python2/3 interoperability. Historical: none of this survived the port to Go.

Appendix A: The ontological-mismatch problem and its consequences

There are many tools for converting repositories between version-control systems out there. This appendix explains why reposurgeon is the best of breed by comparing it to the competition.

The problems other repository-translation tools have come from ontological mismatches between their source and target systems - models of changesets, branching and tagging can differ in complicated ways. While these gaps can often be bridged by careful analysis, the techniques for doing so are algorithmically complex, difficult to test, and have ugly edge cases.

Furthermore, doing a really high-quality translation often requires human judgment about how to move artifacts - and what to discard. But most lifting tools are, unlike reposurgeon, designed as run-it-once batch processors that can only implement simple and mechanical rules.

Consequently, most repository-translation tools evade the harder problems. They produce a sort of pidgin rendering that crudely and partially copies the history from the source system to the target without fully translating it into native idioms, leaving behind metadata that would take more effort to move over or leaving it in the native format for the source system.

But pidgin repository translations are a kind of friction drag on future development, and are just plain unpleasant to use. So instead of evading the hard problems, reposurgeon provides a power assist for a human to tackle them head-on.

Here are some specific symptoms of evasion that are common enough to deserve tags for later reference.

LINEAR: One very common form of evasion is only handling linear histories.

NO_IGNORES: There are many different mechanisms for ignoring files - .cvsignore, Subversion svn:ignore properties, .gitignore and their analogues. Many older Subversion repositories still have .cvsignore files in them as relics of CVS prehistory that weren’t translated when the repository was lifted. Reposurgeon, on the other hand, knows these can be changed to .gitignore files and does it.

NO_TAGS: Many repository translators cannot generate annotated tags (or their non-git equivalents) even when that would be the right abstraction in the target system.

CONFIGURATION: Another common failing is for repository-translation tools to require a lot of configuration and ceremony before they can operate. Often, for example, tools that translate from Subversion repositories require you to declare the repository’s branch structure every time even though sensible defaults and a bit of autodetection could have avoided this.

MIXEDBRANCH: Yet another case usually handled poorly (in translators that handle branching) is mixed-branch commits. In Subversion it is possible (though a bad idea) to commit a changeset that modifies multiple branches at once. All sufficiently old Subversion repositories have these, often by accident. The proper thing to do is split these up; the usual thing is to assign them to one branch and leave them omitted from the others.

Version references in commit comments. It is not uncommon to see a lot of references that are no longer usable embedded in translated repositories like fossils in geological strata - file-version numbers like '1.2' in Subversion repos that had a former life in CVS, Subversion references like 'r1234' in git repositories, and so forth. There’s no tag for this because tools other than reposurgeon generally have no support at all for lifting these.

To avoid repetitive text in these descriptions, we use the following additional bug tags:

ABANDONED: Effectively abandoned by its maintainer. Some tools with this tag are still nominally maintained but have not been updated or released in years.

NO_DOCUMENTATION: Poorly (if at all) documented.

!FOO means the tool is known not to have problem FOO.

?FOO means the author has not tried the tool but has strong reason to suspect the problem is present based on other things known about it.

You should assume that none of these tools do reference-lifting.

cvs2svn

Just after the turn of the 21st century, when Subversion was the new thing in version control, most projects that were using version control were using CVS, and cvs2svn was about the only migration path.

Early cvs2svn had problems on every level, only some of which have been fixed by more recent releases. It tended to spew junk commits into the translated history, and produced strange combinations of Subversion internal operations that most later translation tools would cope with only poorly. Sometimes the resulting translations are actually malformed; more often they contains noisy commits or commit duplications that made little sense under Subversion and make even less under the new target system.

!LINEAR, ?MIXEDBRANCH, DOCUMENTATION

cvs-fast-export

Formerly named parsecvs. Originally written by Keith Packard to port the X.org repositories, which it did a good job on. Now maintained by me; reposurgeon uses it to read CVS repositories. It is extremely fast and can thus be productively used even on huge repositories.

!ABANDONED, !LINEAR, !NO_IGNORES, !DOCUMENTATION, !CONFIGURATION

cvsps

Don’t use this. Just plain don’t. The author maintained version 3.x until end-of-lifing it in favor of cvs-fast-export due to fundamental, unfixable problems. It gets branch topology wrong in ways that are difficult to detect.

git-svn

git-svn, the Subversion converter in the git distribution, is really designed to be a two-way live gateway enabling git users to push and pull commits from a Subversion server. It operates by creating a git repository that is effectively a local mirror of the Subversion history, then performing Subversion client commands to synchronize the two in a git-like way.

This choice of mission means that git-svn’s translation of history into git uses a compromise between Subversion idioms and git ones that is more designed to make transactions back to the Subversion server easy and safe to generate than it is to make full use of the git capabilities that Subversion doesn’t have. This is pidgin translation for a reason better than laziness or failure of nerve, but it’s still pidgin.

Worse, git-svn has bugs that severely compromise it for full translations. It tends to stumble over common repository malformations in Subversion, producing history damage that is significant but evades superficial scrutiny. The author has written about this problem in detail at Don’t do svn-to-git repository conversions with git-svn!

For a straight linear history with no tags or branches, the difference between git-svn’s Subversion-emulating behavior and the way a git repository would most naturally be structured is minimal. But for conformability with Subversion, git-svn cannot (practically speaking) use git’s annotated-tag facility in the local mirror; instead, Subversion tags have to be represented in the local mirror as git branches even if they have no changes after the initial branch copy.

Another thing the live-gatewaying use case prevents is reference-lifting. Subversion references like "r1234" in commit comments have to be left as-is to avoid creating pain for users of the same Subversion remote not going through git-svn.

git-svn was used by both Google Code’s exporter and is used in GitHub’s importer web services. Depending on the latter is not recommended.

!ABANDONED, MIXEDBRANCH, NO_TAGS, NO_IGNORES.

git-svnimport

Formerly part of the git suite; what they had before git-svn, and inferior to it. Among other problems, it can only handle Subversion repos with a "standard" trunk/tags/branches layout. Now deprecated.

MIXEDBRANCH, NO_TAGS, NO_IGNORES, ABANDONED.

git-svn-import

A trivial wrapper around git-svn. All the reasons not to use git-svn apply to it as well.

MIXEDBRANCH, NO_TAGS, NO_IGNORES, !ABANDONED.

svn-fe

svn-fe was a valiant effort to produce a tool that would dump a Subversion repository history as a git fast-import stream. It made it into the git contrib directory and lingers there still.

LINEAR, NO_TAGS, NO_IGNORES, ABANDONED.

Tailor

Tailor aimed to be an any-to-any repository translator.

LINEAR, ?NO_IGNORES, ABANDONED.

agito

This is a Subversion-to-git tool that was written to handle some cases that git-svn barfs on (but reposurgeon doesn’t - the reposurgeon test suite contains a case sent by agito’s author to check this). It even handles mixed-branch commits correctly.

!LINEAR, !NO_TAGS, !MIXEDBRANCH, CONFIGURATION.

If you cannot use reposurgeon for some reason, this is one of the best alternatives.

svn2git (jcoglan/nirvdrum version)

A batch-conversion wrapper around git-svn that creates real tag objects. This is the one written in Ruby. Has all pf git-svn’s problems, alas.

!ABANDONED, !NO_TAGS, NO_IGNORES.

If you cannot use reposurgeon for some reason, this is another alternative that is not too horrible. But beware of possible history damage if your Subversion repo has malformations that confuse git-svn.

svn2git (Schemenauer version)

Native Python. More a proof of concept than a production tool.

LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION, ABANDONED.

svn2git (Nyblom version)

Written in C++. Says it’s based on svn-fast-export by Chris Lee. Not easy to figure out what it actually does, as there is no documentation at all and no test cases. May be genetically related to svn-all-fast-export, but if so they diverged in 2008.

CONFIGURATION, NO_DOCUMENTATION.

svn-fast-export

Written in C. More a proof of concept than a production tool.

LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION, ABANDONED.

svn-dump-fast-export

Written in C. Documentation is so lacking that there isn’t even a README. However, it’s possible to deduce what isn’t there by reading the code.

LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION.

svn-all-fast-export

May be genetically related to the Nyblom svn2git, but if so they diverged in 2008.

LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION, ABANDONED.

SubGit

Nearly unique for this category of software in being closed-source. Beyond an evaluation period, users have to register, possibly for a cost (it’s supposed to be free-of-charge for certain uses: open source projects, education, and ``startups'' — history with BitKeeper shows that such exemptions should probably not be trusted).

The intended outcome of this program is to provide a server with support for both Subversion and Git users to interact at once. This may be of little value overall, as new developers are frequently unfamiliar with Subversion (and old ones forget the usage patterns!), fundamental differences in design of the two VCSes interfering with the quality of both views, and increased confusion with preferred modes of contribution arise.

The quality of SubGit’s conversion is rather poor. It fails to properly translate at least half of the reposurgeon *.svn regression tests, even some of the simpler ones - although trickier cases such as agito.svn it does translate correctly. Large real-world Subversion repos will exhibit multiple issues that SubGit may, silently or otherwise, trip over.

This program will forever contain compromises for the same reasons git-svn does. The non-open source nature leaves little hope of having such issues repaired by skilled community members.

Atlassian’s BitBucket service relies on this for Subversion-to-Git migration. Depending on this service is not recommended.

!MIXEDBRANCH, !LINEAR, CONFIGURATION, DOCUMENTATION

Appendix B: A tour of the codebase

Reposurgeon is intended to be hackable to support for special-purpose or custom operations, though it’s even better if you can come up with a new surgical primitive general enough to ship with the stock version. For either case, here’s a guide to the architecture.

inner.go

The core classes in inner.go support deserializing and reserializing import streams. In between these two operations the repo state lives in a fairly simple object, Repository. The main part of Repository is just a list of objects implementing the Event interface - Commits, Blobs, Tags, Resets, and Passthroughs. These are straightforward representations of the command types in an import stream, with Passthrough as a way of losslessly conveying lines the parser does not recognize.

 +-------------+    +---------+    +-------------+
 | Deserialize |--->| Operate |--->| Reserialize |
 +-------------+    +---------+    +-------------+

The general theory of reposurgeon is: you deserialize, you do stuff to the event list that preserves correctness invariants, you reserialize. The "do stuff" is mostly not in the core classes, but there is one major exception. The primitive to delete a commit and squash its fileops forwards or backwards is seriously intertwined with the core classes and actually makes up almost 50% of Repository by line count.

The rest of the surgical code lives outside the core classes. Most of it lives in the RepoSurgeon class (the command interpreter) or the RepositoryList class (which encapsulated by-name access to a list of repositories and also hosts surgical operations involving multiple repositories). A few bits, like the repository reader and builder, have enough logic that’s independent of these classes to be factored out of it.

In designing new commands for the interpreter, try hard to keep them orthogonal to the selection-set code. As often as possible, commands should all have a similar form with a (single) selection set argument.

VCS is not a core class. The code for manipulating actual repos is bolted on the the ends of the pipeline, like this:

 +--------+    +-------------+    +---------+    +-----------+    +--------+
 | Import |--->| Deserialize |--->| Operate |--->| Serialize |--->| Export |
 +--------+    +-------------+ A  +---------+    +-----------+    +--------+
      +-----------+            |
      | Extractor |------------+
      +-----------+

The Import and Export boxes call methods in VCS.

extractor.go

Extractor classes build the deserialized internal representation directly. Each extractor class is a set of VCS-specific methods to be used by the RepoStreamer driver class. Key detail: when a repository is recognized by an extractor it sets the repository type to point to the corresponding VCS instance.

reposurgeon.go

All code that knows about the DSL syntax should live in reposurgeon.go alonmg with the program main and the functions for reporting errors, logging, handling signals and aborts, etc.

svnread.go

This is the reader for Subversion dumpfiles. It is the only exception to the rule that reads support for version control systems is implemented by front ends that read them and emit a fast-import stream.

The reason it’s an exception is that Subversion has its own serialization format, and the total complexity of embedding support for those strams was estimated to be lower than that if writing a a completely separate front end.

Style notes

The code was translated from Python. It retains, for internal documentation purposes, the Python convention of using leading underscores on field names to flag fields that should never be referenced outside a method of the associated struct.

The capitalization of other fieldnames looks inconsistent because the code tries to retain the lowercase Python names and compartmentalize as much as possible to be visible only within the declaring package. Some fields are capitalized for backwards compatibility with the setfield command in the Python implementation, others (like some members of FileOp) because there’s an internal requirement that they be settable by the Go reflection primitives.

Appendix C: Adding support for more version-control systems

The best way way to add support for a version-control system not already on the list is to write a pair of foo-fast-export and foo-fast-import utilities (separate from reposurgeon) that generate and consume git fast-import streams. When this is achievable, it enables full read/write support for repositories of that type. In this case the supporting changes in reposurgeon will be trivial, just a pair of table entries.

The next best route is to write a FooExtractor class in reposurgeon itself. This is less good because (a) it provides only read-side support, and (b) it adds complexity to reposurgeon. There’s also a filter derived from testing requirements; a command-line client of your FooVCS must be freely available running under Unix in order for the reposurgeon maintainers to run tests on it. We are not willing to ship features we can’t test.

Finally, if your VCS supports a native serialization format that it can use as a dump/restore for live repositories, and has or both of a pair of foo-dump/foo-load utilities analogous to git-fast-export and git-fast-import, it may be possible to support your FooVCS through that format. Subversion’s svnadmin dump/load commands fit this pattern.

In this case it is still best to try to write filters that interconvert between the native serialization and git-fast-import streams, separately from reposurgeon. This makes the testing problem more tractable, and means that reposurgeon itself needs only a couple of additional table entries calling simple pipelines.

As a last resort, the reposurgeon maintainers may consider adding support for reading and writing a native serialization format to reposurgeon itself. So far this has only been done once, for Subversion, and there is an important precondition; the serialization format must have complete public documentation.

Be aware that proprietary VCSes in general are likely to cause us serious testing problems and we are reluctant to try to support them. If a maintainer has to pay money to have binaries he or she can run tests with, you will have to pay a maintainer money to make that happen.

It’s also basically a crash landing if your FooVCS can only be accessed through a GUI, or its clients only run on Windows, or it has a CLI that is not capable enough to support an extractor class. We know of cases where proprietatary VCS vendors have deliberately crippled their export and CLI features in order to lock customers in; that is no fun to deal with, so you’ll have to pay somebody money.

Appendix D: Reposurgeon success stories

Reposurgeon has been used for successful conversion on projects including but not limited to the following. These are in rough chronological order.

Hercules (IBM mainframe emulator)

The author did this one, Subversion to hg. About ten years of history at the time, not too horribly messy.

NUT (Network UPS Tools)

The author did this one, Subversion to git. The trial by fire - it was when the Subversion dump analyzer got built. Very large old repository with lots of pathologies (there was a CVS stratum).

Battle For Wesnoth

The author did this one, Subversion to git. Very large repo, moderately complex.

Roundup (issue tracker)

The author did this one, Subversion to git (they later switched to hg). Moderate-sized Subversion repo with some very strange malformations.

robotfindskitten

The author did this one, CVS to git. Simple history, pretty easy.

Blender

Two guys at Blender did this one with help from the author, Subversion to git. Huge repository with a lot of nasty pathologies. The tool needed some serious optimization and feature upgrades to handle it.

groff

The author did this one, CVS to git. Rather easy as the project history was almost linear and, though very old, not huge.

Nethack

CVS to git. This conversion has not yet been publicly released at time of writing (late October 2014) for complicated political reasons.

Emacs

A record three layers, Bazaar over CVS over RCS. Malformations not too bad except for some unique challenges created by the RCS-to-CVS conversion, but the sheer size of the history and number of layers makes it the most complex conversion yet. Converted in 2011.

ntp

The author did this, BitKeeper to git using a derivative of Tridge’s SourcePuller as a front end, done in early 2015. Nothing especially taxing about the reposurgeon side of things, the magic was all in the front end.

pdfrw, playtag, pyeda, rson

Four small Subversion projects by Patrick Maupin, converted in two hours' work in May 2015. No significant difficulties. These mainly served to demonstrate that the standard conversion workflow in conversion.mk is fast and effective for a wide range of projects.

mh-e

The Emacs interface for MH. Converted by Bill Wohler in late 2015. He reports that the standard conversion workflow worked fine.

GNUPLOT

CVS to git, 30 years of history with some early releases recovered from tarballs. Converted by the author in late 2017. Somewhat messy due to vendor-branch issues.

GCC

SVN to git, with ancient strata of CVS and RCS. 280K commits of history back to 1987, dwarfing Emacs. Converted by myself and two core GCC developers. The 4.0 release came out of this. Final cutover was on Jan 12th 2020.

Here are some other some other field reports on successful uses:

Appendix E: Development History

Links to notable blog posts during the development of reposurgeon. Trivial release announcements have been omitted.

Cometary Contributors (2016-01-10)

30 Days in the Hole (2020-01-24)

Two graceful finishes (2020-05-13)