1. How to use this manual

This is the long-form manual for the reposurgeon tool suite.

Everybody should read the Introduction (immediately after this section) to be sure reposurgeon is the tool you actually want.

Then read the Quick Start section (immediately following the Introduction) to get a feeling for the syntax and style of reposurgeon commands.

Assuming you’re trying to do a repository lift, you should probably continue by reading A Guide to Repository Conversion. It is not necessary to read the entire main body of the manual after that (the command reference) straight through, but keep it handy for when you need to learn more about a particular command or group of commands.

If you’re doing something more unusual than a conversion, you’ll probably have to read through the command reference until you discover the commands you need.

Help is available within reposurgeon using the "help" command. Type "help" alone for a list of help topics.

2. Introduction

The purpose of reposurgeon is to enable risky operations that VCSes (version-control systems) don’t want to let you do, such as (a) editing past comments and metadata, (b) excising commits, (c) coalescing and splitting commits, (d) removing files and subtrees from repo history, (e) merging or grafting two or more repos, and (f) cutting a repo in two by cutting a parent-child link, preserving the branch structure of both child repos.

A major use of reposurgeon is to assist a human operator to perform higher-quality conversions among version-control systems than can be achieved with fully automated converters. Another application is when code needs to be removed from a repository for legal or policy reasons.

Fully supported systems (those for which reposurgeon can both read and write repositories) include git, hg, bzr, darcs, bk, RCS, and SRC. For a complete list, with dependencies and technical notes, type "prefer" to the reposurgeon prompt.

Writing to the file-oriented systems RCS and SRC has some serious limitations because those systems cannot represent all the metadata in a git-fast-export stream. They require rcs-fast-import as a back end; consult that tool’s documentation for details and partial workarounds.

Fossil repository files can be read in using the --format=fossil option of the ‘read’ command and written out with the --format=fossil option of the ‘write’ comment. Ignore patterns are not translated in either direction.

SVN and CVS are supported for read only, not write. For CVS, reposurgeon must be run from within a repository directory (one with a CVSROOT subdirectory), not a checkout. Each module becomes a subdirectory in the reposurgeon representation of the change history.

Note that reposurgeon is a sharp enough tool to cut you. It never modifies a repository in place, and it takes care not to ever write a repository in an actually inconsistent state, and will terminate with an error message rather than proceed when its internal data structures are confused. However, there are lots of things you can do with it - like altering stored commit timestamps so they no longer match the commit sequence - that are likely to cause havoc after you’re done. Proceed with caution and check your work.

Also note that, if your DVCS does the usual thing of making commit IDs a cryptographic hash of content and parent links, editing a publicly-accessible repository with this tool would be a bad idea. All of the surgical operations in reposurgeon will modify the hash chains.

Please also see the notes on system-specific issues under Limitations and guarantees.

3. Quick start

As a motivating example, commands follow to do a Subversion-to-Git lift on a repository named project just under the current directory. Note that these are given as though typed to the interactive command prompt, which is not shown.

# Load the project into main memory
# Warning: this command is slow because Subversion is slow.
read project

# Check for and report glitches such as timestamp collisions,
# ill-formed committer/author IDs, multiple roots, etc.
lint

# Map Subversion usernames to Git-style user IDs.  In a real
# conversion you'd probably have a lot more of these and you'd
# probably read them in from a separate file, not a heredoc.
authors read <<EOF
fred = Fred Foonly <fred@foonly.net> America/Chicago
jrh = J. Random Hacker <jrh@random.org> America/Los_Angeles
esr = Eric S. Raymond <esr@thyrsus.com> America/New_York
julien = Julien '_FrnchFrgg_' RIVAUD <julien@frnchfrgg.pw> Europe/Paris
db48x = Daniel Brooks <db48x@db48x.net> America/Los_Angeles
ecree = Edward Cree <ecree@solarflare.com> Europe/London
EOF

# Massage comments into Git-like form (with a topic sentence and a
# spacer line after it if there is following running text). Only
# done when the first line is syntactically recognizable as a whole
# sentence.
gitify

# Tags with the name prefix emptycommit were branch-creation commits
# in Subversion. Usually there's nothing interesting in the comment
# text, but you'll want to browse them and check.  These commands
# save one such tag and delete the rest
tag emptycommit-23 noteworthy
tag /emptycommit/ delete

# Delete remnant .cvsignore files from a past life as CVS.
expunge /.cvsignore$/

# Often, your Subversion repository was a CVS repository in a past
# life. CVS creates tags named with the suffix "-root" to mark branch
# points for its internal housekeeping, and cv2svn blindly copied them
# even though the Subversion tools don't need that marker. This
# clutter can be tossed.
tag /-root$/ delete

# This command illustrates how to use msgin to modify the comment
# text of a commit. In this test we're patching a Subversion revision
# reference because we're going to want to reference-lift it later.
# But this capability could also be used, for example, to add an
# update note to a commit comment that turned out to be incorrect.
msgin <<EOF
Legacy-ID: 23
Check-Text: Referring back to r2.

Referring back to [[SVN:2]].
EOF

# Change cookies like [[SVN:2]] into action stamps that are independent
# of the VCS in use. A typical action stamp looks like this:
# <jrh@random.org!2006-08-14T02:34:56Z>
references lift

# Sometimes it's useful to drop files from the repo that should
# never have been checked in.
1..$ expunge /documents/.*.pdf/

# Process GNU Changelogs to get better attributions for changesets.
# When a commit was derived from a patch and checked in by someone
# other than its author this can often correct the commit attribution.
changelogs

# It's good practice to add a tag marking the point of conversion.
tag cutover-git create @max(=C)
msgin <<EOF
Tag-Name: cutover-git
Tagger: J. Random Hacker <jrh@random.org> America/Los_Angeles

This tag marks the last Subversion commit before the move to Git.
EOF

# We want to write a Git repository
prefer git

# Do it
rebuild project-git

Typically you’d have these commands in a script that you evolved by experimenting until you got a conversion that suited your tastes and needs.

To learn more about reposurgeon, keep reading or type "help" at the reposurgeon command prompt.

Reposurgeon can be passed commands on its invocation line but the syntax may be a little surprising if you’re used to classic Unix switch conventions. Each string in the command line argument is passed to the interpreter as a command, in order; thus, you could replicate the first two actions of the previous script with this:

$ reposurgeon "read project" lint

The quotes are required so the read command and its argument will be presented to the command interpreter as a single string.

4. A Guide to Repository Conversion

One of the main uses for reposurgeon is converting repositories between different version-control systems. Since around 2015 has almost always meant converting from something else to Git.

This section is a guide to up-converting your repository, and adopting practices that will reduce process friction to a minimum. It is meant to provide context for the description of reposurgeon’s features in later sections.

If you are aiming at something other than a repository conversion, you can safely skip this section.

In 90% of cases you’ll be converting from CVS or Subversion, and those are the cases we’ll discuss in detail. If you’re using something older or weirder, see the short section on other VCSes for some hints, but you’re mostly on your own.

4.1. Why convert with reposurgeon?

Reposurgeon is more difficult to use than any of dozens of fully-automated conversion tools out there; you have to make choices and compose a recipe. This section explains why it’s worth the bother.

In brief, it’s because fully-automated converters don’t work very well. They are very poor at dealing with the ontological mismatches between the data models of different version-control systems. For detailed discussion of the technical flaws in many common converters, see Appendix A.

In particular: reposurgeon is the only conversion tool that handles multibranch Subversion repositories in full generality. It can even correctly translate Subversion commits that alter multiple branches.

But even automated converters that are relatively good at bridging data-model differences tend to produce crude, jackleg, unidiomatic conversions that make the seam between the pre-conversion and post-conversion parts of the repository very obvious.

A central example of this is commit references in change comments. These references convey important information to anyone reading the comments, and it is correspondingly important to change them from using the reference format of the old system to one that is intelligible in whatever your new one is.

As another example, git has a convention about the form of change comments; they’re supposed to consist of a standalone summary line, followed optionally by a spacer blank line and running text. Git relies on this convention to produce log summaries that are easy to take in at a glance.

Older version-control systems don’t have this convention. An ideal conversion changes as many comments as possible to be in Git-like form so that the Git summary tools see the data regularity they want. But this kind of editing can’t be fully automated. The best you can hope for, if you want to do it right, is that your tool automates as much of this fixup as it can and it assists a human operator in applying fixups.

Neither reference-lifting nor patching comments for Git-friendliness is a process that can be fully automated. Both require human judgment; accordingly, fully-automated converters don’t even try to do the right thing. The result is often a history that is full of unpleasant little speedbumps and distractions. These induce wasted developer effort and, correspondingly, higher defect rates.

On the other hand, a skilled operator of reposurgeon can produce a conversion that is fully idiomatic in the target system, significantly lowering future friction costs for developers browsing the history.

One fully automated reposurgeon feature of some significance that no other importer supports is that it can parse ChangeLogs in histories which use that Free Software Foundation convention, and use the attributions in them to fill in Git author fields. This recovers better information about the provenance of changesets corresponding to patches committed by a project developer (who continues to be recorded as the committer of that changeset).

4.2. Commercial Note

If you are an organization that pays programmers and has a requirement to do a repository conversion, the author can be engaged to perform or assist with the transition. You are likely to find this is more efficient than paying someone in-house for the time required to learn the tools and procedures. I (the author) have been very open about my methods here, but nothing substitutes for experience when you need it done quickly and right.

If you are wondering why it’s worth spending any money at all for a real history conversion, as opposed to just starting a new repository with a snapshot of the old head revision, the answer comes down to two words: risk management.

Suppose you do a snapshot conversion, head revision only. Then you get a regression report with a way to reproduce the problem. What you want to do is bisect in the new history to identify the revision where the bug was introduced, because knowing what the breaking change was makes a fix far easier. Bzzzt! You can’t. That history is missing in the new system.

Yes, in theory you could run a manual bisection using bracketing builds in new and old repositories. Until you have tried this, you will have no comprehension of how easy it is to get that process slightly but fatally wrong, and (actually more importantly) how difficult it is to be sure you haven’t gotten it wrong. This is the kind of friction cost that sounds minor until the first time it blows up on you and eats man-weeks of NRE.

So congratulations, tracing the bug just got an order of magnitude more expensive in engineer time, and your expected time to fix changed proportionally. It typically only takes one of these incidents to justify the up-front cost of having had the conversion done right.

If you go the snapshot-conversion route, maybe you’ll get lucky and never need visibility further back. Or maybe you’ll have a disaster because you increased the friction costs of debugging just enough that you, say, miss a critical ship date. The more experienced with in-the-trenches software development you are, the more plausible that second scenario will sound.

A subtler issue is that by losing the old change comments you have thrown away a great deal of hard-won knowledge about why your code is written the way it is. Again, this may never matter – but if it does, it’s going to bite you on the butt, hard, probably when you least expect it.

And if you’re thinking "No problem, the old repository will still be around"…​heh. Repositories that have become seldom-accessed are like other kinds of dead storage in that they have a way of quietly disappearing because after a few job turnovers the knowledge of why they’re important is lost. Typically you don’t find out this has happened until you have an unanticipated urgent need, at which point whatever trouble you were in gets deeper.

Spending the relatively small amount it takes to have a proper full conversion done right is a way of bounding your downside risk. If you aren’t a software engineer and had trouble following the preceding argument, propose a snapshot conversion to the engineer you trust the most and watch that person reaching for a diplomatic way to tell you it’s a stupid, shortsighted idea.

4.3. Step Zero: Preparation

Make sure the tools in the reposurgeon suite (especially reposurgeon and repotool) are on your $PATH.

Create a scratch directory for your conversion work.

Run "repotool initialize" in the scratch directory; it requires that you follow the initialize verb with a project name. This will create a Makefile designed to sequence your conversion, and an empty lift script. Then set the variables near the top appropriately for your project.

This Makefile will help you avoid typing a lot of fiddly commands by hand, and ensure that later products of the conversion pipeline are always updated when earlier ones have been modified or removed.

The most important variables to set in the Makefile are the ones that set up local mirroring of your repository. The repotool command has a mode that handles the details of making (and, when necessary, updating) a local mirror. To enable this you need to fill in either REMOTE_URL or CVS_HOST and CVS_MODULE; read the header comment of the conversion makefile for details.

If you’re lifting a Subversion repository you can specify the repository URL in either of two ways: as a standard Subversion repository URL (service prefix "svn:") or as an rsync URL pointing at the same repository master directory (service prefix "rsync:"). Usually rsync mirroring is faster, but it depends on sshd running at the server end and may being in complications around its security features. Use a "svn:" URL, which uses mirroring by svnsync, when you can’t make rsync work. Note by the way that repotool mirror’s "rsync:" URLs do not have the requirement for a remote rsyncd that an rsync URL fed directly to rsync itself does; internally, repotool turns them into a single-colon host plus path rsync source specification.

Later, you will put your custom commands in the lift script file. Doing this helps you not lose older steps as you experiment with newer ones, and it documents what you did.

Doing a high-quality repository conversion is not a simple job, and the odds that you will get it perfectly right the first time are close to zero. By packaging your lift commands in a repeatable script and using the Makefile to sequence repetitive operations, you will reduce the overhead of experimenting.

In the rest of the steps we describe below, when we write "make foo" that means the step can be sequenced by the "foo" production in the Makefile. Replace $(PROJECT) in these instructions with your project name.

You may find it instructive to type "make -n" to see what the entire conversion sequence will look like.

4.4. Step One: The Author Map

Subversion and CVS identify users by a Unix login name local to the repository host; DVCSes use pairs of fullnames and email addresses. Before you can finish your conversion, you’ll need to put together an author map that maps the former to the latter; the Makefile assumes this is named $(PROJECT).map. The author map should specify a full name and email address for each local user ID in the repo you’re converting. Each line should be in the following form:

foonly = Fred Foonly <foonly@foobar.com>

You can optionally specify a third field that is a timezone description, either an ISO8601 offset (like "-0500") or a named entry in the Unix timezone file (like "America/Chicago"). If you do, this timezone will be attached to the timestamps on commits made by this person.

Using the generic Makefile for Subversion, "make stubmap" will generate a start on an author-map file as $(PROJECT).map. Edit in real names and addresses (and optionally offsets) to the right of the equals signs.

How best to get this information will vary depending on your situation.

  • If you can get shell access to the repository host, looking at /etc/passwd will give you the real name corresponding to each username of a developer still active: usually you can simply append @ and the repository hostname to each username to get a valid email address. You can do this automatically, and merge in real names from the password file, using the 'repomapper' tool from the reposurgeon distribution.

  • If the repository is owned by a project on a forge site, you can usually get the real name information through the Web interface; try looking for the project membership or developer’s list information.

  • If the project has a development mailing list, posting your incomplete map with a request for completions often gives good results.

  • If you can download the archives of the project’s development mailing list, grepping out all the From addresses may suggest some obvious matches with otherwise unknown usernames. You may also be able to get timezone offsets from the date stamps on the mail. The repomapper tool can mine matching addresses from mailbox files automatically, though it does not extract timezones.

If you are converting the repository for an open-source project, it is good courtesy and good practice after the above first step to email the contributors and invite them to supply a preferred form of their name, a preferred email address to be used in the mapping, and a timezone offset. The reason for this is that some sites, like OpenHub, aggregate participation statistics (and thus, reputation) across many projects, using developer name and email address as a primary key.

Your authors file does not have to be final until you ship your converted repo, so you can chase down authors' preferred identifications in parallel with the rest of the work.

4.5. Step Two: Conversion

Install whatever front end reposurgeon needs to read your repository. That will be cvs-fast-export for CVS, or the Subversion tools themselves for Subversion.

The generic-workflow Makefile will call reposurgeon for you, interpreting your $(PROJECT).lift file, when you type "make". You may have to watch the baton spin for a few minutes. For very large repositories it could be more than a few minutes.

This will convert your repository to git. If you need to export to something else, reposurgeon has write support for a couple of other modern VCSes.

4.5.1. CVS

Reposurgeon can read CVS repositories. You must have a copy of the repository master directory, and run reposurgeon pointed at that directory or one of its submodules; a checkout directory won’t do.

If you are exporting from CVS, it may be a good idea to run some trial conversions with cvsconvert, a wrapper script shipped with cvs-fast-export. This script runs a conversion direct to git; the advantage is that it can do a comparison of the repository histories and identify problems for you to fix in your lift script. You probably don’t want to use this for final conversion, though, as it does not clean up CVS junk tags, perform reference lifting, or Gitify comments.

For more detailed discussion of CVS conversion and troubleshooting see Working with CVS.

4.5.2. Subversion

reposurgeon can read Subversion dumpfiles or repositories. If you want it to read a repository, you must run it within the top-level directory of the repository itself, not in a checkout directory made from the repository.

Unlike CVS, Subversion repositories have real changesets, and the work in them can effectively always be mapped to equivalent DVCS commits. The parent-child relationships among commits will also translate cleanly.

Normally reposurgeon will do branch analysis for you. On most Subversion repositories, and in particular anything with a standard trunk/tags/branches layout, it will do the right thing. It will also cope with adventitious branches in the root directory of the repo, such as many projects use for website content.

There is, however, a minor problem around tags, and a slightly more significant problem around Subversion merges. Also, some Subversion repositories are multi-project with a nonstandard directory layout.

For more detailed discussion of Subversion conversion and troubleshooting see Working with Subversion. For discussion of handling multiproject repositories, see Multiproject Subversion repositories.

4.5.3. Other VCSes

Mercurial: reposurgeon can read a Mercurial repository using either of two methods. See Working with Mercurial for details.

SCCS: Use sccs2rcs to get to RCS, then follow the directions for RCS. There is a script called sccs2git on CPAN which is not recommended, as it is poorly documented and makes no attempt to group commits into changesets.

RCS: reposurgeon will read an RCS collection. It uses cvs-fast-export, which despite its name does not actually require CVS metadata other than the RCS master files that store the content.

Fossil: reposurgeon will read a Fossil repository file. It uses the native Fossil exporter, which is pretty good but doesn’t export ignore patterns, wiki events, or tickets.

BitKeeper: As of version 7.3 (and probably earlier versions under open-source licensing) BitKeeper has fast-import and fast-export subcommands, and reposurgeon now knows how to use these.

Perforce: According to this Stack Overflow answer, the magic incantation is something like git p4 clone --import-labels --detect-branches //depot/path/project@all. This will create a live Git repository capturing your Perforce history. This recipe is included for completeness; it is unknown to the author what (if any) reposurgeon cleanup operations might be required, but a skim of Perforce documentation suggests that mapping Perforce user IDs to a Git-style name/address pair will be desirable.

AccuRev: There are a couple of tools for translating AccuRev repositories to live Git repositories. Of these the most developed appears to be called "ac2git.py"; you should be able to find it with a search engine. We recommend you use this tool to get a first-pass conversion to Git, then use reposurgeon to clean up the result. The ac2git.py converter’s goal is to produce an accurate representation of a collection of ac2git streams, as multiple disconnected git branches; while there is functionality to identify branch and merge points, actually weaving the streams into a single DAG is something best done in reposurgeon.

For other systems, see the Git wiki page on conversion tools.

4.6. Step Three: Sanity Checking

Before putting in further effort on polishing your conversion and putting it into production, you should check it for basic correctness.

Pay attention to error messages emitted during the lift. Most of these, and remedial actions to take, are described in this guide.

For Subversion lifts, use the "headcompare", "tagscompare" and "branchescompare" productions to compare the converted with the unconverted repository. If you didn’t use the cvsconvert wrapper for your CVS lift, these productions have a similar effect. Be aware that these operations may be extremely slow on large Subversion repositories.

The only differences you should see are those due to keyword expansion and ignore-file lifting. If this is not true, you may have found a serious bug in either reposurgeon or the front end it used, or you might just have a misassigned tag that can be manually fixed. Consult How to report bugs for information on how to usefully report bugs.

Use reposurgeon’s ‘lint’ command to find anomalies like detached branches that may need manual correction.

If you are converting from CVS, use reposurgeon’s ‘graph’ command to examine the conversion, looking (in particular) for misplaced tags or branch joins. Often these can be manually repaired with little effort. These flaws do 'not' necessarily imply bugs in cvs-fast-export or reposurgeon; they may simply indicate previously undetected malformations in the history. However, reporting them may help improve cvs-fast-export.

Warning: As of September 2016, stock CVS is known buggy in ways which may affect checking the correctness of conversions. For best results, use a CVS version with the MirOS patches. These are carried by Debian Linux and derivatives; you can check by Looking for "MirDebian" in the output of cvs --version.

4.7. Step Four: Cleanup

You should now have a git repository, but it is likely to have a lot of cruft and conversion artifacts in it. Here are some common forms of cruft:

Commit comments and attributions containing non-UTF8 data

You could have metadata in your repository in an encoding incompatible with UTF-8 (Latin-1 is the most common offender). You will probably want to transcode the repo to UTF-8.

Subversion and CVS commit references

Often Subversion references will be in the form 'r' followed by a string of digits referring to a Subversion commit number. But not always; humans come up with lots of ambiguous ways to write these. CVS commit references are even harder to spot mechanically, as they’re just groups of digits separated by dots with no identifying prefix. A clean conversion should turn all these into VCS-independent commit references, which will be described later in this document.

Multi-line contents with no summary

git and hg both encourage comments to begin with a summary line that can stand alone as a short description of the change; this practice produces more readable output from git log and hg log. For a really high-quality conversion, multi-line comments should be edited into this form.

Tags from Subversion no-fileop commits

Commits with no fileops are automatically transformed into tags when reading a Subversion repository. Other importers may generate them for various reasons; you can detect them as the =Z visibility set. You will probably want to delete these; they’re preserved just in case something about the metadata is interesting.

Tags from CVS -root commits

Branch root markers are made by CVS for internal purposes. These often persist through up-conversions to SVN. "tag /-root$/ delete" will clean those up; this is not done automatically because implicit magic that deletes data generally turns out to be a bad idea.

Branch tip deletes, deletealls, and unexpressed merges

In Subversion it is common practice to delete a branch directory when that line of development is finished or merged to trunk; this makes sense because it reduces the checkout size of the repo in later revisions. In a DVCS, deletes at a branch tip don’t save you any storage, so it makes more sense to leave the branch with all of its tip content live if you’re not going to delete it entirely. Sometimes editing a later commit to have the branch tip as a parent (creating a merge that Subversion could not express) makes sense; look for svn:mergeinfo properties as clues.

Commits generated by cvs2svn to carry tag information

These lurk in the history of a lot of Subversion projects. Sometimes these junk commits are empty (no file operations associated with them at all); sometimes they’re translated as long lists of spurious delete fileops, and sometimes they have actual file content (duplicating parent file versions, or referring randomly to file versions far older than the junk commit). Older versions of cvs2svn seem to have generated all kinds of meaningless crud into these.

Metadata inserted by git-svn

git-svn inserts lines at the end of each commit comment that refer back to the Subversion commit it is derived from. This is necessary for live-gatewaying, and useful during one-shot conversions, but you may not want it in the final repo.

Remnant RCS, CVS and Subversion dollar cookies

Under older version-control systems there was a custom of embedding magic cookies in master files that would be expanded on checkout with varioious metadata like the commit date and committer ID. These are useless clutter under modern VCSes.

Here’s a checklist of cleanup steps. If you’re using the makefile generated by repotool, most of these will be done by commands in your lift script.

  1. Map author IDs from local to DVCS form.

  2. Check for leftover cvs2svn junk commits and remove them if possible.

  3. Lift references in commit comments.

  4. Massage comments into summary-line-plus-continuation form.

  5. If the project used the GNU ChangeLog convention, run "changelogs".

  6. Remove empty and delete-only tip commits where appropriate.

  7. Review generated tags, pruning and fixing locations as appropriate.

  8. Look for branch merge points and patch parent marks to make them.

  9. Fix up or remove $-keyword cookies in the latest revision.

  10. Resolve any [zombie] files in a CVS conversion by patching in D ops.

  11. If there’s a root branch, check for and remove junk commits on it.

  12. Use the transcode command to fix up metadata in non-UTF8 encoding.

  13. Run lint to detect remaining anomalies that might need to be patched.

  14. For the record, make a tag noting time and date of the repo lift.

  15. If your target was git, run git gc --aggressive.

Most of the work will be in the comment-fixup and reference-lifting stages. These normally take only a couple of hours even on very large repos with thousands of commits. An entire conversion is usually less than two days of work.

You can use the authors read command to perform the author-ID mapping operation with reposurgeon.

You can find empty commits as the =Z visibility set set and clean them up with the command tagify. Consult the reposurgeon manual page for usage details.

A good way to spot junk commits is to eyeball the picture of the commit DAG created by the reposurgeon 'graph' command - they tend to stand out visually as leaf nodes in odd places. Be aware that the graph command outputs DOT, the language interpreted by the graphviz suite; you will need a DOT rendering program and an image viewer.

See the documentation of the references command, for details on how fix up Subversion and CVS changeset references in comments so they’re still meaningful.

The command =L msgout is good for extracting commits with multi-line comments in a form that can be deleted and merged back in with the msgin command.

The reposurgeon command inspect =H will show you tip commits which may contain only deletes and deletealls.

Tags can be inspected with =T inspect. Junk tags can be removed with the delete command. Tag comments can be modified with a msgin/msgout sequence. Check that the creation date of tags matches what you see in the source repository; this is the easiest way to spot when one has been attached to the wrong commit, something that can be manually fixed by editing its "from" field.

The command =I will select all commits that don’t decode to UTF-8 in both the commit comment and attribution parts. You can eyeball those to figure out what the encoding is and apply the transcode command to fix things.

Reposurgeon has a merge command specifically for performing branch merges. The msgin command will also allow you to add a parent mark to a commit.

One minor feature you lose in moving from SCCS, CVS, Subversion, or BitKeeper to a DVCS is keyword expansion. You should go through the last revision of the code and remove $Id$, $Date$, $Revision$, and other keyword cookies lest they become unhelpful fossils. The full Subversion set is $Date$, $Revision$, $Author$, $HeadURL$ and $Id$. CVS uses $Author$, $Date$, $Header$, $Id$, $Log$, $Revision$, also (rarely) $Locker$, $Name$, $RCSfile$, $Source$, and $State$. A command like grep -R '$[A-Z]' . may be helpful.

After conversion of a branchy repository, look to see if there is a 'root' branch. If there are any commits with a sufficiently pathological structure that reposurgeon can’t figure out what branch they belong to, they’ll wind up there.

It’s good practice to leave an annotated tag at the conversion point noting the date and time of the repo lift. See the next section on conversion comments for discussion. Here’s an example of how to make a tag:

msgin --create <<EOF
Tag-Name: git-conversion

Marks the spot at which this repository was converted from Subversion to git.

Conversion notes are enclosed in double square brackets. Junk commits
generated by cvs2svn have been removed, commit references have been
mapped into a uniform VCS-independent syntax, and some comments edited
into summary-plus-continuation form.
EOF

Experiments with reposurgeon suggest that git import doesn’t try to pack or otherwise optimize for space when it populates a repo from a dump file; this produces large repositories. Running git repack and git gc --aggressive can slim them down quite a lot.

4.7.1. Conversion comments

Sometimes, in converting a repository, you may need to insert an explanatory comment - for example, if metadata has been garbled or missing and you need to point to that fact.

It’s helpful for repository-browsing tools if there is a uniform syntax for this that is highly unlikely to show up in repository comments. Enclosing translation notes in [[ ]] is recommended; this has the advantage of being visually similar to the [ ] traditionally used for editorial comments in text.

It is good practice to include, in either the root commit of the repository or the conversion tag, a note dating and attributing the conversion work and explaining these conventions. Example:

[[This repository was converted from Subversion to git on 2012-10-24
by Eric S. Raymond &lt;esr@thyrsus.com&gt;.  Here and elsewhere, conversion
notes are enclosed in double square brackets. Junk commits generated
by cvs2svn have been removed, commit references have been mapped into
a uniform VCS-independent syntax, and some comments edited into
summary-plus-continuation form.]]

You should also, as previously noted, leave a tag in the normal commit sequence noting the switchover. You can do this with the msgin --create command.

4.7.2. Nonsurgical cleanup steps

A step that too often gets missed and then inelegantly patched in later is converting the declarations that tell the version-control system to ignore derived files. reposurgeon does this for you if you’re using it for CVS- or Subversion-to-git conversion, both expressing Subversion svn:ignore and svn:global-ignores properties as .gitignore files and lifting .cvsignore files to .gitignore files; see Limitations and guarantees if other DVCSes are involved.

Any .gitignore files found in a repository were almost certainly created by git-svn users ad hoc and should be discarded; it is up to the human doing the conversion to look through them and rescue any ignore patterns that should be merged into the converted repository. This behavior can be reversed with the --user-ignores option, which will retain that information and merge it with the ignore patterns generated from svn:ignore and svn:global-ignores properties.

4.7.3. Recovering from errors

Occasionally you’ll discover problems with a conversion after you’ve pushed it to a project’s hosting site, typically to a bare repo that the hosting software created for you. Here’s how to cope:

  1. Do your surgery on a copy of the repo with its .git/config pointing to the public location.

  2. Warn the public repo’s users that it is briefly going out of service, and they will need to re-clone it afterwards!

  3. Ensure that it is possible to force-push to the repository. How you do this will vary depending on your hosting site.

  4. On gitlab.com, under Settings, there is a "Protected Branches" item you can use. If you unprotect a branch, you can force-push to it.

    Elsewhere, you may be able to re-initialize the public repo (this works, for example, on SourceForge). You’ll need ssh access to the bare repo directory on the host - let’s suppose it’s 'myproject'. Pop up to the enclosing directory and do this:

        mv myproject myproject-hidden
        rm -fr myproject-hidden/*
        git init --bare myproject-hidden
        mv myproject-hidden myproject

    The point of doing it this way is (a) so you never actually remove myproject (on many hosts you will not have create permissions in the enclosing directory), and (b) so no user can update the repo while you’re clearing it (mv is atomic).

    Here’s a script that will do the job on SourceForge:

    #!/usr/bin/expect -f
    #
    # nuke - nuke a SourceForge repo
    #
    # usage: nuke project [userid]
    #
    
    if {$argc < 1} {
        puts "nuke: project name argument is required"
        exit 1
    } else {
        set project [lindex $argv 0]
        set user $env(USER)
        if {$argc >= 2} {
    	set user [lindex $argv 1]
        }
    }
    
    set remoteprompt "bash-4.1"
    
    set timeout -1
    spawn $env(SHELL)
    match_max 100000
    send -- "ssh -t $user@shell.sourceforge.net create"
    expect -exact "ssh -t $user@shell.sourceforge.net \r create"
    send -- "\r"
    expect -exact "$remoteprompt\$ "
    send -- "cd /home/git/p/$project\n"
    expect -exact "$remoteprompt\$ "
    send -- "cd git-main.git\n"
    expect -exact "$remoteprompt\$ "
    send -- "rm -fr *\n"
    expect -exact "$remoteprompt\$ "
    send -- "git init --bare .\n"
    expect -exact "$remoteprompt\$ "

    After re-initializing, you should be able to run git push to push the new history up to the public repo.

  5. From your modified local repo, try

         git push --mirror --force

    to push the new history up to the public repo.

  6. Inform the public repo’s users that it is available and remind them that they will need to re-clone it.

On GitLab, you can get a similar effect by unprotecting all branches and doing a git push --force to unconditionally overwrite the public history. It is good practice to re-protect the branches afterwards.

4.8. Step Five: Client Tools

Developers who are already git fans and know how to use a git client will, of course, have no particular trouble using a git repository.

Windows users accustomed to working through TortoiseSVN can move to TortoiseGIT.

Developers who like hg can use the hg-git mercurial plugin. There is an Ubuntu package "mercurial-git" for this, and other distributions are likely to carry it as well. It installs a Mercurial plugin called hg-git.

There are some hg-git limitations to be aware of. In order to simulate git behavior, hg-git keeps some local state in the .hg directories; a map from git branch names to Mercurial commits, a list of Mercurial bookmarks describing git branches (which have bookmark-like behavior different from a Mercurial named branch) and a file mapping git SHA1 hashes to hg SHA1 hashes (both systems use them as commit IDs). The problem is that hg doesn’t copy any of this local state when it clones a repo, so clones of hg-git repos lose their git branches and tags.

If you have developers attached to the CVS interface, it is possible (and in fact relatively easy) to set up a gateway interface that lets them continue using their CVS client tools. Consult the documentation for git-cvsserver.

4.9. Step Six: Good Practice

Educate your developers in the following good practices:

4.9.1. Commit references

The combination of a committer email address with a ISO8601 timestamp is a good way to refer to a commit without being VCS-specific. Thus, instead of "commit 304a53c2", use "<2011-10-25T15:11:09Z!fred@foonly.com>". It is recommended that you not vary from this format, even in trivial ways like omitting the 'Z' or changing the 'T' or '!'. Making these cookies uniform and machine-parseable will have good consequences for future repository-browsing tools. The reference-lifting code in reposurgeon generates them.

Being careful about this has an additional benefit. Someday your project may need to change VCSes yet again; on that day, it will be extremely helpful if nobody has to try to convert years' or decades' worth of VCS-specific magic cookies in the history.

Sometimes it’s enough to quote the summary line of a commit. So, instead of "Back out most of commit 304a53c2", you might write "Back out Attempted divide-by-zero fix."

When appropriate, "my last commit" is simple and effective.

4.9.2. Comment summary lines

As previously noted, git and hg both want comments to begin with a summary line that can stand alone as a short description of the change; this may optionally be followed by a separating blank line and details in whatever form the commenter likes.

Try to end summary lines with a period. Ending punctuation other than a period should be used to indicate that the summary line is incomplete and continues after the separator; "…​" is conventional.

For best results, stay within 72 characters per line. Don’t go over 80.

Good comment practice produces more readable output from git log and hg log, and makes it easy to take in whole sequences of changes at a glance.

5. Theory of Operation

5.1. The outside view

As the quick-start example shows, you’re typically going to do three steps when you use reposurgeon: (1) read in one (or more) repositories, (2) do surgical things on them, and (3) write out one (or more) repositories.

To keep reposurgeon simple and flexible, it normally does not do its own repository reading and writing. Instead, it relies on being able to parse and emit the command streams created by git-fast-export and read by git-fast-import. This means that it can be used on any version-control system that has both fast-export and fast-import utilities. The git-import stream format also implicitly defines a common language of primitive operations for reposurgeon to speak.

In order to deal with version-control systems that do not have fast-export equivalents, reposurgeon can also host extractor code that reads repositories directly. For each version-control system supported through an extractor, reposurgeon uses a small amount of knowledge about the system’s command-line tools to (in effect) replay repository history into an input stream internally. Repositories under systems supported through extractors can be read by reposurgeon, but not modified by it. In particular, reposurgeon can be used to move a repository history from any VCS supported by an extractor to any VCS supported by a normal importer/exporter pair.

Mercurial repository reading is implemented with an extractor class; writing is handled with the "hg-git-fast-import" command. A test extractor exists for git, but is normally disabled in favor of the regular exporter.

Subversion is an important exception. Its exporter is ‘svnadmin dump’, which doesn’t ship a git-fast-import stream, but rather the unique dump format supported by Subversion. Reposurgeon contains an interpreter for this stream format.

As a matter of historical interest, some old versions of reposurgeon had the ability to build a Subversion repository on output by synthesizing a Subversion dump stream and feeding it to ‘svnadmin load’. This feature was a cute stunt, but was abandoned during translation to Go for a couple of reasons. Most importantly, there is zero demand for moving histories to Subversion - and supposing there were, moving content and metadata from git’s DAG representation to a Subversion stream is very lossy. Subversion to Git to Subversion wouldn’t even have round-tripped well.

5.2. The inside view

Between reads and writes, reposurgeon can usefully be thought of as a structure editor for directed acyclic graphs with a pre-defined set of attributes on their nodes.

To get a feel for what that graph is like, it’s helpful to have seen a git-fast-import stream file. Here is a trivial example from the reposurgeon test suite, describing a history with two commits to a single file:

blob
mark :1
data 20
1234567890123456789

commit refs/heads/master
mark :2
committer Ralf Schlatterbeck <rsc@runtux.com> 0 +0000
data 14
First commit.
M 100644 :1 README

blob
mark :3
data 20
0123456789012345678

commit refs/heads/master
mark :4
committer Ralf Schlatterbeck <rsc@runtux.com> 10 +0000
data 15
Second commit.
from :2
M 100644 :3 README

A git-fast-import stream consists of a sequence of commands which must be executed in the specified sequence to build the repo; to avoid confusion with reposurgeon commands we will refer to the stream commands as events in this documentation. These events are implicitly numbered from 1 upwards. Most commands require specifying a selection of event sequence numbers so reposurgeon will know which events to modify or delete.

For all the details of event types and semantics, see the git-fast-import(1) manual page; the rest of this paragraph is a quick start for the impatient. The most prominent events in a stream are commits describing revision states of the repository; these group together under a single change comment one or more fileops (file operations), which usually point to blobs that are revision states of individual files. A fileop may also be a delete operation indicating that a specified previously-existing file was deleted as part of the commit; there are a couple of other special fileop types of lesser importance.

Reposurgeon’s internal representation of a repository history is basically a deserialized git fast-import stream. A few extra attributes are supported; most notably, commits and resets have a legacy-id attribute that carries over the object’s ID from whatever version-control system exported the stream, in particular a Subversion or CVS revision number.

5.3. The interpreter view

The program can be run in one of two modes, either as an interactive command interpreter or in batch mode to execute commands given as arguments on the reposurgeon invocation line.

The only differences between these modes are (1) the interactive one begins by turning on the ‘interactive’ option, (2) in batch mode all errors (including normally-recoverable errors in selection-set syntax) are fatal, and (3) each command-line argument beginning with ‘--’ has that stripped off (which in particular means that --help and --version will work as expected).

Also, in interactive mode, Ctrl-P and Ctrl-N will be available to scroll through your command history, and tab completion of both command keywords and name arguments (wherever that makes semantic sense) is available.

It is expected that interactive mode will be used mainly for exploring repository metadata, while conversion experiments will be captured in a script that is gradually improved until the day final cutover can be performed and the old repository decommissioned.

Note that this means the old repository can be left in service while the conversion recipe is under development. Recipe development should be treated as a serious project with its own change tracking.

5.4. Finding your way around

In the remainder of this document, individual commands are described by hanging paragraphs led by the command syntax in a simple dialect of BNF. Metavariables and substitutable text are capitalized, [] surrounds optional arguments, and {} surrounds mandatory ones.

Help is always available.

help [COMMAND]

Follow with space and a command name to show help for the command.

Without an argument, list help topics.

"?" is a shortcut synonym for "help".

If required, and $PAGER is set, help items long enough to need it will be fed to that pager for display.

History is always available.

history

Dump your command list from this session so far.

You can do Ctrl-P or up-arrow to scroll back through the command history list, and Ctrl-N or down-arrow to scroll forward in it. Tab-completion on command keywords is available in combination with these commands.

You don’t need to exit the interpreter to run quick shell commands.

shell [COMMAND-TEXT]

Run a shell command. Honors the $SHELL environment variable.

"!" is a shortcut for this command.

6. General command syntax

Commands to reposurgeon consist of a command keyword, usually preceded by a selection set, sometimes followed by whitespace-separated arguments. It is often possible to omit the selection-set argument and have it default to something reasonable. For commands that are considered safe (no side effects) the default is all events; for risky commands the default is no events.

When a command changes repository state, it will usually so indicate in a response.

Here are some motivating examples. The commands will be explained in more detail after the description of selection syntax.

29..71 list            ;; list summary index of events 29..71.

236..$ list            ;; List events from 236 to the last.

<#523> inspect         ;; Look for commit #523; they are numbered
                       ;; 1-origin from the beginning of the
                       ;; repository.

<2317> inspect         ;; Look for a tag with the name 2317, a tip
                       ;; commit of a branch named 2317, or a commit
                       ;; with legacy ID 2317. Inspect what is found.
                       ;; A plain number is probably a legacy ID
                       ;; inherited from a Subversion revision
                       ;; number.

/regression/ list      ;; list all commits and tags with comments or
                       ;; committer headers or author headers
                       ;; containing the string "regression".

1..:97 & =T delete     ;; delete tags from event 1 to mark 97.

[Makefile] inspect     ;; Inspect all commits with a file op touching
                       ;; Makefile and all blobs referred to in a
                       ;; fileop touching Makefile.

:46 tip                ;; Display the branch tip that owns
                       ;; commit :46.

@dsc(:55) list         ;; Display all commits with ancestry tracing
                       ;; to :55.

@min([.gitignore]) remove .gitignore delete
                       ;; Remove the first .gitignore fileop in the
                       ;; repo.  In a Subversion lift this contains
                       ;; patterns corresponding to Subversion default
                       ;; ignores.

6.1. Regular Expressions

The pattern expressions used in event selections and various commands (attribution, expunge, filter, msgout, path) are those of the Go language, with one exception. Due to a conflict with the use of $ for arguments in the "script" command, we retain Python’s use of backslashes as a leader for references to group matches.

Normally patterns intended to be interpreted as regular expressions are wrapped in slashes (e.g. /foobar/ matches any text containing the string "foobar"), but any punctuation character other than single quote will work as a delimiter in place of the /; this makes it easier to use an actual / in patterns. Matched single quote delimiters mean the literal should be interpreted as plain text, suppressing interpretation of regexp special characters and requiring an anchored, entire match.

Pattern expressions following the command verb may not contain literal whitespace; use \s or \t if you need to. Event-selection regexps may contain literal whitespace.

Regular expressions are not anchored. Use ^ and $ to anchor them to the beginning or end of the search space, when appropriate.

Some commands support regular expression flags, and some even add additional flags over the standard set. The documentation for each individual command will include these details.

6.2. Selection syntax

A selection set is ordered; that is, any given element may occur only once, and the set is ordered by when its members were first added.

The selection-set specification syntax is an expression-oriented minilanguage. The most basic term in this language is a location. The following sorts of primitive locations are supported:

event numbers

A plain numeric literal is interpreted as a 1-origin event-sequence number. It is not expected that you will have to use this feature often.

marks

A numeric literal preceded by a colon is interpreted as a mark; see the import stream format documentation for explanation of the semantics of marks.

tag and branch names

The basename of a branch (including branches in the refs/tags namespace) refers to its tip commit. The name of a tag is equivalent to its mark (that of the tag itself, not the commit it refers to). Tag and branch locations are bracketed with < > (angle brackets) to distinguish them from command keywords.

legacy IDs

If the content of name brackets (< >) does not match a tag or branch name, the interpreter next searches legacy IDs of commits. This is especially useful when you have imported a Subversion dump; it means that commits made from it can be referred to by their corresponding Subversion revision numbers.

commit numbers

A numeric literal within name brackets (< >) preceded by # is interpreted as a 1-origin commit-sequence number.

reset targets

If the previous ways of interpreting a name within brackets don’t resolve, the name is checked to see if it matches a reset. If so, the expression resolves to the commit the reset is attached to.

reset@ names

A name with the prefix ‘reset@’ refers to the latest reset with a basename matching the part after the @. Usually there is only one such reset.

$

Refers to the last event.

These may be grouped into sets in the following ways:

ranges

A range is two locations separated by ‘..’, and is the set of events beginning at the left-hand location and ending at the right-hand location (inclusive).

lists

Comma-separated lists of locations and ranges are accepted, with the obvious meaning.

There are some other ways to construct event sets:

visibility sets

A visibility set is an expression specifying a set of event types. It will consist of a leading equal sign, followed by type letters. These are the type letters:

B

blobs

Most default selection sets exclude blobs; they have to be manipulated through the commits they are attached to.

C

commits

D

all-delete commits

These are artifacts produced by some older repository-conversion tools.

H

head (branch tip) commits

O

orphaned (parentless) commits

U

commits with callout parents

Z

commits with no fileops

M

merge (multi-parent) commits

F

fork (multi-child) commits

L

commits with unclean multi-line comments

E.g. without a separating empty line after the first

I

commits for which metadata cannot be decoded to UTF-8

T

tags

R

resets

P

passthroughs

All event types simply passed through, including comments, progress commands, and checkpoint commands

N

Legacy IDs

Any comment matching a cookie (legacy-ID) format

Q

Recently touched

Set/cleared by some commands

references

A reference name (bracketed by angle brackets) resolves to a single object, either a commit or tag.

type interpretation

tag name

annotated tag with that name

branch name

the branch tip commit

legacy ID

commit with that legacy ID

assigned name

name equated to a selection by assign

Note that if an annotated tag and a branch have the same name foo, <foo> will resolve to the tag rather than the branch tip commit.

dates and action stamps

A date or action stamp in angle brackets resolves to a selection set of all matching commits.

type interpretation

RFC3339 timestamp

commit or tag with that time/date

action stamp (timestamp!email)

commits or tags with that timestamp and author (or committer if no author). Aliases of the author are also accepted.

yyyy-mm-dd part of RFC3339 timestamp

all commits and tags with that date

To refine the match to a single commit, use a 1-origin index suffix separated by #. Thus <2000-02-06T09:35:10Z> can match multiple commits, but <2000-02-06T09:35:10Z#2> matches only the second in the set.

text search

A text search expression is a regular expression surrounded by forward slashes (to embed a forward slash in it, use a C-like string escape such as \x2f).

A text search normally matches against the comment fields of commits and annotated tags, or against their author/committer names, or against the names of tags; also the text of passthrough objects.

The scope of a text search can be changed with qualifier letters after the trailing slash. These are as follows:

letter interpretation

a

author name in commit

b

branch name in commit; also matches blobs referenced by commits on matching branches, and tags which point to commits on matching branches.

c

comment text of commit or tag

r

committish reference in tag or reset

p

text in passthrough

t

tagger in tag

n

name of tag

B

blob content

Multiple qualifier letters can add more search scopes.

(The "b" qualifier replaces the branch-set syntax in earlier versions of reposurgeon.)

paths

A "path expression" enclosed in square brackets resolves to the set of all commits and blobs related to a path matching the given expression. The path expression itself is either a path literal or a regular expression surrounded by slashes. Immediately after the trailing / of a path regexp you can put any number of the following characters which act as flags: ‘a’, ‘c’, ‘D’, ‘M’, ‘R’, ‘C’, ‘N’.

By default, a path is related to a commit if the latter has a fileop that touches that file path - modifies that change it, deletes that remove it, renames and copies that have it as a source or target. When the ‘c’ flag is in use the meaning changes: the paths related to a commit become all paths that would be present in a checkout for that commit.

A path literal matches a commit if and only if the path literal is exactly one of the paths related to the commit (no prefix or suffix operation is done). In particular a path literal won’t match if it corresponds to a directory in the chosen repository.

A regular expression matches a commit if it matches any path related to the commit anywhere in the path. You can use ^ or $ if you want the expression to only match at the beginning or end of paths. When the ‘a’ flag is in use, the path expression selects commits whose every path matches the regular expression. This is necessarily a subset of commits selected without the ‘a’ flag because it also selects commits with no related paths (e.g. empty commits, deletealls and commits with empty trees). If you want to avoid those, you can use e.g. ‘[/regexp/] & [/regexp/a]’.

The flags ‘D’, ‘M’, ‘R’, ‘C’, ‘N’ restrict match checking to the corresponding fileop types. Note that this means an ‘a’ match is easier (not harder) to achieve. These are no-ops when used with ‘c’.

A path or literal matches a blob if it matches any path that appeared in a modification fileop that referred to that blob. To select purely matching blobs or matching commits, compose a path expression with =B or =C.

If you need to embed ‘[^/]’ into your regular expression (e.g. to express "all characters but a slash") you can use a C-like string escape such as \x2f.

The selection-expression language has named special functions. The syntax for a named function is "@" followed by a function name, followed by an argument in parentheses. Presently the following functions are defined:

@min()

create singleton set of the least element in the argument

@max()

create singleton set of the greatest element in the argument

@amp()

nonempty selection set becomes all events, empty set is returned

@par()

all parents of commits in the argument set

@chn()

all children of commits in the argument set

@dsc()

all commits descended from the argument set (argument set included)

@anc()

all commits ancestral to the argument set (argument set included)

@pre()

events before the argument set

@suc()

events after the argument set

@srt()

sort the argument set by event number.

Set expressions may be combined with the operators ‘|’ and ‘&’ which are, respectively, set union and intersection. The | has lower precedence than intersection, but you may use parentheses ‘(’ and ‘)’ to group expressions in case there is ambiguity.

Any set operation may be followed by ‘?’ to add the set members' neighbors and referents. This extends the set to include the parents and children of all commits in the set, and the referents of any tags and resets in the set. Each blob reference in the set is replaced by all commit events that refer to it. The ? can be repeated to extend the neighborhood depth. The result of a ? extension is sorted so the result is in ascending order.

Do set negation with prefix ‘~’; it has higher precedence than & and | but lower than ?.

6.3. Command syntax

Following the (optional) selection set will be a whitespace-separated command name, possibly another whitespace-separated subcommand name, and possibly following arguments.

The syntax of following arguments is variable according to the requirements of individual commands, but there are a couple of general rules.

  • You can have comments in a script, led by the character "#". Both whole-line and "winged" comments following command arguments are supported. Note that reposurgeon’s command parser is fairly primitive and will be confused by a literal # in a command argument.

  • Many commands interpret C/Go style backslash escapes like \n in arguments. You can usually, for example, get around having to include a literal # in an argument by writing \x23.

  • Some commands support option flags. These are led with a --, so if there is an option flag named "foo" you would write it as "--foo". Option flags are parsed out of the command line before any other interpretation is performed, and can be anywhere on the line. The order of option flags is never significant.

  • When an option flag "foo" sets a value, the syntax is --foo=xxx with no spaces around the equal sign.

  • All commands that expect data to be presented on standard input support input redirection. You may write "<myfile" to take input from the file named "myfile". Redirections are parsed out early, before the command arguments proper are interpreted, and can be anywhere on the line

  • All commands that expect data to be presented on standard input also accept a here-document, just the shell syntax for here-documents with a leading "<<". There are two here-documents in the quick-start example.

  • Most commands that normally ship data to standard output accept output redirection. As in the shell, you can write ">outfile" to send the command output to "outfile", and ">>outfile2" to append to outfile2.

  • Some commands take following arguments that are regular expressions. In this context, they still require start and end delimiters as they do when used in a selection prefix, but if you need to have a / in the expression the delimiters can be any punctuation character other than an ASCII single quote. As a reminder, these are described in the embedded help as pattern expressions.

  • Also note that following-argument regular expressions may not contain whitespace; if you need to specify whitespace or a non-printable character use a standard C-style escape such as \s for space.

6.4. Redirection and shell-like features

An optional command argument prefixed by "<" indicates that the command accepts input redirection; an optional argument prefixed by ">" indicates that the command accepts output redirection. There must be whitespace before the "<" or ">" so that the command parser won’t falsely match uses of these characters in regular expressions.

Commands that support output redirection can also be followed by a pipe bar and a normal Unix command. For example, "list | more" directs the output of a list command to more(1). Some whitespace around the pipe bar is required to distinguish it from uses of the same character as the alternation operator in regular expressions.

The command line following the first pipe bar, if present, is passed to a shell and may contain a general shell command line, including more pipe bars. The SHELL environment variable can set the shell, falling back to /bin/sh.

Beware that while the reposurgeon CLI mimics these simple shell features, many things you can do in a real shell won’t work until the right-hand side of a pipe-bar output redirection, if there is one. String-quoting arguments will fail unless the specific, documented syntax of a command supports that. You can’t redirect standard error (but see the "log" command for a rough equivalent). And you can’t pipe input from a shell command.

In general you should avoid trying to get cute with the command parser. It’s stupider than it looks.

7. Import and Export

reposurgeon can hold multiple repository states in core. Each has a name. At any given time, one may be selected for editing. Commands in this group import repositories, export them, and manipulate the in-core list and the selection.

If you are planning a conversion from Subversion, you should probably read Working with Subversion after this section.

If you are planning a conversion from Mercurial, out should probably read Working with Mercurial after this section.

7.1. Reading and writing repositories

read [ --format=fossil ] [ --no-implicit ] [ DIRECTORY | - | <INFILE ]

With a directory-name argument, this command attempts to read in the contents of a repository in any supported version-control system under that directory; read with no arguments does this in the current directory. If input is redirected from a plain file, it will be read in as a fast-import stream or Subversion dumpfile. With an argument of ‘-’, this command reads a fast-import stream or Subversion dumpfile from standard input (this will be useful in filters constructed with command-line arguments).

If the content is a fast-import stream, any “cvs-revision” property on a commit is taken to be a newline-separated list of CVS revision cookies pointing to the commit, and used for reference lifting.

If the content is a fast-import stream, any “legacy-id” property on a commit is taken to be a legacy ID token pointing to the commit, and used for reference-lifting.

If the read location is a git repository and contains a .git/cvsauthors file (such as is left in place by ‘git cvsimport -A’) that file will be read in as if it had been given to the ‘authors read’ command.

If the read location is a directory, and its repository subdirectory has a file named legacy-map, that file will be read as though passed to a ‘legacy read’ command.

If the read location is a file and the --format=fossil option is used, the file is interpreted as a Fossil repository.

The just-read-in repo is added to the list of loaded repositories and becomes the current one, selected for surgery. If it was read from a plain file and the file name ends with one of the extensions ‘.fi’ or ‘.svn’, that extension is removed from the load list name.

Normally, missing ‘from’ links in input streams are defaulted to the previous commit. The --no-implicit option disables this and may enable round-tripping of some streams on which it would fail (note however that git fast-export generates explicit ‘from’ links). This option will mainly be useful for testing and debugging.

Additional options to this command and not listed here are given in Reading Subversion repositories, as they apply only to Subversion repositories.

Note: this command does not take a selection set.

[SELECTION] write [--legacy] [--format=fossil] [--noincremental] [--callout] [>OUTFILE | - | DIR]

Dump selected events as a fast-import stream representing the edited repository; the default selection set is all events. Where to dump to is standard output if there is no argument or the argument is ‘-’, or the target of an output redirect.

Alternatively, if there is no redirect and the argument names a directory, the repository is rebuilt into that directory, with any selection set being ignored; if that target directory is nonempty its contents are backed up to a save directory.

If the write location is a file and the --format=fossil option is used, the file is written in Fossil repository format.

With the --legacy option, the Legacy-ID of each commit is appended to its commit comment at write time. This option is mainly useful for debugging conversion edge cases.

If you specify a partial selection set such that some commits are included but their parents are not, the output will include incremental dump cookies for each branch with an origin outside the selection set, just before the first reference to that branch in a commit. An incremental dump cookie looks like “refs/heads/foo^0” and is a clue to export-stream loaders that the branch should be glued to the tip of a pre-existing branch of the same name. The --noincremental option suppresses this behavior.

Specifying a partial selection set, including a commit object, forces the inclusion of every blob to which it refers and every tag that refers to it.

Specifying a partial selection may cause a situation in which some parent marks in merges don’t correspond to commits present in the dump. When this happens and the --callout option was specified, the write code replaces the merge mark with a callout, the action stamp of the parent commit; otherwise the parent mark is omitted. Importers will fail when reading a stream dump with callouts; it is intended to be used by the ‘graft’ command.

Specifying a write selection set with gaps in it is allowed but unlikely to lead to good results if it is loaded by an importer.

Property extensions will be be omitted from the output if the importer for the preferred repository type cannot digest them.

Note: to examine small groups of commits without the progress meter, use ‘inspect’.

7.2. Repository type preference

prefer [VCS-NAME]

Report or set (with argument) the preferred type of repository. With no arguments, describe capabilities of all supported systems. With an argument (which must be the name of a supported version-control system, and tab-completes in that list) this has two effects:

First, if there are multiple repositories in a directory you do a read on, reposurgeon will read the preferred one (otherwise it will complain that it can’t choose among them).

Secondly, this will change reposurgeon’s preferred type for output. This means that you do a write to a directory, it will build a repo of the preferred type rather than its original type (if it had one).

If no preferred type has been explicitly selected, reading in a repository (but not a fast-import stream) will implicitly set reposurgeon’s preference to the type of that repository.

sourcetype [VCS-NAME]

Report (with no arguments) or select (with one argument) the current repository’s source type. This type is normally set at repository-read time, but may remain unset if the source was a stream file. The argument tab-completes using the list of supported systems.

The source type affects the interpretation of legacy IDs (for purposes of the =N visibility set and the 'references' command) by controlling the regular expressions used to recognize them. If no preferred output type has been set, it may also change the output format of stream files made from the repository.

The source type is reliably set whenever a live repository is read, or when a Subversion stream or Fossil dump is interpreted - but not necessarily by other stream files. Streams generated by cvs-fast-export(1) using the "--reposurgeon" option are detected as CVS. In some other cases, the source system is detected from the presence of magic $-headers in contents blobs.

7.3. Rebuilds in place

reposurgeon can rebuild an altered repository in place. Untracked files are normally saved and restored when the contents of the new repository are checked out (but see the documentation of the ‘preserve’ command for a caveat).

rebuild [DIRECTORY]

Rebuild a repository from the state held by reposurgeon. This command does not take a selection set.

The single argument, if present, specifies the target directory in which to do the rebuild; if the repository read was from a repo directory (and not a git-import stream), it defaults to that directory. If the target directory is nonempty its contents are backed up to a save directory. Files and directories on the repository’s preservation list are copied back from the backup directory after repo rebuild. The default preserve list depends on the repository type, and can be displayed with the "stats" command.

If reposurgeon has a nonempty legacy map, it will be written to a file named "legacy-map" in the repository subdirectory as though by a "legacy write" command. (This will normally be the case for Subversion and CVS conversions.)

7.4. Crash recovery

This section will become relevant only if reposurgeon or something underneath it in the software and hardware stack crashes while in the middle of writing out a repository, in particular if the target directory of the rebuild is your current directory.

The tool has two conflicting objectives. On the one hand, we never want to risk clobbering a pre-existing repo. On the other hand, we want to be able to run this tool in a directory with a repo and modify it in place.

We resolve this dilemma by playing a game of three-directory monte.

  1. First, we build the repo in a freshly-created staging directory. If your target directory is named /path/to/foo, the staging directory will be a peer named /path/to/foo-stageNNNN, where NNNN is a cookie derived from reposurgeon’s process ID.

  2. We then make an empty backup directory. This directory will be named /path/to/foo.~N~, where N is incremented so as not to conflict with any existing backup directories. reposurgeon never, under any circumstances, ever deletes a backup directory.

    So far, all operations are safe; the worst that can happen up to this point if the process gets interrupted is that the staging and backup directories get left behind.

  3. The critical region begins. We first move everything in the target directory to the backup directory.

  4. Then we move everything in the staging directory to the target.

  5. We finish off by restoring untracked files in the target directory from the backup directory. That ends the critical region.

During the critical region, all signals that can be ignored are ignored.

7.5. File preservation

When the repository type you are working with has a "lister" method, it can tell which files in a repository directory are not checked in and will copy them into the edited repository made by a rebuild.

The following commands are required only if there is no lister method and you have to set preservations by hand. Under systems with such a command (which include git and hg), all files that are neither beneath the repository dot directory nor under reposurgeon temporary directories are preserved automatically.

preserve [PATH…​]

Add (presumably untracked) files or directories to the repo’s list of paths to be restored from the backup directory after a rebuild. Each argument, if any, is interpreted as a pathname. The current preserve list is displayed afterwards.

unpreserve [PATH…​]

Remove (presumably untracked) files or directories to the repo’s list of paths to be restored from the backup directory after a rebuild. Each argument, if any, is interpreted as a pathname. The current preserve list is displayed afterwards.

7.6. Incorporating release tarballs

When converting a legacy repository, it sometimes happens that there are archived releases of the project surviving from before the date of the repository’s initial commit. It may be desirable to insert those releases at the front of the repository history. Do do this, use this command:

{SELECTION} incorporate [--date=YY-MM-DDTHH:MM:SS|--after|--firewall] [TARBALL…​]

Insert the contents of specified tarballs as commit. The tarball names are given as arguments; if no arguments, a list is read from stdin. Tarballs may be gzipped or bzipped. The initial segment of each path is assumed to be a version directory and stripped off. The number of segments stripped off can be set with the option --strip=<n>, n defaulting to 1.

Takes a singleton selection set. Normally inserts before that commit; with the option --after, insert after it. The default selection set is the very first commit of the repository.

The option --date can be used to set the commit date. It takes an argument, which is expected to be an RFC3339 timestamp.

The generated commits have a committer field (the invoking user) and each gets as date the modification time of the newest file in the tarball (not the mod time of the tarball itself). No author field is generated. A comment recording the tarball name is generated.

Note that the import stream generated by this command is - while correct - not optimal, and may in particular contain duplicate blobs.

With the --firewall option, generate an additional commit after the sequence consisting only of deletes crafted to prevent the incorporarted content fromm leaking forward.

7.7. The repository list

Reposurgeon can have several repositories loaded at once. The following commands operate on the repository list.

choose [REPO-NAME]

Choose a named repo on which to operate. The name of a repo is normally the basename of the directory or file it was loaded from, but repos loaded from standard input are 'unnamed'. The program will add a disambiguating suffix if there have been multiple reads from the same source.

With no argument, lists the names of the currently stored repositories. The second column is '*' for the currently selected repository, '-' for others.

With an argument, the command tab-completes on the above list.

drop [REPO-NAME]

Drop a repo named by the argument from reposurgeon’s list, freeing the memory used for its metadata and deleting on-disk blobs. With no argument, drops the currently chosen repo. Tab-completes on the list of loaded repositories.

rename {NEW-NAME}

Rename the currently chosen repo; requires an argument. Won’t do it if there is already one by the new name.

8. Information and reports

Commands in this group report information about the selected repository.

The output of these commands can individually be redirected to a named output file. Where indicated in the syntax, you can prefix the output filename with ‘>’ and give it as a following argument. If you use ‘>>’ the file is opened for append rather than write.

8.1. Reports on the DAG

[SELECTION] list [>OUTFILE]

Display commits in a human-friendly format; the first column is raw event numbers, the second a timestamp in UTC. If the repository has legacy IDs, they will be displayed in the third column. The leading portion of the comment follows.

[SELECTION] index [>OUTFILE]

Display four columns of info on selected events: their number, their type, the associated mark (or '-' if no mark) and a summary field varying by type. For a branch or tag it’s the reference; for a commit it’s the commit branch; for a blob it’s a space-separated list of the repository path of the files with the blob as content.

[SELECTION] stamp [>OUTFILE]

Display full action stamps corresponding to commits in a select. The stamp is followed by the first line of the commit message.

[SELECTION] tip [>OUTFILE]

Display the branch tip names associated with commits in the selection set. These will not necessarily be the same as their branch fields (which will often be tag names if the repo contains either annotated or lightweight tags).

If a commit is at a branch tip, its tip is its branch name. If it has only one child, its tip is the child’s tip. If it has multiple children, then if there is a child with a matching branch name its tip is the child’s tip. Otherwise this function throws a recoverable error.

[SELECTION] tags {>OUTFILE]

Display tags and resets: three fields, an event number and a type and a name. Branch tip commits associated with tags are also displayed with the type field 'commit'.

[SELECTION] inspect

Dump a fast-import stream representing selected events to standard output or via > redirect to a file. Just like a write, except (1) the progress meter is disabled, and (2) there is an identifying header before each event dump.

[SELECTION] graph [>OUTFILE]

Emit a visualization of the commit graph in the DOT markup language used by the graphviz tool suite. This can be fed as input to the main graphviz rendering program dot(1), which will yield a viewable image.

Because graph supports output redirection, you can do this:

graph | dot -Tpng | display

You can substitute in your own preferred image viewer, of course.

[SELECTION] lint [--OPTION…​] [>OUTFILE]

Look for DAG and metadata configurations that may indicate a problem. Presently can check for: (1) Mid-branch deletes, (2) disconnected commits, (3) parentless commits, (4) the existence of multiple roots, (5) committer and author IDs that don’t look well-formed as DVCS IDs, (6) multiple child links with identical branch labels descending from the same commit, (7) time and action-stamp collisions.

Options to issue only partial reports are supported; "lint --options" or "lint -?" lists them.

The options and output format of this command are unstable; they may change without notice as more sanity checks are added.

8.2. Statistics

stats [>OUTFILE]

Report object counts for the loaded repository.

{SELECTION} count [>OUTFILE]

Report a count of items in the selection set. Default set is everything in the currently-selected repo.

[SELECTION] sizes [>OUTFILE]

Print a report on data volume per branch; takes a selection set, defaulting to all events. The numbers tally the size of uncompressed blobs, commit and tag comments, and other metadata strings (a blob is counted each time a commit points at it). Not an exact measure of storage size: intended mainly as a way to get information on how to efficiently partition a repository that has become large enough to be unwieldy.

8.3. Examining tree states

[SELECTION] manifest [PATTERN] [>OUTFILE]

Print commit path lists. Takes an optional selection set argument defaulting to all commits, and an optional pattern expression. For each commit in the selection set, print the mapping of all paths in that commit tree to the corresponding blob marks, mirroring what files would be created in a checkout of the commit. If a regular expression is given, only print "path → mark" lines for paths matching it. See "help regexp" for more information about regular expressions.

{SELECTION} checkout

Check out files for a specified commit into a directory. The selection set must resolve to a singleton commit.

{SELECTION} diff [>OUTFILE]

Display the difference between commits. Takes a selection-set argument which must resolve to exactly two commits.

9. Surgical Operations

These are the operations the rest of reposurgeon is designed to support.

9.1. Commit deletion

[SELECTION] squash [POLICY…​]

Combine or delete commits in a selection set of events. The default selection set for this command is empty. Has no effect on events other than commits unless the --delete policy is selected; see the ‘delete’ command for discussion.

Normally, when a commit is squashed, its file operation list (and any associated blob references) gets either prepended to the beginning of the operation list of each of the commit’s children or appended to the operation list of each of the commit’s parents. Then children of a deleted commit get it removed from their parent set and its parents added to their parent set.

The analogous operation is performed on commit comments, so no comment text is ever outright discarded. Exception: comments consisting of “*** empty log message ***”, as generated by CVS, are ignored.

The default is to squash forward, modifying children; but see the list of policy modifiers below for how to change this.

Warning
It is easy to get the bounds of a squash command wrong, with confusing and destructive results. Beware thinking you can squash on a selection set to merge all commits except the last one into the last one; what you will actually do is to merge all of them to the first commit after the selected set.

Normally, any tag pointing to a combined commit will also be pushed forward. But see the list of policy modifiers below for how to change this.

Following all operation moves, every one of the altered file operation lists is reduced to a shortest normalized form. The normalized form detects various combinations of modification, deletion, and renaming and simplifies the operation sequence as much as it can without losing any information.

The following modifiers change these policies:

--delete

Simply discards all file ops and tags associated with deleted commit(s).

--no-coalesce

Do not normalize the modified commit operations.

--pushback

Append fileops to parents, rather than prepending to children.

--pushforward

Prepend fileops to children. This is the default; it can be specified in a lift script for explicitness about intentions.

--tagforward

Any tag on the deleted commit is pushed forward to the first child rather than being deleted. This is the default; it can be specified for explicitness.

--tagback

Any tag on the deleted commit is pushed backward to the first parent rather than being deleted.

--quiet

Suppresses warning messages about deletion of commits with non-delete fileops.

--complain

The opposite of --quiet. Can be specified for explicitness.

--empty-only

Complain if a squash operation modifies a nonempty comment.

--blobs

Allow deletion of selected blobs.

Under any of these policies except --delete, deleting a commit that has children does not back out the changes made by that commit, as they will still be present in the blobs attached to versions past the end of the deletion set. All a delete does when the commit has children is lose the metadata information about when and by who those changes were actually made; after the delete any such changes will be attributed to the first undeleted children of the deleted commits. It is expected that this command will be useful mainly for removing commits mechanically generated by repository converters such as cvs2svn.

[SELECTION] delete

Delete a selection set of events. The default selection set for this command is empty. Tags, resets, and passthroughs are deleted with no side effects. Blobs cannot be directly deleted with this command; they are removed only when removal of fileops associated with commits requires this.

When a commit is deleted, what becomes of tags and fileops attached to it is controlled by policy flags. A delete is equivalent to a squash with the --delete flag.

Clears all Q bits.

9.2. Commit mutation

{SELECTION} merge

Create a merge link. Takes a selection set argument, ignoring all but the lowest (source) and highest (target) members. Creates a merge link from the highest member (child) to the lowest (parent).

{SELECTION} unmerge

Linearizes a commit. Takes a selection set argument, which must resolve to a single commit, and removes all its parents except for the first. It is equivalent to reparent --rebase {first parent},{commit}, where {commit} is the selection set given to unmerge and {first parent} is a set resolving to that commit’s first parent, but doesn’t need you to find the first parent yourself, saving time and avoiding errors when nearby surgery would make a manual first parent argument stale.

[SELECTIONS] reparent [OPTIONS…​] [POLICY]

Changes the parent list of a commit. Takes a selection set, zero or more option arguments, and an optional policy argument.

Selection set: The selection set must resolve to one or more commits. The selected commit with the highest event number (not necessarily the last one selected) is the commit to modify. The remainder of the selected commits, if any, become its parents: the selected commit with the lowest event number (which is not necessarily the first one selected) becomes the first parent, the selected commit with second lowest event number becomes the second parent, and so on. All original parent links are removed. Examples:

# this makes 17 the parent of 33
17,33 reparent

# this also makes 17 the parent of 33
33,17 reparent

# this makes 33 a root (parentless) commit
33 reparent

# this makes 33 an octopus merge commit.  its first parent
# is commit 15, second parent is 17, and third parent is 22
22,33,15,17 reparent

The option --use-order says to use the selection order to determine which selected commit is the commit to modify and which are the parents (and if there are multiple parents, their order). The last selected commit (not necessarily the one with the highest event number) is the commit to modify, the first selected commit (not necessarily the one with the lowest event number) becomes the first parent, the second selected commit becomes the second parent, and so on. Examples:

# this makes 33 the parent of 17
33,17 reparent --use-order

# this makes 17 an octopus merge commit.  its first parent
# is commit 22, second parent is 33, and third parent is 15
22,33,15,17 reparent --use-order

Because ancestor commit events must appear before their descendants, giving a commit with a low event number a parent with a high event number triggers a re-sort of the events. A re-sort assigns different event numbers to some or all of the events. Re-sorting only works if the reparenting does not introduce any cycles. To swap the order of two commits that have an ancestor–descendant relationship without introducing a cycle during the process, you must reparent the descendant commit first.

By default, the manifest of the reparented commit is computed before modifying it; a deleteall and some fileops are prepended so that the manifest stays unchanged even when the first parent has been changed. This behavior can be changed by specifying a policy flag. --rebase inhibits the default behavior—no deleteall is issued, and the tree contents of all descendants can be modified as a result.

[SELECTION] split [ at {M} | by {PREFIX} ]

Split a specified commit in two, the opposite of squash.

The selection set is required to be a commit location; the modifier is a preposition which indicates which splitting method to use. If the preposition is 'at', then the third argument must be an integer 1-origin index of a file operation within the commit. If it is 'by', then the third argument must be a pathname to be matched.

The commit is copied and inserted into a new position in the event sequence, immediately following itself; the duplicate becomes the child of the original, and replaces it as parent of the original’s children. Commit metadata is duplicated; the mark of the new commit is then changed. If the new commit has a legacy ID, the suffix '.split' is appended to it.

Finally, some file operations - starting at the one matched or indexed by the split argument - are moved forward from the original commit into the new one. Legal indices are 2-n, where n is the number of file operations in the original commit.

reposurgeon% :3 inspect
Event 4 =================================================================
commit refs/heads/master
+#+legacy-id 2
mark :3
committer brooksd <brooksd> 1353813663 +0000
data 33
add a new file in each directory
M 100644 :1 .gitignore
M 100644 :2 bar/src
M 100644 :2 baz/src
M 100644 :2 foo/src
+
reposurgeon% :3 split by bar
reposurgeon: new commits are events 4 and 5.
+
reposurgeon% 4,5 inspect
Event 4 =================================================================
commit refs/heads/master
+#+legacy-id 2
mark :3
committer brooksd <brooksd> 1353813663 +0000
data 33
add a new file in each directory
M 100644 :1 .gitignore
M 100644 :2 baz/src
M 100644 :2 foo/src
+
Event 5 =================================================================
commit refs/heads/master
+#+legacy-id 2.split
mark :7
committer brooksd <brooksd> 1353813663 +0000
data 33
add a new file in each directory
from :3
M 100644 :2 bar/src
{SELECTION} add { "D" {PATH} | "M" {PERM} {MARK} {PATH} | "R" {SOURCE} {TARGET} | "C" {SOURCE} {TARGET} }

In a specified commit, add a specified fileop.

For a D operation to be valid there must be an M operation for the path in the commit’s ancestry.

For an M operation to be valid, PERM must be a token ending with 755 or 644 and the MARK must refer to a blob that precedes the commit location. If the MARK is nonexistent or names something other than a blob, attempting to rebuild a live repository will throw a fatal error.

For an R or C operation to be valid, there must be an M operation for the SOURCE path in the commit’s ancestry.

Some examples:


# At commit :15, stop .gitignore from being checked out in later revisions :15 add D .gitignore

# Create a new blob :2317 with specified content. At commit :17, add modify # or creation of a file named "spaulding>" with its content in the new blob. # Make it check out with 755 (-rwxr-xr-x) permissions rather than the # normal 644 (-rw-r—​r--). blob :2317 <<EOF Hello, I must be going. EOF :17 add M 100755 :2317 spaulding ---

[SELECTION] remove [DMRCN] {OP} [to {SELECTION}]

From a specified commit, remove a specified fileop. The syntax:

OP must be one of (a) the keyword 'deletes', (b) a file path, (c) a file path preceded by an op type set (some subset of the letters DMRCN), or (c) a 1-origin numeric index. The 'deletes' keyword selects all D fileops in the commit; the others select one each.

If the to clause is present, the removed op is appended to the commit specified by the following singleton selection set. This option cannot be combined with 'deletes'.

[SELECTION] tagify [ --tagify-merges | --canonicalize | --tipdeletes ]

Search for empty commits and turn them into tags. Takes an optional selection set argument defaulting to all commits. For each commit in the selection set, turn it into a tag with the same message and author information if it has no fileops. By default merge commits are not considered, even if they have no fileops (thus no tree differences with their first parent). To change that, see the '--tagify-merges' option.

The name of the generated tag will be 'emptycommit-<ident>', where <ident> is generated from the legacy ID of the deleted commit, or from its mark, or from its index in the repository, with a disambiguation suffix if needed.

tagify currently recognizes three options: first is '--canonicalize' which makes tagify try harder to detect trivial commits by first removing all fileops of the selected commits which have no actual effect when processed by fast-import. For example, file modification ops that don’t actually change the content of the file, or deletion ops that delete a file that doesn’t exist in the parent commit get removed. This rarely happens naturally, but can happen after some surgical operations, such as reparenting.

The second option is '--tipdeletes' which makes tagify also consider branch tips with only deleteall fileops to be candidates for tagification. The corresponding tags get names of the form 'tipdelete-<branchname>' rather than the default 'emptycommit-<ident>'.

The third option is '--tagify-merges' that makes reposurgeon also tagify merge commits that have no fileops. When this is done the merge link is moved to the tagified commit’s parent.

[SELECTION] reorder [--quiet]

Re-order a contiguous range of commits.

Older revision control systems tracked change history on a per-file basis, rather than as a series of atomic "changesets", which often made it difficult to determine the relationships between changes. Some tools which convert a history from one revision control system to another attempt to infer changesets by comparing file commit comment and time-stamp against those of other nearby commits, but such inference is a heuristic and can easily fail.

In the best case, when inference fails, a range of commits in the resulting conversion which should have been coalesced into a single changeset instead end up as a contiguous range of separate commits. This situation typically can be repaired easily enough with the 'coalesce' or 'squash' commands. However, in the worst case, numerous commits from several different "topics", each of which should have been one or more distinct changesets, may end up interleaved in an apparently chaotic fashion. To deal with such cases, the commits need to be re-ordered, so that those pertaining to each particular topic are clumped together, and then possibly squashed into one or more changesets pertaining to each topic. This command, 'reorder', can help with the first task; the 'squash' command with the second.

Selected commits are re-arranged in the order specified; for instance: ":7,:5,:9,:3 reorder". The specified commit range must be contiguous; each commit must be accounted for after re-ordering. Thus, for example, ':5' can not be omitted from ":7,:5,:9,:3 reorder". (To drop a commit, use the 'delete' or 'squash' command.) The selected commits must represent a linear history, however, the lowest numbered commit being re-ordered may have multiple parents, and the highest numbered may have multiple children.

Re-ordered commits and their immediate descendants are inspected for rudimentary fileops inconsistencies. Warns if re-ordering results in a commit trying to delete, rename, or copy a file before it was ever created. Likewise, warns if all of a commit’s fileops become no-ops after re-ordering. Other fileops inconsistencies may arise from re-ordering, both within the range of affected commits and beyond; for instance, moving a commit which renames a file ahead of a commit which references the original name. Such anomalies can be discovered via manual inspection and repaired with the 'add' and 'remove' (and possibly 'path') commands. Warnings can be suppressed with '--quiet'.

In addition to adjusting their parent/child relationships, re-ordering commits also re-orders the underlying events since ancestors must appear before descendants, and blobs must appear before commits which reference them. This means that events within the specified range will have different event numbers after the operation.

9.3. Branches, tags, and resets

branch {rename|delete} {BRANCH-PATTERN} [--not] [NEW-NAME]

Rename or delete all branches matching the pattern expression BRANCH-PATTERN (also any associated annotated tags and resets). For purposes of this command a Git lightweight tag is simply a branch in the tags/ namespace.

The --not option inverts a selection for deletion, deleting all branches other than those matched.

Second argument must be one of the verbs 'rename' or 'delete'.

For a rename, the third argument may be any token that is a syntactically valid branch name (but not the name of an existing branch). If it does not begin with "refs/", then "refs/" is prepended; you should supply "heads/" or "tags/" yourself. In it, references to match parts in BRANCH-PATTERN will be expanded.

[SELECTION] tag {create|move|rename|delete} [TAG-PATTERN] [--not] [NEW-NAME|SINGLETON]

Create, move, rename, or delete annotated tags.

Creation is a special case. First argument is NEW-NAME, which must not be an existing tag. Takes a singleton event second argument which must point to a commit. A tag event pointing to the commit is created and inserted just after the last tag in the repo (or just after the last commit if there are no tags). The tagger, committish, and comment fields are copied from the commit’s committer, mark, and comment fields.

Otherwise, the TAG-PATTERN argument is a pattern expression matching a set of tags. The subcommand must be one of the verbs 'move', 'rename', or 'delete'.

For a 'move', a second argument must be a singleton selection set. For a 'rename', the third argument may be any token that is a syntactically valid tag name (but not the name of an existing tag). When TAG-PATTERN is a regexp, NEW-NAME may contain references to portions of the match. Errors are thrown for wildcarding that would produce name collisions

For a 'delete', no second argument is required. Annotated tags with names matching the pattern are deleted. Giving a regular expression rather than a plain string is useful for mass deletion of junk tags such as those derived from CVS branch-root tags. Such deletions can be restricted by a selection set in the normal way.

[SELECTION] reset {create|move|rename|delete} [RESET-PATTERN] [--not] [NEW-NAME]

Note: While this command is provided for the sake of completeness, think twice before actually using it. Normally a reset should only be deleted or renamed when its associated branch is, and the branch command does this.

Create, move, rename, or delete resets. Create is a special case; it requires a singleton selection which is the associated commit for the reset, takes as a first argument the name of the reset (which must not exist), and ends with the keyword create. In this case the name must be fully qualified, with a refs/heads/ or refs/tags/ prefix.

In the other modes, the RESET-PATTERN finds by text match existing resets within the selection. If RESET-PATTERN is a delimited regexp, the match is to the regexp (--not inverts this, selecting all non-matching resets).

If RESET-PATTERN is a text literal, each reset’s name is matched if RESET-PATTERN is either the entire reference (refs/heads/FOO or refs/tags/FOO for some some value of FOO) or the basename (e.g. FOO), or a suffix of the form heads/FOO or tags/FOO. An unqualified basename is assumed to refer to a branch in refs/heads/.

The second argument must be one of the verbs 'move', 'rename', or 'delete'. The default selection is all events.

For a 'move', a SINGLETON argument must be a singleton selection set. For a 'rename', the third argument may be any token that can be interpreted as a valid reset name (but not the name of an existing reset). For a 'delete', no third argument is required.

When a reset is renamed, commit branch fields matching the tag are renamed with it to match. When a reset is deleted, matching branch fields are changed to match the branch of the unique descendant of the tip commit of the associated branch, if there is one. When a reset is moved, no branch fields are changed.

branchlift {SOURCEBRANCH} {PATHPREFIX} [NEWNAME]

Every commit on SOURCEBRANCH with fileops matching the PATHPREFIX is examined; all commits with every fileop matching the PATH are moved to a new branch; if a commit has only some matching fileops it is split and the fragment containing the matching fileops is moved.

Every matching commit is modified to have the branch label specified by NEWNAME. If NEWNAME is not specified, the basename of PATHPREFIX is used. If the resulting branch already exists, this command errors out without modifying the repository.

The PATHPREFIX is removed from the paths of all fileops in modified commits.

Backslash escapes are processed in all three names.

Sets Q bits: commits on the source branch modified by having fileops lifted to the new branch true, all others false.

9.4. Repository splitting and merging

{SELECTION} divide

Attempt to partition a repo by cutting the parent-child link between two specified commits (they must be adjacent). Does not take a general selection-set argument. It is only necessary to specify the parent commit, unless it has multiple children in which case the child commit must follow (separate it with a comma).

If the repo was named 'foo', you will normally end up with two repos named 'foo-early' and 'foo-late'. But if the commit graph would remain connected through another path after the cut, the behavior changes. In this case, if the parent and child were on the same branch 'qux', the branch segments are renamed 'qux-early' and 'qux-late', but the repo is not divided..

[SELECTION] expunge [--not|--notagify] {PATH-PATTERN}

Expunge files from the selected portion of the repo history; the default is the entire history. The argument to this command is a pattern expression matching paths.

The option --not inverts this; all file paths other than those selected by the remaining arguments to be expunged. You may use this to sift out all file operations matching a pattern set rather than expunging them.

All filemodify (M) operations and delete (D) operations involving a matched file in the selected set of events are disconnected from the repo and put in a removal set. Renames are followed as the tool walks forward in the selection set; each triggers a warning message. If a selected file is a copy © target, the copy will be deleted and a warning message issued. If a selected file is a copy source, the copy target will be added to the list of paths to be deleted and a warning issued.

After file expunges have been performed, any commits with no remaining file operations will be deleted, and any tags pointing to them. By default each deleted commit is replaced with a tag of the form emptycommit-<ident> on the preceding commit unless the --notagify option is specified. Commits with deleted fileops pointing both in and outside the path set are not deleted.

This command sets Q bits: true on any commit which lost fileops but was not entirely deleted, false on all other events.

Example:


# Delete all PDFs from the loaded repository. expunge /[.]pdf$/ ---

unite [--prune] [REPO-NAME…​]

Unite repositories. Name any number of loaded repositories; they will be united into one union repo and removed from the load list. The union repo will be selected.

The root of each repo (other than the oldest repo) will be grafted as a child to the last commit in the dump with a preceding commit date. This will produce a union repository with one branch for each part. Running last to first, tag and branch duplicate names will be disambiguated using the source repository name (thus, recent duplicates will get priority over older ones). After all grafts, marks will be renumbered.

The name of the new repo will be the names of all parts concatenated, separated by '+'. It will have no source directory; if all factors have the same type it will be inehrited, otherwise no type will be set.

With the option --prune, at each join generate D ops for every file that doesn’t have a modify operation in the root commit of the branch being grafted on.

Note that the union repo will not have a single master branch. You must rename one of its branches before the result can be fed to git without throwing a "fatal: You are on a branch yet to be born" error.

[SELECTION] graft [--prune] {REPO-NAME}

For when unite doesn’t give you enough control. This command may have either of two forms, selected by the size of the selection set. The first argument is always required to be the name of a loaded repo.

If the selection set is of size 1, it must identify a single commit in the currently chosen repo; in this case the named repo’s root will become a child of the specified commit. If the selection set is empty, the named repo must contain one or more callouts matching a commits in the currently chosen repo.

Labels and branches in the named repo are prefixed with its name; then it is grafted to the selected one. Any other callouts in the named repo are also resolved in the control of the currently chosen one. Finally, the named repo is removed from the load list.

With the option --prune, prepend a deleteall operation into the root of the grafted repository.

9.5. Metadata editing

[SELECTION] msgout [--filter=PATTERN] [--blobs]

Emit a file of messages in RFC822 format representing the contents of repository metadata. Takes a selection set; members of the set other than commits, annotated tags, and passthroughs are ignored (that is, presently, blobs and resets).

May have an option --filter, followed by a pattern expression. If this is given, only headers with names matching it are emitted. In this control the name of the header includes its trailing colon. The value of the option must be a pattern expression. See "help regexp" for information on the regexp syntax.

Blobs may be included in the output with the option --blobs.

The following example produces a mailbox of commit comments in a decluttered form that is convenient for editiing:

=C msgout --filter=/Event-Number:|Committer:|Author|Check-Text:/
[SELECTION] msgin [--create] [<INFILE]

Accept a file of messages in RFC822 format representing the contents of the metadata in selected commits and annotated tags. If there is an argument, it will be taken as the name of a message-box file to read from; if no argument, or one of '-', reads from standard input. Supports < redirection. Ordinarily takes no selection set.

Users should be aware that modifying an Event-Number or Event-Mark field will change which event the update from that message is applied to. This is unlikely to have good results.

The header CheckText, if present, is examined to see if the comment text of the associated event begins with it. If not, the item modification is aborted. This helps ensure that you are landing updates on the events you intend.

If the --create modifier is present, new tags and commits will be appended to the repository. In this case it is an error for a tag name to match any existing tag name. Commit events are created with no fileops. If Committer-Date or Tagger-Date fields are not present they are filled in with the time at which this command is executed. If Committer or Tagger fields are not present, reposurgeon will attempt to deduce the user’s git-style identity and fill it in. If a singleton commit set was specified for commit creations, the new commits are made children of that commit.

Otherwise, if the Event-Number and Event-Mark fields are absent, the msgin logic will attempt to match the commit or tag first by Legacy-ID, then by a unique committer ID and timestamp pair.

If the option --empty-only is given, this command will throw a recoverable error if it tries to alter a message body that is neither empty nor consists of the CVS empty-comment marker.

This operation sets Q bits; true where an object was modified by it, false otherwise.

[SELECTION] setfield {FIELD} {VALUE}

In the selected events (defaulting to none) set every instance of a named field to a string value. The string may be quoted to include whitespace, and use backslash escapes interpreted by Go’s C-like string-escape codec, such as \s.

Attempts to set nonexistent attributes are ignored. Valid values for the attribute are internal field names; in particular, for commits, 'comment' and 'branch' are legal. Consult the source code for other interesting values.

The special fieldnames 'author', 'commitdate' and 'authdate' apply only to commits in the range. The latter two sets attribution dates. The former sets the author’s name and email address (assuming the value can be parsed for both), copying the committer timestamp. The author’s timezone may be deduced from the email address.

[SELECTION] attribution [ATTR-SELECTION] {show|set|delete|prepend|append} [ARG…​]

Inspect, modify, add, and remove commit and tag attributions.

Attributions upon which to operate are selected in much the same way as events are selected, as described in Selection syntax. ATTR_SELECTION is an expression composed of 1-origin attribution-sequence numbers, ‘$’ for last attribution, ‘..’ ranges, comma-separated items, ‘(…​)’ grouping, set operations ‘|’ union, ‘&’ intersection, and ‘~’ negation, and function calls @min(), @max(), @amp(), @pre(), @suc(), @srt(). Attributions can also be selected by visibility set ‘=C’ for committers, ‘=A’ for authors, and ‘=T’ for taggers. Finally, ‘/regexp/’ will attempt to match the regular expression regexp against an attribution name and email address; ‘/n’ limits the match to only the name, and ‘/e’ to only the email address.

With the exception of ‘show’, all actions require an explicit event selection upon which to operate. Available actions are:

[ATTR-SELECTION] show [>OUTFILE]

Inspect the selected attributions of the specified events (commits and tags). The ‘show’ keyword is optional. If no attribution selection expression is given, defaults to all attributions. If no event selection is specified, defaults to all events. Supports > redirection.

{ATTR-SELECTION} set [NAME] [EMAIL] [DATE]

Assign NAME, EMAIL, DATE to the selected attributions. As a convenience, if only some fields need to be changed, the others can be omitted. Arguments NAME, EMAIL, and DATE can be given in any order.

[ATTR-SELECTION] delete

Delete the selected attributions. As a convenience, deletes all authors if ATTR-SELECTION is not given. It is an error to delete the mandatory committer and tagger attributions of commit and tag events, respectively.

[ATTR-SELECTION] prepend [NAME] [EMAIL] [DATE]

Insert a new attribution before the first attribution named by ATTR_SELECTION. The new attribution has the same type (committer, author, or tagger) as the one before which it is being inserted. Arguments NAME, EMAIL, and DATE can be given in any order.

If NAME is omitted, an attempt is made to infer it from EMAIL by trying to match EMAIL against an existing attribution of the event, with preference given to the attribution before which the new attribution is being inserted. Similarly, EMAIL is inferred from an existing matching NAME. Likewise, for DATE.

As a convenience, if ATTR-SELECTION is empty or not specified a new author is prepended to the author list.

It is presently an error to insert a new committer or tagger attribution. To change a committer or tagger, use "setfield" instead.

[ATTR-SELECTION] append [NAME] [EMAIL] [DATE]

Insert a new attribution after the last attribution named by ATTR_SELECTION. The new attribution has the same type (committer, author, or tagger) as the one after which it is being inserted. Arguments NAME, EMAIL, and DATE can be given in any order.

If NAME is omitted, an attempt is made to infer it from EMAIL by trying to match EMAIL against an existing attribution of the event, with preference given to the attribution after which the new attribution is being inserted. Similarly, EMAIL is inferred from an existing matching NAME. Likewise, for DATE.

As a convenience, if ATTR-SELECTION is empty or not specified a new author is appended to the author list.

It is presently an error to insert a new committer or tagger attribution. To change a committer or tagger, use "setfield" instead.

[SELECTION] append [--rstrip] {TEXT}

Append text to the comments of commits and tags in the specified selection set. The text is the first token of the command and may be a quoted string. C-style escape sequences in the string are interpreted using Go’s Quote/Unquote codec from the strconv library.

If the option --rstrip is given, the comment is right-stripped before the new text is appended. If the option --legacy is given, the string %LEGACY% in the append payload is replaced with the commit’s lagacy-ID before it is appended.

Example:

=C append "\nLegacy-Id: %LEGACY%" --legacy
[SELECTION] gitify

Attempt to massage comments into a git-friendly form with a blank separator line after a summary line. This code assumes it can insert a blank line if the first line of the comment ends with '.', ',', ':', ';', '?', or '!'. If the separator line is already present, the comment won’t be touched.

Takes a selection set, defaulting to all commits and tags.

Sets Q bits: true for each commit and tag with a comment modified by this command, false on all other events.

[SELECTION] filter {dedos|shell|regexp|replace} [TEXT-OR-REGEXP]

Run blobs, commit comments and committer/author names, or tag comments and tag committer names in the selection set through the filter specified on the command line.

With any verb other than dedos, attempting to specify a selection set including both blobs and non-blobs (that is, commits or tags) throws an error. Inline content in commits is filtered when the selection set contains (only) blobs and the commit is within the range bounded by the earliest and latest blob in the specification.

When filtering blobs, if the command line contains the magic cookie '%PATHS%' it is replaced with a space-separated list of all paths that reference the blob.

With the verb shell, the remainder of the line specifies a filter as a shell command. Each blob or comment is presented to the filter on standard input; the content is replaced with whatever the filter emits to standard output.

With the verb regex, the remainder of the line is expected to be a Go regular expression substitution written as /from/to/ with Go-style backslash escapes interpreted in 'to' as well as 'from'. Python-style backreferences (\1, \2 etc.) rather than Go-style $1, $2…​ are interpreted; this avoids a conflict with parameter substitution in script commands. Any non-space character will work as a delimiter in place of the /; this makes it easier to use / in patterns. Ordinarily only the first such substitution is performed; putting 'g' after the slash replaces globally, and a numeric literal gives the maximum number of substitutions to perform. Other flags available restrict substitution scope - 'c' for comment text only, 'C' for committer name only, 'a' for author names only. See "help regexp" for more information about regular expressions.

With the verb replace, the behavior is like regexp but the expressions are not interpreted as regular expressions. (This is slightly faster).

With the verb dedos, DOS/Windows-style \r\n line terminators are replaced with \n.

All variants of this command set Q bits; events actually modified by the command get true, all other events get false

Some examples:

+#+ In all blobs, expand tabs to 8-space tab stops
=B filter shell expand --tabs=8
+
+#+ Text replacement in comments
=C filter replace /Telperion/Laurelin/c

9.6. Path reports and modifications

path [list [>OUTFILE] | rename {PATTERN} [--force] {TARGET}]]

With the verb "list", list all paths touched by fileops in the selection set (which defaults to the entire repo). This command does > redirection.

With the verb "rename", rename a path in every fileop of every selected commit. The default selection set is all commits. The first argument is interpreted as a pattern expression to match against paths; the second may contain back-reference syntax (\1 etc.). See "help regexp" for more information about regular expressions.

Ordinarily, if the target path already exists in the fileops, or is visible in the ancestry of the commit, this command throws an error. With the --force option, these checks are skipped.

Example:

+#+ move all content into docs/ subdir
path rename /.+/ docs/\0

This command sets commit Q bits; true if the commit was modified.

{SELECTION} setperm {PERM} [PATH…​]

For the selected events (defaulting to none) take the first argument as an octal literal describing permissions. All subsequent arguments are paths. For each M fileop in the selection set and exactly matching one of the paths, patch the permission field to the first argument value.

9.7. Timequakes and time offsets

Modifying a repository so every commit in it has a unique timestamp is often a useful thing to do, in order for every commit to have a unique action stamp that can be referred to in surgical commands.

The ‘lint’ command will tell you if you have timestamp collisions.

[SELECTION] timequake

Attempt to hack committer and author time stamps to make all action stamps in the selection set (defaulting to all commits in the repository) to be unique. Works by identifying collisions between parent and child, than incrementing child timestamps so they no longer coincide. Won’t touch commits with multiple parents.

Because commits are checked in ascending order, this logic will normally do the right thing on chains of three or more commits with identical timestamps.

Any collisions left after this operation are probably cross-branch and have to be individually dealt with using 'timeoffset' commands.

The normal use case for this command is early in converting CVS or Subversion repositories, to ensure that the surgical language can count on having a unique action-stamp ID for each commit.

This command sets Q bits: true on each event with a timestamp bumped, false on all other events.

[SELECTION] timeoffset {OFFSET}

Apply a time offset to all time/date stamps in the selected set. An offset argument is required; it may be in the form [-]ss, [-]mm:ss or [+-]hh:mm:ss. The leading sign is optional. With no argument, the default is 1 second.

Optionally you may also specify another argument in the form [+-]hhmm, a timeone literal to apply. To apply a timezone without an offset, use an offset literal of 0, +0 or -0.

Those of you twitchy about "rewriting history" should bear in mind that the commit stamps in many older repositories were never very reliable to begin with.

+ CVS in particular is notorious for shipping client-side timestamps with timezone and DST issues (as opposed to UTC) that don’t necessary compare well with stamps from different clients of the same CVS server. Thus, inducing a timequake in a CVS repo seldom produces effects anywhere near as large as the measurement noise of the repository’s own timestamps.

+ Subversion was somewhat better about this, as commits were stamped at the server, but older Subversion repositories often have sections that predate the era of ubiquitous NTP time.

9.8. Miscellanea

quit

Terminate reposurgeon cleanly.

Typing EOT (usually Ctrl-D) is a shortcut for this.

blob [MARK-NUMBER] {<INFILE|>OUTFILE]

Given an argument, create a blob with the specified mark name, which must not already exist. The new blob is inserted at the front of the repository event sequence, after options but before previously-existing blobs. The blob data is taken from standard input, which may be a redirect from a file or a here-doc.

Without an argument return a legal blob name that is not in use, having a numeric part just one greater than the highest-numbered existing blob in the repository. This output may be redirected, but is intended for interactive use when developing a script.

These commands can be used with the add command to patch new data into a repository.

renumber

Renumber the marks in a repository, from :1 up to <n> where <n> is the count of the last mark. Just in case an importer ever cares about mark ordering or gaps in the sequence.

A side effect of this command is to clean up stray "done" passthroughs that may have entered the repository via graft operations. After a renumber, the repository will have at most one "done", and it will be at the end of the events.

[SELECTION] dedup

Deduplicate blobs in the selection set. If multiple blobs in the selection set have the same SHA1, throw away all but the first, and change fileops referencing them to instead reference the (kept) first blob.

[SELECTION] transcode {ENCODING}

Transcode blobs, commit comments and committer/author names, or tag comments and tag committer names in the selection set to UTF-8 from the character encoding specified on the command line.

Attempting to specify a selection set including both blobs and non-blobs (that is, commits or tags) throws an error. Inline content in commits is filtered when the selection set contains (only) blobs and the commit is within the range bounded by the earliest and latest blob in the specification.

The ENCODING argument must name one of the codecs listed at https://www.iana.org/assignments/character-sets/character-sets.xhtml and known to the Go standard codecs library.

Errors in this command force the repository to be dropped, because an error may leave repository events in a damaged state.

The theory behind the design of this command is that the repository might contain a mixture of encodings used to enter commit metadata by different people at different times. After using "=I" to identify metadata containing non-Unicode high bytes in text, a human must use context to identify which particular encodings were used in particular event spans and compose appropriate transcode commands to fix them up.

This command sets Q bits; objects actually modified by the command get true, all other events get false.


# In all commit comments containing non-ASCII bytes, transcode from Latin-1. =I transcode latin1 ---

debranch {SOURCE-BRANCH} [TARGET-BRANCH]

Takes one or two arguments which must be the names of source and target branches; if the second (target) argument is omitted it defaults to 'master'. The history of the source branch is merged into the history of the target branch, becoming the history of a subdirectory with the name of the source branch. Any trailing segment of a branch name is accepted as a synonym for it; thus 'master' is the same as 'refs/heads/master'. Any resets of the source branch are removed.

[SELECTION] strip {--blobs|--reduce}

This is intended for producing reduced test cases from large repositories.

Replace the blobs in the selected repository with self-identifying stubs; and/or strip out topologically uninteresting commits. The options for this are "--blobs" and "--reduce" respectively; the default is "--blobs".

A selection set is effective only with the "--blobs" option, defaulting to all blobs. The "--reduce" mode always acts on the entire repository.

With the modifier "--reduce", perform a topological reduction that throws out uninteresting commits. If a commit has all file modifications (no deletions or copies or renames) and has exactly one ancestor and one descendant, then it may be boring. To be fully boring, it must also not be referred to by any tag or reset. Interesting commits are not boring, or have a non-boring parent or non-boring child.

10. Artifact handling

Some commands automate fixing various kinds of artifacts associated with repository conversions from older systems.

10.1. Attributions

[SELECTION] authors {read <INFILE | write >OUTFILE}

Apply or dump author-map information for the specified selection set, defaulting to all events.

Lifts from CVS and Subversion may have only usernames local to the repository host in committer and author IDs. DVCSes want email addresses (net-wide identifiers) and complete names. To supply the map from one to the other, an authors file is expected to consist of lines each beginning with a local user ID, followed by a '=' (possibly surrounded by whitespace) followed by a full name and email address. Thus:

fred = Fred J. Foonly <foonly@foo.com> America/New_York

An authors file may also contain lines of this form

+ Fred J. Foonly <foonly@foobar.com> America/Los_Angeles

These are interpreted as aliases for the last preceding = entry that may appear in ChangeLog files. When such an alias is matched on a ChangeLog attribution line, the author attribution for the commit is mapped to the basename, but the timezone is used as is. This accommodates people with past addresses (possibly at different locations) unifying such aliases in metadata so searches and statistical aggregation will work better.

An authors file may have comment lines beginning with #; these are ignored.

When an authors file is applied, email addresses in committer and author metdata for which the local ID matches between < and @ are replaced according to the mapping (this handles git-svn lifts). Alternatively, if the local ID is the entire address, this is also considered a match (this handles what git-cvsimport and cvs2git do). If a timezone was specified in the map entry, that person’s author and committer dates are mapped to it.

With the 'read' modifier, apply author mapping data (from standard input or a ←redirected input file). Q bits are set: true on each commit event with attributions actually modified by the mapping, false on all other events.

With the 'write' modifier, write a mapping file that could be interpreted by 'authors read', with entries for each unique committer, author, and tagger (to standard output or a >-redirected file). This may be helpful as a start on building an authors file, though each part to the right of an equals sign will need editing.

10.2. Ignore patterns

reposurgeon recognizes how supported VCSes represent file ignores (CVS .cvsignore files lurking untranslated in older Subversion repositories, Subversion ignore properties, .gitignore/.hgignore/.bzrignore file in other systems) and moves ignore declarations among these containers on repo input and output. This will be sufficient if the ignore patterns are exact filenames.

Translation may not, however, be perfect when the ignore patterns are Unix glob patterns or regular expressions. This compatibility table describes which patterns will translate; "plain" indicates a plain filename with no glob or regexp syntax or negation, "no !" means no negated regexps, and "no RE:" means the RE prefix for a regular expression does not work.

RCS has no ignore files or patterns and is therefore not included in the table.

from CVS from svn from git from hg from bzr from darcs from SRC from bk

to CVS

all

all

no ! & nonempty

all

no RE:, no !

plain

all

all

to svn

no !

all

no !

all

no RE:. no !

plain

all

all

to git

all

all

all

no !

no RE:

plain

all

all

to hg

no !

all

no !

all

no RE:, no !

plain

all

all

to bzr

all

all

all

all

all

plain

all

all

to darcs

plain

plain

plain

plain

plain

all

all

all

to SRC

no !

all

no !

all

no RE:, no !

plain

all

all

The hg rows and columns of the table describe compatibility to hg’s glob syntax rather than its default regular-expression syntax. When writing to an hg repository from any other kind, reposurgeon prepends to the output .hgignore a ‘syntax: glob’ line.

For dealing with unusual cases, there’s this:

ignores [--rename] [--translate] [--defaults]

Intelligent handling of ignore-pattern files.

This command fails if no repository has been selected or no preferred write type has been set for the repository. It does not take a selection set.

If --rename is present, this command attempts to rename all ignore-pattern files to whatever is appropriate for the preferred type - e.g. .gitignore for git, .hgignore for hg, etc. This option does not cause any translation of the ignore files it renames.

+ If --translate is present, syntax translation of each ignore file is attempted. At present, the only transformation the code knows is to prepend a 'syntax: glob' header if the preferred type is hg.

+ If --defaults is present, the command attempts to prepend these default patterns to all ignore files. If no ignore file is created by the first commit, it will be modified to create one containing the defaults. This command will error out on prefer types that have no default ignore patterns (git and hg, in particular). It will also error out when it knows the import tool has already set default patterns.

10.3. Reference lifting

This group of commands is meant for fixing up references in commits that are in the format of older version-control systems. The general workflow is this: first, go over the comment history and change all old-fashioned commit references into machine-parseable cookies. Then, automatically turn the machine-parseable cookie into action stamps. The point of dividing the process this way is that the first part is hard for a machine to get right, while the second part is prone to errors when a human does it.

A Subversion cookie is a comment substring of the form ‘[[SVN:ddddd]]’ (example: ‘[[SVN:2355]]’) with the revision read directly via the Subversion exporter, deduced from git-svn metadata, or matching a $Revision$ header embedded in blob data for the filename.

A CVS cookie is a comment substring of the form ‘[[CVS:filename:revision]]’ (example: ‘[[CVS:src/README:1.23]]’) with the revision matching a CVS $Id$ or $Revision$ header embedded in blob data for the filename.

A mark cookie is of the form ‘[[:dddd]]’ and is simply a reference to the specified mark. You may want to hand-patch this in when one of the previous forms is inconvenient.

An action stamp is an RFC3339 timestamp, followed by a ‘!’, followed by an author email address (author is preferred rather than committer because that timestamp is not changed when a patch is replayed on to a branch, but the code to make a stamp for a commit will fall back to the committer if no author field is present). It attempts to refer to a commit without being VCS-specific. Thus, instead of “commit 304a53c2” or “r2355”, “2011-10-25T15:11:09Z!fred@foonly.com”.

The following git aliases allow git to work directly with action stamps. Append it to your ~/.gitconfig; if you already have an [alias] section, leave off the first line.

[alias]
	# git stamp <commit-ish> - print a reposurgeon-style action stamp
	stamp = show -s --format='%cI!%ce'

	# git scommit <stamp> <rev-list-args> - list most recent commit that matches <stamp>.
	# Must also specify a branch to search or --all, after these arguments.
	scommit = "!f(){ d=${1%%!*}; a=${1##*!}; arg=\"--until=$d -1\"; if [ $a != $1 ]; then arg=\"$arg --committer=$a\"; fi; shift; git rev-list $arg ${1:+\"$@\"}; }; f"

	# git scommits <stamp> <rev-list-args> - as above, but list all matching commits.
	scommits = "!f(){ d=${1%%!*}; a=${1##*!}; arg=\"--until=$d --after $d\"; if [ $a != $1 ]; then arg=\"$arg --committer=$a\"; fi; shift; git rev-list $arg ${1:+\"$@\"}; }; f"

	# git smaster <stamp> - list most recent commit on master that matches <stamp>.
	smaster = "!f(){ git scommit \"$1\" master --first-parent; }; f"
	smasters = "!f(){ git scommits \"$1\" master --first-parent; }; f"

	# git shs <stamp> - show the commits on master that match <stamp>.
	shs = "!f(){ stamp=$(git smasters $1); shift; git show ${stamp:?not found} $*; }; f"

	# git slog <stamp> <log-args> - start git log at <stamp> on master
	slog = "!f(){ stamp=$(git smaster $1); shift; git log ${stamp:?not found} $*; }; f"

	# git sco <stamp> - check out most recent commit on master that matches <stamp>.
	sco = "!f(){ stamp=$(git smaster $1); shift; git checkout ${stamp:?not found} $*; }; f"

There is a rare case in which an action stamp will not refer uniquely to one commit. It is theoretically possible that the same author might check in revisions on different branches within the one-second resolution of the timestamps in a fast-import stream. There is nothing to be done about this; tools using action stamps need to be aware of the possibility and throw a warning when it occurs.

In order to support reference lifting, reposurgeon internally builds a legacy-reference map that associates revision identifiers in older version-control systems with commits. The contents of this map come from three places: (1) cvs2svn:rev properties if the repository was read from a Subversion dump stream, (2) $Id$ and $Revision$ headers in repository files, and (3) the .git/cvs-revisions created by ‘git cvsimport’.

The detailed sequence for lifting possible references is this: first, find possible CVS and Subversion references with the references or =N visibility set; then replace them with equivalent cookies; then run references lift to turn the cookies into action stamps (using the information in the legacy-reference map) without having to do the lookup by hand.

[SELECTION] references [list|lift]

With the 'list' modifier, produces a listing of events that may have Subversion or CVS commit references in them. This version of the command supports > redirection. Equivalent to '=N list'.

With the modifier 'lift', transform commit-reference cookies from CVS and Subversion into action stamps. This command expects cookies consisting of the leading string '[[', followed by a VCS identifier (currently SVN or CVS) followed by VCS-dependent information, followed by ']]'. An action stamp pointing at the corresponding commit is substituted when possible. Enables writing of the legacy-reference map when the repo is written or rebuilt. This variant sets Q bits: true if a commit’s comment was modified by a reference lift, false on all other events.

It is not guaranteed that every such reference will be resolved, or even that any at all will be. Normally all references in history from a Subversion repository will resolve, but CVS references are less likely to be resolvable.

legacy {read [<INFILE] | write [>OUTFILE]}

Apply or list legacy-reference information. Does not take a selection set. The 'read' variant reads from standard input or a ←redirected filename; the 'write' variant writes to standard output or a >-redirected filename.

A legacy-reference file maps reference cookies to (committer, commit-date, sequence-number) triplets; these in turn (should) uniquely identify a commit. The format is two whitespace-separated fields: the cookie followed by an action stamp identifying the commit.

+ It should not normally be necessary to use this command. The legacy map is automatically preserved through repository reads and rebuilds, being stored in the file legacy-map under the repository subdirectory.

10.4. Changelogs

CVS, Subversion and Mercurial do not have separated notions of committer and author for changesets; when lifted to a VCS that does, like git, their one author field is used for both.

However, if the project used the FSF ChangeLog convention, many changesets will include a ChangeLog modification listing an author for the commit. In the common case that the changeset was derived from a patch and committed by a project maintainer, but the ChangeLog entry names the actual author, this information can be recovered.

[SELECTION] changelogs [BASENAME-PATTERN]

Mine ChangeLog files for authorship data.

Takes a selection set. If no set is specified, process all changelogs. An optional following argument is a pattern expression to match the basename of files that should be treated as changelogs; the default is "/^ChangeLog$/". See "help regexp" for more information about regular expressions.

This command assumes that changelogs are in the format used by FSF projects: entry header lines begin with YYYY-MM-DD and are followed by a fullname/address.

When a ChangeLog file modification is found in a clique, the entry header at or before the section changed since its last revision is parsed and the address is inserted as the commit author. This is useful in converting CVS and Subversion repositories that don’t have any notion of author separate from committer but which use the FSF ChangeLog convention.

If the entry header contains an email address but no name, a name will be filled in if possible by looking for the address in author map entries.

In accordance with FSF policy for ChangeLogs, any date in an attribution header is discarded and the committer date is used. However, if the name is an author-map alias with an associated timezone, that zone is used.

Sets Q bits: true if the event is a commit with authorship modified by this command, false otherwise.

The Co-Author convention described in the Linux kernel’s co-author message conventions is observed: If an attribution header is followed by a whitespace-led line containing only a valid email address, that name becomes the payload of a "Co-Author" header that is appended to the change comment for the containing commit.

The command reports statistics on how many commits were altered.

10.5. Clique coalescence

When lifting a history from a version-control system that lacks changesets, it is useful to have a way to recognize cliques of per-file changes that ought to be grouped into changesets.

You won’t need this for CVS because cvs-fast-export does clique coalescence itself.

[SELECTION] coalesce [--debug]

Scan the selection set (defaulting to all) for runs of commits with identical comments close to each other in time (this is a common form of scar tissues in repository up-conversions from older file-oriented version-control systems). Merge these cliques by pushing their fileops and tags up to the last commit, in order.

The optional argument, if present, is a maximum time separation in seconds; the default is 90 seconds.

The default selection set for this command is "=C", all commits. Occasionally you may want to restrict it, for example to avoid coalescing unrelated cliques of "empty log message" commits from CVS lifts.

With the --changelog option, any commit with a comment containing the string 'empty log message' (such as is generated by CVS) and containing exactly one file operation modifying a path ending in 'ChangeLog' is treated specially. Such ChangeLog commits are considered to match any commit before them by content, and will coalesce with it if the committer matches and the commit separation is small enough. This option handles a convention used by Free Software Foundation projects.

With the --debug option, show messages about mismatches.

11. Control Options

The following options change reposurgeon’s behavior:

asciidoc

Dump help items using asciiidoc definition markup.

canonicalize

If set, import stream reads and msgin and edit will canonicalize comments by replacing CR-LF with LF, stripping leading and trailing whitespace, and then appending a LF. This behavior inverts if the crlf option is on - LF is replaced with Cr-LF and CR-LF is appended.

crlf

If set, expect CR-LF line endings on text input and emit them on output. Comment canonicalization will map LF to CR-LF.

compress

Use compression for on-disk copies of blobs. Accepts an increase in repository read and write time in order to reduce the amount of disk space required while editing; this may be useful for large repositories. No effect if the edit input was a dump stream; in that case, reposurgeon doesn’t make on-disk blob copies at all (it points into sections of the input stream instead).

echo

Echo commands before executing them. Setting this in test scripts may make the output easier to read.

experimental

This flag is reserved for developer use. If you set it, it could do anything up to and including making demons fly out of your nose.

interactive

Enable interactive responses even when not on a tty.

progress

Enable fancy progress messages even when not on a tty.

relax

Continue script execution on error, do not bail out.

serial

Disable parallelism in code. Use for generating test loads.

testmode

Disable some features that cause output to be vary depending on wall time, screen width, and the ID of the invoking user. Use in regression-test loads.

quiet

Suppress time-varying parts of reports.

Most options are described in conjunction with the specific operations that they modify.

Here are the commands to manipulate them. None of these take a selection set:

set [canonicalize|crlf|compress|echo|experimental|interactive|progress|serial|testmode|quiet]+

Set a (tab-completed) boolean option to control reposurgeon’s behavior. With no arguments, displays the state of all flags. Do "help options" to see the available options.

clear [canonicalize|crlf|compress|echo|experimental|interactive|progress|serial|testmode|quiet]+

Clear a (tab-completed) boolean option to control reposurgeon’s behavior. With no arguments, displays the state of all flags. Do "help options" to see the vailable options.

12. Scripting and debugging support

12.1. Variables, macros, and scripts

Occasionally you will need to issue a large number of complex surgical commands of very similar form, and it’s convenient to be able to package that form so you don’t need to do a lot of error-prone typing. For those occasions, reposurgeon supports simple forms of named variables and macro expansion.

{SELECTION} assign [--singleton] [NAME]

Compute a leading selection set and assign it to a symbolic name, which must follow the assign keyword. It is an error to assign to a name that is already assigned, or to any existing branch name. Assignments may be cleared by some sequence mutations (though not by ordinary deletion); you will see a warning when this occurs.

With no selection set and no argument, list all assignments. This version accepts output redirection.

If the option --singleton is given, the assignment will throw an error if the selection set is not a singleton.

Use this to optimize out location and selection computations that would otherwise be performed repeatedly, e.g. in macro calls.

Example:

+#+ Assign toi the name "cvsjunk" the selection set of all commits with a
+#+ boilerplate CVS empty log message in the comment.
/empty log message/ assign cvsjunk
unassign {NAME}

Unassign a symbolic name. Throws an error if the name is not assigned. Tab-completes on the list of defined names.

names [>OUTFILE]

List all known symbolic names of branches and tags. Tells you what things are legal within angle brackets and parentheses.

define [NAME [TEXT]]

Define a macro. The first whitespace-separated token is the name; the remainder of the line is the body, unless it is '{', which begins a multi-line macro terminated by a line beginning with '}'.

A later 'do' call can invoke this macro.

'define' by itself without a name or body produces a macro list.

do {MACRO-NAME} [ARG…​]

Expand and perform a macro. The first whitespace-separated token is the name of the macro to be called; remaining tokens replace {0}, {1}…​ in the macro definition. Tokens may contain whitespace if they are string-quoted; string quotes are stripped. Macros can call macros.

If the macro expansion does not itself begin with a selection set, whatever set was specified before the 'do' keyword is available to the command generated by the expansion.

undefine {MACRO-NAME}

Undefine the macro named in this command’s first argument.

Here’s an example to illustrate how you might use this. In CVS repositories of projects that use the GNU ChangeLog convention, a very common pre-conversion artifact is a commit with the comment “*** empty log message ***” that modifies only a ChangeLog entry explaining the commit immediately previous to it. The following

define changelog <{0}> & /empty log message/ squash --pushback
do changelog 2012-08-14T21:51:35Z
do changelog 2012-08-08T22:52:14Z
do changelog 2012-08-07T04:48:26Z
do changelog 2012-08-08T07:19:09Z
do changelog 2012-07-28T18:40:10Z

is equivalent to the more verbose

<2012-08-14T21:51:35Z> & /empty log message/ squash --pushback
<2012-08-08T22:52:14Z> & /empty log message/ squash --pushback
<2012-08-07T04:48:26Z> & /empty log message/ squash --pushback
<2012-08-08T07:19:09Z> & /empty log message/ squash --pushback
<2012-07-28T18:40:10Z> & /empty log message/ squash --pushback

but you are less likely to make difficult-to-notice errors typing the first version.

(Also note how the text regexp acts as a failsafe against the possibility of typing a wrong date that doesn’t refer to a commit with an empty comment. This was a real-world example from the CVS-to-git conversion of groff.)

script {PATH} [ARG…​]

Takes a filename and optional following arguments. Reads each line from the file and executes it as a command.

During execution of the script, the script name replaces the string "$0", and the optional following arguments (if any) replace the strings "$1", "$2" …​ "$n" in the script text. This is done before tokenization, so the "$1" in a string like "foo$1bar" will be expanded. Additionally, "$$" is expanded to the current process ID (which may be useful for scripts that use tempfiles).

Within scripts (and only within scripts) reposurgeon accepts a slightly extended syntax: First, a backslash ending a line signals that the command continues on the next line. Any number of consecutive lines thus escaped are concatenated, without the ending backslashes, prior to evaluation. Second, a command that takes an input filename argument can instead take literal data using the syntax of a shell here-document. That is: if the "<filename" is replaced by "<<EOF", all following lines in the script up to a terminating line consisting only of "EOF" will be read, placed in a temporary file, and that file fed to the command and afterwards deleted. "EOF" may be replaced by any string. Backslashes have no special meaning while reading a here-document.

Scripts may have comments. Any line beginning with a "#" is ignored. If a line has a trailing portion that begins with one or more whitespace characters followed by "#", that trailing portion is ignored.

Scripts may call other scripts to arbitrary depth.

Here are some more advanced examples of scripting:

define lastchange {
@max(=B & [/ChangeLog/] & /{0}/B)? list
}

List the last commit that refers to a ChangeLog file containing a specified string. (The trick here is that ? extends the singleton set consisting of the last eligible ChangeLog blob to its set of referring commits, and list only notices the commits.)

index >index.txt
shell <index.txt awk '/refs\/tags/ {print $4}' | sort | uniq | while read t; do echo "tag $(basename "$t") rename $(basename "$t" | sed -e 's/sample/example/')"; done >renames.script
script renames.script

Mass-rename tags, replacing "sample" on the basename with "example". Illustrates a general technique of generating reposurgeon commands via shell that you then execute with the ‘script’ command. Enabling this technique is the reason as many commands as possible support redirects.

12.2. Housekeeping

gc [GOGC]

Trigger a garbage collection. Scavenges and removes all blob events that no longer have references, e.g. as a result of delete operations on repositories. This is followed by a Go-runtime garbage collection.

The optional argument, if present, is passed as a SetPercentGC call to the Go runtime. The initial value is 100; setting it lower causes more frequwent garbage collection and may reduces maximum working set, while setting it higher causes less frequent garbage collection and will raise maximum working set.

when {TIMESTAMP}

Interconvert between git timestamps (integer Unix time plus TZ) and RFC3339 format. Takes one argument, autodetects the format. Useful when eyeballing export streams. Also accepts any other supported date format and converts to RFC3339.

12.3. Debugging and diagnostics

A few commands have been implemented primarily for debugging and regression-testing purposes, but may be useful in unusual circumstances.

The output of most of these commands can individually be redirected to a named output file. Where indicated in the syntax, you can prefix the output filename with ‘>’ and give it as a following argument.

{SELECTION} resolve

Does nothing but resolve a selection-set expression and report the resulting event-number set to standard output. The remainder of the line after the command, if any, is used as a label for the output.

Implemented mainly for regression testing, but may be useful for exploring the selection-set language.

The parenthesized literal produced by this command is valid selection-set syntax; it can be pasted into a script for re-use.

log [[+-]LOG-CLASS]…​

Without an argument, list all log message classes, prepending a + if that class is enabled and a - if not.

Otherwise, it expects a space-separated list of "<+ or →<log message class>" entries, and enables (with +) or disables (with -) the corresponding log message class. The special keyword "all" can be used to affect all the classes at the same time.

For instance, "log -all +shout +warn" will disable all classes except "shout" and "warn", which is the default setting. "log +all -svnparse" would enable logging everything but messages from the svn parser.

A list of available message classes follows; most above "warn" level or above are only of interest to developers, consult the source code to learn more.

shout
warn
baton
tagfix
topology
properties
extract
filemap
ancestry
delete
ignores
svnparse
emailin
shuffle
commands
unite
lexer
logfile [PATH]

Error, warning, and diagnostic messages are normally emitted to standard error. This command, with a nonempty path argument, directs them to the specified file instead. Without an argument, reports what logfile is set.

version [EXPECT]

With no argument, display the reposurgeon version and supported VCSes. With argument, declare the major version (single digit) or full version (major.minor) under which the enclosing script was developed. The program will error out if the major version has changed (which means the surgical language is not backwards compatible).

It is good practice to start your lift script with a version requirement, especially if you are going to archive it for later reference.

[SELECTION] hash [--tree] [>OUTFILE]

Report Git event hashes. This command simulates Git hash generation.

Takes a selection set, defaulting to all. For each eligible event in the set, returns its index and the same hash that Git would generate for its representation of the event. Eligible events are blobs and commits.

With the option --bare, omit the event number; list only the hash.

With the option --tree, generate a tree hash for the specified commit rather than the commit hash. This option is not expected to be useful for anything but verifying the hash code itself.

sizeof

This command is for developer use when optimizing structure packing to reduce memory use. It is probably not of interest to ordinary reposurgeon users.

It displays byte-extent sizes for various reposurgeon internal types. Note that these sizes are stride lengths, as in C’s sizeof(); this means that for structs they will include whatever trailing padding is required for instances in an array of the structs.

12.4. Profiling

elapsed [>OUTFILE]

Display elapsed time since start.

timings [MARK-NAME] [>OUTFILE]

Report phase-timing results from repository analysis.

If the command has following text, this creates a new, named time mark that will be visible in a later report; this may be useful during long-running conversion recipes.

readlimit {N}

Set a maximum number of commits to read from a stream. If the limit is reached before EOF it will be logged. Mainly useful for benchmarking. Without arguments, report the read limit; 0 means there is none.

memory [>OUTFILE]

Report memory usage. Runs a garbage-collect before reporting so the figure will better reflect storage currently held in loaded repositories; this will not affect the reported high-water mark.

profile [live|start|save] [PORT | SUBJECT [FILENAME]]

Profiling is enabled by default, but viewing the profile data requires either starting the HTTP server with ‘profile live’, or saving it to a file with ‘profile save’. When no arguments are given it prints out the available types of profiles. There is more detailed documentation on this command in the embedded help.

exit [>OUTFILE]

Exit cleanly, emitting a goodbye message including elapsed time.

bench

Report elapsed time and memory usage in the format expected by repobench. Note: this comment is not intended for interactive use or to be used by scripts other than repobench. The output format may change as repobench does.

Runs a garbage-collect before reporting so the figure will better reflect storage currently held in loaded repositories; this will not affect the reported high-water mark.

13. Working with Mercurial

There is a built-in extractor class to perform extractions from Mercurial repositories.

Mercurial branches are exported as branches in the exported repository, and tags are exported as tags. By default, bookmarks are ignored. You can specify explicit handling for bookmarks by setting ‘reposurgeon.bookmarks’ in your .hg/hgrc. Set the value to the prefix that reposurgeon should use for bookmarks.

For example, if your bookmarks represent branches, put this at the bottom of your .hg/hgrc:

[reposurgeon]
bookmarks=heads/

If you do that, it’s your responsibility to ensure that branch names do not conflict with bookmark names. You can add a prefix like ‘bookmarks=heads/feature-’ to disambiguate as necessary.

Alternatively, you can import directly using hg-git-fast-import. This importer is not yet well-tested, but may be substantially faster than using the extractor harness. You may wish to run test conversions using both methods and compare them.

13.1. Mercurial subrepositories

The hg extractor does not attempt to recursively handle subrepos. Rather, it will extract the history of the top-level repo, in which .hgsub and .hgsubstate will be treated as regular files. If you wish to translate these into the semantics of your target VCS, you will need to do so with surgical primitives after reading the history into reposurgeon.

14. Working with Subversion

The transaction model of Subversion is nothing like that of the DVCSes (distributed version-control systems) that followed it. Two of the more obvious differences are around tags and branches. These differences occasionally lead to conversion problems.

A Subversion tag isn’t an annotation attached to a commit. The Subversion data model is that a history is a sequence of surgical operations on a tree; there are no annotation tags as such, a tag is just another branch of the tree. Accordingly a Subversion tag is a copy of the state of an entire branch at a particular revision. This can be losslessly translated to an annotation only if no additional commits are added to the tag branch after the copy. But nothing prevents this! reposurgeon tries to do the right thing, creating a DVCS-style annotated tag when it can and otherwise preserving the changes as commits, using a lightweight tag to point at the tip.

There is a subtler problem around branches themselves. In a DVCS, deleting a branch removes it from the repository history entirely, a fact of some significance since repositories are copied around often enough that keeping every discarded experiment forever would eventually drown the live content in superannuated cruft. Subversion repositories, on the other hand, are designed on the assumption that they sit on one server and never move. A Subversion branch is just a directory in the branch namespace; if you delete it, you won’t see it in following revisions, but if you update to an older one that content will still be there. By default, reposurgeon will delete the corresponding branches as if the deletion was done in a DVCS, keeping only the commits that are also part of other branches' histories, but you can tell it to preserve the branches instead and give them unambiguous names in the refs/deleted namespace.

Bad things can happen when a tag directory is created, copied from, deleted, then recreated from a different source directory. This is a place where the Subversion model of tags clashes destructively with the changeset-DAG model used by git and other DVCSes, especially if the same tag is recreated later! The obvious thing to do when converting this sequence would be to just nuke the tag history from the deletion back to its branch point, but that will cause problems if a copy operation was ever sourced in the deleted branch (and this does happen!).

What reposurgeon does instead is preserve the most recent branch with any given name, so the view back from any branch tip in the repository has the correct content. This does however mean that reposurgeon discards the content of any previous branch having that same name. However, see the --preserve option of the read command.

In very unusual cases you may need to use the "--nobranch" option. However, this has the disadvantage that you’ll have to do the branch surgery by hand at a later stage (the command for this is "branchlift"). Instead, you may be able to use the repocutter filter to transform the dump file into a version shaped right for a regular branch-sensitive lift.

The reposurgeon analyzer tries to warn you about pathological cases, and reposurgeon gives you tools for coping with them. Unfortunately, the warnings are (unavoidably) cryptic unless you understand Subversion internals in detail.

There’s another problem around Subversion merges. In a DVCS, a merge normally coalesces two entire branches. Subversion has something close to this in newer versions; it’s called a "sync merge" working on directories (and is expressed as an svn:mergeinfo property of the target directory that names the source). A sync merge of a branch directory into another branch directory behaves like a DVCS merge; reposurgeon picks these up and translates them for you.

The older, more basic Subversion merge is per file and is expressed by per-file svn:mergeinfo properties. These correspond to what in DVCS-land are called "cherry-picks", which just replay a commit from a source branch onto a target branch but do not create cross-branch links.

Sometimes Subversion developers use collections of per-file mergeinfo properties to express partial branch merges. This does not map to the DVCS model at all well, and trying to promote these to full-branch merges by hand is actually dangerous. An excellent essay, Partial git merges — just say no. explores the problem in depth.

The bottom line is that reposurgeon warns about per-file svn:mergeinfo properties and then discards them for good reasons. If you feel an urge to hand-edit in a branch merge based on these, do so with care and check your work.

Three minor issues to watch for:

  1. Superfluous root tags.

  2. Fossil Subversion revision numbers.

  3. File content mismatches due to $-keyword expansion.

More details follow.

14.1. Reading Subversion repositories

Note that the Subversion dump reader only supports versions 1 and 2 of the dump file format, not version 3 with diff-based file changes. This shouldn’t be a problem with normal use of reposurgeon, which calls svnadmin dump in its default mode generating version 2.

Certain optional modifiers on the read command change its behavior when reading Subversion repositories:

--nobranch

Suppress branch analysis. The generated git repository will mirror the whole subversion tree, with trunk and branches (if they exist) as subdirectories. No directory deletions are translated to branch deletions, since no directories are seen as branches in the first place.

--user-ignores

By default reposurgeon tosses out in-tree .gitignore files found in the history because they probably come from git-svn users who checked-in their own .gitignore files. Using this option makes reposurgeon keep the content of these files and merge them with the .gitignore files generated from svn:ignore and svn:global-ignores properties, if any.

--use-uuid

If the --use-uuid read option is set, the repository’s UUID will be used as the hostname when faking up email addresses, a la git-svn. Otherwise, addresses will be generated the way git cvs-import does it, simply copying the username into the address field.

--no-automatic-ignores

Do not generate .gitignore files from svn:ignore and svn:global-ignores properties. If --user-ignores is also used then only .gitignore files that were present in the SVN tree will exist in the final repository. If --user-ignores is not used, no .gitignore file at all will survive the conversion.

--preserve

When a branch or tag was deleted in SVN, preserve the history up to deletion in a git ref under refs/deleted/, instead of deleting the branch and only keeping the commits that are also part of the history of other branches. The reference is disambiguated using the base revision number of the dead branch. Also, preserve branch-copy commits autogenerated by cvs2svn that would otherwise be discarded. (Note that the reason --preserve is not the default behavior is because of experience with large old repositories that may have hundreds or even thousands of dead branches. While it is important that content copies from dead branches be resolved correctly, the branches themselves are almost never interesting.)

--branchify=DIRECTORY[:DIRECTORY]…​

Specify a semicolon-separated list of directories to be treated as potential branches (to become tags if there are no modifications after the creation copies) when analyzing a Subversion repo. This option is ignored when reading with the --nobranch option. It defaults to the 'standard layout' set of directories, plus any unrecognized directories in the repository root.

An asterisk at the end of a path in the set means 'all immediate subdirectories of this path, unless they are part of another (longer) path in the branchify set'.

+#+ This is what the branchify option would look like
+#+ If you needed to specify the default set of branch patterns.
rwad  --branchify=trunk:tags/*:branches/*:* <example.svn

These modifiers can go anywhere in any order on the command line after the read verb. They must be whitespace-separated.

As stacking up read options can result in a very long read invocation line, it’s useful to know that backslash is accepted as a continuation character. Thus, you can do something like this, which is an actual line from reposurgeon’s test suite:

read <nesting.svn \
     --branchify=cpp-msbuild/trunk:cpp-msbuild/branches/*:cpp-msbuild/tags/*

It is also possible to embed a magic comment in a Subversion stream file to set these options (this is mainly useful in test loads). Prefix a space-separated list of them with the magic comment ‘ # reposurgeon-read-options:’; the leading space is required. This may be useful when synthesizing test loads; in particular, a stream file that does not set up a standard trunk/branches/tags directory layout can use this to perform a mapping of all commits onto the master branch that the git importer will accept.

Here are the rules used for mapping subdirectories in a Subversion repository to branches:

  • At any given time there is a set of eligible paths and path wildcards which declare potential branches. See the documentation of the --branchify read option for how to alter this set, which initially consists of {trunk, tags/*, branches/*, and *}.

  • A repository is considered “flat” if it has no directory that matches a path or path wildcard in the branchify set. All commits in a flat repository are assigned to branch master, and what would have been branch structure becomes directory structure. In this case, we’re done; all the other rules apply to non-flat repos.

  • If you give the option --nobranch when reading a Subversion repository, branch analysis is skipped, and the repository is treated as though flat (left as a linear sequence of commits on refs/heads/master). This may be useful if your repository configuration is highly unusual and you need to do your own branch surgery, with e.g. the branchlift command. Note that this option will disable partitioning of mixed commits.

  • If ‘trunk’ is eligible, it always becomes the master branch.

  • If an element of the branchify set ends with /*, it is considered a branch namespace: each immediate subdirectory of it is considered a potential branch, unless it itself appears in --branchify as a namespace. If * is in the branchify set (which is true by default) all top-level directories are also considered potential branches (other than /trunk which is mapped to master, and /tags and /branches which are namespaces by default).

  • Files in the top-level directory are assigned to a synthetic branch named ‘unbranched’. If there is no "trunk", then this synthetic ‘unbranched’ branch becomes the master branch. If the Subversion repository has a branch named ‘unbranched’ the name ‘unbranched-bis’ is used instead; actually, ‘-bis’ is appended enough times to get to an unused branch name.

  • Each potential branch is checked to see if it has commits on it after the initial creation or copy. If there are such commits, or if the branch creation or copy introduces changes other than the copy, it becomes a branch. If not, it may become a tag in order to preserve the commit metadata. In all cases, the name of any created tag or branch is the basename of the directory, unless another mapping is in place.

Branch-creation operations with no following commits are usually tagified. However, this is done to preserve comment/committer data entered by users; when reposurgeon can detect that a branch-creation comment was automatically generated (as often happens in cvs2svn conversions) the commit will simply be discarded so as not to create clutter that has to be manually removed by the operator. (That discard action is prevented by the --preserve option.)

Otherwise, each commit that only creates or deletes directories (in particular, copy commits for tags and branches, and commits that only change properties) will be transformed into a tag named after the tag or branch, containing the date/author/comment metadata from the commit.

Subversion branch deletions are turned into deletealls, clearing the fileset of the import-stream branch. When a branch finishes with a deleteall at its tip, the deleteall is transformed into a tag. This rule cleans up after aborted branch renames.

Occasionally (and usually by mistake) a branchy Subversion repository will contain revisions that touch multiple branches. These are handled by partitioning them into multiple import-stream commits, one on each affected branch. The Legacy-ID of such a split commit will have a pseudo-decimal part - for example, if Subversion revision 2317 touches three branches, the three generated commits will have IDs 2317.1, 2317.2, and 2317.3.

The svn:executable and svn:special properties are translated into permission settings in the input stream; svn:executable becomes 100755 and svn:special becomes 120000 (indicating a symlink; the blob contents will be the path to which the symlink should resolve).

Any cvs2svn:rev properties generated by cvs2svn are incorporated into the internal map used for reference-lifting, then discarded.

Normally, per-directory svn:ignore properties (and svn:global-ignores properties, e.g. in a site configuration file) become .gitignore files. Actual .gitignore files in a Subversion directory are presumed to have been created by git-svn users separately from native Subversion ignore properties and discarded with a warning. It is up to the user to merge the content of such files into the target repository by hand. But this behavior is changed by the --user-ignores option which disables filtering of in-tree .gitignore files and instead merges them with .gitignore files generated from Subversion properties. On the other hand, the --no-automatic-ignores option discards Subversion svn:ignore and svn:global-ignores properties without translation.

Normally, .cvsignore files left over from a Subversion repository’s ancient history as a CVS repository are deleted. The assumption is that the repository users want the (presumably more up-to-date) Subversion ignore properties to be translated. However, this deletion can be prevented with the --cvsignores read option.

svn:mergeinfo properties are interpreted. Any svn:mergeinfo property on a revision A with a merge source containing all revisions on a branch from the forking point (or the branch start if the histories are independent) up to revision B produces a merge link such that the branch tip at revision B becomes a parent of A. The "svnmerge-integrated" properties produced by Subversion’s svnmerge.py script are handled the same way.

All other Subversion properties are discarded. (This may change in a future release.) The property for which this is most likely to cause semantic problems is svn:eol-style. However, since property-change-only commits get turned into annotated tags, the translated tags will retain information about setting changes.

The sub-second resolution on Subversion commit dates is discarded; Git wants integer timestamps only. Normally Subversion timestamps are rounded down, but when two adjacent timestamps have the same seconds part and the later one is in the top half-second of the interval, the later one is rounded up instead. This does much to reduce collisions while guaranteeing that no timestamp is ever shifted to a non-adjacent second mark.

Because fast-import format cannot represent an empty directory, empty directories in Subversion repositories will be lost in translation.

Normally, Subversion local usernames are mapped in the style of git cvs-import; thus user ‘foo’ becomes ‘foo <foo>’, which is sufficient to pacify git and other systems that require email addresses. With the option --use-uuid, usernames are mapped in the git-svn style, with the repository’s UUID used as a fake domain in the email address. Both forms can be remapped to real addresses using the authors read command.

Reading a Subversion stream enables writing of the legacy map as 'legacy-id' passthroughs when the repo is written to a stream file.

reposurgeon tries hard to silently do the right thing, but there are Subversion edge cases in which it emits warnings because a human may need to intervene and perform fixups by hand. Here are the less obvious messages it may emit:

user-created .gitignore ignored

This message means reposurgeon has found a .gitignore file in the Subversion repository it is analyzing. This probably happened because somebody was using git-svn as a live gateway, and created ignores which may or may not be congruent with those in the generated .gitignore files that the Subversion ignore properties will be translated into. You’ll need to make a policy decision about which set of ignores to use in the conversion, and possibly set the --user-ignores option on read to pass through user-created .gitignore files and possibly merge them with files generated from Subversion ignore properties; in that case this warning will not be emitted.

properties set

reposurgeon has detected a setting of a user-defined property, or the Subversion properties svn:externals. These properties cannot be expressed in an import stream; the user is notified in case this is a showstopper for the conversion or some corrective action is required, but normally this error can be ignored. This warning is suppressed by the --ignore-properties option.

Detected link from <revision> to <revision> might be dubious

When trying to detect parent links from multiple file copies like what cvs2svn can produce, source revisions of the different copies were not all the same. The link should probably be monitored because it has a non-negligible probability of being slightly wrong. This does not impact the tree contents, only the quality of the history.

14.2. Mid-branch deletions

When a branch A is deleted and a branch B is copied to the name A, the Subversion intent is to replace the contents of branch A with the contents of branch B, keeping the A name. This is a poor man’s merge from before "svn merge" existed. Many Subversion users who formed their habits before svn merge existed still operate this way.

In git terms, this almost corresponds to a merge of A into B followed by a rename of B to A. Branch B continues to exist, however, so we can’t do that in translation. The reposurgeon logic does not try to be clever about this, because "clever" would have rebarbative edge cases; the sequence is translated into a deleteall followed by a commit operation that recreates the B files under corresponding A names. No merge link is created. The commit filling A with a branch copy from B will have B as its first parent, though, so all that would be needed is to create a merge link from the old A before the delete to the commit recreating A.

This case is mentioned here because it is likely to confuse the merge-tracking algorithms used, e.g., by git diff, or if you ever try to merge a branch that forked off the old A to a branch spun off the new (and expect git to know that you do not want to incorporate old A’s changes).

14.3. Multiproject Subversion repositories

Subversion repositories are sometimes organized to hold multiple projects, with the root directory containing one subdirectory per project and each subdirectory having its own trunk/branches/tags layout.

Suppose you have a stream dump from a repository with two project subdirectories, project1 and project2. One pattern for dissecting out project1 looks like this:

read <multiproject.svn \
  --branchify=project1/trunk:project1/branches/*:project1/tags/*:*
branch heads/project1/trunk rename heads/master
branch @heads/project1/branches/(.*)@ rename heads/\1
branch heads/project2 delete

The read option branchifies every directory underneath project1 for which that’s required, with project2 left as its own branch from top level. This transformation is performed when the actual read of the repository happens.

The following branch renames transform these branches into a standard layout. Tag creations do not need to be mapped separately, as the generated gitspace tags are moved when their base branch is. Following the renames, the unneeded project2 branch can be dropped.

Of course we could have done the same thing with project2 and dropped project1. Repeat this as many times as required to turn each partial into an autonomous git repository.

While something like this could be done with repocutter sift commands, that would not correctly resolve Subversion copies across projects. This reposurgeon procedure handles those correctly.

Alternatively, you can do a "--nobranch" read of the Subversion repository and then individually turn subdirectories into branches using the [branchlift] command. The pattern for that looks like this:

read --nobranch <multiproject.svn
branchlift master branches/release-1.0
branchlift master branches/release-2.0
path rename "trunk/(.*)" "\1"

This will turn two subdirectories into branches named "release-1.0" and "release-2.0", then pops "trunk" off all the paths where it occurs (leaving those commits on the master branch).

This method has the disadvantage that you have to enumerate all branches you want to lift. Still, it may be useful on repositories that consist of one big unbranched file tree not conforming to a standard layout.

15. Working with CVS

When you are converting a CVS repository using reposurgeon, most of the heavy lifting will have been done by the importer - cvs-fast-export. In particular, it coalesces CVS per-file changes into changesets when it detects that they have identical comments and attributions and are close in time, and it converts .cvsignore files to .gitignores.

A CVS repository normally consists of a set of module subdirectories and a CVSROOT directory containing metadata. cvs-fast-export ignores CVSROOT; thus you can run reposurgeon at any level of a directory tree containing CVS master files, and it will try to lift what it can see at and below the current directory it is run from.

If you do this at the top level of the repository directory, your converted repository will have a subdirectory corresponding to each module. This is normally not the way you want to do things, as CVS tags are not likely to be consistent across all modules and thus won’t lift correctly. You probably want to do individual module conversions.

Problems in CVS conversions generally arise from the fact that CVS’s data model doesn’t have real multi-file changesets, which are the fundamental unit of a commit in DVCSes. It can be difficult to fully recover changesets from what are actually large numbers of single-file changes flying in loose formation - in fact, old CVS operator errors can sometimes make it impossible. Bad tools silently propagate such damage forward into your translation. Good tools, like cvs-fast-export and reposurgeon, warn you of problems and help you recover.

Here are the kinds of conversion glitches to watch for:

  1. Failure to coalesce runs of comments with identical attribution and comment text.

  2. Superfluous root tags.

  3. Fossil CVS revision numbers.

  4. File content mismatches due to $-keyword expansion.

  5. "Zombie" files due to failure to track deletion operations.

Details follow.

Glitch #1 Is driven by whether the window defining "close in time" is wide enough. If it’s not, you may detect commit groups with the same committer and comment text that should have been merged into one changeset but were not. You can either clean these up with the ‘coalesce’ command in reposurgeon or run cvs-fast-export by hand with a larger -w option and read in the generated stream.

Glitch #2: In cleaning up a CVS conversion that is unique to that system is deleting root tags - tags which have "-root" as a name suffix and mark the beginning of a branch, CVS uses these for bookkeeping, but later systems don’t need them. They’re just clutter and can be removed.

Glitch #3: It’s also worth paying careful attention to reference-lifting so that you can scrub useless CVS revision numbers out of comments. This is a more pressing issue than it is with Subversion, where changesets map to changesets, and conversions have the option of marking each target changeset with its revision number.

Glitch #4: You can spot content mismatches due to keyword expansion easily. They will produce single-line diffs of lines containing dollar signs surrounding keyword text. Because binary files can be corrupted by keyword expansion, cvs-fast-export behaves like cvs -kb mode and does no keyword expansion of its own.

Glitch #5: Manifest mismatches on tags are most likely to occur on files which were deleted in CVS but persist under later tags in the Git conversion. You can bet this is what’s going on if, when you search for the pathname in the CVS repository, you find it in an Attic directory.

These spurious reports happen because CVS does not always retain enough information to track deletions reliably and is somewhat flaky in its handling of "dead"-state revisions. To make your CVS and git repos match perfectly, you may need to add delete fileops to the conversion - or, more likely, move existing ones back along their branches to commits that predate the gitspace tag - using reposurgeon.

Manifest mismatches in the other direction (present in CVS, absent in gitspace) should never occur. If one does, submit a bug report.

Any other kind of content or manifest match - but especially any on the master branch - is bad news and indicates either a severe repository malformation or a bug in cvs-fast-export (or possibly both). Any such situation should be reported as a bug.

Conversion bugs are disproportionately likely to occur on older branches or tags made with CVS version before CVS got commit IDs in 2006 (version 2.12). Often the most efficient remedy is simply to delete junk branches and tags; reposurgeon(1) makes this easy to do.

16. Troubleshooting and bug reports

16.1. Dealing with memory exhaustion

To do its job, reposurgeon needs to hold all of your history’s metadata in memory. That doesn’t mean the content part, but does mean all of the changeset attributions, comments, and tags. Given a large enough repository, this will overrun the RAM of a small machine. If this happens to you, your reposurgeon instance will die abruptly with an OOM (Out Of Memory) error while attempting to read in your repository.

It is extremely unlikely that this is due to a bug in reposurgeon. Before filing an issue about it, there is a procedure you should try. It consists of bisecting on the parameter the Go language runtime uses to control the frequency of garbage collection. You can set this using the environment variable GOGC or reposurgeon "gc" command.

GOGC defaults to 100, which instructs the runtime to garbage-collect when the heap size is 100% bigger than (i.e., twice) the size of the reachable objects after the last garbage collection. To increase the frequency of GC, usually resulting in a lower memory high-water mark, decrease that percentage threshold. To decrease gc frequency, increase the threshold so the runtime tolerates a larger heap.

To troubleshoot your OOM problem, bisect on this threshold to find the highest value that will avoid OOM. Start by cutting it to 50, then to 25, then to 12, then to 6. If you find a value that allows you to read to completion, you may want to try increasing it by a half interval (e,g. 50 to 75, 25 to 37, etc.) to get back some throughput.

If your repository won’t read in at GOGC=6 you have a real problem. Unfortunately, it’s not one the reposurgeon devteam can help you with; the correct solution to it is to do your conversion on a machine with more RAM and/or more swap configured. 64GB should be sufficient. The largest repository the reposurgeon devteam has ever seen (the history of GCC, 280K Subversion commits) fit on a 128GB machine with GOGC=30.

If you can’t read your history onto a 128GB machine with GOGC=30, then maybe the reposurgeon devteam ought to hear about it. That said, if you can find ways to make reposurgeon more efficient, we are eager to accept those patches, or even just a bug report with the details. It’s probable that there are some efficiency gains yet to be made.

16.2. Dealing with stalled conversions

Occasionally it will happen that a conversion on a particularly large or malformed repository seems to stall out, grinding endlessly without completing a conversion phase.

Reposurgeon’s execution time is dominated by cycles spent in the memory allocator and garbage collector. Thus, you can pay RAM to decrease running time - push GOGC up from its default of 100. If your conversion completes in reasonable time before your memory usage increases enough that reposurgeon gets killed by OOM, you win. Otherwise see the previous section about adding RAM and swap space.

A stallout is more difficult to troubleshoot than an OOM, and more likely to indicate an actual bug or algorithmic problem in reposurgeon. There are a couple of things you can do to make a good resolution more likely:

  1. Identify and report the phase in which the stallout occurs. Be aware that there is a known O(n**2) problem in phase C2 of the Subversion dump reader that really thrashes the allocator; that’s not reducible and we’re just going to tell you you have to throw more RAM at the problem.

  2. Use repobench to see if you can identify a revision that triggers the stall. The procedure for this is to use it to step your readlimit up from zero until you see the runtime spike.

  3. As always, provide a stripped (and possibly obscured) dump of the repository for testing.

16.3. How to report bugs

It is generally not possible to reproduce reposurgeon/repocutter bugs without a copy of the history on which they occurred. When you find a bug, please send the maintainers:

(a) An exact description of how observed behavior differed from expected behavior. If reposurgeon/repocutter died, include the backtrace.

(b) A git fast-import or Subversion dump file of the repository you were operating on, or a pointer to where it can be pulled from.

(c) A script containing the sequence of reposurgeon or repocutter commands that revealed the bug. If you were exploring interactively, remember that the "history" command exists and can dump your command history to a file.

(d) If you were using the standard-workflow Makefile generated by "repotool initialize", mention that in your bug report. If you modified the Makefile, include a copy with the bug report.

Please use the reposurgeon project’s issue tracker and attach these files. It’s helpful.

Are you seeing git die with a complaint about an unknown --show-original-IDs option? Upgrade your git; reposurgeon needs 1.19.2 or later.

16.3.1. Test case reduction

If you know how to reproduce the error, the best possible test case is a hand-crafted dump stream of minimal size with content that explains how it breaks the tool. Those are turned into regression tests instantly.

When you don’t know the cause of the error, ship the project a dump file derived from the real repository that triggered it. To speed up the debugging process so you can get an answer more quickly, there are some tactics you can use to reduce the bulk of the test case you send. Also, a well-reduced dump can become a regression test to ensure the bug does not recur.

How to make dumps in Git: cd to your git repository and capture the output of "repotool export".

How to make dumps in Subversion: cd to the toplevel directory of the repository master - not a checkout of it. You can tell you’re in the right place if you see this:

$ ls
conf  db  format  hooks  locks  README.txt

Then run "repotool export", capturing the output.

The commands you will use for test-case reduction are reposurgeon and, on Subversion dumps, repocutter.

16.3.2. Replace the content blobs in the dump with stubs

The subcommand in both tools is 'strip'; it will usually cut the size of the dump by more than a factor of 10. Check that the bug still reproduces on the stripped dump; if it doesn’t, that would be unprecedented and interesting in itself.

If you are trying to maintain confidentiality about your code, sending me a stripped repo has the additional advantage that the code won’t be disclosed! The command preserves structure and metadata but replaces each content blob with a unique magic cookie.

If you don’t want to disclose even the metadata, you can do a repocutter "obscure" pass after the strip. This will mask file paths and developer names.

16.3.3. Truncate the dump as much as possible

Try to truncate the dump to the shortest leading section that reproduces the bug.

A reposurgeon error message will normally include a mark, event number, or (when reading a Subversion dump) a Subversion revision number. Use a selection-set argument to reposurgeon’s 'write' command, or the 'select' subcommand of repocutter, to pare down the dump so that it ends just after the error is triggered. Again, check to ensure that the bug reproduces on the truncated dump.

If the error message doesn’t tell you where the problem occurred, try a bisection process. Use the --readlimit option of the 'read' to ignore the last half of the events in the dump; check to see if the bug reproduces. If it does, repeat; if it does not, throw out the last quarter, then the last eighth, and so forth. Keep this up until you can no longer drop events without making the bug go away.

Bisection is more effective than you might expect, because the kinds of repository malformations that trigger misbehavior from reposurgeon tend to rise in frequency as you go back in time. The largest single category of them has been ancient cruft produced by cvs2svn conversions.

16.3.4. Topologically reduce the dump

Next, topologically reduce the dump, throwing out boring commits that are unlikely to be related to your problem.

If a commit has all file modifications (no additions or deletions or copies or renames) and has exactly one ancestor and one descendant, then it may be boring. In a Subversion dump it also has to not have any property changes; in a git dump it must not be referred to by any tag or reset. Interesting commits are not boring, or have a not-boring parent or not-boring child.

Try using the 'reduce' subcommand of repocutter to strip boring commits out of a Subversion dump. For a git dump, look at reposugeon’s "strip --reduce" command.

16.3.5. Prune irrelevant branches

Try to throw away branches that are not relevant to your problem. The 'expunge' operation on repocutter or the 'branch delete' command in reposurgeon may be helpful.

This is the attempted simplification least unlikely to make your bug vanish, so check that carefully after each branch deletion.

16.4. Know how to spot possible importer bugs

If your target VCS’s importer dies during a rebuild, try writing the repository content to a stream instead and importing the stream by hand. If the latter does not fail, the target VCS’s importer may be slightly buggy - but you have a workaround.

(This has been observed under git 2.5.0 with the result of a 'unite' operation on two repositories. The cause is unknown, as git dies suddenly enough to not leave a crash report.)

16.5. Benchmarking

A fair amount of effort has been expended to keep the run-time performance of reposurgeon as linear as possible. This is not an easy state to stay in; it is unfortunately quite simple to accidentally regress this without noticing.

To that end, there are some fairly simple scripts in the bench directory of the source distribution that can be used to check for this type of problem. repobench runs reposurgeon multiple times with a different readlimit each time, recording the run time and memory allocated at each iteration. Supply arguments specifying the svn dump file to read and the readlimit values to use like this:

./repobench your-dump-file.svn 1000 2000 20000

This reads your-dump-file.svn 10 times, with the readlimit set first to 1,000, then 3,000, etc, stepping up until it reaches 20,000.

This produces a .dat file which you can use with repobench -p, or repobench -o to produce graphs.

For an example, see oops.svg. This shows a graph made using a good revision that had linear performance, several made with revisions that introduced a regression that made performance quite non-linear, and the fix. You can easily tell the difference visually.

16.6. Incompatible language changes

Reposurgeon scripts are effectively never reused. Thus, incompatible changes to the command language don’t have a high cost in pain to users, and the maintainers feel free to make them whenever improvement seems possible. But just in case, such changes are recorded here.

The blob command now takes an explict mark, rather than creating a new blob :1 and renumbering all others as in 4.23 and earlier.

Filter syntax now uses subcommand verbs rather than the options of 4.23 and earlier.

In versions 4.23 and earlier, the stats command operated on named repositories rather tan the currently-selected one.

In versions 4.23 and earlier, it was possible to redirect output from msgin to capture a mailbox of changed entries. This feature has been removed; msgin now sets =Q bits which can be used to generate the same report.

In versions 4.23 and earlier, "--branchify" read option was a separate command that needed to be invoked before the read and set hidden global context. There was a related branchmap command that has been retired because branch rename covers all its cases in a simpler way.

In versions 4.23 and earlier, the behavior of the branch command was incompatibly different; it did not require a "heads/" or "tags/" prefix on its operand and, because of that, could only operate on heads/ branches and not lightweight tags.

In versions 4.23 and earlier, several commands tha now had the form "object verb selection" had the form "object selection verb". This includes branch, tag, reset, and path.

In versions 4.23 and earlier, the syntax of the expunge command was different; it used "~" instead of "--not" and took multiple patterns.

The "paths sup" and "paths sub" commands of versions 4.23 and earlier have been retired and replaced by the enhanced path rename command.

In versions before 4.10, the "reduce" and "blob" options of the "strip" command were bare keywords. Also the options of the "ignores" command were bare keywords. There was a command to set the prompt string that has been retired.

In versions before 4.8, the expunge command run on a repository named "foo" tried to keep deleted fileops in a new synthetic repository named "foo-expunges". This feature has been replaced by the "~" negation operator on expunge selections.

In versions before 4.1, the index command did not see blobs by default.

In versions before 4.0, msgin and msgout were named "mailbox_in" and "mailbox_out:"; --branchify was "branchify_map". Previous versions used the Python variant of regular expressions; some of the more idiosyncratic features of these are not replicated in the Go implementation.

In versions before 3.23, ‘prefer’ changed the repository type as well as the preferred output format. Since then, do this with ‘sourcetype’.

In versions before 3.0, the general command syntax put the command verb first, then the selection set (if any) then modifiers (VSO). It has changed to optional selection set first, then command verb, then modifiers (SVO). The change made parsing simpler, allowed abolishing some noise keywords, and recapitulates a successful design pattern in some other Unix tools - notably sed(1).

In versions before 3.0, path expressions only matched commits, not commits and the associated blobs as well. The names of the "a" and "c" flags were different.

In reposurgeon versions before 3.0, the delete command had the semantics of squash; also, the policy flags did not require a ‘--’ prefix. The ‘--delete’ flag was named "obliterate".

In reposurgeon versions before 3.0, read and write optionally took file arguments rather than requiring redirects (and the write command never wrote into directories). This was changed in order to allow these commands to have modifiers. These modifiers replaced several global options that no longer exist.

In reposurgeon versions before 3.0, the earliest factor in a unite command always kept its tag and branch names unaltered. The new rule for resolving name conflicts, giving priority to the latest factor, produces more natural behavior when uniting two repositories end to end; the master branch of the second (later) one keeps its name.

In reposurgeon versions before 3.0, the tagify command expected policies as trailing arguments to alter its behavior. The new syntax uses similarly named options with leading dashes, that can appear anywhere after the tagify command.

In versions before 2.9. the syntax of authors, legacy, list, and what are now msgin and msgout was different (and legacy was named fossils). They took plain filename arguments rather than using redirect < and >.

In versions so old that the changeover point is now lost in the mists of time, curly brackets (not parens) performed subexpression grouping.

16.7. Emergency help

If you need emergency help, go to the #reposurgeon IRC on freenode. Be aware, however, that the maintainer is too busy to babysit difficult repository conversions unless he has explicitly volunteered for one or someone is paying him to care about it. For explanation, see Your money or your spec.

17. Stream syntax extensions

The event-stream parser in reposurgeon supports some extended syntax. Exporters designed to work with reposurgeon may have a --reposurgeon option that enables emission of extended syntax; notably, this is true of cvs-fast-export(1). The remainder of this section describes these syntax extensions. The properties they set are (usually) preserved and re-output when the stream file is written.

The token ‘#reposurgeon’ at the start of a comment line in a fast-import stream signals reposurgeon that the remainder is an extension command to be interpreted by reposurgeon.

One such extension command is implemented: ‘sourcetype’, which behaves identically to the reposurgeon sourcetype command. An exporter for a version-control system named "frobozz" could, for example, say

#reposurgeon sourcetype frobozz

Within a commit, a magic comment of the form ‘#legacy-id’ declares a legacy ID from the stream file’s source version-control system.

Also accepted is the bzr syntax for setting per-commit properties. While parsing commit syntax, a line beginning with the token ‘property’ must continue with a whitespace-separated property-name token. If it is then followed by a newline it is taken to set that boolean-valued property to true. Otherwise it must be followed by a numeric token specifying a data length, a space, following data (which may contain newlines) and a terminating newline. For example:

commit refs/heads/master
mark :1
committer Eric S. Raymond <esr@thyrsus.com> 1289147634 -0500
data 16
Example commit.

property legacy-id 2 r1
M 644 inline README

Unlike other extensions, bzr properties are only preserved on stream output if the preferred type is bzr, because any importer other than bzr’s will choke on them.

18. Limitations and guarantees

Guarantee: In DVCSes that use commit hashes, editing with reposurgeon never changes the hash of a commit object unless (a) you edit the commit, or (b) it is a descendant of an edited commit in a VCS that includes parent hashes in the input of a child object’s hash (git and hg both do this).

Guarantee: reposurgeon only requires main memory proportional to the size of a repository’s metadata history, not its entire content history. (Exception: the data from inline content is held in memory.)

Guarantee: In the worst case, reposurgeon makes its own copy of every content blob in the repository’s history and thus uses intermediate disk space approximately equal to the size of a repository’s content history. However, when the repository to be edited is presented as a stream file, reposurgeon requires no or only very little extra disk space to represent it; the internal representation of content blobs is a (seek-offset, length) pair pointing into the stream file.

Guarantee: reposurgeon never modifies the contents of a repository it reads, nor deletes any repository. The results of surgery are always expressed in a new repository.

Guarantee: Any line in a fast-import stream that is not a part of a command reposurgeon parses and understands will be passed through unaltered. At present the set of potential passthroughs is known to include the progress, options, and checkpoint commands as well as comments led by #.

Guarantee: All reposurgeon operations either preserve all repository state they are not explicitly told to modify or warn you when they cannot do so.

Guarantee: reposurgeon handles the bzr commit-properties extension, correctly passing through property items including those with embedded newlines. (Such properties are also editable in the message-box format.)

Limitation: Because reposurgeon relies on other programs to generate and interpret the fast-import command stream, it is subject to bugs in those programs.

Limitation: bzr suffers from deep confusion over whether its unit of work is a repository or a floating branch that might have been cloned from a repo or created from scratch, and might or might not be destined to be merged to a repo one day. Its exporter only works on branches, but its importer creates repos. Thus, a rebuild operation will produce a subdirectory structure that differs from what you expect. Look for your content under the subdirectory ‘trunk’.

Limitation: under git, signed tags are imported verbatim. However, any operation that modifies any commit upstream of the target of the tag will invalidate it.

Limitation: Stock git (at least as of version 1.7.3.2) will choke on property extension commands. Accordingly, reposurgeon omits them when rebuilding a repo with git type.

Limitation: Converting an hg repo that uses bookmarks (not branches) to git can lose information; the branch ref that git assigns to each commit may not be the same as the hg bookmark that was active when the commit was originally made under hg. Unfortunately, this is a real ontological mismatch, not a problem that can be fixed by cleverness in reposurgeon.

Limitation: Converting an hg repo that uses branches to git can lose information because git does not store an explicit branch as part of commit metadata, but colors commits with branch or tag names on the fly using a specific coloring algorithm, which might not match the explicit branch assignments to commits in the original hg repo. Reposurgeon preserves the hg branch information when reading an hg repo, so it is available from within reposurgeon itself, but there is no way to preserve it if the repo is written to git.

Limitation: Not all BitKeeper versions have the fast-import and fast-export commands that reposurgeon requires. They are present back to the 7.3 opensource version.

Limitation: reposurgeon may misbehave under a filesystem which smashes case in filenames, or which nominally preserves case but maps names differing only by case to the same filesystem node (Mac OS X behaves like this by default). Problems will arise if any two paths in a repo differ by case only. To avoid the problem on a Mac, do all your surgery on an HFS+ file system formatted with case sensitivity specifically enabled.

Limitation: If whitespace followed by # appears in a string or regexp command argument, it will be misinterpreted as the beginning of a line-ending comment and screw up parsing.

Guarantee: As version-control systems add support for the fast-import format, their repositories will become editable by reposurgeon.

19. Credits

These are in roughly descending magnitude.

Eric S. Raymond <esr@thyrsus.com>

Designer and original author.

Julien "FrnchFrgg" RIVAUD <frnchfrgg@free.fr>

Lots of high-quality code cleanups and speed tuning. Responsible for at least half of the massively revamped Subversion dump reader on the 4.0 releases. Ported the CoW filemaps from Python to Go.

Daniel Brooks <db48x@db48x.net>

Date unit testing, improvements for split and expunge commands. Assistance on Python to Go port. Go profiling support. Several significant reductions in total run time, total allocations, and max heap usage.

Greg Hudson <ghudson@MIT.EDU>

Contributed copy-on-write filemaps, which both tremendously sped up Subversion dumpfile parsing and squashed a nasty bug in the older code. While his CoW implementation was eventually replaced with one by Julien Rivaud, it busted the project out of a nearly two-year slump.

Eric Sunshine <sunshine@sunshineco.com>

Review of seldom-used features, test improvements, bug-fixing. Generalized selection expression parser for use-cases other than events. Converted selection parser, which evaluated an expression while parsing it, to a compile/evaluate paradigm in which a selection expression can be compiled once and evaluated many times. Added 'attribution' command. Added 'reorder' command. Assist Python to Go port.

Edward Cree <ec429@cantab.net>

Wrote the Hg extractor class and its test.

Ian Bruene <ianbruene@gmail.com>

Wrote the kommandant package and the Go port of Python difflib in order to support this package.

Chris Lemmons <alficles@gmail.com>

Solved some problems with inline blobs, improved interoperability with Mercurial, wrote the --prune option for graft.

Richard Hansen <rhansen@rhansen.org>

Selections as ordered rather than compulsorily sorted sets. The generalized reparent command. Improvements in regression-test infrastructure.

Peter Donis <peterdonis@alum.mit.edu>

Python 3 port and Python2/3 interoperability. Historical: none of this survived the port to Go.

Appendix A: The ontological-mismatch problem and its consequences

There are many tools for converting repositories between version-control systems out there. This appendix explains why reposurgeon is the best of breed by comparing it to the competition.

The problems other repository-translation tools have come from ontological mismatches between their source and target systems - models of changesets, branching and tagging can differ in complicated ways. While these gaps can often be bridged by careful analysis, the techniques for doing so are algorithmically complex, difficult to test, and have ugly edge cases.

Furthermore, doing a really high-quality translation often requires human judgment about how to move artifacts - and what to discard. But most lifting tools are, unlike reposurgeon, designed as run-it-once batch processors that can only implement simple and mechanical rules.

Consequently, most repository-translation tools evade the harder problems. They produce a sort of pidgin rendering that crudely and partially copies the history from the source system to the target without fully translating it into native idioms, leaving behind metadata that would take more effort to move over or leaving it in the native format for the source system.

But pidgin repository translations are a kind of friction drag on future development, and are just plain unpleasant to use. So instead of evading the hard problems, reposurgeon provides a power assist for a human to tackle them head-on.

Here are some specific symptoms of evasion that are common enough to deserve tags for later reference.

LINEAR: One very common form of evasion is only handling linear histories.

NO_IGNORES: There are many different mechanisms for ignoring files - .cvsignore, Subversion svn:ignore properties, .gitignore and their analogues. Many older Subversion repositories still have .cvsignore files in them as relics of CVS prehistory that weren’t translated when the repository was lifted. Reposurgeon, on the other hand, knows these can be changed to .gitignore files and does it.

NO_TAGS: Many repository translators cannot generate annotated tags (or their non-git equivalents) even when that would be the right abstraction in the target system.

CONFIGURATION: Another common failing is for repository-translation tools to require a lot of configuration and ceremony before they can operate. Often, for example, tools that translate from Subversion repositories require you to declare the repository’s branch structure every time even though sensible defaults and a bit of autodetection could have avoided this.

MIXEDBRANCH: Yet another case usually handled poorly (in translators that handle branching) is mixed-branch commits. In Subversion it is possible (though a bad idea) to commit a changeset that modifies multiple branches at once. All sufficiently old Subversion repositories have these, often by accident. The proper thing to do is split these up; the usual thing is to assign them to one branch and leave them omitted from the others.

Version references in commit comments. It is not uncommon to see a lot of references that are no longer usable embedded in translated repositories like fossils in geological strata - file-version numbers like '1.2' in Subversion repos that had a former life in CVS, Subversion references like 'r1234' in git repositories, and so forth. There’s no tag for this because tools other than reposurgeon generally have no support at all for lifting these.

To avoid repetitive text in these descriptions, we use the following additional bug tags:

ABANDONED: Effectively abandoned by its maintainer. Some tools with this tag are still nominally maintained but have not been updated or released in years.

NO_DOCUMENTATION: Poorly (if at all) documented.

!FOO means the tool is known not to have problem FOO.

?FOO means the author has not tried the tool but has strong reason to suspect the problem is present based on other things known about it.

You should assume that none of these tools do reference-lifting.

A.1. cvs2svn

Just after the turn of the 21st century, when Subversion was the new thing in version control, most projects that were using version control were using CVS, and cvs2svn was about the only migration path.

Early cvs2svn had problems on every level, only some of which have been fixed by more recent releases. It tended to spew junk commits into the translated history, and produced strange combinations of Subversion internal operations that most later translation tools would cope with only poorly. Sometimes the resulting translations are actually malformed; more often they contains noisy commits or commit duplications that made little sense under Subversion and make even less under the new target system.

!LINEAR, ?MIXEDBRANCH, DOCUMENTATION

A.2. cvs-fast-export

Formerly named parsecvs. Originally written by Keith Packard to port the X.org repositories, which it did a good job on. Now maintained by me; reposurgeon uses it to read CVS repositories. It is extremely fast and can thus be productively used even on huge repositories.

!ABANDONED, !LINEAR, !NO_IGNORES, !DOCUMENTATION, !CONFIGURATION

A.3. cvsps

Don’t use this. Just plain don’t. The author maintained version 3.x until end-of-lifing it in favor of cvs-fast-export due to fundamental, unfixable problems. It gets branch topology wrong in ways that are difficult to detect.

A.4. git-svn

git-svn, the Subversion converter in the git distribution, is really designed to be a two-way live gateway enabling git users to push and pull commits from a Subversion server. It operates by creating a git repository that is effectively a local mirror of the Subversion history, then performing Subversion client commands to synchronize the two in a git-like way.

This choice of mission means that git-svn’s translation of history into git uses a compromise between Subversion idioms and git ones that is more designed to make transactions back to the Subversion server easy and safe to generate than it is to make full use of the git capabilities that Subversion doesn’t have. This is pidgin translation for a reason better than laziness or failure of nerve, but it’s still pidgin.

Worse, git-svn has bugs that severely compromise it for full translations. It tends to stumble over common repository malformations in Subversion, producing history damage that is significant but evades superficial scrutiny. The author has written about this problem in detail at Don’t do svn-to-git repository conversions with git-svn!

For a straight linear history with no tags or branches, the difference between git-svn’s Subversion-emulating behavior and the way a git repository would most naturally be structured is minimal. But for conformability with Subversion, git-svn cannot (practically speaking) use git’s annotated-tag facility in the local mirror; instead, Subversion tags have to be represented in the local mirror as git branches even if they have no changes after the initial branch copy.

Another thing the live-gatewaying use case prevents is reference-lifting. Subversion references like "r1234" in commit comments have to be left as-is to avoid creating pain for users of the same Subversion remote not going through git-svn.

git-svn was used by both Google Code’s exporter and is used in GitHub’s importer web services. Depending on the latter is not recommended.

!ABANDONED, MIXEDBRANCH, NO_TAGS, NO_IGNORES.

A.5. git-svnimport

Formerly part of the git suite; what they had before git-svn, and inferior to it. Among other problems, it can only handle Subversion repos with a "standard" trunk/tags/branches layout. Now deprecated.

MIXEDBRANCH, NO_TAGS, NO_IGNORES, ABANDONED.

A.6. git-svn-import

A trivial wrapper around git-svn. All the reasons not to use git-svn apply to it as well.

MIXEDBRANCH, NO_TAGS, NO_IGNORES, !ABANDONED.

A.7. svn-fe

svn-fe was a valiant effort to produce a tool that would dump a Subversion repository history as a git fast-import stream. It made it into the git contrib directory and lingers there still.

LINEAR, NO_TAGS, NO_IGNORES, ABANDONED.

A.8. Tailor

Tailor aimed to be an any-to-any repository translator.

LINEAR, ?NO_IGNORES, ABANDONED.

A.9. agito

This is a Subversion-to-git tool that was written to handle some cases that git-svn barfs on (but reposurgeon doesn’t - the reposurgeon test suite contains a case sent by agito’s author to check this). It even handles mixed-branch commits correctly.

!LINEAR, !NO_TAGS, !MIXEDBRANCH, CONFIGURATION.

If you cannot use reposurgeon for some reason, this is one of the best alternatives.

A.10. svn2git (jcoglan/nirvdrum version)

A batch-conversion wrapper around git-svn that creates real tag objects. This is the one written in Ruby. Has all pf git-svn’s problems, alas.

!ABANDONED, !NO_TAGS, NO_IGNORES.

If you cannot use reposurgeon for some reason, this is another alternative that is not too horrible. But beware of possible history damage if your Subversion repo has malformations that confuse git-svn.

A.11. svn2git (Schemenauer version)

Native Python. More a proof of concept than a production tool.

LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION, ABANDONED.

A.12. svn2git (Nyblom version)

Written in C++. Says it’s based on svn-fast-export by Chris Lee. Not easy to figure out what it actually does, as there is no documentation at all and no test cases. May be genetically related to svn-all-fast-export, but if so they diverged in 2008. Notable for having a configuration language for specifying rules that guide the conversion, allowing the conversion to be incrementally developed.

As Gitorious has shut down, this repository is now available from the Software Heritage Archive.

Now apparently hosted on GitHub. This version has been extended and modified somewhat, and some documentation has been added.

A later version turned up in FreeBSD’s SVN repository. The git history has been removed from this version, but the SVN log indicates that it is a fork of the GitHub repository.

Both the FreeBSD repository and the kde-ruleset repository on GitHub contain rules used by actual conversions, making them a useful source of inspiration for your own.

CONFIGURATION, NO_DOCUMENTATION.

A.13. svn-fast-export

Written in C. More a proof of concept than a production tool.

LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION, ABANDONED.

A.14. svn-dump-fast-export

Written in C. Documentation is so lacking that there isn’t even a README. However, it’s possible to deduce what isn’t there by reading the code.

LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION.

A.15. svn-all-fast-export

May be genetically related to the Nyblom svn2git, but if so they diverged in 2008.

LINEAR, NO_TAGS, NO_IGNORES, NO_DOCUMENTATION, ABANDONED.

A.16. SubGit

Nearly unique for this category of software in being closed-source. Beyond an evaluation period, users have to register, possibly for a cost (it’s supposed to be free-of-charge for certain uses: open source projects, education, and ``startups'' — history with BitKeeper shows that such exemptions should probably not be trusted).

The intended outcome of this program is to provide a server with support for both Subversion and Git users to interact at once. This may be of little value overall, as new developers are frequently unfamiliar with Subversion (and old ones forget the usage patterns!), fundamental differences in design of the two VCSes interfering with the quality of both views, and increased confusion with preferred modes of contribution arise.

The quality of SubGit’s conversion is rather poor. It fails to properly translate at least half of the reposurgeon *.svn regression tests, even some of the simpler ones - although trickier cases such as agito.svn it does translate correctly. Large real-world Subversion repos will exhibit multiple issues that SubGit may, silently or otherwise, trip over.

This program will forever contain compromises for the same reasons git-svn does. The non-open source nature leaves little hope of having such issues repaired by skilled community members.

Atlassian’s BitBucket service relies on this for Subversion-to-Git migration. Depending on this service is not recommended.

!MIXEDBRANCH, !LINEAR, CONFIGURATION, DOCUMENTATION

Appendix B: A tour of the codebase

Reposurgeon is intended to be hackable to add support for special-purpose or custom operations, though it’s even better if you can come up with a new surgical primitive general enough to ship with the stock version. For either case, here’s a guide to the architecture.

B.1. inner.go

The core classes in inner.go support deserializing and reserializing import streams. In between these two operations the repo state lives in a fairly simple object, Repository. The main part of Repository is just a list of objects implementing the Event interface - Commits, Blobs, Tags, Resets, and Passthroughs. These are straightforward representations of the command types in a Git import stream, with Passthrough as a way of losslessly conveying lines the parser does not recognize.

 +-------------+    +---------+    +-------------+
 | Deserialize |--->| Operate |--->| Reserialize |
 +-------------+    +---------+    +-------------+

The general theory of reposurgeon is: you deserialize, you do stuff to the event list that preserves correctness invariants, you reserialize. The "do stuff" is mostly not in the core classes, but there is one major exception. The primitive to delete a commit and squash its fileops forwards or backwards is seriously intertwined with the core classes and actually makes up almost 50% of Repository by line count.

The rest of the surgical code lives outside the core classes. Most of it lives in the RepoSurgeon class (the command interpreter) or the RepositoryList class (which encapsulated by-name access to a list of repositories and also hosts surgical operations involving multiple repositories). A few bits, like the repository reader and builder, have enough logic that’s independent of these classes to be factored out of it.

In designing new commands for the interpreter, try hard to keep them orthogonal to the selection-set code. As often as possible, commands should all have a similar form with a (single) selection set argument.

VCS is not a core class. The code for manipulating actual repos is bolted on the the ends of the pipeline, like this:

 +--------+    +-------------+    +---------+    +-----------+    +--------+
 | Import |--->| Deserialize |--->| Operate |--->| Serialize |--->| Export |
 +--------+    +-------------+ A  +---------+    +-----------+    +--------+
      +-----------+            |
      | Extractor |------------+
      +-----------+

The Import and Export boxes call methods in VCS.

B.2. extractor.go

Extractor classes build the deserialized internal representation directly. Each extractor class is a set of VCS-specific methods to be used by the RepoStreamer driver class. Key detail: when a repository is recognized by an extractor it sets the repository type to point to the corresponding VCS instance.

B.3. reposurgeon.go

All code that knows about the DSL syntax should live in reposurgeon.go alonmg with the program main and the functions for reporting errors, logging, handling signals and aborts, etc.

B.4. svnread.go

This is the reader for Subversion dumpfiles. It is the only exception to the rule that reads support for version control systems is implemented by front ends that read them and emit a fast-import stream.

The reason it’s an exception is that Subversion has its own serialization format, and the total complexity of embedding support for those streams was estimated to be lower than that if writing a a completely separate front end.

B.5. Style notes

The code was translated from Python. It retains, for internal documentation purposes, the Python convention of using leading underscores on field names to flag fields that should never be referenced outside a method of the associated struct.

The capitalization of other fieldnames looks inconsistent because the code tries to retain the lowercase Python names and compartmentalize as much as possible to be visible only within the declaring package. Some fields are capitalized for backwards compatibility with the setfield command in the Python implementation, others (like some members of FileOp) because there’s an internal requirement that they be settable by the Go reflection primitives.

Appendix C: Adding support for more version-control systems

The best way way to add support for a version-control system not already on the list is to write a pair of foo-fast-export and foo-fast-import utilities (separate from reposurgeon) that generate and consume git fast-import streams. When this is achievable, it enables full read/write support for repositories of that type. In this case the supporting changes in reposurgeon will be trivial, just a pair of table entries.

The next best route is to write a FooExtractor class in reposurgeon itself. This is less good because (a) it provides only read-side support, and (b) it adds complexity to reposurgeon. There’s also a filter derived from testing requirements; a command-line client of your FooVCS must be freely available running under Unix in order for the reposurgeon maintainers to run tests on it. We are not willing to ship features we can’t test.

Finally, if your VCS supports a native serialization format that it can use as a dump/restore for live repositories, and has or both of a pair of foo-dump/foo-load utilities analogous to git-fast-export and git-fast-import, it may be possible to support your FooVCS through that format. Subversion’s svnadmin dump/load commands fit this pattern.

In this case it is still best to try to write filters that interconvert between the native serialization and git-fast-import streams, separately from reposurgeon. This makes the testing problem more tractable, and means that reposurgeon itself needs only a couple of additional table entries calling simple pipelines.

As a last resort, the reposurgeon maintainers may consider adding support for reading and writing a native serialization format to reposurgeon itself. So far this has only been done once, for Subversion, and there is an important precondition; the serialization format must have complete public documentation.

Be aware that proprietary VCSes in general are likely to cause us serious testing problems and we are reluctant to try to support them. If a maintainer has to pay money to have binaries he or she can run tests with, you will have to pay a maintainer money to make that happen.

It’s also basically a crash landing if your FooVCS can only be accessed through a GUI, or its clients only run on Windows, or it has a CLI that is not capable enough to support an extractor class. We know of cases where proprietatary VCS vendors have deliberately crippled their export and CLI features in order to lock customers in; that is no fun to deal with, so you’ll have to pay somebody money.

Appendix D: Reposurgeon success stories

Reposurgeon has been used for successful conversion on projects including but not limited to the following. These are in rough chronological order.

Hercules (IBM mainframe emulator)

The author did this one, Subversion to hg. About ten years of history at the time, not too horribly messy.

NUT (Network UPS Tools)

The author did this one, Subversion to git. The trial by fire - it was when the Subversion dump analyzer got built. Very large old repository with lots of pathologies (there was a CVS stratum).

Battle For Wesnoth

The author did this one, Subversion to git. Very large repo, moderately complex.

Roundup (issue tracker)

The author did this one, Subversion to git (they later switched to hg). Moderate-sized Subversion repo with some very strange malformations.

robotfindskitten

The author did this one, CVS to git. Simple history, pretty easy.

Blender

Two guys at Blender did this one with help from the author, Subversion to git. Huge repository with a lot of nasty pathologies. The tool needed some serious optimization and feature upgrades to handle it.

groff

The author did this one, CVS to git. Rather easy as the project history was almost linear and, though very old, not huge.

Nethack

CVS to git. This conversion has not yet been publicly released at time of writing (late October 2014) for complicated political reasons.

Emacs

A record three layers, Bazaar over CVS over RCS. Malformations not too bad except for some unique challenges created by the RCS-to-CVS conversion, but the sheer size of the history and number of layers makes it the most complex conversion yet. Converted in 2011.

ntp

The author did this, BitKeeper to git using a derivative of Tridge’s SourcePuller as a front end, done in early 2015. Nothing especially taxing about the reposurgeon side of things, the magic was all in the front end.

pdfrw, playtag, pyeda, rson

Four small Subversion projects by Patrick Maupin, converted in two hours' work in May 2015. No significant difficulties. These mainly served to demonstrate that the standard conversion workflow in conversion.mk is fast and effective for a wide range of projects.

mh-e

The Emacs interface for MH. Converted by Bill Wohler in late 2015. He reports that the standard conversion workflow worked fine.

GNUPLOT

CVS to git, 30 years of history with some early releases recovered from tarballs. Converted by the author in late 2017. Somewhat messy due to vendor-branch issues.

GCC

SVN to git, with ancient strata of CVS and RCS. 280K commits of history back to 1987, dwarfing Emacs. Converted by myself and two core GCC developers. The 4.0 release came out of this. Final cutover was on Jan 12th 2020.

Here are some other some other field reports on successful uses:

Appendix E: Development History

Links to notable blog posts during the development of reposurgeon. Trivial release announcements have been omitted.

Cometary Contributors (2016-01-10)

30 Days in the Hole (2020-01-24)

Two graceful finishes (2020-05-13)