The purpose of reposurgeon is to enable risky operations that version-control systems don't want to let you do, such as (a) editing past comments and metadata, (b) excising commits, (c) coalescing and splitting commits, (d) removing files and subtrees from repo history, (e) merging or grafting two or more repos, and (f) cutting a repo in two by cutting a parent-child link, preserving the branch structure of both child repos.
The original motivation for reposurgeon was to clean up artifacts created by repository conversions, and it has some special-purpose commands for this use. It was foreseen that the tool would also have applications when code needs to be removed from repositories for legal or policy reasons.
To keep reposurgeon simple and flexible, it does not do its own repository reading and writing. Instead, it relies on being able to parse and emit the command streams created by git-fast-export and read by git-fast-import. This means that it can be used on any version-control system that that has both fast-export and fast-import utilities. At time of writing this set includes git itself, hg, and bzr. The git-import stream format also implicitly defines a common language of primitive operations for reposurgeon to speak.
reposurgeon is a sharp enough tool to cut you. It takes care not to ever write a repository in an actually inconsistent state, and will terminate with an error message rather than proceed when its internal data structures are confused. However, there are lots of things you can do with it - like altering stored commit timestamps to they no longer match the commit sequence - that are likely to cause havoc after you're done. Proceed with caution and check your work.
Also note that, if your DVCS does the usual thing of making commit IDs a cryptographic hash of content and parent links, editing a publicly-accessible repository with this tool would be a bad idea. All of the surgical operations in reposurgeon will modify the hash chains, meaning others will become unable to pull from or push to the repo.
Please also see the notes on system-specific issues under the section called “LIMITATIONS AND GUARANTEES”.
The program can be run in one of two modes, either as an interactive command interpreter or in batch mode to execute commands given as arguments on the reposurgeon invocation line. The only differences between these modes are (1) the interactive one begins by turning on the 'verbose 1' option, and (2) in batch mode all errors (including normally recoverable errors in selection-set syntax) are fatal. Also, in interactive mode, Ctrl-P and Ctrl-N will be available to scroll through your command history and tab completion of command keywords is available.
A git-fast-import stream consists of a sequence of commands which must be executed in the specified sequence to build the repo; to avoid confusion with reposurgeon commands we will refer to the stream commands as events in this documentation. These events are implicitly numbered from 1 upwards. Most commands require specifying a selection of event sequence numbers so reposurgeon will know which events to modify or delete.
Commands to reposurgeon consist of a command keyword, sometimes followed by a selection set, sometimes followed by whitespace-separated arguments. It is often possible to omit the selection-set argument and have it default to something reasonable. When the command descriptions refer to a 'second' argument, it may actually be first after the keyword with the selection set omitted.
Here are some motivating examples. The commands will be explained in more detail after the description of selection syntax.
edit :15 ;; edit the object associated with mark :15
edit ;; edit all editable objects
list 29..71 ;; list summary index of events 20..71
list /regression/ ;; list all commits and tags with comments or
;; committer headers or author headers containing
;; the string "regression"
delete =T & 1..:97 ;; delete tags from event 1 to mark 97
inspect [Makefile] ;; Inspect all commits with a file op touching Makefile
list (master) ;; List commits on the 'master' branch
The selection-set specification syntax is an expression-oriented minilanguage. The most basic term in this language is a location. The following sorts of primitive locations are supported:
A plain numeric literal is interpreted as a 1-origin event-sequence number.
A numeric literal preceded by a colon is interpreted as a mark; see the import stream format documentation for explanation of the semantics of marks.
The basename of a branch refers to its tip commit. The name of a tag is equivalent to its mark, and through that to a commit. Tag and branch locations are bracketed with < > (angle brackets) to distinguish them from command keywords. (But also see the discussion of branch sets and & a bit further on; also, in older versions of this tool tag ad branch locations were prefixed with '@' rather than bracketed. The change fixed a parsing ambiguity)
These may be grouped into sets in the following ways:
A range is two locations separated by "..", and is the set of events begining at the left-hand location and ending at the right-hand location (inclusive).
Comma-separated lists of locations and ranges are accepted, with the obvious meaning.
There are some other ways to construct event sets:
A visibility set is an expression specifying a set of event types. It will consist of a leading equal sign, followed by type letters. These are the type letters:
| B | blobs | Most default selection sets exclude blobs; they have to be manipulated through the commits they are attached to. |
| C | commits |
|
| H | head (branch tip) commits |
|
| T | tags |
|
| R | resets |
|
| P | Passthrough | All event types types simply passed through, including comments, progress commands, and checkpoint commands. |
A branch name (bracketed by parentheses) resolves to the set of all commits on that branch, plus any tags pointed at them and any associated branch resets.
A text search expression is a Python regular expression surrounded by forward slashes (to embed a forward slash in it, use a Python string escape such as \x2f).
A text search matches against the comment fields of commits and annotated tags, or against their author/committer headers. It matches against the text of passthrough objects.
A path name enclosed in [] resolves to the set of all commits with a fileop that touches that file path - modifies that change it, deletes that remove it, renames and copies that have it as a source or target.
Set expressions may be combined with the operators | and &; these are, respectively, set union and intersection. The | has lower precendence than intersection, but you may use curly brackets '{' and '}' to group expressions in case there is ambiguity.
Finally, any set operation may be followed by '?' to add the set members' neighbors and referents. This extends the set to include the parents and children of all commits in the set, and the referents of any tags and resets in the set. Each blob reference in the set is replaced by all commits that refer to it.
reposurgeon can hold multiple repository states in core. Each has a name. At any given time, one may be selected for editing. Commands in this group import repositories, export them, and manipulate the in-core list and the selection.
With a directory-name argument, this command attempts to read in the contents of a repository in any supported version-control system under that directory; read with no arguments does this in the current directory. If the argument is the name of a plain file, it will be read in as a fast-import stream. With an argument of “-”, this command reads a fast-import stream from standard input (this will be useful in filters constructed with command-line arguments).
If the read location is a git repository and contains a
.git/cvsauthors file (such as is left in in place
by git cvsimport -A) that file will be read in as
if it had been given to the authormap read.
If the read location is a directory, and its repository
subdirectory has a file named fossils, that file
will be read as though passed to a fossil read
command.
The just-read-in repo is added to the list of loaded repositories and
becomes the current one, selected for surgery. If it was read from a
plain file and the file name ends with the extension .fi,
that is removed from the load list name.
Note: this command does not take a selection set.
Dump selected events as a fast-import stream representing the edited repository; the default selection set is all events. The optional second argument tells where to dump to; standard output if argument is empty or '-' or a named file otherwise. Fails if the argument exists and is a directory or anything other than a plain file.
Property extensions will be be omitted from the output if the importer for the selected repo cannot digest them.
Note: to examine small groups of commits without the progress meter, use inspect.
Choose a named repo on which to operate. The name of a repo is normally the basename of the directory or file it was loaded from, but repos loaded from standard input are "unnamed". reposurgeon will add a disambiguating suffix if there have been multiple reads from the same source.
With no argument, lists the names of the currently stored repositories and their load times. The second column is '*' for the currently selected repository, '-' for others.
Drop a repo named by the argument from reposurgeon's list, freeing the memory used for its metadata and deleting on-disk blobs. With no argument, drops the currently chosen repo.
Rename the currently chosen repo; requires an argument. Won't do it if there is already one by the new name.
reposurgeon can rebuild an altered repository in place. Because of safety measures it takes to ensure that no exiting repo is ever altered or clobbered, it has to be told which untracked files should be saved and restored when the contents of the new repository is checked out.
Rebuild a repository from the state held by reposurgeon. This command does not take a selection-set argument.
The single argument, if present, specifies the target directory in which to do the rebuild; if the repository read was from a repo directory (and not a git-import stream), it defaults to that directory. If the target directory is nonempty its contents are backed up to a save directory. Files and directories on the repository's preserve list are copied back from the backup directory after repo rebuild. The default preserve list depends on the repository type, and can be displayed with the stats command.
If reposurgeon has a nonempty fossil map,
it will be written to a file named fossils
in the repository subdirectory as though by a a
fossil write command.
Add (presumably untracked) files or directories to the repo's list of paths to be restored from the backup directory after a rebuild. Each argument, if any, is interpreted as a pathname. The current preserve list is displayed afterwards.
It is only necessary to use this command if your version-control system lacks a command to list files under version control. Under systems with such a command, all files that are newither beneath the repository dot directory nor under reposurgeon temporary directories are preserved automatically.
Remove (presumably untracked) files or directories to the repo's list of paths to be restored from the backup directory after a rebuild. Each argument, if any, is interpreted as a pathname. The current preserve list is displayed afterwards.
Commands in this group report information about the selected repository.
This is the main command for identifying the events you want to modify. It lists commits in the selection set by event sequence number with summary information. The first column is raw event numbers, the second a timestamp in local time, and the third the leading text of the comment. If there is a second argument, or the first is not recognized as a selection set, it will be taken as the name of the file to report to; no argument, or one of '-'; writes to standard output.
For each commit in the selection set with a branch member that contains "/tags/", list the event and the branch member. This will list the lightweight tags in the selection set.
Report size statistics and import/export method information about named repositories, or with no argument the currently chosen repository.
Dump a fast-import stream representing selected events to standard output. Just like a write, except (1) the progress meter is disabled, (2) properties are always dumped, and (3) there is an identifying header before each event dump.
These are the operations the rest of reposurgeon is designed to support.
Delete a selection set of commits. The default selection set for this command is empty.
No blob event can be deleted directly with this command; blob events are tied to associated commit or tag events and are discarded only when those are.
Deleting a tag, reset or passthrough event has no side effects.
The interesting use of this command is to delete commits. Children of a deleted commit get it removed from their parent set and its parents added.
Normally, when a commit is deleted, its file operation list (and any associated blob references) gets either prepended to the beginning of the operation list of each of the commit's children or appended to the operation list of each of the commit's parents. The default is to push forward. modifying children; but see the list of policy modifiers below for how to change this.
Normally, any tag pointing to a deleted commit will also be deleted. But see the list of policy modifiers below for how to change this.
Following all operation moves, every one of the altered file operation lists is reduced to a shortest normalized form. The normalized from detects various combinations of modification, deletion, and renaming and simplifies the operation sequence as much as it can without losing any information.
After canonicalization, a file op list may still end up containing multiple M operations in the on the same file. Normally the tool utters a warning when this occurs but does not try to resolve it.
The following modifiers change these policies:
simply discards all file ops associated with deleted commit(s).
Discard all M operations (and associated blobs) except the last.
Append fileops to parents, rather than prepending to children.
With the "tagforward" modifier, any tag on the deleted commit is pushed forward to the first child rather than being deleted.
With the "tagback" modifier, any tag on the deleted commit is pushed backward to the first parent than being deleted.
Under either of these policies, deleting a commit that has children does not back out the changes made by that commit, as they will still be present in the blobs attached to versions past the end of the deletion set. All a delete does when the commit has children is lose the metadata information about when and by who those changes were actually made; after the delete any such changes willbe attributes to the first undeleted children of the deleted commits. It is expected that this command will be useful mainly for removing commits mechanically generated by repository converters such as cvs2svn.
Other policies which do more to attempt to back out content changes may be added in future versions of this tool.
Attempt to partition a repo by cutting the parent-child link between two specified commits (they must be adjacent). Does not take a general selection-set argument. It is only necessary to specify the parent commit, unless it has multiple children in which case the child commit must follow.
This operation may fail if the commit graph remains connected through another path; the tool will detect this.
On success, the original repo will be dropped, and there will be no repo still chosen, but two repos will appear in the in-core list. If the repo was named 'foo', the cut segments will be named 'foo-early' and 'foo-late'. Option and feature events at the beginning of the early segnment will be duplicated onto the beginning of the late one.
Expunge files from the selected portion of the repo history; the default is the entire history. The arguments to this command may be paths or Python regular expressions matching paths.
All filemodify (M) operations and delete (D) operations involving a matched file in the selected set of events are removed. Renames are followed as the tool walks forward in the selection set; each triggers a warning message. If a selected file is a copy (C) target, the copy will be deleted and a warning message issued. If a selected file is a copy source, the copy target will be added to the list of paths to be deleted and a warning issued.
After file expunges have been performed, any commits with no remaining file operations will be removed, and any tags pointing to them.
The removed commits and blobs are not discarded. They are assembled into a new repository named after the old one with the suffix "-expunges" added. This, this command can be used to carve a repository into sections by file path matches.
Scan the selection set for runs of commits with identical comments close to each other in time (this is a common form of scar tissues in repository up-conversions from older file-oriented version-control systems). Merge these cliques by deleting all but the last commit, in order.
The optional second argument, if present, is a maximum time separation in seconds; the default is 90 seconds.
The default selection set for this command is =C, all commits. Occasionally you may want to restrict it, for example to avoid coalescing unrelated cliques of "*** empty log message ***" commits from CVS lifts.
Split a specified commit in two, the opposite of coalesce. The first argument is required to be a commit location; the separating keyword 'at' must follow, then an integer 1-origin index of a file operation within the commit.
The commit is copied and inserted into a new position in the event sequence, immediately following itself; the duplicate becomes the child of the original, and replaces it as parent of the original's children. Commit metadata is duplicated; the mark of the new commit is then changed, with 'bis' added as a suffix.
Finally, some file operations - starting at the one indexed by the split argument - are moved forward from the original commit into the new one. Legal indices are 2-n, where n is the number of file operations in the original commit.
Renumber the marks in a repository, from :1 up to :<n> where <n> is the count of the last mark. Just in case an importer ever cares about mark ordering or gaps in the sequence.
Emit a mailbox file of messages in RFC822 format representing the contents of repository metadata. Takes a selection set; members of the set other than commits, annotated tags, and passthroughs are ignored (that is, presently, blobs and resets). If there is a second argument, or the first is not recognized as a selection set, it will be taken as the name of the file to report to; no argument, or one of '-'; writes to standard output.
Accept a mailbox file of messages in RFC822 format representing the contents of the metadata in selected commits and annotated tags. Takes no selection set. If there is an argument it will be taken as the name of a mailbox file to read from; no argument, or one of '-'; reads from standard input.
Users should be aware that modifying an Event-Number field will change which event the update from that message is applied to. This is unlikely to have good results.
Report the selection set of events to a tempfile as mailbox_out does, call an editor on it, and update from the result as mailbox_in does. If you do not specify an editor name as second argument, it will be taken from the $EDITOR variable in your environment.
Normally this command ignores blobs because mailbox_out does. However, if you specify a selection set consisting of a single blob, your editor will be called directly on the blob file.
The modifier 'multiline' will trim the selection set to commits that are multiline and not in summary/blank-line/details form.
Apply a time offset to all time/date stamps in the selected set. An offset argument is required; it may be in the form [+-]ss, [+-]mm:ss or [+-]hh:mm:ss. The leading sign is required to distingush it from a selection expression.
Optionally you may also specify another argument in the form [+-]hhmm, a timeone literal to apply. To apply a timezone without an offset, use an offset literal of +0 or -0.
Merge repositories. Name any number of loaded repositories; they will be merged into one union repo and removed from the load list. The union repo will be selected.
Before merging, the repos will be sorted by date of first commit. The oldest will keep all its branch and tag names unchanged (this rule is followed so there will always be a defined default branch). All others will have their branch and tag names suffixed with their load name. Marks will be renumbered.
The name of the new repo will be the names of all parts concatenated, separated by '+'. It will have no source directory or preferred system type.
For when merge doesn't give you enough control. The selection set must be of size 1, identifying a single commit in the currently selected repo. A following argument must be a repository name. Labels and branches in the named repo are prefixed with its name; then it is grafted to the selected one. Its root becomes a child of the specified commit. Finally the named repo is removed from the load list.
Sort events in the selected repo by timestamp, unless doing so would put a child before a parent in which case it throws an error listing the out-of-order commit pair. All parent-child relationships are preserved. Commit hashes are not modified.
May be useful after a graft operation; a sorted repo tends to display more nicely in browsing tools which exibit the graph structure.
Does not take a selection set; always operates on the entire repository.
Takes a selection set. Without a modifier, list all paths touched by fileops in the selection set (which defaults to the entire repo). With the 'sub' modifier, take a second argument that is a directory name and prepend it to every path. With the 'sup' modifier, strip the first directory component from every path.
This group of commands is meant for fixing up references in commits that are in the format of older version control systems. The general workflow is this: first, go over the comment history and change all old-fashioned commit references into machine-parseable cookies. Then, automatically turn the machine-parseable cookie into action stamps. The point of dividing the process this way is that the first part is hard for a machine to get right, while the second part is prone to errors when a human does it.
A Subversion cookie is a comment substring of the form [[SVN:ddddd]] (example: [[SVN:2355]] with the revision deduced from git-svn metadata or matching a $Revision$ header embedded in blob data for the filename.
A CVS cookie is a comment substring of the form [[CVS:filename:revision]] (example: [[CVS:src/README:1.23]] with the revision matching a CVS $Id$ or $Revision$ header embedded in blob data for the filename.
A mark cookie is of the form [[:dddd]] and is simply a reference to the specified mark. You may want to hand-patch this in when one of previous forms is inconvenient.
An action stamp is an RFC3339 timestamp, followed by a '!', followed by a committer email address; it refers to a commit without being VCS-specific. Thus, instead of "commit 304a53c2" or "r2355", "2011-10-25T15:11:09Z!fred@foonly.com".
In order to support reference lifting,
reposurgeon internally builds a fossil-reference
map that associates revision identifiers in older version-control
systems with commits. The contents of this map comes from three
places: (1) git-svn metadata headers (when gitsvnparse is called), (2)
$Id$ and $Revision$ headers in repository files, and (3) the
.git/cvs-revisions created by git
cvsimport.
The workflow for lifting possible references is this: first, find possible CVS and Subversion references with the references; then replace them with equivalent cookies; then run references lift to turn the cookies into action stamps (using the information in the fossil-reference map) without having to do the lookup by hand.
Interpret final comment lines of commits beginning with 'git-svn-id:' as git-svn metadata lines and remove them. If present, the Subversion commit ID is extracted and made the value of a per-commit property named 'svn', which becomes visible; also, the association between that Subversion revision and its commit becomes part of the fossil map.
Also, change refs/remotes/svn branch names to corresponding local ones and lift tip tag commits without file opations to actual tag objects (replicating what svn2git does to clean up a repository after translation).
Finally, detect delete/modify sequences that are actually renames and turn them into renames.
The command modifier "strip" suppresses saving the 'svn' property. This option was required for older versions of this tool that did not automatically suppress writing properties to importers that couldn't handle them: it has been retained for backwards compatibility.
Also, enables writing of the fossil map as 'fossil' passthroughs when the repo is written to a steam file.
Note: in earlier version of reposugeon this command took a selection set of events to modify, defaulting to all events. It now always converts the entire repo, and no longer takes a selection set.
Search commit comments for strings that might be CVS- or Subversion-style revision identifiers. This will be useful when you want to replace them with equivalent cookies that can automatically be translated into VCS-independent action stamps. With the modifier 'edit', edit the set where revision IDs are found.
With the modifier "lift", attempt to resolve Subversion and CVS cookies in comments into action stamps using the fossil map. An action stamp is a timestamp/email combination uniquely identifying the commit associated with that blob, as described in the section called “TRANSLATION STYLE”.
It is not guaranteed that every such reference will be resolved, or even that any at all will be. Generally, if the repo contains git-svn headers and gitsvnparse has been called, every SVN cookie will resolve. CVS references are less likely to be resolvable.
Also, enables writing of the fossil map as 'fossil' passthroughs when the repo is written to a stream file.
Some commands automate fixing various kinds of artifacts associated with repository conversions from order systems.
Purge cruft created by cvs2svn conversions. Takes a selection-set argument, but the default to the entire selected repo is probably what you want.
This looks for dummy commits created by cvs2svn with one parent that were created within 10 seconds of their parent and contain "manufactured by cvs2svn" in the comment. For each such commit, a tag object is with appropriate metadata generated and attached.
If the commit either has no fileops at all or a fileop set consisting entirely of deletes, the commit is marked for deletion; otherwise a warning is emitted. At deletion time, if the commit was created to represent a tag, the branch attribute of the commit) is given to the commit's parent and the generated tag object is re-pointed at the parent. The commit itself is unconditionally deleted.
Apply or dump author-map information for the specified selection set, defaulting to all events.
Lifts from CVS and Subversion may have only usernames local to the repository host in committer and author IDs. DVCSes want email addresses (net-wide identifiers) and complete names. To supply the map from oune to the other, an authors file is expected to consist of lines each beginning with a local user ID, followed by a '=' (possibly surrounded by whitespace) followed by a full name and email address.
When an authors file is applied, email addresses in committer and author metdata for which the local ID matches between < and @ are replaced according to the mapping (this handles git-svn lifts). Alternatively, if the local ID is the entire address, this is also considered a match (this handles what git-cvsimport and cvs2git do)
With the 'read' modifier, or no modifier and a filename argument, apply author mapping data (no filename argument, read the mapping from standard input). May be useful if you are editing a repo or dump created by cvs2git or by git-svn invoked without -A.
With no file argument, or with the 'write' modifier and a file argument: write each unique committer, author, and tagger (no file argument sends the report to standard output). This may be helpful as a start on building an authors file.
The gitsvnparse (previously described) will strip out git-svn metadata lines when given the 'strip' modifier.
Takes a selection set which must resolve to a single commit, and a second argument. The second argument is interpreted as a directory name. The state of the code tree at that commit is materialized beneath the directory.
Display the difference between commits. Takes a selection-set argument which must resolve to exactly two commits.
These are backed up by the following housekeeping commands, none of which take a selection set:
Get help on the interpreter commands. Optionally follow with whitespace and a command name; with no argument, lists all commands. '?' also invokes this.
Execute the shell command given in the remainder of the line. '!' also invokes this.
With no arguments, describe capabilities of all supported systems. With an argument (which must be the name of a supported system) this has two effects:
First, if there are multiple repositories in a directory you do a read on, reposurgeon will read the preferred one (otherwise it will complain that it can't choose among them).
Secondly, if there is a selcted repo, this will change its type. This means that you do a write to a directory, it will build a repo of the preferred type rather than its original type (if it had one).
If no preferred type has been explicitly selected, reading in a repository (but not a fast-import stream) will implicitly set it to the type of that repository.
A few commands have been implemented primarily for debugging and regression-testing purposes, but may be useful in unusual circumstances.
Display four columns of info on objects in the selection set: their number, their type, the associate mark (or '-' if no mark) and a summary field varying by type. For a branch or tag it's the reference; for a commit it's the commit branch; for a blob it's the repository path of the file in the blob.
The default selection set for this command is =CTRU, all objects except blobs.
Does nothing but resolve a selection-set expression and echo the resulting event-number set to standard output. Implemented mainly for regression testing, but may be useful for exploring the selection-set language.
List the names of all known branches and tags. Tells you what things are legal within angle brackets and parentheses.
Report the version of reposurgeon and the list of version-control systems it directly supports.
'verbose 1' enables the progress meter and messages, 'verbose 0' disables them. Higher levels of verbosity are available but intended for developers only.
'echo 1' causes each reposurgeon command to be echoed to standard output just before its output. This can be useful in constructing regression tests that are easily checked by eyeball.
Takes a filename argument. Reads each line from the file and executes it as a command.
List the commands you have entered this session.
Apply or list fossil-reference information. Does not take a selection set.
A fossil-reference file maps reference cookies to (committer, commit-date) pairs; these in turn (should) uniquely identify a commit. The format is three whitespace-separated fields: the cookie, an RFC3339 timestamp, and a committer email ID.
It should not normally be necessary to use this command. The
fossil map is automatically preserved through repository reads and
rebuilds, being stored in the file fossils under
the repository subdirectory..
After converting a CVS or SVN repository, check for and remove $-cookies in the head revision(s) of the files. The full Subversion set is $Date:, $Revision:, $Author:, $HeadURL and $Id:. CVS uses $Author:, $Date:, $Header:, $Id:, $Log:, $Revision:, also (rarely) $Locker:, $Name:, $RCSfile:, $Source:, and $State:.
When you need to specify a commit, use the action-stamp format that references lift generates when it can resolve an SVN or CVS reference in a comment. It is best that you not vary from this format, even in trivial ways like omitting the 'Z' or changing the 'T' or '!'. Making action stamps uniform and machine-parseable will have good consequences for future repository-browsing tools.
Sometimes, in converting a repository, you may need to insert an explanatory comment - for example, if metadata has been garbled or missing and you need to point to that fact. It's helpful for repository-browsing tools if there is a uniform syntax for this that is highly unlikely to show up in repository comments. I recommend enclosing translation notes in [[ ]]. This has the advantage of being visually similar to the [ ] traditionally used for editorial comments in text.
It is good practice to include, in the comment for the root commit of the repository, a note dating and attributing the conversion work and explaining these conventions. Example:
[[This repository was converted from Subversion to git on 2011-10-24 by Eric S. Raymond <esr@thyrsus.com>. Here and elsewhere, conversion notes are enclosed in double square brackets. Junk commits generated by cvs2svn have been removed, commit references have been mapped into a uniform VCS-independent syntax, and some comments edited into summary-plus-continuation form.]]
Guarantee: Editing with reposurgeon never changes the hash of a commit object unless (a) you edit the commit, or (b) it is a descendant of an edited commit in a VCS that includes parent hashes in the input of a child object's hash (git and hg both do this).
Guarantee: reposurgeon only requires main memory proportional to the size of a repository's metadata history, not its entire content history. Blobs are stored on disk.
Guarantee: reposurgeon never modifies the contents of a repository it reads, nor deletes any repository. The results of surgery are always expressed in a new repository.
Guarantee: Any line in a fast-import stream that is not a part of a command reposurgeon parses and understands will be passed through unaltered. At present the set of potential passthroughs is known to include the progress, the options, and checkpoint commands as well as comments led by #.
Guarantee: All reposurgeon operations either preserve all repository state they are not explicitly told to modify or warn you when they cannot do so.
Guarantee: reposurgeon handles the bzr commit-properties extension, correctly passing through property items including those with embedded newlines. (Such properties are also editable in the mailbox format.)
Limitation: Because reposurgeon relies on other programs to generate and interpret the fast-import command stream, it is subject to bugs in those programs.
Limitation: bzr suffers from deep confusion over whether its unit of work is a repository or a floating branch that might have been cloned from a repo or created from scratch, and might or might not be destined to be merged to a repo one day. Its exporter only works on branches, but its importer creates repos. Thus, a rebuild operation will produce a subdirectory structure that differs from what you expect. Look for your content under the subdirectory 'trunk'.
Limitation: under git, signed tags are imported verbatim. However, any operation that modifies any commit upstream of the target of the tag will invalidate it.
Limitation: Subversion/RCS/CVS aren't directly supported because exporting from them requires fixups of usernames in the committer information to full email addresses. Trying to handle that entirely inside this tool would be excessively messy, so we don't. Instead we let the user transform repo-command streams and cope with the export/import separately.
Limitation: Stock git (at least as of version 1.7.3.2) will choke on property extension commands. Accordingly, reposurgeon omits them when rebuilding a repo with git type.
Guarantee: As version-control systems add support for the fast-import format, their repositories will become editable by reposurgeon. See the Git Wiki tools page for a large collection of such tools.
reposurgeon relies on importers and exporters associated with the VCSes it supports,
Core git supports both export and import.
Requires bzr plus the bzr-fast-import plugin.
Requires core hg, the hg-fastimport plugin, and the third-party hg-fast-export.py script.
It is expected that reposurgeon will be extended with more deletion policies. Policy authors may need to know more about how a commit's file operation sequence is reduced to normal form after operations from deleted commits are prepended to it.
Recall that each commit has a list of file operations, each a M (modify), D (delete), R (rename), C (copy), or 'deleteall' (delete all files). Only M operations have associated blobs. Normally there is only one M operation per individual file in a commit's operation list.
To understand how the reduction process works, it's enough to understand the case where all the operation in the list are working on the same file. Sublists of operations referring to different files don't affect each other and reducing them can be thought of as separate operations. Also, a "deleteall" acts as a D for everything and cancels all operations before it in the list.
The reduction process walks through the list from the beginning looking for adjacent pairs of operations it can compose. The following table describes all possible cases and all but one of the reductions.
| M + D → D | If a file is modified then deleted, the result is as though it had been deleted. If the M was the only modify for the file, it's removed too. |
| M a + R a b → R a b + M b | The purpose of this transformation is to push renames towards the beginning of the list, where they may become adjacent to another R or C they can be composed with. If the M is the only modify operation for this file, the rename is dropped. |
| M a + C a b | No reduction. |
| M b + R a b → nothing | Should be impossible, and may indicate repository corruption. |
| M b + C a b → nothing | The copy undoes the modification. |
| D + M → M | If a file is deleted and modified, the result is as though the deletion had not taken place (because M operations store entire files, not deltas). |
| D + {D|R|C} | These cases should be impossible and would suggest the repository has been corrupted. |
| R a b + D a | Should never happen, and is another case that would suggest repository corruption. |
| R a b + D b → nothing | The delete removes the just-renamed file. |
| {R|C} + M | No reduction. |
| R a b + R b c → R a c | The b terms have to match for these operations to have made sense when they lived in separate commits; if they don't, it indicates repository corruption. |
| R a b + C b c | No reduction. |
| C a b + D a → R a b | Copy followed by delete of the source is a rename. |
| C a b + D b → nothing | This delete undoes the copy. |
| C a b + R a c | No reduction. |
| C a b + R b c → C a c | Copy followed by a rename of the target reduces to single copy |
| C + C | No reduction. |
This section will become relevant only if reposurgeon or something underneath it in the software and hardware stack crashes while in the middle of writing out a repository, in particular if the target directory of the rebuild is your current directory.
The tool has two conflicting objectives. On the one hand, we never want to risk clobbering a pre-existing repo. On the other hand, we want to be able to run this tool in a directory with a repo and modify it in place.
We resolve this dilemma by playing a game of three-directory monte.
First, we build the repo in a freshly-created staging
directory. if your target directory is named
/path/to/foo, the staging directory will be a
peer named /path/to/foo-stageNNNN, where NNNN is
a cookie derived from reposurgeon's process
ID.
We then make an empty backup directory. This directory will
be named /path/to/foo.~N~, where N is incremented
so as not to conflict with any existing backup drectories.
reposurgeon never, under any circumstances,
ever deletes a backup directory.
So far, all operations are safe; the worst that can happen up to this point if the process gets interrupted is that the staging and backup directories get left behind.
The critical region begins. We first move everything in the target directory to the backup directory.
Then we move everything in the staging directory to the target.
We finish off by restoring untracked files in the target directory from the backup directory. That ends the critical region.
During the critical region, all signals that can be ignored are ignored.