Notes on the Go translation of Reposurgeon

This is an experience report on a Python-to-Go translation of a program with significant complexity, written in attempted conformance with the Go community’s practice for grounding language enhancement requests not in it-would-be-nice-to-have abstractions but rather in a description of real-world problems.

Reposurgeon is a program for editing version-control histories and interconverting them among version-control systems. I spent much of 2019 moving the reposurgeon codebase from Python to Go because the Python implementation was too slow to be useful on on really large repositories. The workflow it is designed to support is rapid iterative improvement of conversion recipes by a human operator; long test cycles disrupt this by making experimentation painful.

Subversion-to-git conversion of the Gnu Compiler Collection history, at over 280K commits (over 1.6M individual change actions), was the straw that broke the camel’s back. Using PyPy with every optimization on semi-custom hardware tuned for this job still yielded test conversion times of over 9 hours, which is death on the recipe-debugging cycle. A test translation of the auxiliary repocutter tool suggested that we could expect up to a 40x speedup, which was in line with published Python vs. Go comparative benchmarks.

The problem directed the choice of Go, not the other way around. I seriously considered OCaml or a compiled Lisp as alternatives. I concluded that in either case the semantic gap between Python and the target language was so large that translation would be impractical. Only Go offered me any practical hope.

I did examine two automated tools for Python to Go translation, but rejected them because I judged the generated Go code would have been a maintainability disaster. Thus, translation by hand. Though at about 22% in I did write a fast, crude, incomplete Python-to-Go source translator to assist the process.

The man barrier to translation was that, while at 14KLOC of Python reposurgeon was not especially large, the code is very dense. It’s a DSL that’s a structure editor for attributed DAGs — algorithmically complex, bristling with graph theory, FSMs, parsing, tricky data structures, two robot harnesses driving other tools, and three different operator-composition algebras. It became 21KLOC of Go.

The algorithmic density of reposurgeon is such that it would be a challenge to the expressiveness of any language it were implemented in. It makes a good test of the relative expressiveness of Python and Go, and an effective way to audit for places where moving from Python to Go hinders concision and readability.

The skillset I approached this problem with is: Go novice, Python and C expert; old Unix hand; Lisp hacker of even more ancient vintage; ex-mathematician with strength in graph theory, group theory and combinatorics; lots of experience as a systems programmer, lots of exposure to odd languages, and lots of domain knowledge about version-control systems. Adjust bias compensators accordingly.

(The 1.8 version of these notes was shipped to golang-nuts on 2020-01-25; the topic thread is here. A typo fix and a clarification have been backported from the thread.)

Expected problems that weren’t

Semantic distance. In general, the translation gap between Python and Go is narrower than I initially expected, especially considering the dynamic-vs-static type-system difference. On reflection, I think it turns out that GC and the existence of maps as first-class types do more to narrow that gap than the static/dynamic divergence in type systems does to widen it.

Polymorphic lists. The principal data structure of a repository’s representation in reposurgeon is a list of events — data structures representing modification operations on sets of files. The list is necessarily polymorphic because (for example) a change commit and a tag creation are different kinds of things. Translating this to static typing using interfaces proved less arduous than I had feared, though the process revealed a documentation issue and a problem with absence of sum types; I will return to both points.

Operator overloading. Good style in Python, but surprisingly easy to relinquish in Go. I went in thinking that they’d be on my want list for Go before I was done, but no — not even at reposurgeon’s complexity scale.

Generics. Yes, these would have made translation easier, but the main impact turned out to be that I had to write my own set-of-int and set-of-string classes. That was a surprisingly light hit. What I really missed was generic map-function-over-slice, which could be handled by adding a much narrower feature.

The positive part of my summation is that hand-translation of Python to Go even at this scale and complexity is not horrible. It’s not easy, exactly, but quite doable. It is however time-intensive; counting time to build the required unit tests in Go, I managed about 150-200 lines a day before writing pytogo and 500-600 lines per day afterwards. The entire translation, interleaved with my work on NTPsec and other projects, took just about 12 months in wall time. Perhaps a third of that was spent debugging the Go result after it achieved first full compilation.

The pain points were mainly around a handful of advanced Python features: iterators, generators, comprehensions, and class-valued exceptions. Fortunately, even in quite advanced Python code like reposurgeon’s these turn out to not be very common on a per-KLOC basis.

Problems that were

Keyword arguments

The problem that obtruded on me first was quite unexpected: absence of keyword arguments. In Python one can write a function signature like this

    def EuclideanDistance(x, y):

and then call it like this:

    d = EuclideanDistance(x=3.0, y=9.6)

I used keyword arguments extensively in the Python, especially in object-creation functions where it is often required to pass in multiple parameters that won’t fit neatly into a single data structure.

Go presently has no such feature. This is probably the single most serious readability hit my translation took; it got significantly more difficult to grasp what was going on at all those callsites.

No map over slices

Translating Python map() calls and comprehensions produces code that is ugly and bulky, forcing the declaration of dummy variables that shouldn’t need to exist.

If one graded possible Go point extensions by a figure of merit in which the numerator is "how much Python expressiveness this keeps" and the denominator is "how simple and self-contained the Go feature would be", I think this one would be top of list.

So: map as a functional builtin takes two arguments, one x = []T and a second f = func(T)T. The expression map(x, f) yields a new slice in which for each element of x, f(x) is appended.

This proposal can be discarded if generics are implemented, as any reasonable implementation of generics would make it trivial to implement in Go itself.

Annoying limitations on const

Inability to apply const to variables with structure, map, or slice initializers is annoying in these ways:

Compiler can’t enforce noli me tangere
const functions as a declaration of programmer intent that is valuable at scale.

In Python one can often get a similar effect by using tuples. I used this as a form of internal documentation hint in the original Python. I want it back in Go.

Any extension in the scope of const, even a relatively conservative one like only allowing const structures with compile-time constant members, would have significant benefits.

Absence of lookbehind in Go regexps

This is a small point problem, easily fixed, that was far more annoying in practice than it should have been in theory.

Python regexps have both positive and negative lookbehind clauses. The following expression looks for possible Subversion revision designators in comments, excluding bug references:

"(?<!bug )[0-9]+"

Go translation reveals that it is remarkably unpleasant, verging on "too painful to be worth it" to do that filtering without lookbehinds.

This is the only real problem I have identified in moving from Python regexps to Go ones. Take that "only" seriously, because regexps are a Swiss-army knife I use heavily; Go regexps are doing well to have no limits that are more annoying.

Absence of sum/discriminated-union types

I have read issue #19412 and am aware of the objections to adding sum types to Go.

Nevertheless, I found their absence was something of a pain point in my translation. Because reposurgeon events can have any one of a set of types (Blob, Tag, Commit, Callout, Passthrough, Reset) I found myself writing a lot of stupid boilerplate code like this:

	for _, child := range commit.children() {
		switch child.(type) {
		case *Commit:
			successorBranches.Add(child.(Commit).branch)
		case *Callout:
			complain("internal error: callouts do not have branches: %s",
				child.idMe())
		default:
			panic("in tags method, unexpected type in child list")
		}
	}

Besides being inelegant, the requirement for a runtime check to exhaust all cases is a defect attractor. It’s way too easy to forget to write the default case and wind up with silent errors.

Thus, absence of discriminated-sum types is an actual hole in the language that compromises its goal of enforcing strong invariants through type safety checked at compile time.

This will especially tend to become an issue when translating from a language like Python with fully dynamic typing.

I don’t have a concrete proposal to fix this yet. If these notes are well received I may write one.

Catchable exceptions require silly contortions

Though I revised it significantly on completion, much of this report was originally written at about the 12% point of the translation. By twice that far in, 23%, another problem about which I had not originally been intending to complain became obtrusive. That is absence of a general facility for structured exceptions.

Yes, I’m familiar with all the reasons throw/catch wasn’t included in Go 1. Including the laudable goal of forcing programmers to be explicit about error handling and how they propagate errors up their call stack. And I understand that defer/recover was an attempt to provide a tractable subset of catchable exceptions that would minimize the temptation to sin.

Because I broadly agree with this set of goals, I was actively intending when I started this translation not to complain about the lack of general catchable exceptions, or ship any related RFEs, in spite of having a presentiment that they would be a problem. That is, until I hit a wall in the real world and had to rethink.

Here’s my use case. Reposurgeon is an interpreter for a DSL. Situations in which I can tolerate panic-out and die are rare and mostly occur at initialization time. Usually what I want to do instead of panicking on error is throw control back to the read/eval loop, executing some kind of local cleanup hook on the way out. Analogous situations will frequently occur in, for example, network servers.

In a language with labeled throw/catch, or class-valued exceptions, I can address this by explicitly target an exception to some level of the call stack above the point it’s raised. In reposurgeon, for example, there are usually two levels of interest. One is the DSL’s read-eval loop. The other is the outermost scope; if an exception gets there I want to call hooks to gracefully remove working directories (blob storage associated with the repository-history structures being edited) before exiting the program.

In Go, I didn’t seem to have a clean option for this. Which was a problem on two levels….

Python reposurgeon was 14 KLOC of dense code. At that scale, any prudent person in a situation like this will perform as linear and literal a translation as possible; to do otherwise is to risk a complexity explosion as you try to cross the semantic gap and rethink the design at the same time. Absence of class-valued exceptions was far and away the biggest technical blocker. "First make it work, then make it right"; the least risky path seemed to be to shim in exceptions with the intention of removing them later.

Eventually, after beating on the panic/recover feature for a while, I found this kludge:

package main

import "fmt"

type exception struct {
	class string
	message string
}

func (e exception) Error() string {
	return e.message
}

func throw(class string, msg string, args ...interface{}) *exception {
	// We could call panic() in here but we leave it at the callsite
	// to clue the compiler in that no return after is required.
	e := new(exception)
	e.class = class
	e.message = fmt.Sprintf(msg, args...)
	return e
}

func catch(accept string, x interface{}) *exception {
	// Because recover() returns interface{}.
	// Return us to the world of type safety.
	if x == nil {
		return nil
	}
	err := x.(*exception)
	if err.class == accept {
		return err
	}
	panic(x)
}

func main() {
	defer println("Defer 1")
	defer println("Defer 2")
	defer println("Defer 3")

	defer func() {
		fmt.Println("Recover:", catch("recoverable", recover()))
	}()
	panic(throw("recoverable", "Don't Panic!!!"))

	fmt.Println("Unreachable.")
}

This works, and it works if you change the class to something other than "recoverable"; you get the expected rethrow and panic. But it is unreasonably ugly. So why am I bringing it forward? Because…

The translation experience reduced my disposition to think that Go is right to be narrow and prescriptive on this issue. Two kinds of doubts grew on me:
- Pragmatic doubt. Trying to be a good Go citizen, I kept looking at places where existing nonlocal control transfers in Python could be replaced by explicit Go-style passing upwards of an error status. But I noticed that there were a significant percentage of cases in which doing this made the code more difficult to follow rather than easier.
  
  A simple representative example is a call chain of several data transformations in which each stage has its own failure condition and any failure aborts the transformation. If we there were no error cases we might write, in a Pythonoid sort of notation:
  sink = transform3(transform2(transform1(source)))
  If a stage can error out, we might have these structural alternatives to consider. One is Go style:
  (fail1, result1) = transform1(source) if fail1 == true: status = Exception1 else: (fail2, result2) = transform2(result1) if fail2 == true: status = Exception2 else: (fail3, result3) = transform3(result1) if fail3 == true: status = Exception3 else: sink = result3 status = OK
  The other style is with a catchable exception:
  status = OK try: sink = transform3(transform2(transform1(source))) except (Exception1, Exception2, Exception3) as err: status = err
  I don’t think there’s even a colorable argument that the Go structure is better in a case like this. Look at all those extra variables, that eye-confusing ladder structure, the defect-prone near-but-not-quite repetition of code.
  
  An early reviewer pointed out that if the Go code were an entire function it could be expressed something like this:
  func pipeline(source T) { result1, err1 := transform1(source) if err1 != nil { return err } result2, err2 := transform2(result1) if err2 != nil { return err } result3, err3 := transform3(result2) if err3 != nil { return err } return nil }
  That’s still a lot of eyeball friction compared to functional-style with exceptions. And it gets worse faster as the number of stages rises.
  
  My problem was that I kept finding analogous situations in my translation. The specific one that motivated the above pseudocode was in a feature called "extractor classes". There are little bots that run the client tools of a VCS to mine the output for its metadata. It’s actually a five- or six-stage process wherein any command failure requires an abort.
  
  In these cases moving to Go style produced a serious loss of clarity. And a rising feeling that I wanted my exceptions back (and in fact the extractor-class code now contains the one real instance of my exceptions kludge). Which leads to this:
- Aesthetic doubt. I’ve never written a general-purpose language, but I have designed way more than my share of DSLs and declarative markups, and from this I have learned a heuristic for doing engineering that I won’t regret. For any given capability X:
  
  Being able to express X elegantly is a good place to be. Leaving out X entirely for safety and verifiability can be a good choice, and is at least defensible on those grounds. But if you implement X in a half-hearted, weak way that requires ugly code to use and fails to actually foreclose the conceptual problems you were trying to dodge, that’s a bad place to be.
  
  That bad place is where Go is right now with respect to nonlocal control transfers, and why I had to write my kludge.
  
  Interestingly, I was also able to come up with a very minimalist solution. No new syntax, two minor new compilation rules.
  
  To motivate it, let’s set the goal of being able to rewrite my example like this:
  package main import "fmt" type exception struct { class string message string } func (e exception) Error() string { return e.message } func throw(class string, msg string, args ...interface{}) { e := new(exception) e.class = class e.message = fmt.Sprintf(msg, args...) panic(e) } func catch(accept string) *exception { if x := recover(); x == nil { return nil } err := x.(*exception) if err.class == accept { return err } panic(x) } func main() { defer println("Defer 1") defer println("Defer 2") defer println("Defer 3") defer func() { fmt.Println("Recover:", catch("recoverable")) }() throw("recoverable", "Don't Panic!!!") fmt.Println("Unreachable.") }
  That is rather less ugly, actually pretty reasonable if the implementations of throw and catch aren’t staring you in the face. And all it would take to get there is two minor loosenings of restrictions.
  1. The panic function has a new property, "terminating". If the compiler can prove that all exit paths from a function invoke terminating functions, it is marked "terminating". The effect of this property is to suppress "missing return" errors on any code path from call of a terminating function to exit of its caller, but not on other paths to exit.
  2. A recover() call is no longer required to be within the lexical frame of a defer(). It can be in a helper called by the defer clause (but still within the call scope of a defer). For safety we’d need an additional rule that a go clause in the helper puts the code it runs out of scope for purposes of this check.

Absence of iterators

Having Python iterators go missing is really annoying for reposurgeon, in which lazy evaluation of very long lists is a frequent requirement.

Here’s the type example. I have in my repository representation a list of possibly hundreds of thousands of events. A subset of these events is Commit objects. I would like to be able to write

	for i, commit := range repo.commits() {
		do_stuff_to(commit)
	}

In Python it is easy and natural to write commits() as an iterator which lazily walks the repository event list looking for Commit objects. Each time it is called it either returns with "yield", handing back the next commit, or actually returns — which is a signal that the for loop should terminate.

I can’t do this in Go; I have to write commits() to return an entire constructed slice made by filtering the event list. Which is annoying for long lists, especially when it might well terminate early.

Sure, there’s an alternative. It looks like this…

	for i, event := range self.events {
		switch event.(type) {
		case *Commit:
			do_stuff_to(event.(*Commit))
		}
	}

…and about which what I have to say is "Ugh!". That code does not say "walk through all commits", it says "walk through all events and do something to the ones that happen to be commits". I don’t want to wander into event-land here; that type-assertion/cast pair looks altogether too much like a defect attractor. Also, unnecessary eyeball friction.

I had no good idea what could be done about this. I read Ewen Cheslack-Postava’s excellent discussion of iterator patterns in Go and agreed with him that none of them are really satisfactory.

Annoyingly, the iterator pattern he suggests is almost the right thing — except for the part where early break from a channel-based iterator leaves its goroutine running and some uncollectible garbage.

Then, on my second reading, I had a brainstorm. I found a trivial Go extension that would give iterators with no new syntax, no hidden magic, and no yield/return distinction:

New evaluation rule on how to interpret for loops when the range operand is a callable: the loop runs as a generator, yielding each value in succession, until the callable returns the zero value of its type.

So, with that I could write a Repository method like this:

// Iterator variant A: range stops on a zero value

func (repo *Repository) commits() func() *Commit {
	idx := -1
	return func() *Commit {
		for {
			if idx++; idx >= len(self.events) {
				return nil
			}
			if _, ok = self.events[idx].(*Commit); ok {
				return self.events[idx]
			}
		}
	}
}

…and there I have it. An iterator, with exactly the same lifetime as the for loop.

Then I thought it might be best to make this properly parallel to the way iteration via range works.

// Iterator variant B: stop variable.

func (repo *Repository) commits() func() *Commit {
	idx := -1
	return func() (*Commit, bool) {
		for {
			if idx++; idx >= len(self.events) {
				return nil, false
			}
			if _, ok = self.events[idx].(*Commit); ok {
				return self.events[idx], true
			}
		}
	}
}

With this form the iterator could pass back zero values without terminating, terminating only when the second return value from the function-valued range argument goes to false.

I suggest that one of these be adopted for a future release of Go. Small, easy new evaluation rule, big gain in expressiveness.

Hieratic documentation

Figuring out how to do type-safe polymorphism in the event list was more difficult than it should have been. The problem here wasn’t the Go language, it was the official (and unofficial) documentation.

There are two problems here, one of organization and one of style.

The organization problem is that there isn’t one. The official Go documentation seems to center on the library API docs, the specification, the Tour, and a couple of "official" essays written for it. It also includes a corona of white papers and blog posts. Often these are valuable deep dives into specific aspects of the language even when they are notionally obsolete. Some of them are outside the boundaries of the official documentation site.

For example, I got substantial help understanding interfaces from an old blog post by Ian Lance Taylor (one of the Go devs) that was offsite, dated from 2009, and contained obsolete implementation details.

The high-level problem is that while the Go devs have done a praiseworthy and unusually effective job of documenting their language considering the usual limitations of documentation-by-developers, finding things in the corona is hard. And knowing what’s current is hard.

The documentation is (dis)organized in such a way that it’s difficult to know what you still don’t know after reading a Tour page or blog entry or white paper. There should be more "But see here for a dangerous detail" links, in particular to the language specification.

Style. Go has a problem that is common to new languages with opinionated developers (this is part of "the usual limitations" above). There are one or two exceptions, but the documentation is predominantly written in a terse, hieratic style that implicitly assumes the reader already inhabits the mindset of a Go developer.

The documentation is not very good at providing an entry path into that mindset. Not even for me, and I’m an extreme case of the sort of person for whom it should do an effective job if it can do that for anyone.

There is a fix for both problems. It is not magic, but it is doable.

The Go dev team should bring in a documentation specialist with no initial knowledge of Go and a directive to try to maintain an outside-in view of the language as he or she learns. That specialist needs to be full-time on the following tasks:

(1) Edit for accessibility — a less hieratic style

(2) Maintain a documentation portal that attempts to provide a reasonable map of where everything is and how to find it.

(3) Curate links to third-party documents (for example notable Stack Overflow postings), with dates and attached notes on what parts might be obsolete and when the document was last reviewed for correctness.

(4) Bring the very best third-party stuff inside, onto https://golang.org/doc/.

Note: After writing this, I had an even worse time digging up and fixing in my mind all the details of how defer/panic/recover works. It’s almost all documented somewhere, though Peter Seebach and I ended up writing a FAQ entry on how to set local variables from a defer clause to clear up minor confusion. There’s a very helpful blog post on the general topic. But the blog post leaves out the crucial detail that recover returns interface {}, not error; this tripped me up when I was writing my kludge, and I ended up on IRC getting referred to the formal Go specification.

This is all too typical. Everything makes sense once you know it, but before you know it critical details are often lurking in places you have no way of knowing you should look.

Attention to the problem and a good technical writer/editor can fix this.

Outcomes

My performance objectives were achieved. I didn’t get a fully 40x speedup, but only because the running time of the GCC conversion is dominated by the low speed of the Subversion tools. The non-I/O limited part of processing fell from about 7 hours to about 20 minutes (about 20x), and the overall speedup over Python was about 10x.

Additionally, maximum working set drastically decreased. These improvements re-enabled the workflow reposurgeon was designed for, rapid iterative improvement of conversion recipes.

On January 12th 2020 the production conversion of the GCC history from Subversion to Git actually took place.

I am no longer a Go novice. :-)

Accentuating the positive

The Go translation of reposurgeon is better — more maintainable — code than the Python original, not just faster. And this is not because I rewrote or refactored as I went; as I’ve explained, I tried very hard to avoid that. It’s that Go’s minimalistic approach actually…works.

I see a maintainability benefit from the static typing. The Go type system does what a type system is supposed to do, which is express program invariants and assist understanding of its operational semantics.

Translation from Python, which is a dreadful language to try to do concurrency in due to its global interpreter lock, really brought home to me how unobtrusively brilliant the Go implementation of Communicating Sequential Processes is. The primitives are right and the integration with the rest of the language is wonderfully seamless. The Go port of reposurgeon got some very large speedups at an incremental-complexity cost that I found to be astonishingly low. I am impressed both by the power of the CSP part of the design and the extreme simplicity and non-fussiness of the interface it presents. I hope it will become a model for how concurrency is managed in future languages.

I’ve also seen a maintainability benefit from how easy Go makes it to write unit tests in parallel with code.

The Go profiling tools (especially the visualization parts) are extremely effective, much better at smoking out hidden superlinearities in algorithms than Python’s.

I have to call out the Go time library as a particularly good piece of work. Having the basic timestamp property be location-aware with its presentation modified by the implied zone offset simplified a lot of cruft out of the handling of committer/author dates in Python.

Now that I’ve seen Go strings…holy hell, Python 3 unicode strings sure look like a nasty botch in retrospect. Good work not falling into that trap.

Translation pragmatics

The strategy of writing as literal as possible a translation of the Python first, at the cost of generating Go code that was initially clunky and unidiomatic, worked quite well. It actually took effort and discipline to refrain from trying to improve the code as it passed through translation, but I am very glad I expended that effort. It kept the translation process sane and controllable.

I should note that a prerequisite for a translation like this is an excellent test suite. As I write (in late January 2020), reposurgeon at 22KLOC, has 52 unit tests, 177 round-trip tests, 218 end-to-end functional tests, and a miscellany of special tests that add up to a total of 502 reporting items. Only around 20 of these were added during the translation itself. All these tests are run by continuous integration on every commit.

That translation phase was completed In November 2019 when the Go code first passed the full test suite. The two months since has been sufficient to polish the literalistic mock-Python it was at that time into what is now, I believe, pretty clean idiomatic Go.

Pass-by-reference vs. pass-by-value

I think I can say now that once one has a translation from Python to Go that compiles, the largest single sources of bugs is the difference between Python pass-by-reference semantics for object and Go pass-by-value. Especially when iterating over lists.

Go "for i, node := range nodelist" looks very similar to Python "for (i, node) in enumerate nodelist"; the gotcha is that Go’s pass-by-value semantics means that altering members of node will not mutate the nodelist.

The fix isn’t very difficult; this

	for i := range nodelist {
		node := &nodelist[i]
		...
	}

often suffices.

I don’t have any recommended language change around this, as I don’t think Go’s choice is wrong. I do think the fact that this relatively minor issue is one of the larger translation barriers is interesting.

Envoi: Actionable recommendations

These are in what I consider rough priority order.

Keyword arguments should be added to the language.
True iterators are felt by their absence and would be easy to add.
Any enlargement in the range of what can be declared const would be good for safety and expressiveness.
Yes, throw()/catch() needs to be writeable in the language. Two minimal relaxations of compilation rules would make writing it possible.
A technical writer with an outside-in view of the language should be hired on to do an edit pass and reorganization of the documents.
Lookbehinds should be added to the regexp library.
If generics don’t fly, a map-over-slice intrinsic should be added.

Not quite actionable yet:

Absence of sum types creates an actual hole in the type-safety of the language.