The C-INTERCAL compiler has a very conventional implementation using YACC and LEX (latterly, bison and flex). It generates C, which is then passed to your C compiler.
The spectacular ugliness of INTERCAL syntax requires that the lexical analyzer have two levels. One, embedded in the input() function, handles the backquote and bang constructs, and stashes the input line away in a buffer for the splat construct’s benefit. The upper level is generated by lex(1) and does normal tokenizing for YACC.
The upper level of the lexer also does a bit of syntax analysis, by distinguishing it from an ordinary COME FROM and passing different symbols to the parser. The lexer is also responsible for helping the parser to match sparks and ears in complicated array situations, by distinguishing possibly closing sparkears from definitely opening sparkears.
In order to splatter erroneous statements correctly, the generated code has to track both logical and physical lines. That’s the reason for the lineno variable in the generated C, it’s actually tracking physical lines.
Numeral tokens for input are defined in a symbol table (numerals.c) that is directly included in the run-time library module (cesspool.c). This avoids having to put the the size of the numerals array in an extern. To add new numeral tokens, simply put them in the numerals initializer.
The parser builds an array of tuples, one for each INTERCAL statement. Most tuples have node trees attached. Once all tuples have been generated, the compile-time checker and optimizer phases can do consistency checks and expression-tree rewrites. Finally, the tuples are ground out as C code by the emit() function.
Calculations are fully type-checked at compile time; they have to be because (as I read the manual) the 16- and 32-bit versions of the unary ops do different things. The only potential problem here is that the typechecker has to assume that :m ~ :n has the type of :n (32-bit) even though the result might fit in 16 bits. At run-time everything is calculated in 32 bits. When INTERCAL-72 was designed 32 bits was expensive; now it’s cheap. Really, the only reason for retaining a 16-bit type at all is for the irritation value of it (yes, C-INTERCAL does enforce the 16-bit limit on constants).
Labels are mapped to tuple indices (logical line numbers) in the code checker, just before optimization.
Each line of INTERCAL is translated into a C if()-then; the guard part is used to implement abstentions and RESUMES, and the arm part translates the ‘body’ of the corresponding INTERCAL statement.
The generated C code is plugged into the template file ick-wrap.c inside main(). It needs to be linked with cesspool.o, fiddle.o and lose.o (these are in libick.a, with the support for runtime switches, arrgghh.o). Cesspool.o is the code that implements the storage manager; fiddle.o implements the INTERCAL operators; and lose.o is the code that generates INTERCAL’s error messages. The routine arrgghh.o parses the runtime command line arguments.
The abstain array in the generated C is used to track line and label abstentions; if member i is on, the statement on line i is being abstained from. If gerund abstentions/reinstatements are present in the code, a second array recording the type of each statement in generated into the runtime, and used to ensure that these operations are translated into abstention-guard changes on all appropriate line numbers.
RESUMES are implemented with a branch to a generated switch statement that executes a goto to the appropriate label. If there are no RESUMES, no such switch is generated.
The compiler places a simple label at the location of each COME FROM in the program, while all of the machinery for checking for abstention and conditional execution and actually performing the jump is placed immediately after the code for the target statement.
AIS adds: The additions to the language [since 0.24] have made code generation potentially more complicated, depending on which switches are on. Each INTERCAL command is still translated into an if()-then, but with extra guards around it. COME FROM is still handled the same way, but only in singlethreaded programs with no computed COME FROMs. For anything more complicated than that, a general COMING FROM mechanism is invoked. C’s most confusing control-flow identifiers, setjmp and longjmp, are used. When computed COME FROMs are involved, after every labeled statement, a set of gotos are done to check each in turn. A jmp_buf, cjb, is used to store the point in the program where the check started. Each COME FROM leading to that line and computed COME FROM anywhere in the program is checked, and modifies cjb to aim for its suckpoint if neccesary. In the case of a multithreaded program, one cjb will be stored in a linked ring of threads, and the other will be the one in the main program. Extra guards are also added for ONCE/AGAIN; the position of this guard depends on whether the command is a NEXT or something else. ONCE/AGAIN are quite legal even in singlethreaded programs. Computed ABSTAIN is dealt with by allowing the abstain array to hold values other than 0 or 1. Multithreading is supported by a new library, libickmt.a, to hold the multithread versions of libick.a functions. See unravel.c for an explanation of the full gory details of the multithread implementation (it has many comments explaining the implementation at the top).
The optimizer core does full recursive folding of all constant expressions at compile time (evaluating away all the irritating little kluges you have to use to generate 32-bit constants). It also checks for known INTERCAL idioms for ‘test for equality’, ‘test for nonzeroness’, and the C logical operators &, |, ^, and ~.
Here are the constant-folding optimizations from the original optimizer core:
Vx$y"#0x55555555" into (x | y) ?x$y"#0x55555555" into (x y) &x$y"#0x55555555" into (x & y) (x 0xFFFF) or (x ^ 0xFFFFFFFF) into x
AIS adds: When I got my C-INTERCAL distribution, it only had a rudimentary optimiser. Constant folding had already been done, and the following optimisations. The following is mixed INTERCAL and C, sometimes in the same statement. It should be clear from context and whether I have used INTERCAL spark-ears or C brackets which language is which, although in some places I’ve mixed them. Note that the optimizer is only run in binary.
ESR adds: After 0.24, Alex Smith, assisted by Joris Huizer, wrote a program that interprets an Optimizer Idiom Language and uses it to generate a library implementing those optimizations to be linked to the compiler. This subsumes all the stuff that used to be in the optimizer code.
AIS adds: Optimizations available with -f (I wrote all these)
Guard removal (if a line can’t be abstained, the if(!abstained[x]) line is removed when possible)
Ignorance checks (if a variable can’t be ignored and it can’t overflow, assignment is done with C’s = rather than a function)
Recognition of NEXT…NEXT…computed RESUME as an idiom for if() (currently disabled because the rest of the optimiser can’t handle it)
Optimizations available with -F (I wrote this one)
This optimization should mean that INTERCAL suddenly becomes the best language for various benchmark programs. Basically, the file is analysed to try to determine whether it takes no input and always produces the same output. If that is the case, the output is simply a shell-script (with executable permissions set and a #! line) that outputs the correct output! Therefore, the time required for one run of the program is taken up during compilation, and whenever the program is executed, it just copies out the predetermined output. This means (for instance) that primes.i runs in only a fraction of a second when optimized with -F. Of course, this tends to slow down the compiler and often produces a large object file, which is why it has a command-line option of its own.
ESR wrote the first version of this compiler over a weekend using a pre-ANSI C compiler. It worked, but it wasn’t pretty.
Louis Howell added the array support later; he also torture-tested the COME FROM implementation by actually using it for the life2.i program included in this distribution, and fixed some bugs.
Brian Raiter did a much-needed delinting while porting it for ANSI C, flex and Linux. He also improved the lexical analyzer’s line tracking.
Alex Smith, assisted by Joris Huizer, did a lot of heavy-duty work on optimization, including designing and implementing OIL, after 0.24.