SYNOPSIS

comit [-v] [-h] [-d phases] program…​

DESCRIPTION

COMIT, dating from 1957, was the first programming language designed specifically for string manipulation. It was directly ancestral to SNOBOL, and through SNOBOL to sh(1), sed(1), perl(1) and all other languages built around regular expressions.

This program interprets COMIT II program files. Execution begins at the first rule of the first file.

Language syntax and semantics are as described in "Programming with COMIT II" by Victor H. Yngwe, MIT Press 1972. Only excerpts from the book (almost) sufficient to describe the language itself are included in the distribution.

Numerous example COMIT programs from the COMIT II book are included in the distribution. Read them and marvel.

OPTIONS

-h

Dump a usage message and exit.

-d 'phases'

Debug; followed by a comma-separated list of phases. Defined phases are 'parse', 'match', and 'exec'; these produce labeled traces to standard error.

LIMITATIONS

Some language features are not yet implemented: Subroutines, medial $0, subscripts, some routing instructions other than R and W. The main blocker on these is that it is difficult to extract a crisp specification for them from the rather vague language and fragmentary examples in the source materials.

This implementation’s author has elected to revive the largest language subset that can be verified by eyeball and punt the rest, rather than choose interpetations that are plausible but might be wildly wrong. In the future, perhaps careful forensics will make it possible to reconstruct the rest of the language with more confidence.

Some limits in the IBM 704 implementation are not enforced. Arithmetic is in the machine’s native size, not 16 bits. Names may be longer than 12 characters. Character data is not limited to the IBM 704’s 6-bit EBCDIC character set.

The batch-mode diagnostic dump at the end of each run is not implemented.

In the I/O routing instructions R and W, the mode and format fields defined for the IBM 704’s batch environment are ignored. RC loads a newline-terminated line into the workspace from standard input, with each character as one constituent; RT reads in blank-separated tokens. W writes out the entire workspace every time. These, in effect, implement COMIT’s "Format C" input and "Format A" output style, which seem to have been those most commonly used.

The channel code in R and W routing instructions is mostly ignored. Under W, channel M is mapped to standard error and all other channels to standard output.

The interpreter will cough and die if the input contains non-ASCII characters.

See Also

A searchable transcript of the 1958 paper is included in the comit source distribution.

REPORTING BUGS

Use the bugtracker on the project page at https://gitlab.com/esr/comit

SEE ALSO

sed(1)