file(1) specification

Daniel Quinlan (quinlan@proton.pathname.com)
Sun, 20 Oct 96 20:40 PDT

Eric S Raymond <esr@locke.ccil.org> writes:

> [...] Once the magic numbers RFC and the Magic Numbers Registry hve
> been developed, the list may take on other challenges including
> extension of the file(1) specification language to be able to get
> hints from filename patterns.

A new version of the POSIX 2 specification, including file(1), is
under development. The last drafting I saw had some problems and I
began writing a critique, but didn't go anywhere with it because of
time constraints.

I just mailed hlj@posix.com, the editor of POSIX 2, to ask for an
updated version of the draft (of the file(1) section). I'll forward
that to here when I get it. If I can manage to locate my last draft I
had (draft 11), I'll send that here too.

Anyway, I might as well include the critique as it stands.

------- start of cut text --------------
Comments concerning POSIX 1003.2B, draft 11, section 5.14:

1. byte-order
-------------

The current draft has no provision for the differentiation between magic
numbers originating from big-endian (MSB first or Motorola-order) and
little-endian (LSB first or Intel-order) machines.

With the exception of string matches, this places the validity of any
match under serious question. It also makes the porting of magic files
between big-endian and little-endian architectures impossible. (Note:
byte matches are even worse, Darwin file only uses them for subtests.)

Darwin file was extended in 1993 to provide byte-order handling. It was
accomplished through the addition of several new magic types: "beshort",
"leshort", "belong", and "lelong" (where the "be" and "le" prefixes
refer to big-endian and little-endian, respectively).

There was been a lot of work to update the Darwin file collection of
magic entries to byte-order specific forms, where possible.

2. Strings
----------

According to the current draft, any non-ASCII characters included in
string magic must be written with octal escapes. Providing a mechanism
to support hexadecimal escape sequences would be beneficial to writers
of magic files.

A common coding practice is to specify magic in a hexadecimal format;
manual conversion from hexadecimal numbers to "hexadecimal strings" is
very simple (especially on big-endian machines).

Additionally, this is a very simple extension to support in file.

Darwin file supports the use of \x?? or \X?? to specify an escaped
character, where '??' is the value of the character in hexadecimal
notation.

Without this extension, SGI had to do use this magic, which has the
potential to match much more than IRIX vmcore dumps, especially since
"belong" isn't being used.

# New style crash dump file
0 long 0x43727368
>4 long 0x44756d70 IRIX vmcore dump of
>36 addr x '%s'

Darwin file was able to use:

# New style crash dump file
0 string \x43\x72\x73\x68\x44\x75\x6d\x70 IRIX vmcore dump of
>36 string >\0 '%s'

Translating from the "long long" 0x4372736844756d70 to the octal string
notation is an awkward and time-consuming task.

3. Multiple levels of subtests
------------------------------

Multiple levels of subtests should be supported in magic files. This is
an essential facility for the accurate recognition of files.

The Solaris 5.4 magic(4) manual page lists the lack of multiple levels
of subtests as a bug.

+----
| BUGS
| There should be more than one level of subtests, with the
| level indicated by the number of `>' at the beginning of the
| line.
+----

Darwin file supports multiple file levels, in the manner proposed by
Sun's magic(4) manual page.

For example, the following section of magic file, intended for ELF
objects, becomes impossible without multiple test levels. Please also
note the use of byte-order specific magic.

[insert ELF magic here]

4. Required <type> strings
--------------------------

Each <type> string is completely in lower case, but several of the words
are almost always written with some (or all) upper case characters. The
left column of the table even lists several items in upper case: "FIFO",
"C program text", and "FORTRAN program text".

POSIX should allow either upper or lower case output or perhaps it
should specify a sensible usage of capital letters.

5. Types
--------

Section 5.14.7, subsection "type", contains very serious problems.

The number of bytes for any particular type must not be implementation
defined. Doing so would make magic files completely non-portable.
Files should be identifiable on any system.

This section of the specification also breaks historical implementations
in a drastic way, such as Darwin file, Sun's file, and other
implementations.
------- end ----------------------------