Re: RFC first cut -- springboard for discussion

Daniel Quinlan (quinlan@proton.pathname.com)
Thu, 31 Oct 96 23:54 PST

General comments.

> HOW TO PICK MAGIC NUMBERS
>
> GLOSSARY
>
> Primary magic:
>
> Magic numbers used to identify the type of data stream or file. Any given
> file has only one primary magic block.

Possibly superfluous clarification: some files, such as archives, may
be composed of multiple files, each containing a primary magic block,
but the archive itself still only has one primary magic -- the rest is
data in the file.

> Secondary magic
>
> Magic numbers used to identify characteristics of the data stream
> of file such as version, sub-type, etc. A file may have more than
> one secondary magic block.

And secondary magic can be composed of multiple levels of tests.

> REQUIREMENTS

> The first and third are required for compatibility with both
> traditional and POSIX versions of file(1). To be eligible for
> inclusion in the Registry, a new file format MUST have these properties:

> 1. The primary and all secondary magic blocks must be located at predictable
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> constant offsets from the beginning of the data stream. (The purpose of
> this requirement is to permit tests for magic to be expressed in the
> simple, rapidly-interpretable notation of file(1)).

Define predictable. Predictable means "at a fixed offset" from the
beginning of the file. The secondary magic may be ignored or utilized
depending on the results of a higher-level magic entry.

We might distinguish between secondary magic and other information
about the file's characteristics. Sure, stuff can be all over, but it
can't be "secondary magic" if we can't find it easily.

> 2. The primary and all secondary magic blocks must be located within
> 512 bytes of the beginning of the file or stream. (The purpose
> of this requirement is to limit the amount of data which
> MAGIC-compliant programs must read to determine a file's type.)

No, only make the primary magic satisfy this requirement. We might go
for 1k too. 1k is not a big deal.

Here are the only active magic entries that are appear above offset 50
or so.

216 lelong 0421 Linux/i386 core file
257 string ustar\0 POSIX tar archive
257 string ustar\040\040\0 GNU tar archive
596 string \130\337\377\377 Ultrix core file
1080 leshort 0xEF53 Linux/i386 ext2 filesystem
2048 string PCD_IPI Kodak Photo CD image pack file
2080 string Microsoft\ Word\ 6.0\ Document %s
2080 string Microsoft\ Excel\ 5.0\ Worksheet %s
4086 string SWAP-SPACE Linux/i386 swap file

On this basis, I'm tempted to require <4k, but recommend <512 bytes.
*Nothing* is past 4k.

> 1. Both primary and secondary magic blocks should be limited to 8 bytes each.

Limited? Why? More than 8 is fine. It's less than 4 that is a problem!

> 2. Primary magic blocks should not contain NULs.

Hmm... on second thought, it's really more complicated than that. NUL
bytes are okay, as evidenced by the POSIX tar archive. We know it
isn't plain text that way.

We should say something more general "should not contain repeated
characters, especially in sequence".

> 3. Magic should be not composed solely of characters in ASCII, ISO Latin-1,
> or other standard character sets unless the file type is a text format
> with characters limited to that set.
>
> 4. For binary files, magic should be a random mix of text and non-text
> characters, preferably including characters in each of these ranges:
> ASCII, Latin-1 extensions to ASCII, and everything else.
>
> 5. The primary magic should be constant for a particular application or data
> type. Only secondary magic should change depending on the version of
> an application or data type.

Right on the money!

> (2) Contact
>
> A Web page or email contact for the person(s) responsible for the
> format.

Contacting responsible parties is an interesting problem. Who can be
responsible?

1. Single Person (John Christy for Magick Image File Format)
2. Committee or group (PNG authors or GNU developers)
3. Company or Organization (Sun Microsystems)

We should to consider each case, but without using a three-way case
statement (since there may be others).

Contact information:
Name of Responsible Party
Email address(es) of Responsible Party
URL of Responsible Party (optional)
Postal address of Responsible Party (optional)

Some contacts may be blank.

> (3) Status
>
> One of: Experimental, Production, or Obsolete.

One of: Experimental, Maintained, Unmaintained, Obsolete, Unknown.

> (4) Code
>
> A code of at most 8 ASCII characters intended for use as a type
> representation icon in text-only navigation tools.

Mention Braille.

That's all for now.

Dan

-- 
Daniel Quinlan                  http://www.pathname.com/~quinlan
quinlan@pathname.com            quinlan@transmeta.com (at work)