Trove Design Document <author>by Eric S. Raymond <date>$Id: trove-design.m4,v 1.16 1998/06/25 15:49:23 esr Exp $ <abstract> Trove is a next-generation Internet software archiving facility, intended to supplant the classical FTP-tree-with-decorations model. This document describes the history, design, architecture, and user interface of Trove. It is a work in progress, intended both to guide implementation and to document the project. </abstract> <toc> <sect>Introduction <p> <sect1>Why Trove? <p> The `classical' model of Internet software archive (exemplified by <url url="http://sunsite.unc.edu" name="Sunsite">, WWW frosting on an FTP cake) is no longer adequate to the increasing size and evolutionary speed of the open-source community. It eats too much maintainer time; the classification/search mechanisms are woefully weak; and the package namespace has no collision detection. One of us (Eric Raymond) had been Sunsite's principal maintainer for more than a year before Trove got started. Eric wrote the <url url="http://sunsite.unc.edu/search" name="keeper"> tool, which does about as good a job as possible of automating away the scutwork under the present system. It's not good enough. The amount of maintainer time Sunsite requires is rising to the point where the archive is not sustainable. On present trends, Eric thinks Sunsite's system (or its maintainers) will collapse by the end of 1998. Some prominent Python people (including Ken Manheimer, Andrew Kuchling, and Guido Van Rossum) had realized for a while they were facing similar problems in the future of the Python archive, and begun discussing a redesign they thought of as the `locator' project.<P> The concept of the Trove project was originally floated by Eric Raymond in early April 1998. Within a week, he was approached by Guido van Rossum about joining forces. By the end of April, when the project and the Trove web pages were officially launched, principals included Ken Manheimer and Andrew Kuchling of the Python Software Activity. Ken Manheimer proposed the name `Trove'. John Cowan provided valuable expertise in database design and IR pragmatics.<P> <sect1>Terminology <p> For purposes of this document, a <em>resource</em> is a file such as a source or binary archive, an RPM or Debian installable package, a documentat, etc. A resource may have associated <em>metadata</em> (such as a description of the resource). Related resources will be grouped into a <em>package</em>, which will have associated metadata of its own (including but not limited to author's name, the project home page location, etc.). The metadata exists to provide a handle on packages and resources, making them discoverable through searching and browsing facilities. Resources may have associated metadata of their own A <em>search</em> is any selection operation that returns a subset of the archive metadata. A <em>site ring</em> is a collection of Trove sites that mirror each others' metadata (so that a search of any is effectively a search of all). <sect>Objectives and Architecture <p> <sect1>Objectives <p> <sect2>Primary Objectives <p> <itemize> <item> CONTRIBUTOR-DRIVEN: Minimize the need for intervention by archive maintainers, so the system scales up to the capacity of the automation, rather than the availability of maintainer.</item> <item> SEARCHABLE: Support access to packages through a rich, user-friendly keyword and text-search-based interface, rather than topic directories.</item> <item> NON-RESTRICTIVE: the design should be enabling rather than restrictive -- it should not force use of a single interface or server that might become a performance or (more importantly) a conceptual bottleneck.</item> <item> LOCATION-INDEPENDENCE: the metadata representation and Trove tools should be indifferent to where resources are actually stored.</item> <item> RICH METADATA: Per-package metadata should have at least the descriptive power of the best-of-breed installable package format, which means RPM.</item> <item> NOTIFICATION: Anyone should be able to sign up to be notified when a package's resources or its metadata are updated.</item> <item> MIRRORABILITY: It must be possible for an entire Trove site (resources and metadata both) to be mirrored for load-sharing purposes.</item> <item> DISTRIBUTOR-FRIENDLINESS: One of the deliverables should be a tool or access mode that collects copies of all resources and metadata turned up by a given search, so that CD-ROM distributors can make distributable snapshots of the archive or subsets of it.</item> <item> CONFIGURABILITY: Full configurability of things like keyword categories, so the software can be used for multiple archives with different policies (in particular, both son-of-Sunsite and the Python archive).</item> <item> SCALABILITY: Must scale well, up to Sunsite's level of traffic and beyond. Verifying this scalability before releasing will be important.</item> </itemize> <sect2>Secondary Objectives <p> <itemize> <item> PERFORMANCE: It would be a good idea (for performance) if running CGIs was only required for searching and for modifying the database, and everything else was available as static HTML files.</item> <item> AUTHENTICATION: Strong authentication for packages and package updates, like what Debian does.</item> <item> META-ARCHIVE: Meta-archive functions -- queries to one Trove service may automatically also forwarded to other Trove services.<P> <item> EMAIL: Support metadata updates by email to a robot.</item> <item> CRAWLER: Support an optional `trusted remote metadata' field in the metadata and write a crawler that polls these for metadata updates.</item> </itemize> <sect2>Blue Sky <p> <itemize> <item> DEPENDENCIES: Teach Trove to extract inter-resource dependencies by analyzing binaries. Long-term project! </item> </itemize> <sect1>Architectural Implications <p> To achieve the CONTRIBUTOR-DRIVEN objective, submissions and updates will normally be done through a Web form with upload capability. Maintaining metadata will be the responsibility of each package's authors and maintainers. The ENABLING objective implies that at least package resources (if not the metadata) should be directly accessible via FTP or the Web. The LOCATION-INDEPENDENCE objective implies that all resource pointers in metadata are actually URLs. The ENABLING and LOCATION-INDEPENDENCE objectives together require that the Trove data architecture must have a clean separation between two parts; the <em>catalog</em>, a database holding package metadata, and the <em>archive</em>, a local FTP/Web tree holding some (but necessarily all) of the resources pointed to by the catalog. The ENABLING and PERFORMANCE objectives further imply that as much as possible of the catalog view should be available through unmediated Web and FTP access into the archive. This implies making HTML and plaintext versions of package metadata available in the archive, updated automatically when the master copy in the catalog database changes. To achieve RICH METADATA, we must roughly capture RPM's annotation semantics. See the appendix on <ref id="rpm-import" name="importing RPMs">. The NOTIFICATION implies that each package's metadata must include a mailing list, and that the interface must support subscription and unsubscription facilities. The SCALABILITY requirement implies using managing the metadata with a real database capable of handling high transaction volumes. For the ENABLING and EMAIL and CRAWLER objectives, we must define a plain-text tag format for rendering metadata. We'll use this to (1) represent the metadata in FTP-accessible files in the archive, (2) define the required format for email submissions, and (3) define the required format for trusted remote metadata. The plain-text tag format will come up again, so it needs a name: TRL, for Trove Request Language. <sect1>Architecture <p> The forgoing objectives make it pretty clear what the general architecture of the system. A Trove site will consist of the following parts: <itemize> <item> The <bf>catalog</bf> -- a database of metadata records, including URIs pointing to resources.</item> <item> The <bf>archive</bf>, a local directory tree containing resources managed by the Trove software but independently FTP- and Web-accessible. (Some Trove sites may not have an archive, instead being purely registries of metadata and pointers.)</item> <item> The <bf>shovel</bf>, a serializing front end that translates TRL requests on its standard input into database actions. The shovel is the only program that modifies the database directly. It's the shovel's job to ensure transaction atomicity.</item> <item> The <bf>librarian</bf>, a collection of web pages and CGIs that mediates interactive access to the library (the catalog and archives together) through Web browsers. The librarian manipulates the database by making TRL service requests through the shovel program. It may query the database directly.</item> <item> The <bf>crawler</bf>, a program that periodically attempts to update the library by polling maintainer sites specified in metadata. The crawler makes TRL service requests through the shovel program. (Some Trove sites may not have a crawler.)</item> <item> The <bf>mailbot</bf>, a program that accepts email updates in TRL format. The mail robot makes service requests through the shovel program.</item> </itemize> The structure of TRL, with an example, is discussed <ref id="TRL" name="in the Appendix">. <sect2>Fundamental Types and Namespace Control <p> To reason about the design, we need to know what kinds of things will be in the Trove database and how they are named (e.g. what handles they can be retrieved by. Some of this has been touched on in the section on terminology. There are three different kind of objects in the Trove universe. These are: <itemize> <item><em>Resource</em> A <em>resource</em> is `real' data, a source or binary archive or document of the kind a Trove archive is intended to serve. In the Trove universe, a resource it represented by a <em>resource record</em> that must include a URL to where the resource actually lives and may include other metadata (such as a description). The name of a resource is the URL of the resource. Accordingly, any given resource name always identifies exactly one resource. <item><em>Package</em> A <em>package</em> is a collection of resources tried together by a <em>package record</em>. The associated resources may be the same program or document in several different forms (such as source archive, binary archive, installable package, etc.) or it may be a group of related resources such as the individual components of a multiple-program project. Besides resources names, package records contain other metadata intended to facilitate finding packages by topic or subject area, including both a text description and controlled-vocabulary keywords (discriminators). The name of a package is an arbitrary identifier chosen by the package record creator (its initial owner) and changeable by the package record owner. A package may have any number of resources associated with it. In general, any given resource will only belong to one package, but exceptions are harmless. <item><em>Person</em> A <em>person</em> record associates metadata with an RFC822 email name/address pair. The metadata may include such things as a home-page location, a PGP public key (as an optimization, in order to make a public-key-server lookup on each submission unnecessary), etc. Person records exist so that Trove users can go from a package to its maintainers to their home pages and other projects. A person is named by the email address part of their name (which is unique). </itemize> All three kinds of resources are always explicitly created, modified, and deleted, with a notoification to interested parties on each action. The general policy on name validation is that references to unregistered people and packages are not. Thus, maintainers of a package need not be in the Person table as long as they have syntactically valid email addresses; and package relations may refer to packages by name that are not registered in Trove. This implies that every creation of a Package or Person record needs a global check to mark references it suddenly fills, but that is an acceptable price for making the namespace open rather than closed. Issue: We know that package names will be unique per site. Are they unique across all sites in the Trove ring? If not, how do we do synchronization when rings merge? And how do crawlers know which package they are responsible for? <sect2>Catalog architecture <p> The catalog will be stored in a database. The <url name="schema" url="http://www.tuxedo.org/~esr/trove"> is available at the Trove website. <sect2>Archive architecture <p> To make the rest of this document concrete, we need to specify an organization for the archive part. Here it is: Each project has a directory. The name of the directory is the name of the project, <em>without</em> a version number (this is so project directories can contain multiple directories). Observe the implication that project names must be unique per Trove site. Project directories may live directly under a per-site root, or (for performance) under superdirectories which express some kind of hash on the names. It is important for bare-FTP accessibility that this hash be easy for human beings to calculate by inspection. Example: terminfo's scheme of having each terminal type live in a superdirectory named after the first character of the terminal type name. Whether such a scheme is used, an what it is, is per-site policy. Within each project's directory live all its associated local resources. Other resources may live offsite (the catalog records don't care, they use URIs for everything). The directory will also contain FTP and HTML versions of the package's metadata, as files named %%INDEX.TRL and index.html respectively. The former name is chosen to sort as early as possible in an FTP directory listing without including Unix shell metacharacters; the latter, to be the page automatically displayed by a browser pointed at the directory. <sect2>Librarian architecture <p> The librarian will be a set of HTML pages and CGIs that mediate between users (including uploaders and maintainers) and the library. It will be necessary for the librarian to maintain state through multiple-form transactions. For discussion of the librarian design, see the major section on <ref id="interface" name="user interface design"> below. <sect2>Mail-Robot and Crawler architecture <p> These will be programs that, essentially, translate metadata submissions in TRL into actions on the archive. The only difference between them will be that the email robot waits for input fed to it though a mail alias, while the crawler looks for descriptions in remote locations specified by metadata URIs. In both cases, a parse error or package name collision or other exception will generate email to the submitting party and contact persons given in the both new and old metadata. <sect1>Architecture Open Issues <p> What do we use as the database back end? Postgres95? SOLID? MySQL? Something else? <sect>User Interface Design<label id="interface"> <p> <sect1>User Roles <p> To understand the interface, it will be helpful to recognize that different kinds of people will be using the archive: <sect2>Users <p> Users (people looking for packages that match their requirements to download) are presented with a search/browse form on the Web. The search/browse form allows them to enter search terms (discriminators). The keywords may be selected with buttons from a controlled vocabulary defined by site policy, or entered as `roll-your-owns' in a text field for free-text searching of package descriptions. Searches would yield all targets that are in the the intersection set of the controlled-keyword hits, intersected with all hits from a search for roll-your-own keywords in package text descriptions. We'll give below a more detailed description of the handling of <ref id="keytrees" name="controlled-vocabulary keywords">. The result of a search/browse operation is a generated HTML <bf>catalog listing</bf>. The body of a catalog listing consists of a series of one-line entries each beginning with a package-name hotlink and including a one-line package summary. The catalog has section headers indicating which lines are controlled-keyword hits and which are free-text hits. Users looking at a catalog listing may either refine the search or look at individual entries that interest them (by chasing the package-name hotlinks). An individual entry displays all package metadata contained in the Trove database, possibly including resource links to a local cache of package resource files. When an individual entry is selected, a user may take one of several actions: <itemize> <item> Chase a resource hotlink on the package metadata display (such as the package home page URL, or a mailto URL for the package contact person).</item> <item> Download package resources (e.g. by chasing FTP hotlinks on the package metadata display).</item> <item> Subscribe or unsubscribe to the package's notification list (that is, the list of people automatically notified by email whenever package metadata or resources are changed). All unsubscription requests must be authenticated; this is to prevent bad guys from masquerading as good guys in order to suppress notifications.</item> <item> Attach a review annotation to the package. (This is a future feature and has not yet been designed into the database schema.)</item> </itemize> <sect2>Contributors <p> Contributors have three tasks: (1) creating and updating resources, (2) creating and updating package records, and (3) creating and updating person records. All three of these things are done in the same way; by submitting TRL requests to a Trove site. A contributor can either mail a TRL request to a Trove mail robot, or use a TRL request to register a URL where update requests for a given package can be found. Periodically a Trove crawler will go to the registered URL to pick up a new copy of the metadata. The crawler keeps internal track of the last-upload time and will only actually copy the metadata if the file has since been altered. In either case, if the request is authenticated (PGP-signed) it will be executed immediately and a report emailed to the contributor. If the request is not authenticated, it will be emailed back to the contributor's home address for confirmation; the confirmation message will include a request ticket. Replying to the confirmation email will ship it back to Trove for execution of the request. TRL requests may also be generated by a browser session with Trove (in fact, this will be normal for initial package creation, and assist the contributor in aditing controlled-vocabulary fields like discriminators). When the request is committed, a TRL copy will be made and emailed to the contributor. including a request ticket. When a Trove server finally executes a request, it makes local copies of any attached and replica resources. It then makes whatever modifications are permitted by the request's privilege level and the locked/unlocked state of the items the request wants to modify. Finally, an email report of all changes made (and any updates refused) is sent to everyone on the package notification list. <sect2>Administrators <p> Trove site administrators get email notification of all package creations (so they can watch for site-policy violations). Trove site administrators can use a web form to view a catalog of recently added entries, and delete or modify them if there appears to be some problem. Administrators are also responsible for watching logs of roll-your-own keyword entries and noticing when keywords should be migrated into the core keyword set described in site policy. <sect2>Mirror Makers <p> Mirror makers include both CD-ROM distribution makers and people running load-sharing mirrors of a Trove site. Both have the same requirement, which is to be able to snapshot the library to another medium. The CD-ROM distributor's happens to be read-only, but he has every good reason to simply ship an instance of Trove as his organizer for the archive. <sect1>Searching and Browsing <p> <label id="keytrees">To flesh out the user interface, we also need to specify how users will find packages. We have developed a unified searching/browsing model for the Trove project. This model was motivated by a desire to structure the controlled-keyword set so the user doesn't have to see all of it while specifying a package (that is, some keywords become `visible' only after given other keywords have been chosen). <sect2>Discriminators, Packages, and Searches <p> The general model is that the set of controlled keywords is structured like the nodes of a tree (or more generally like a directed acyclic graph). A keyword is selectable only when a predecessor node has been selected. A <em>discriminator</em> is a rooted path in the tree (e.g. a sequence of keywords going from root to most specific). Each package has a list of discriminators associated with it. A package matches a given discriminator if the package has some discriminator of which the given discriminator is a prefix. Thus, if a package has the discriminator /a/b/c/d, any of the given discriminators /a, /a/b, /a/b/c, or /a/b/c/d will match it. The discriminators a, b, c, d, or a/b, or c/d will also match it (but a/d would not). A <em>search</em> is a set of discriminators created by the user, plus uncontrolled keywords to match against package descroption fields. The result of the search is the intersection set of all packages that match every discriminator in the search, plus the set of all packages whose descriptions matched the uncontrolled keywords. <sect2>An Example <p> As an example for this model, consider a user wishing to explore what graphics viewers are available in a Trove catalog. The user would like to be able to specify both (a) graphics formats of interest, and (b) a display toolkit (SVGA, Xlib, Motif, etc.). Let us suppose that the user is looking for a Motif GIF viewer. The user's search might be performed by building the following pair of discriminators: <bf>/topic/graphics/viewers/gif</bf> and <bf>/interface/toolkit/motif</bf>. <sect2>Catalogs and Browsing <p> A search defines a <em>catalog</em>, a subset of the global catalog that is the entire set of metadata. This model unifies searching and browsing; one browses the set of Trove packages by editing a collection of selections, and viewing the catalog resulting after each edit. <sect1>Sketch of Interface Design <p> (Note: this sketch deliberately avoids specifying a detailed UI.) The user is initially presented with a menu consisting of all keywords that appear in the leftmost slot of any spec. (This is a subset of the keytree nodes adjacent to root.) On choosing one of these, the user is presented with all keywords which appear in slot 2 of a spec containing the chosen keyword in slot 1. This is browsing down the specs. In addition, any <em>package</em> tagged by a spec consisting solely of the chosen keyword is also displayed. There is also the further option <bf>Narrow Search</bf>. Further choices of keywords walk down the spec path, displaying any packages that have been tagged by the currently chosen spec, along with the keywords for the next level. Eventually one reaches a level where there are only packages and no further keywords. Choosing ``Narrow Search'' returns one to the top level, but now only those keywords and packages that can be located in the catalog consisting of the result of the previous browsing operation are selectable. Packages outside the catalog are ignored; excluded keywords are indicated in a non-selectable mode. The purpose of `graying out' keywords is to allow the user to see instantly that what is wanted is inconsistent with the currently narrowed search. Narrowing can be iterated, or it can be backed out of (by throwing away rightmost segments of the current spec). At all times the current narrowing list and the current spec are displayed, so that the user doesn't get lost. In this style, the user can hardly tell the difference between browsing and searching: ideally the delay is always the same, so one does not assemble a ``search string'' and then hit ``Search''; instead, one simply browses and, if the current catalog seems to be too large, narrows. Depending on external considerations, a too-large catalog might elicit a warning such as ``There are 3500 packages available. You can |display| the full list or |narrow| your search.'' (where pipe bars bracket hotlinks). <sect2>Sample Session <p> Let's say the top-level screen provides keywords Topic, Interface, Audience, Status, etc. (I will use This Style or THIS STYLE for keywords, and all-lower-case-nn.nn for packages.) Choosing Topic, the user is presented with Compilers, Browsers, Graphics, etc. and Narrow Search (henceforth N.S.). Choosing Graphics, the user is presented with Painters, Drawers, Viewers, etc. and N.S. Choosing Viewers, the user is presented with GIF, JPEG, PNG, etc. the packages barfoo-2.2, zambaz-3.3 (these are packages that are not specialized as to format), and N.S. Choosing GIF, the user gets packages foobar-1.2, bazzam-3.4, etc., etc., etc. and N.S. There are too many to investigate in detail, so the user chooses N.S. and is returned to the top level. Choosing Interface, the user is presented with Dumb, Curses, Toolkit, etc. Choosing Toolkit, the user is presented with razbaz-9.99 (which uses a standardly available toolkit), Motif, KDE, etc. Choosing Motif, the user sees only foobar-1.2, the intersection of /Topic/Graphics/Viewers/GIF and /Interface/Toolkit/Motif. And chooses it. All of this could be done HTML-style, or using one or more choice boxes, or in any number of other ways - this is an abstract UI. <sect>Security and Authentication <p> There are two levels of protection in the Trove design. Which will operate depends on whether a contributor is authenticated or not. <sect1>Security through Visibility <p> The contributor who creates a package entry, and anyone who changes the package metadata or resources after the fact, will be put on the package's heads-up list. Every time the package metadata is modified after that, the updating contributor will be added to the heads-up list, and everybody on the heads-up list will be notified. The intent of this feature (and the requirement that unsubscription requests be validated) is to make sure that all metadata & resource changes are visible to everybody with a stake in the package. In particular, any modifications an unauthorized person succeeds in doing will be visible to the real package owners. <sect1>Security through Authentication <p> Either a resource or a package may be <em>locked</em>. When an item (resource or package) is locked, any request to modify it must be authenticated as coming from a <em>maintainer</em> of the package. Here are the rules of maintainership: 1. The <em>owner</em> of an item is the person who can add and delete maintainers. <itemize> <item> The owner of an item is automatically an maintainer of the item. </item> <item> The person who creates an item is its first owner.</item> <item> The owner may pass the owner role to another validated user.</item> </itemize> 2. Any maintainer of an item (package or resource) can modify or delete the item. 3. The maintainers of a package may delete associated resources. <sect1>How To Authenticate Requests <p> The Trove authentication system leverages the PGP-key infrastructure. To authenticate a TRL request, you must sign it with PGP. The shovel program will check the signature. If the key ID matches the TRL Contributor line, the request is considered authenticated. The big advantage of this system is that it leverages the public-key infrastructure. This means that Trove won't need to keep any secrets of its own, and contributors don't need any secrets other than their existing PGP passphrases. <sect1>Package Authentication <p> To be specified. Base on the <url url="http://java.sun.com/products/jdk/1.1/docs/guide/jar/manifest.html" name="JAR approach"> suggested by Jeremy Hylton? <sect>TRL Language Reference <p> <label id="TRL"> TRL (Trove Request Language) is used for two central purposes in the Trove architecture. First, it is the request language used to request updates to a Trove archive. Second, it is the format in which FTP-browseable dumps of index data are emitted (and from which the index may be regenerated). <sect1>TRL Reference <p> Lexical analysis of TRL is simple (it's modeled on RFC822 message format). It consists of the required begin marker with version, followed by any number of tagged logical lines, followed by the required end marker, followed by an optional PGP signature. A tagged logical line consists of a tag, followed by a colon, followed by a line of text, optionally followed by continuation lines. A tag is any sequence of printable non-space, non-colon characters beginning with an alphabetic. Continuation lines begin with whitespace or tab. Blank lines are ignored. # and end-of-line delimit comments. Values in keyword fields (Locked, Action, Icon-Action, Resource-Action, Resource-Role) are case-insensitive. Semantically, a TRL request consists of a preamble followed by any number of person or package updates. The preamble consists a Contributor field followed by an optional Comment field. The Contributor should be the name of the person submitting the TRL request; it must correspond to the PGP key if both are present. A person update may include Home-Page, Authorization-Mode, and Authorization-Secret fields directly corresponding to the schema. A Rename-To field is also supported. A package update contains a package section followed by any number of resource sections. Order of lines within sections is not significant. The END-TRL field may be followed by MIME-multipart attachments corresponding to `attached' resources in the TRL header. An optional cryptosignature following attached resources may be used to authenticate the request. Here's a motivating example of a package update. <tscreen><verb> BEGIN-TRL 0.6 Contributor: "Eric S. Raymond" <esr@thyrsus.com> Comment: This is a sample # Replace package metadata # Note: if this TRL were a dump rather than an action request, it # would include Created and Last-Modified date fields and an # Update-Count integer. Package: fetchmail Summary: A full-featured POP/IMAP mail retrieval daemon. Description: fetchmail is a free, full-featured, robust, and well-documented remote mail retrieval and forwarding utility intended to be used over on-demand TCP/IP links (such as SLIP or PPP connections). It retrieves mail from remote POP and IMAP servers and forwards it to your local (client) machine's delivery system, so it can then be be read by normal mail user agents such as mutt, elm, pine, or mailx. Comes with an interactive GUI configurator suitable for end-users. Update-Notes: Anybody running a version older than 4.3.0 should definitely upgrade. Latest-Version: 4.5.0 Last-Stable-Version: 4.5.0 Icon: http://www.tuxedo.org/~esr/fetchmail/fetchmail.gif Icon-Location: replica Home-Page: http://www.tuxedo.org/~esr/fetchmail Crawl-To: http://www.tuxedo.org/~esr/fetchmail/TROVE-METADATA Owner: "Eric S. Raymond" <esr@thyrsus.com> # # The following are list fields which update package-to-person relations # Authors: "Eric S. Raymond" <esr@thyrsus.com> Contacts: "Eric S. Raymond" <esr@thyrsus.com> Maintainers: "Eric S. Raymond" <esr@thyrsus.com>, "Rob Funk" <funk+@osu.edu>, "Dave Bodenstab" <imdave@mcs.net>, "Al Youngwerth" <alberty@apexxtech.com> # This adds a person to the package notification list. # The entire list could have been set with a `Notify' header, # or individual unsubscriptions done with an `Unsubscribe' header. Subscribe: "Catherine Olanich Raymond" <cor@ccil.org> # # The following are list fields which update package-to-package # relations. The two other relations are Extends and See-Also. # Supersedes: popclient Requires: smtpdaemon # Discriminators: system/mail/{pop, imap}, audience/{end-users, sysadmins}, status/production, embedding/application, interaction/utility, license/GPL, platforms/{Linux, BSD}, Locked: TRUE Action: replace # Delete old source tarball Resource: http://www.tuxedo.org/~esr/fetchmail/fetchmail-4.4.8.tar.gz Action: delete # Create new source tarball. # Note: if this TRL were a dump rather than an action request, it # would include Created and Last-Modified date fields and an # Update-Count integer. Resource: http://www.tuxedo.org/~esr/fetchmail/fetchmail-4.4.9.tar.gz Resource-Role: source Resource-Location: replica Version: 4.4.9 MIME-Type: application/data Description: Gzipped source tarball of fetchmail sources Locked: TRUE Action: replace # Change version field of existing metadata for FAQ Resource: http://www.tuxedo.org/~esr/fetchmail/fetchmail-FAQ.html Resource-Role: documentation Version: 4.4.9 Action: merge END-TRL </verb></tscreen> <sect2>Person fields <p> All Person update requests must be authenticated. <itemize> <item><em>Person</em>: RFC822 name/address pair of the person this record is about. This field will be used as the key when attempting to fetch a PGP public key to verify requests. <item><em>Home-Page</em>: WWW home page of the person. <item><em>Rename-To</em>: (Updates only) Specifies an RFC822 name/address pair to replace the Person address with (replacement will be performed throughout the database). </itemize> This is a test load for the Person record parsing. It would have the effect of replacing all metadata references to "Eric S. Raymond" with "Thaddeus Q. Foonly". Normally such an update would be used to change a contributor's primary email address, the spelling of his/her name, or his/her home page. <tscreen><verb> BEGIN-TRL 0.6 Contributor: "Eric S. Raymond" <esr@thyrsus.com> Person: "Eric S. Raymond" <esr@thyrsus.com> Home-Page: http://www.tuxedo.org/~esr Rename-To: Thaddeus Q. Foonly <foon@random.org> END-TRL </verb></tscreen> <sect2>Package fields <p> <itemize> <item><em>Action</em>: (Updates only) Must be one of `merge', `replace', or `delete'; `merge' is the default. The `delete' action requests deletion of the package record. The `merge' action requests that only nonempty fields in the package update should be merged into the existing record. The `replace' action specifies that the data in the update should entirely replace the existing record. (It is an error to specify any field besides the name in a delete request.) <item><em>Authors</em>: A list of RFC822 name/address pairs, the people considered authors of the package. <item><em>Contacts</em>: A list of RFC822 name/address pair, the people considered public contact people for the package. <item><em>Crawl-To</em>: A URL where the Trove crawler can find updated metadata for this package. <item><em>Created</em>: (Dumps only) Date of first creation of this record. <item><em>Conflicts-With</em>: Asserts that this package cannot be concurrently installed with the listed package. <item><em>Description</em>: Description of this package. <item><em>Discriminators</em>: A comma-separated list of discriminators of discriminator wildcards (alternation is supported annd implies a list of all discriminators matching the expression. All discriminators are considered rooted (leading slash is implicit). <item><em>Extends</em>: A comma-separated list of package names (not necessarily Trove-registered packages). This field declares that the current package is an extension of the listed package. <item><em>Fixes-For</em>: Asserts that this package contains fixes for the listed packages. <item><em>Home-Page</em>: WWW home page of the package. <item><em>Icon</em>: URL of a PNG, JPEG, or GIF to use as a package icon. <item><em>Icon-Location</em>: (Updates only) Must be one of `replica', `original', or `attached'. These specify whether a local copy of the icon should be made. If the value is `original', no copy will be made. If the value is `replica', the resource will be copied from the specified URL. If the value is `attached', the TRL parser will expect to find a matching MIME-multipart attachment in the update message. <item><em>Last-Modified</em>: (Dumps only) Date this record was last modified. <item><em>Last-Stable-Version</em>: Name of the version considered by the maintainer to be the last production version. <item><em>Latest-Version</em>: Name of the version considered by the maintainer to be the leading version. <item><em>Locked</em>: Must be `true' or `false', If `true', this package record may only be modified by an authenticated request from a maintainer, author, or Trove archivist. <item><em>Maintainers</em>: List of RFC822 name/address pairs of persons allowed to modify this record (even if it is locked). <item><em>Notify</em>: (Updates only) Sets the package notification list, those who will be emailed whenever the metadata changes. Compare <em>Subscribe</em> and <em>Unsubscribe</em>, which modify this list. <item><em>Owner</em>: RFC822 name/address pair of the package owner (the person privileged to modify the maintainers/authors/contacts lists). <item><em>Package</em>: Name of the package. <item><em>Requires</em>: A comma-separated list of package names (not necessarily Trove-registered packages). This field declares that the current package requires the listed package in order to work. <item><em>Rename-To</em>: (Updates only) Specifies a new name for the package (replacement will be performed throughout the database). <item><em>See-Also</em>: A comma-separated list of package names (not necessarily Trove-registered packages). This field declares that all listed packages are somehow related to the current package. <item><em>Subscribe</em>: (Updates only) Comma-separated list of RFC822 name/address pairs to be added to the notification list. Compare <em>Notify</em>, which sets (overwrites) the entire list. <item><em>Summary</em>: One-line summary of the package description. <item><em>Supersedes</em>: A comma-separated list of package names (not necessarily Trove-registered packages). This field declares that the current package supersedes the listed package. <item><em>Unsubscribe</em>: (Updates only) Comma-separated list of RFC822 name/address pairs to be removed from the notification list. Compare <em>Notify</em>, which sets (overwrites) the entire list. <item><em>Update-Count</em>: (Dumps only) Count of times this package record has been updated. <item><em>Update-Notes</em>: Packager's notes on deprecated versions, upgrade urgency etc, separated from Description so free-text searches will ignore it. <item><em>Via</em>: (Dumps only) Name of the program through which the last update was submitted. </itemize> <sect2>Resource fields <p> <itemize> <item><em>Action</em>: (Updates only) Must be one of `merge', `replace', or `delete'; `merge' is the default. The `delete' action requests deletion of the resource record. The `merge' action requests that only nonempty fields in the resource update should be merged into the existing record. The `replace' action specifies that the data in the update should entirely replace the existing record. (It is an error to specify any field besides the name in a delete request.) <item><em>Authors</em>: A list of RFC822 name/address pairs, the people considered authors of the resource. If this list is empty, the authors list is inherited from the containing resource record. <item><em>Created</em>: (Dumps only) Date of first creation of this record. <item><em>Description</em>: Description of this resource. <item><em>Last-Modified</em>: (Dumps only) Date this record was last modified. <item><em>Locked</em>: Must be `true' or `false', If `true', this resource record may only be modified by an authenticated request from a maintainer, author, or Trove archivist. <item><em>MIME-Type</em>: The MIME type of the resource file. <item><em>Maintainers</em>: List of RFC822 name/address pairs of persons allowed to modify this resource. If this list is empty, the authors list is inherited from the containing resource record. <item><em>Notify</em>: (Updates only) Sets the resource notification list, those who will be emailed whenever the metadata changes. <item><em>Owner</em>: RFC822 name/address pair of the resource owner (the person privileged to modify the maintainers/authors lists). <item><em>Resource</em>: Name of the resource. Can be either an URL (if the Resource-Location field is `original' or `replica') or a bare filename (if the Resource-Location field is `attached'). <item><em>Resource-Location</em>: (Updates only) Must be one of `replica', `original', or `attached'. These specify whether a local copy of the icon should be made. If the value is `original', no copy will be made. If the value is `replica', the resource will be copied from the specified URL. If the value is `attached', the TRL parser will expect to find a matching MIME-multipart attachment in the update message. <item><em>Resource-Role</em>: Specifies a role for the resource; see below. <item><em>Update-Count</em>: (Dumps only) Count of times this resource record has been updated. <item><em>Update-Notes</em>: Packager's notes on deprecated versions, upgrade urgency etc, separated from Description so free-text searches will ignore it. <item><em>Version</em>: </itemize> <sect2>Resource roles <p> <itemize> <item><em>source</em>: This resource is a source archive of some part of the package. <item><em>binary</em>: This resource is an executable binary, or archive of executable binaries, generated from the package source. <item><em>installable</em>: This resource is an installable package (such as an RPM) generated from the package sources. <item><em>documentation</em>: This resource is documentation for the package. <item><em>data</em>: This resource is data of some sort associated with the package. <item><em>other</em>: None of the above. </itemize> <sect1>TRL Future Directions <p> Define and implement an XML presentation of TRL semantically equivalent to this one (at present, mature XML tools to support this are lacking). Dates should accepted in <url url="http://www.ft.uni-erlangen.de/~mskuhn/iso-time.html" name="ISO-8601">. This is the format to use with the XML syntax. <sect>Data Formats <p> <sect1>Text Field Rules <p> Description fields are interpreted according to the following rules: Text is plain text. Paragraphs are separated by one or more blank lines. No HTML tags are recognized; >, & and < mean themselves. Normal paragraphs are word-filled. Indented text is treated as-is and converted to <PRE>...</PRE> in HMTL (tabs should be expanded to spaces here). A single word between *asterisks* means <b>bf</b> and a single word in _underscores_ means <i>italics</i> (even in indented text). Any text that looks sufficiently like a URL (e.g. http://www.python.org) is turned into a hyperlink with an <A...>...</A> tag pair (even in indented text). <sect1>Mappings between Trove and RPM metadata <P> <label id="rpm-import">It would be highly desirable to be able to automatically import Red Hat RPMs and SRPMs into the Trove scheme. To this end, we compare them here. Here are the currently defined RPM metadata tags: <itemize> <item><bf>Name</bf> -- the package name</item> <item><bf>Version</bf> -- the package version</item> <item><bf>Release</bf> -- the RPM release number</item> <item><bf>Copyright</bf> -- the license type of the software</item> <item><bf>Group</bf> -- topic category of the application</item> <item><bf>Source</bf> -- URL pointing to home archive of the sources</item> <item><bf>URL</bf> -- URL pointing to documentation</item> <item><bf>Release</bf> -- the RPM release number</item> <item><bf>Distribution</bf> -- the distribution this package belongs to</item> <item><bf>Vendor</bf> -- the organization distributing the software</item> <item><bf>Packager</bf> -- contact email of package maker</item> <item><bf>Summary</bf> -- one-line summary description</item> <item><bf>Description</bf> -- multiline description of package.</item> <item><bf>Icon</bf> -- GIF or XPM icon for package.</item> </itemize> Going through the Trove schema package fields in order, we see that we can copy the Name, Summary, Description, and Icon fields directly. RPM's URL field is equivalent to our Home-Page field. We don't need RPMs to fill in the Crawl-To, Remote-Date, or Refresh-Date fields, as those are strictly for the crawler's use. In the package access record, the Created, Update-Count, Modified, and Locked fields could be created at RPM translation time. The Contributor field could be copied from the RPM Packager field. Now, proceeding to the relations. Assuming we had some systematic mapping of Group discriminators to Trove discriminators, we could derive exactly one topic discriminator. We could derive `required-by' relations from the Required header. We could derive the license-type controlled keywords from the Copyright header. No way to extract `supercedes', `extends', or `see-also'. The real problems are with the package-to-person relations. RPM has no discriminators for contact people, authors, or maintainers. Metadada maintainership privileges would default to the contributor, but in the case of RPMs created by (say) Red Hat for distribution this is unlikely to be useful. The picture is a little brighter with respect to automatically declaring resources. We could declare the RPM itself a resource with a version number composed from the Version and Release fields. The Contributor field could set the maintainer. Conclusion: RPM metadata is not really adequate for generating Trove records from. The major problems are (a) it doesn't supply enough keyword/discriminator info, and (b) there is no way to derive a reliable maintainer or author list from it. <sect1>Mappings between Debian and RPM metadata <P> <label id="debian-import">It would be desirable to be able to automatically import Debian .deb packages into the Trove scheme. To this end, we compare them here. Here are the Debian metadata tags defined in the <url name="Debian Packaging Manual" url="http://fatman.mathematik.tu-muenchen.de/~schwarz/debian-doc/manuals/packaging-manual/">: <itemize> <item><bf>Package</bf> -- the package name</item> <item><bf>Version</bf> -- the package version</item> <item><bf>Architecture</bf> -- architecture the package is for</item> <item><bf>Maintainer</bf> -- contact email of package maker</item> <item><bf>Source</bf> -- name of corresponding source package</item> <item><bf>Depends</bf> -- declares an absolute dependency</item> <item><bf>Recommends</bf> -- declares a strong but not absolute dependency</item> <item><bf>Suggests</bf> -- recommends other packages to install</item> <item><bf>Pre-Depends</bf> -- declares an installation dependency</item> <item><bf>Conflicts</bf> -- says what this cannot coexist with</item> <item><bf>Replaces</bf> -- declares that this replaces given packages</item> <item><bf>Provides</bf> -- declares `virtual' packages for dependency purposes</item> <item><bf>Description</bf> -- multiline description of package.</item> <item><bf>Essential</bf> -- declares that a package cannot be removed (only replaced).</item> <item><bf>Priority</bf> -- how essential the package is.</item> <item><bf>Section</bf> -- application area of the package</item> <item><bf>Installed-Size</bf> -- installed size of the package</item> <item><bf>Standards-Version</bf> -- applicable version of Debian packaging standards</item> <item><bf>Distribution</bf> -- the distribution this package belongs to</item> <item><bf>Urgency</bf> -- how important it is to get current</item> <item><bf>Date</bf> -- last-modified-date of metadata</item> <item><bf>Format</bf> -- format level for changes file</item> <item><bf>Changes</bf> -- human-readable changelog data</item> <item><bf>Size</bf> -- size of binary package</item> <item><bf>MD5sum</bf> -- MD5 checksum of the package</item> </itemize> (A few listed fields which are not package metadata have been omitted. Note that the Maintainer field is the <em>.deb</em> maintainer, analogous to the RPM Packager field, not the person responsible for the software.) We can copy the Package, Version, and Description fields directly. Our Requires field might be derivable from Debian's Depends. It has been noted that we might be able to derive Crawl-To, Latest-Version, and Last-Stable Version by looking at the Debian FTP site. Otherwise there is little overlap -- not nearly enough to make using Debian metadata reasonable. <sect1>Comparison with Dublin Core metadata <p> The <url url="http://purl.oclc.org/metadata/dublin_core" name="Dublin Core">) is a set of 15 metadata items that are meant to be fully general across all kinds of intellectual-property resources. Here is a summary of the Dublin Core fields: <itemize> <item><bf>Title:</bf> -- the name of the resource</item> <item><bf>Creator:</bf> -- the person who created the intellectual content of the resource</item> <item><bf>Subject:</bf> -- structured keywords</item> <item><bf>Description:</bf> -- free text</item> <item><bf>Publisher:</bf> -- the entity responsible for making the resource available</item> <item><bf>Contributor:</bf> -- secondary provider of content</item> <item><bf>Date:</bf> -- creation or first-ability date of the resource</item> <item><bf>Type:</bf> -- category of work (home page, novel, poem, working paper)</item> <item><bf>Format:</bf> -- data format, intended to identify what is required to present or use the resource</item> <item><bf>Identifier:</bf> -- URL, URN, ISBN, or other unique identifier within category</item> <item><bf>Source:</bf> -- information about a base resource from which this one is derived.</item> <item><bf>Language</bf> -- language of the intellectual content of the resource</item> <item><bf>Relation</bf> -- relates this reasource to another, via assertion such as IsVersionOf(), IsBasedOn(), IsPartOf(), etc.</item> <item><bf>Coverage:</bf> -- spatial or temporal characteristics of the intellectual content of the resource.</item> <item><bf>Rights:</bf> -- pointer to license and rights information.</item> </itemize> We are certainly not going to be able to use the Dublin Core as a complete set of descriptors. But there are some things we could do to be name-compatible where we're semantically compatible, and avoid name clashes where we cannot be semantically compatible. <tscreen><verb> Simple renamings: Author -> Creator Maintainer -> Contributor Contributor -> Publisher Discriminators -> Subject Fieldnames to avoid in our metadata: Title (hard experience that people don't interpret this well) Date (because of creation vs. last-modified ambiguity) Type (incompatible vocabulary with Dublin Core's) Format (incompatible vocabulary with Dublin Core's) Identifier (there isn't any natural scheme) Source (doesn't specify mode of derivation well enough) Language (doc-language vs. implementation-language ambiguity) Relation (incompatible vocabulary with Dublin Core's) Coverage (just irrelevant) </verb></tscreen> Finally, we could set a bit that if we end up disambiguating package names with a site prefix or other uniquifying prefix (rather than resolving collisions), the "true name" could be designated the Identifier. No decision has been made on this yet. <sect>Recent Changes <p> Version 1.2: Minor changes to TRL suggested by M. A. Lemburg. Relation fields added to TRL example; corresponds to version 1.12 of the schema. New section on translation from RPMs. Version 1.3: Separate resource-action fields from resource fields. Dump Refresh-Date, life is much simpler without it. Introduce concept of preamble. This example corresponds to version 1.13 of the schema and the initial version of the TRL parser. Versions 1.4, 1.5, 1.7: minor editorial changes. Version 1.7: Action-field renames as suggested by John Cowan. Added Resource-Role field. Version 1.8: Added Subscribe and Unsubscribe fields. Version 1.9: TML renamed to TRL to avoid collision with `HTML' Version 1.10: Submitter field changed to Contributor. Version 1.11: Document the new `attached' adjective. Version 1.12: SGML markup fixes. Version 1.13: Added Person record parsing. Added Rename-To fields. Version 1.14: Support notification lists in base classes. Added a complete TRL reference. Version 1.15: Added comparisons with Debian fields and with Dublin Core metadata. Added sections on basic types and contributor workflow. Removed sticky bit, added Fixes-For and Conflicts-With relations and Update-Notes field. Version 1.16: Minor markup fixes. </article>