Generalized Website Autodiscovery (PRE-DRAFT)
draft-roundy-generalized-autodiscovery-01

Abstract

This document specifies a generalized method of providing standardized, machine-readable data for autodiscovery of information about a website and the services it provides.

Editorial Notes

This draft HAS NOT been submitted for publication, and does not have any status; it should be referred to as a "pre-draft."

Use the form at http://www.geckotribe.com/info/contact.php to provide input on this document.

Sections called out [[like this]] indicate editorial notes that should be filled in or removed before final publication.

[[examples need to be added throughout]]

Table of Contents
1 Introduction
2 Notational Conventions
3 The "autodisc" directory
4 Autodiscovery files
4.1 index.txt
4.2 subdirectories.txt
4.3 robots.txt
4.4 icons.txt
4.5 description.txt
4.6 feeds.txt
5 IANA considerations
5.1 Registry of autodiscovery files
6 Security considerations
A References
B Author's address
C Revision History
D Intellectual property and Copyright Statements

1 Introduction

Conventions have emerged for using specific filenames in specific locations on websites to store data such as instructions to web crawlers regarding URIs they should and should not access (robots.txt) and site icons (favicon.ico). The purpose of this document is to define additional autodiscovery files and a standardized method for defining and registering future autodiscovery files.

This document also defines a method of delegating autodiscovery authority for subdirectories to files not stored in the root autodiscovery authority location.

2 Notational Conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].

3 The "autodisc" directory

All autodiscovery files conforming to this specification MUST reside in a directory named "autodisc" in all lowercase letters. The autodisc directory at the root of a webserver may delegate authority for portions of the server's address space by listing delegated directories in a file named "subdirectories.txt", as specified below.

4 Autodiscovery files

All autodiscovery files specified in this document are plain text files and SHOULD be served with the MIME media type text/plain. They MAY use any combination of carriage returns and line feeds to delimit lines. They MUST be encoded using the UTF-8 character set. Unless explicitly specified, this document assigns no significance to whitespace in these files. Documents MAY be added to the registry created by this document which do not conform to the above.

All of the files are OPTIONAL. If not present, the information they define is not defined by this document.

4.1 index.txt

This file lists the other files in the directory, one per line, to reduce the number of requests made to check for the existance of additional files. If this file exists, it SHOULD list all of the other files. If this file does exist, consuming applications MAY ignore any other autodiscovery files which exist but are not listed in this file.

4.2 subdirectories.txt

This file lists directories on a web server which contain their own "autodisc" subdirectories. One directory is listed on each line. Directory names MAY begin with a forward slash character. Whether they do or not, the path is relative to the directory for which subdirectories.txt has authority. Directory names MAY end with a forward slash character. If they do not, they MUST be processed as if they did end with a forward slash.

The asterisk character in paths is a wildcard character which matches any number of characters other than the forward slash path delimiter character. For example, a typically configured ISP, whose user's directories are found at the path /~/ could delegate to their users using any of the following path specifiers:

~*
/~*
~*/
/~*/

Autodiscovery authority for all portions of the server's address space beginning with the path for which the delegating file has authority plus the path listed on each line is delegated to files in the directory whose path is created by appending "autodisc" to the end of the that path.

Authority MUST NOT be delegated to a location outside the address space controlled by the delegating authority.

It is an error to delegate to a subdirectory within the address space of another subdirectory to which one is delegating, regardless of the order in which the delegations are done. The results of doing so are undefined.

The delegating directory MAY specify which files to delegate by appending a whitespace separated list of filenames on the same line as the delegatee directory name. If a list of filenames is specified, the values in other files are not inherited from the delegating directory, and if any other files exist in the delegatee directory, they MUST be ignored.

The delegating directory MAY specified which files are not to be delegated by appending whitespace, a minus sign character, and a whitespace separated list of filenames on the same line as the delegatee directory name. If a list of filenames is so specified, the values in all other files are inherited from the delegating directory. If any of the files so listed existing in the delegatee directory exist, they MUST be ignored.

Authority for subdirectories.txt is always delegated, and cannot be restricted.

Delegatee directories MAY only delegate authority for autodiscovery files for which they have authority.

If a delegatee directory does not contain a file which has been delegated to them, the value of that file is not defined by this document. Be aware that files specified outside of this document, such as robots.txt if it exists in the server root directory, MAY still apply to delegatee directories as specified in their specifications. But if files specified outside of this document exist in the autodisc directory, and not in locations specified by their own specifications, they SHOULD follow this rule.

Applications MAY decline to explore all subdirectories to which authority has been delegated, but MUST NOT apply autodiscovery settings which have been delegated to delegatee directories--if they do not attempt to access the delegatee to discover the settings, they MUST consider them undefined.

4.3 robots.txt

The format and usage of this file is as defined at http://www.robotstxt.org/. In the root autodisc directory, it will typically either be a symbolic link to, or will be linked from a file named robots.txt in the root directory of the webserver.

4.4 icons.txt

This file lists the URIs of image files to use as an icon for the website. One URI is listed on each line. The images MAY be in any format, and SHOULD use standard filename extensions. The images SHOULD be identical to the extent possible given differences in image formats. The images SHOULD be listed in decending order of publisher's preference. For example, if image quality or size varies, the best quality or smallest images might be listed first. Consuming applcations SHOULD render the first image in the list that they are capable of rendering.

The images SHOULD have have square dimensions, and SHOULD be small. A size of 16 by 16 pixels is RECOMMENDED, as it is currently the de facto standard. Consuming applications MAY scale the images, and MAY change their aspect ratios if they are not square.

4.5 description.txt

This file contains a single line description of the website. [[should this be a multiline file to allow multiple langues? only languages supported by the site should appear in this file if so]] The description SHOULD be no more than 80 characters long. Consuming applications MAY truncate the description to any length. The description SHOULD be suitable for clarifying the identity of the website.

One anticipated usage of the description is for web browsers to help their users locate a desired website by entering only the second or third level domain in the address bar. The browser might for example attempt to display the website whose domain name is created by appending ".com" to that name and perhaps prepending "www.". After that webpage is displayed, the browser might check for the existance domain names with other told level domains, such as ".net", ".org", ".us", etc., and create a menu of such domain names. It might afterward update the menu with the description, or the beginning of the description, taken from the autodisc/description.txt file on each, if one exists.

4.5 feeds.txt

This file lists syndication feeds published by a site, such as RSS and Atom feeds. The URI of each feed is listed on a separate line. Sites MAY list a subset of the feeds they publish. Each line MAY be prepended with the namespace or MIME media type of the format of the feed in square brackets. [[should we use the Atom autodiscovery link element format instead of this? that would also allow for feed titles, which would be useful. should this be an XML file instead of plain text? that would probably be overkill, even though it would look like an XML fragment if we use the Atom autodiscovery format.]] If the path portion of the URI of the feed (the portion terminated by a question mark or hashmark) does not end with a filename extension common to the feed format, a MIME media type SHOULD be specified. If the MIME media type is present, it is an advisory value, and the MIME media type indicated by the HTTP headers when accessing the feed override this value.

5 IANA considerations

5.1 Registry of autodiscovery files

A registry of autodiscovery file types will be used to avoid file naming conflicts and unrestrained growth of autodiscovery files, and to provide a central location for researching their formats. Filenames beginning with "x-" MAY be used for experimental files and unregistered file types. That prefix is reserved for such files. The registry is maintained by [[?????]], and initially contains the six files listed above. Each registry entry lists the filename and a description of the format, and usage of each file.

[[add something like this:

New assignments are subject to IESG Approval, as outlined in [RFC2434]. Requests should be made by email to IANA, which will then forward the request to the IESG requesting approval. The request should use the following template:

* Filename:
* Description of file format:
* Security considerations:

Review guidelines:

A different filename from the one requested may be assigned as deemed appropriate by ?????.

Don't accept vendor-specific things--only things generally applicable.

Don't accept files for experimental technologies--only for reasonably established technologies.

Don't accept IP encumbered things--require registrants to make a statement that as far as they know, there are no IP limitations--if they own IP rights, they must release them for license-agreement and payment free usage by anyone]]

6 Security considerations

When processing subdirectories.txt, be sure not to accept paths attempting to break out of their address space by containing "/../", "../" or "/.." at the beginning and end respectively, or where the entire path is "..". [[might people cause problems by symlinking to higher directories? yes, so be careful about doing any delegation--if you're an ISP...don't let people symlink out of their own directory]]

[[don't delegate anything you think might be dangerous--for example, if you want to ensure that certain robots.txt settings are applied everywhere, don't delegate robots.txt]]

A References

[[fill this out]]

B Author's address

[[for now, I'm listing my contact form only]]
Antone Roundy
http://www.geckotribe.com/info/contact.php

C Revision history

20 April, 2005: Initial draft prepared

D Intellectual property and Copyright Statements

Copyright (C) Antone Roundy (2005). All Rights Reserved.

[[Something like the following will probably appear in the final version, and the copyright may be transferred to a standards body:

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English.

The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns.

This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.]]
Gecko Tribe, LLC
PO Box 5835
Grand Island, NE 68802
Voice Mail: 308-646-0543