<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY rfc1327 SYSTEM "./rfcrefs/reference.RFC.1327.xml">
<!ENTITY rfc1521 SYSTEM "./rfcrefs/reference.RFC.1521.xml">
<!ENTITY rfc1766 SYSTEM "./rfcrefs/reference.RFC.1766.xml">
<!ENTITY rfc2026 SYSTEM "./rfcrefs/reference.RFC.2026.xml">
<!ENTITY rfc2028 SYSTEM "./rfcrefs/reference.RFC.2028.xml">
<!ENTITY rfc2047 SYSTEM "./rfcrefs/reference.RFC.2047.xml">
<!ENTITY rfc2119 SYSTEM "./rfcrefs/reference.RFC.2119.xml">
<!ENTITY rfc2231 SYSTEM "./rfcrefs/reference.RFC.2231.xml">
<!ENTITY rfc2234 SYSTEM "./rfcrefs/reference.RFC.2234.xml">
<!ENTITY rfc2396 SYSTEM "./rfcrefs/reference.RFC.2396.xml">
<!ENTITY rfc2434 SYSTEM "./rfcrefs/reference.RFC.2434.xml">
<!ENTITY rfc2616 SYSTEM "./rfcrefs/reference.RFC.2616.xml">
<!ENTITY rfc2781 SYSTEM "./rfcrefs/reference.RFC.2781.xml">
<!ENTITY rfc2860 SYSTEM "./rfcrefs/reference.RFC.2860.xml">
<!ENTITY rfc3066 SYSTEM "./rfcrefs/reference.RFC.3066.xml">
<!ENTITY rfc3339 SYSTEM "./rfcrefs/reference.RFC.3339.xml">
<!ENTITY rfc3552 SYSTEM "./rfcrefs/reference.RFC.3552.xml">
<!ENTITY rfc3629 SYSTEM "./rfcrefs/reference.RFC.3629.xml">
<!ENTITY rfc4234 SYSTEM "./rfcrefs/reference.RFC.4234.xml">
]>
<?rfc toc='yes' symrefs='yes' sortrefs='yes'?><rfc ipr="full3978"><front>
<title abbrev="draft-phillips-record-jar-01">The record-jar Format</title>
  <author initials="A" surname="Phillips" fullname="Addison Phillips" role="editor">
     <organization>Yahoo! Inc.</organization>
     <address>
       <email>addison@inter-locale.com</email>
       <uri>http://www.inter-locale.com</uri>
     </address>
  </author>
	<date day="24" month="August" year="2007"/><abstract><t>The record-jar format provides a method of storing multiple records with a variable repertoire of fields in a text format. This document provides a description of the format. Comments are solicited and should be addressed to the mailing list 'record-jar@yahoogroups.com' and/or the author.</t></abstract></front><middle><section title="Introduction" anchor="intro"><t>The record-jar format was originally described by <xref target="AOUP">The Art of Unix Programming</xref>. This format is useful for storing information in a human-readable text form, while making the data available for machine processing. It is a flexible format, since it provides for an arbitrary range of fields in any given record and can be used to store data with variable length and content.</t><t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", 
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be 
interpreted as described in <xref target="RFC2119"></xref>.</t></section><section title="Format and Grammar" anchor="format"><t>The record-jar format is described by the following ABNF (<xref target="RFC4234"></xref>):</t><figure title="record-jar ABNF"><artwork type="ABNF" name="record-jar">
record-jar   = [encodingSig] [separator] *record
record       = 1*field separator
field        = ( field-name field-sep field-body CRLF )
field-name   = *character
field-sep    = *SP ":" *SP
field-body   = *(continuation 1*character)
continuation = ["\"] [[*SP CRLF] 1*SP]
separator    = [blank-line] *("%%" [comment] CRLF)
comment      = SP *69(character)
character    = SP / ASCCHAR / UNICHAR / ESCAPE
encodingSig  = "%%encoding" field-sep 
                 *(ALPHA / DIGIT / "-" / "_") CRLF
blank-line   = WSP CRLF

; ASCII characters except %x26 (&amp;) and %x5C (\)
ASCCHAR      = %x21-25 / %x27-5B / %x5D-7E 
UNICHAR      = %x80-%x10FFFF                      ; Unicode chars
ESCAPE       = "\" ("\" / "&amp;" / "r" / "n" / "t" )
             / "&amp;#x" 2*6HEXDIG ";"
</artwork></figure><t>The record-jar format consists of character data that forms a sequence of records. Each record is separated from other records by at least one line beginning with the sequence "%%" (%x25.25). Records are made up of one or more fields and a record MAY contain as many or as few fields as are necessary to convey the necessary data. Empty records and blank lines are ignored.</t><t>A field is a single, logical line of characters from the <xref target="Unicode">Universal Character Set (Unicode)</xref>,  comprised of three parts: the field-name, the field-separator, and the field body.</t><t>The field-name is an identifer. Field-names SHOULD consist only of characters permitted in identifiers according to <xref target="UAX31">Unicode Standards Annex #31 (UAX#31)</xref> and SHOULD start only with characters with the property ID_Start. Often field-names are further restricted to a sequence of letters and digits from the US-ASCII character set <xref target="ISO646"></xref>.  A field-name SHOULD be treated as case sensitive and MUST NOT contain any spaces. Upper and lowercase letters are often used to visually break up the name, for example using CamelCase. It is a common convention that field names use an initial capital letter, although this is not enforced. The hyphen-minus character ("-", %x2D) MAY be used to separate parts of the name visually, however, it MUST NOT appear at the beginning or end of a field-name.</t><t>The field separator (field-sep) is the colon character (":", %x3A). The separator MAY be surrounded on either side by any amount of horizontal whitespace (tab or space characters). The normal convention is one space on each side.</t><t>The field-body contains the data value. Logically, the field-body consists of a single line of text using any combination of characters from the Universal Character Set followed by a CRLF (newline). The carriage return, newline, and tab characters, when they occur in the data value stored in the field-body, are represented by their common backslash escapes ("\r", "\n", and "\t" respectively). See <xref target="characters"></xref> for more information on escape sequences.</t><section title="Folding of Field Values" anchor="folding"><t>Some protocols limit total line length. For example, many Internet plain-text protocols limits lines to 72 total bytes. To accommodate such limits or for readability and presentational purposes, the field-body portion of  a field can be split into a multiple-line representation; this
        is called "folding". </t><t>Successive lines in the same field-body begin with one or more whitespace characters. When processing the record-jar format, the linear whitespace (including the newline and any preceeding spaces) is consumed by the processor and the two parts of the field-body joined to form a single, logical line. For example:</t><figure title="Example of Folding" anchor="fold.fig1"><artwork>Eulers-Number : 2.718281828459045235360287471
  352662497757247093699959574966967627724076630353547
  5945713821785251664274274663919320030599218174135...
</artwork></figure><t hangText="NOTE:">Note that imposing a line length limit effectively limits the length of the field-name, since the field separator MUST appear on the same line with the field-name and the field-name MUST NOT be folded. Also, when imposing a line length limit, note that some encodings (including the Unicode encodings) can use a variable number of bytes per character or commonly use more than one byte per character. Characters MUST NOT be folded in the middle of a byte sequence. Furthermore, folding SHOULD NOT be done just prior to a combining character (since this will alter the display of characters in the file and might result in unintentional alteration of the file's semantics).</t><t hangText="NOTE:">In some cases, the field-body contains spaces that are important to the data. To accurately preserve whitespace in the document, an optional line-continuation character (backslash, %x5C) MAY be included to delimit and separate whitespace to be preserved from whitespace that will be removed by the processor. The line-continuation character and any whitespace that follows it (including whitespace at the beginning of the continuing field-body on the next line) MUST be consumed by the processor when reading the file. Whitespace appearing before the line-continuation MUST NOT be consumed. Use of the line continuation character makes the whitespace visible in the file. </t><t hangText="NOTE:">In other cases, the field-body might contain natural language text, and, while it is readily apparent that many languages use spaces to separate words, others, such as Japanese or Thai, do not. Implementations MAY, in the absence of line continuation characters, replace the continuation sequence (the line break and surrounding whitespace) in a folded line with a single ASCII space (%x20), however, implementations SHOULD just remove the continuation sequence altogether in order to avoid causing unnatural breaks in the text. </t><t>Here are some examples:</t><figure anchor="fold.fig2" title="Example of Folding with Preserved Whitespace"><artwork>SomeField : This is some running text \
 that is continued on several lines \
 and which preserves spaces between \
 the words.
%%
AnotherExample: There are three spaces   \
between 'spaces' and 'between' in this record.
%%
SwallowingExample: There are no spaces between \
       the numbers one and two in this example 1\
       2.
%%</artwork></figure><t>Note that entirely blank continuation lines are not permitted. That is, this record is illegal, since the field-body of "SomeText" would be the empty string:</t><figure anchor="foldexamplefig" title="Whitespace Folding Example"><artwork>%%
SomeText:               \
                        \
                        \
%%</artwork></figure></section><section title="Comments" anchor="comments"><t>Comments MAY be included in the body of the record-jar document by placing them at the end of a separator line. The comment MUST be separated by at least one space from the "%%" sequence that introduces the separator.</t><t> Multiple separators MAY appear between records. Logically this appears to result in records that contain no fields: records containing no fields MUST be ignored by a processor. </t><t>Folding of comments is not permitted; instead multiple comment lines MUST be used. Comments can not appear in the body of a record. For example:</t><figure anchor="commentEx" title="Comment example"><artwork name="commentExample">%% this is a comment.
Record: goes here
%%
%% here is another sequence of comments
%% that appear on multiple lines
Record: another record
%% a final comment
%%</artwork></figure></section><section title="Characters, Encodings, and Escapes" anchor="characters"><t>By default, a file containing a record-jar archive  uses the UTF-8 character encoding (see <xref target="RFC3629"></xref>). If an application, protocol, or specification permits an encoding other than UTF-8 to be used in the file, it SHOULD also support reading the encoding from the encoding signature. The encoding signature, when present, MUST be the very first line of the file. If the encoding signature is not present, an application or protocol MAY attempt to infer the encoding using other means. Record-jar files SHOULD include an encoding signature, even if one is not required, whenever the application, protocol, or specification permits one.</t><t>A file that uses the UTF-16 or UTF-32 encoding MAY also include a Byte Order Mark (U+FEFF) as the first sequence of two octets (in the case of UTF-16) or four octets (in the case of UTF-32) in the file, just preceeding the encoding signature.</t><t>Some applications, protocols, or specifications require that the record-jar file use some other, non-Unicode, legacy character set. In particular, some applications, protocols, or specifications only support the US-ASCII character set (<xref target="ISO646"></xref>).</t><t>Here is an example of the encoding signature for the UTF-8 encoding of Unicode:</t><figure title="Example of an Encoding Signature" anchor="encodingSigFig"><artwork>%%encoding:UTF-8</artwork></figure><t>Printable ASCII characters excepting backslash ("\") and ampersand ("&amp;") are represented as themselves.</t><t>Non-ASCII values MAY be included in a record-jar file in several ways. For portability, the best mechanism is to use escape sequences in the field-body. Exclusive use of escape sequences results in a pure ASCII text file.</t><t>Non-ASCII characters MAY be represented using the character's Unicode value represented using the Numeric Character Reference format adapted from XML; the sequence "&amp;#x" (%x26.23.78) is followed by the character's Unicode scalar value in hex followed directly by the semi-colon character (";", %x3B). Leading zeroes MAY be omitted. For example, the EURO SIGN is U+20AC and could be represented as "&amp;#x20ac;".</t><t>Non-ASCII characters MAY also be represented as their associated octet sequence in the file's character encoding. For example, the EURO SIGN would be represented as the byte sequence %xE2.82.AC in UTF-8.</t><t>The characters for carriage return, newline, and tab when considered as part of the data (and not the file format itself) are represented by the traditional escape sequences "\r" (%x5C.72), "\n" (%x5C.6E), and "\t" (%x5C.74) respectively. The character backslash is represented by "\\" (%x5C.5C), while the ampersand character is represented by "\&amp;" (%x5C.26). A single backslash at the end of a line indicates continuation, as discussed in <xref target="folding"></xref>. Otherwise a single backslash followed by some other character in the data is an error, although a record-jar processor MAY choose to interpret it as a backslash.</t></section></section><section title="Examples" anchor="examples"><t>Here is the canonical example from <xref target="AOUP"></xref>:<figure><artwork>Planet: Mercury
Orbital-Radius: 57,910,000 km
Diameter: 4,880 km
Mass: 3.30e23 kg
%%
Planet: Venus
Orbital-Radius: 108,200,000 km
Diameter: 12,103.6 km
Mass: 4.869e24 kg
%%
Planet: Earth
Orbital-Radius: 149,600,000 km
Diameter: 12,756.3 km
Mass: 5.972e24 kg
Moons: Luna</artwork></figure></t><t>A more complete example showing more of the various features in the format is described in <xref target="RFC4646"></xref>. The data shown here is taken from the Language Subtag Registry defined that document:<figure><artwork>%%
Type: language
Subtag: ia
Description: Interlingua (International Auxiliary Language \
  Association)
Added: 2005-08-16
%%
Type: language
Subtag: id
Description: Indonesian
Added: 2005-08-16
Suppress-Script: Latn
%%
Type: language
Subtag: nb
Description: Norwegian Bokm&amp;#xE5;l
Added: 2005-08-16
Suppress-Script: Latn
%%
</artwork></figure></t></section></middle><back><references title="Normative References"><reference anchor="RFC4234" target="ftp://ftp.rfc-editor.org/in-notes/rfc4234.txt">
<front>
<title>Augmented BNF for Syntax Specifications: ABNF</title>
<author initials="D" surname="Crocker" fullname="Dave Crocker">
<organization/>
</author>
<author initials="P" surname="Overell" fullname="Paul Overell">
<organization/>
</author>
<date month="October" year="2005"/>

</front>
<seriesInfo name="Internet-Draft" value="draft-crocker-abnf-rfc2234bis-00"/>
<format type="TXT" target="ftp://ftp.rfc-editor.org/in-notes/rfc4234.txt"/>
</reference><reference anchor="Unicode"><front><title>The Unicode Consortium. The Unicode Standard, Version 5.0, (Boston, MA, Addison-Wesley, 2003. ISBN 0-321-49081-0)</title><author><organization>Unicode Consortium</organization></author><date year="2007" day="31" month="January"/></front></reference><reference anchor="UAX31"><front><title>Unicode Standard Annex #31:
Identifier and Pattern Syntax</title><author initials="M" surname="Davis" fullname="Mark Davis"><organization>The Unicode Consortium</organization></author><date year="2006" month="09" day="15"/></front></reference>&rfc2119;&rfc3629;</references><references title="Informative References"><reference anchor="AOUP" target="urn:isbn:0-13-142901-9"><front><title>The Art of Unix Programming</title><author fullname="Eric Steven Raymond" initials="E" surname="Raymond"><organization></organization></author><date year="2003"/><note title="Note about record-jar:"><t>This book contains the reference to the record-jar format in Chapter 5. An online version is here: http://www.faqs.org/docs/artu/ch05s02.html#id2906931.</t></note></front></reference><reference anchor="RFC4646" target="http://www.ietf.org/rfc/rfc4646.txt"><front><title abbrev="draft-ietf-ltru-initial-registry">Tags for the Identification of Languages</title><author initials="A" surname="Phillips" fullname="Addison Phillips" role="editor"><organization>LTRU Working Group</organization></author><author initials="M" surname="Davis" fullname="Mark Davis" role="editor"><organization>LTRU Working Group</organization></author><date day="10" month="September" year="2006"/></front></reference><reference anchor="ISO646"><front><title>ISO/IEC 646:1991, Information technology -- ISO 7-bit coded character set for information interchange. </title><author><organization>International Organization for Standardization</organization></author><date year="1991"/><abstract><t>This standard defines an International Reference Version (IRV) which corresponds exactly to what is widely known as ASCII or US-ASCII. ISO/IEC 646 was based on the earlier standard ECMA-6. ECMA has maintained its standard up to date with respect to ISO/IEC 646 and makes an electronic copy available at http://www.ecma-international.org/publications/standards/Ecma-006.htm. ISO/IEC 646 JTC 1/SC 2</t></abstract></front></reference></references><section title="Acknowledgements" anchor="acknowledgements"><t>Thanks to Eris S. Raymond for his gracious permission to both reference and quote The Art of Unix Programming in this document. Without his work, this document would likely not exist.</t><t>Contributors to this document include: Stephane Bortzmeyer, John Cowan, Frank Ellerman, Doug Ewell.</t><t>The IETF LTRU working group adopted record-jar format on John Cowan's suggestion. That effort required record-jar to be documented and many people in that group contributed to this work there: the author thanks everyone who participated in that effort, even though names cannot be mustered here.</t></section></back></rfc>
