web-calculus
code Surface Syntax
2002-07-10
This specification describes the mapping of the web-calculus
document model onto a fast, compact, and binary surface syntax.
The web-calculus code surface syntax is a binary encoding of
the sequence of document events that drive the web-calculus
document state automata
to reconstruct a specific document.
- The format supports optimized reconstruction of the represented application data structure.
- The encoding and decoding program logic is very simple.
- The format overhead is small enough that efficiency is not a concern in using the
web-calculus document model to
represent any type of application data.
- The format supports the use of an arbitrary character encoding for representing an
Annotation.
Except for the finish() event, all
document event types take a single string parameter. To homogenize
this structure, the finish() event is treated as if
taking a string parameter whose value is always the empty string. The code format therefore consists of a
sequence of strings, one for each document event in the stream of document events necessary to
reconstruct the document.
The state automata for a document event stream is a
simple one, involving only one binary decision. The
annotate(Annotation) event may be followed by either the
finish() event or the
assign(Name) event. These two branches are
distinguished in the code format by the fact that the
finish() event requires an empty string
parameter, and the assign(Name) event requires a
non-empty string parameter.
The full grammar for the code format is:
Document ::= Opcode*
Opcode ::= GetString | PutString
GetString ::= Index
Index ::= ExtensionNumber
PutString ::= '10000000' Charset Chunk* '00000000'
Charset ::= Opcode
Chunk ::= NonZeroLength Octet*
NonZeroLength ::= ExtensionNumber
ExtensionNumber ::= Extension ExtensionNumber | Number
Extension ::= '1' <7 bits>
Number ::= '0' <7 bits>
Octet ::= <8 bits>
ExtensionNumber
An ExtensionNumber encodes an unsigned integer in 7 bit quantums. Each quantum is held
in a single octet. The high bit of each octet, except the last, is set to 1. The
encoded integer is assembled by concatenating all of the 7 bit quantums. The bytes representing the
integer are stored in big-endian order, the Internet standard byte ordering.
An initial encoding quantum, in a string of more than one quantum, MUST NOT contain all
0 bits.
This encoding provides for the representation of an unlimited in size, unsigned integer while
preserving a compact representation.
Chunk
String data is encoded in a chunked representation. The chunked representation is a list of zero or
more non-empty chunks, followed by an empty chunk. Each chunk of octets is preceded by the length of
the chunk, measured in octets.
This representation allows string data to be encoded before the full length of
the string is known.
The virtual machine for processing the event stream is composed of a string set of reusable meta data
and a code segment that contains the sequence of document events. The first element in the meta
data set is ''. The second element in the meta data set is 'US-ASCII'.
Document events are processed in order until the end of the code segment. Each document event may
access or extend the meta data set, as described below.
start(Schema) event
If the start(Schema) event is a
GetString operation, the
Index is treated as an index into the meta data set. The string at the
specified index is used as the
Schema parameter to the
start(Schema) event.
If the start(Schema) event is a
PutString operation, the specified string is appended to
the meta data set. The specified string is also used as the Schema parameter to the start(Schema) event.
annotate(Annotation) event
If the annotate(Annotation) event is a
GetString operation, the Index
is treated as an index into the meta data set. The string at the specified index is used as the
Annotation parameter to the
annotate(Annotation) event.
If the annotate(Annotation) event is a
PutString operation, the specified string is used as the
Annotation parameter to the
annotate(Annotation) event. The
annotate(Annotation) event does not modify the meta data set.
assign(Name) event
If the assign(Name) event is a
GetString operation, the Index
is treated as an index into the meta data set. The string at the specified index is used as the
Name parameter to the
assign(Name) event.
If the assign(Name) event is a
PutString operation, the specified string is appended to
the meta data set. The specified string is also used as the Name parameter to the assign(Name) event.
finish() event
The finish() event MUST be a
GetString operation. The specified
Index MUST be 0.
The Charset specifier
If the Charset specifier is a GetString operation, the Index is
treated as an index into the meta data set. The string at the specified index is used as the character set name.
If the Charset specifier is a PutString operation, the specified string is appended to the meta
data set. The specified string is also used as the character set name.
A character set name MUST be case-insensitive and composed of only US-ASCII characters. A character
set name MUST be encoded using the US-ASCII character set.
Compliant decoders MUST support at least the character sets listed here.
US-ASCII
This character set identifies text data encoded in the US-ASCII character set.
The US-ASCII character set name is 'US-ASCII'.
BASE10
This character set identifies integer data encoded in 2's complement. The bytes representing the
integer are stored in big-endian order, the Internet standard byte ordering. When decoded, the
integer MUST be represented as a string containing the base 10 representation of the specified
integer.
The base 10 character set name is 'BASE10'.
A compliant encoder may choose to use additional character sets. Use of these additional character
sets is separately negotiated with any potential decoder.
IANA
If a character set is a listed
IANA character set, the character set name
MUST be the IANA character set name.
US-ASCII encoding of the Schema and Name
To ensure that any compliant decoder will at least be able to decode the overall form of the encoded
document, the encoder MUST encode all start(Schema) and
assign(Name) events using the US-ASCII character set.
|