home -> developer -> Doc -> code

Model

web-calculus

code Surface Syntax

2002-07-10

This specification describes the mapping of the web-calculus document model onto a fast, compact, and binary surface syntax.

Abstract

The web-calculus code surface syntax is a binary encoding of the sequence of document events that drive the web-calculus document state automata to reconstruct a specific document.

Overview

Design goals

  1. The format supports optimized reconstruction of the represented application data structure.
  2. The encoding and decoding program logic is very simple.
  3. The format overhead is small enough that efficiency is not a concern in using the web-calculus document model to represent any type of application data.
  4. The format supports the use of an arbitrary character encoding for representing an Annotation.

An encoded document event stream

Except for the finish() event, all document event types take a single string parameter. To homogenize this structure, the finish() event is treated as if taking a string parameter whose value is always the empty string. The code format therefore consists of a sequence of strings, one for each document event in the stream of document events necessary to reconstruct the document.

The state automata for a document event stream is a simple one, involving only one binary decision. The annotate(Annotation) event may be followed by either the finish() event or the assign(Name) event. These two branches are distinguished in the code format by the fact that the finish() event requires an empty string parameter, and the assign(Name) event requires a non-empty string parameter.

Description

Grammar

The full grammar for the code format is:

Document ::= Opcode*

Opcode ::= GetString | PutString

GetString ::= Index

Index ::= ExtensionNumber

PutString ::= '10000000' Charset Chunk* '00000000'

Charset ::= Opcode

Chunk ::= NonZeroLength Octet*

NonZeroLength ::= ExtensionNumber

ExtensionNumber ::= Extension ExtensionNumber | Number

Extension ::= '1' <7 bits>

Number ::= '0' <7 bits>

Octet ::= <8 bits>

ExtensionNumber

An ExtensionNumber encodes an unsigned integer in 7 bit quantums. Each quantum is held in a single octet. The high bit of each octet, except the last, is set to 1. The encoded integer is assembled by concatenating all of the 7 bit quantums. The bytes representing the integer are stored in big-endian order, the Internet standard byte ordering.

An initial encoding quantum, in a string of more than one quantum, MUST NOT contain all 0 bits.

This encoding provides for the representation of an unlimited in size, unsigned integer while preserving a compact representation.

Chunk

String data is encoded in a chunked representation. The chunked representation is a list of zero or more non-empty chunks, followed by an empty chunk. Each chunk of octets is preceded by the length of the chunk, measured in octets.

This representation allows string data to be encoded before the full length of the string is known.

Execution model

The virtual machine for processing the event stream is composed of a string set of reusable meta data and a code segment that contains the sequence of document events. The first element in the meta data set is ''. The second element in the meta data set is 'US-ASCII'. Document events are processed in order until the end of the code segment. Each document event may access or extend the meta data set, as described below.

start(Schema) event

If the start(Schema) event is a GetString operation, the Index is treated as an index into the meta data set. The string at the specified index is used as the Schema parameter to the start(Schema) event.

If the start(Schema) event is a PutString operation, the specified string is appended to the meta data set. The specified string is also used as the Schema parameter to the start(Schema) event.

annotate(Annotation) event

If the annotate(Annotation) event is a GetString operation, the Index is treated as an index into the meta data set. The string at the specified index is used as the Annotation parameter to the annotate(Annotation) event.

If the annotate(Annotation) event is a PutString operation, the specified string is used as the Annotation parameter to the annotate(Annotation) event. The annotate(Annotation) event does not modify the meta data set.

assign(Name) event

If the assign(Name) event is a GetString operation, the Index is treated as an index into the meta data set. The string at the specified index is used as the Name parameter to the assign(Name) event.

If the assign(Name) event is a PutString operation, the specified string is appended to the meta data set. The specified string is also used as the Name parameter to the assign(Name) event.

finish() event

The finish() event MUST be a GetString operation. The specified Index MUST be 0.

The Charset specifier

If the Charset specifier is a GetString operation, the Index is treated as an index into the meta data set. The string at the specified index is used as the character set name.

If the Charset specifier is a PutString operation, the specified string is appended to the meta data set. The specified string is also used as the character set name.

A character set name MUST be case-insensitive and composed of only US-ASCII characters. A character set name MUST be encoded using the US-ASCII character set.

Required character sets

Compliant decoders MUST support at least the character sets listed here.

US-ASCII

This character set identifies text data encoded in the US-ASCII character set.

The US-ASCII character set name is 'US-ASCII'.

BASE10

This character set identifies integer data encoded in 2's complement. The bytes representing the integer are stored in big-endian order, the Internet standard byte ordering. When decoded, the integer MUST be represented as a string containing the base 10 representation of the specified integer.

The base 10 character set name is 'BASE10'.

Optional character sets

A compliant encoder may choose to use additional character sets. Use of these additional character sets is separately negotiated with any potential decoder.

IANA

If a character set is a listed IANA character set, the character set name MUST be the IANA character set name.

Encoding conventions

US-ASCII encoding of the Schema and Name

To ensure that any compliant decoder will at least be able to decode the overall form of the encoded document, the encoder MUST encode all start(Schema) and assign(Name) events using the US-ASCII character set.

top

Copyright 2003 Waterken Inc. All rights reserved.

Valid
            XHTML 1.0! Valid CSS!