home -> developer -> Doc -> Model

previous version

web-calculus

Document Model

2005-04-07

This specification defines a document model. A "document model" is the definition of the parse tree that results from parsing an encoded representation of a document. [infoset]

Abstract

A document is an annotated symbolic tree. Each node in the tree is typed with a URI. Each branch in the tree is named with a restricted format string. An annotation string is attached to every branch and node.

Overview

Design goals

  1. A wide variety of application data structures can be conveniently represented.
  2. The represented application data structure is upgradeable.
  3. Comments can be manipulated along with application data.

Tree structure

The document model has a tree structure. As shown in LISP, a simple recursive structure can be used to represent a wide variety of application data structures. LISP uses lists of lists to represent data structures. This document model uses a similar tree structure. The main difference is that elements in a list are referenced by position, whereas nodes in a tree can be referenced using a symbolic path.

Symbolic linking

Branches in the tree structure are referenced with symbolic names. This allows new branches to be added to a node, and existing branches to be removed, without affecting the referencing of existing branches. This symbolic referencing can be used to support upgrade of the represented application data structure.

Polymorphism

The document model supports polymorphism in the represented application data structure. Each node in the symbolic tree can be associated with a type. A URI is used to represent type. Two different instances of a symbolic tree may have different types of nodes located using the same symbolic path.

Annotation

Annotation of the symbolic tree is used to represent both application data and comments. This puts comments and application data on equal footing, providing a consistent interface for manipulating both.

Description

Grammar

The full grammar for the document model is:

Document ::= Node*

Node ::= Schema Branch* Annotation

Branch ::= Annotation Name Node

Schema ::= URI | ''

Name ::= [a-zA-Z_] [a-zA-Z_0-9]*

Annotation ::= [^#x00]*

Document

A Document is a list of zero or more Nodes.

Node

A Node is a point or vertex in a tree. Every Node is of a particular Schema. A single Annotation is attached to every Node.

A Node is a list of zero or more Branches. The relative order of Branches with the same name is significant. The relative order of Branches with different names is not significant. A Node containing no Branches is called a leaf Node.

Branch

A Branch is a unidirectional edge originating at a parent Node and terminating at a child Node. No two Branches point to the same child Node. A single Annotation is attached to every Branch.

Every Branch has an associated Name. This Name is interpreted within the namespace defined by the Schema of the Branch's parent Node.

Schema

A Schema is a globally unique identifier that identifies a particular type of Node. A Schema MUST be a URI.

Schemas are compared using a straight case-sensitive character comparison. Two Schemas are the same if they are character for character identical.

A child Node with no specified Schema has an implicit Schema uniquely determined by the Schema of the parent Node and the Name of the Branch pointing to the child Node.

The Schema determines the meaning, if any, of the Annotation and list of Branches in a Node.

Name

A Name is a locally unique identifier. The meaning of a Name is local to the defining Schema. Two different Schemas may define identical Names and these definitions are by default independent of each other. Two different Schemas may seek to be polymorphic by defining identical Names with identical definitions.

Names are compared using a straight case-sensitive character comparison. Two Names are the same if they are character for character identical.

A Name SHOULD be formed from one or more words separated by '_'. Each word SHOULD use only lower case letters and SHOULD be spelled using United States English. The sequence of words used SHOULD be descriptive of the Name's definition. This convention is intended to promote interoperation by limiting meaningless variation.

Annotation

An Annotation is a string of zero or more characters. Any UNICODE character, save the NUL character, is a valid component of an Annotation.

An Annotation may be a human readable comment or encoded data that is part of a represented application data structure. The meaning of an Annotation is determined by the defining Schema.

In a leaf Node Annotation, whitespace is significant. In all other Annotations, whitespace is not significant; a document processor may expand or contract any sequence of whitespace characters. A whitespace character is UNICODE #x20, #x9, #xD or #xA.

State automata

For each Node in the Document, a traversal of the generated parse tree will produce a stream of document events that follows the state automata:

                            ----------------------------
                            |                          |
    begin()-->start(Schema)-->annotate(Annotation)-->finish()-->end()
            |                         |
            -------assign(Name)<-------
start(Schema)

This document event signals the start of a new Node. The event is parameterized with the Schema of the new Node.

annotate(Annotation)

This document event signals an Annotation. If the following event is assign(Name), the Annotation is attached to the next Branch; otherwise, the Annotation is attached to the current Node.

assign(Name)

This document event signals the start of a new Branch. The event is parameterized with the Name of the new Branch.

finish()

This document event signals the end of a Node.

Footnotes

[infoset] The term "document model" and the W3C's "infoset" term refer to similar concepts.

top

Copyright 2003-2005 Waterken Inc. All rights reserved.

Valid XHTML 1.0! Valid CSS!