At Home                                           L. Donnerhacke (private)
Draft 
                                                          February 2, 1997


               HDDB - Hierarchical Distributed Data Base
                          draft-hyper-j-hbbd-04.txt


Status
~~~~~~
This draft is written to become a official Internet protocol in the best 
case.

[Ideally this draft is going to form the basis for an official Internet
protocol in the future.]

Abstract
~~~~~~~~
This draft offers an enhanced DNS protocol to manage large and freely
defined data. Scalability and fast access are the main design criteria.

Contents
~~~~~~~~
TBW

Background
~~~~~~~~~~
Spring 1994, Berlin: Heiko Schlichtig wrote down a list of criteria for a
simple UseNet News Administration System (NAS)[1]. The main idea of this
concept was to use TXT Resource Records of DNS to exploit the advantages
of DNS-like scalability and decentralization. These records were designed
to store data like their names, their archive location, their description line, 
their moderators and many more. Heiko came to the conclusion that DNS is not capable
of managing manage large records and that the caching mechanism of DNS might be damaged
and result in a refusal by the DNS to handle these queries.

Late autumn 1994, Jena: The local web administration and the administration
guy from Religio [2] spent a  number of dinners together, because Religio became
larger and larger ... unmanagable. The idea of a global keyword management
system was born. A system of searchable keywords, which can easily used to
index all documents like those systems for books called a systematical
catalog in a typical library. Librarians are still unable
to provide such a list of keywords for 3 reasons:
  - Traditional indexing of books requires an extensible and consistent
    list in each library. This list needs to be up to date.
  - Keywords must be available in the librarian's native language.
  - If such a list actually existed, some librarians might even lose their jobs.

So the Jena administration checked Hyper-G [3]. But Hyper-G is not
scalable, it offers a worldwide database of all possible links to any
document. Besides, Hyper-G was not available as source code.

The idea of a systematic catalog for every Internet resource came up and
was called Hyper-J. HDDB is the main part of Hyper-J.

Introduction
~~~~~~~~~~~~
HDDB has been invented in order to access huge sets of data indexed by a key derived
from a hierarchical structure. The data ist accessed and stored by sets of
servers, connected by the hierarchy of keys. Access to any entry should 
require no more than a bare minimum necessary to structure infomations
and the data itself. Data are authoritivly stored on distributed servers
including fallback to secondary servers.

The main structure ideas are directly derived from DNS [4]. Thus, similar
records for storing these structure informations are taken from DNS, like SOA,
NS, TTL, SERIAL, CONTACT, RETRY and EXPIRE. The reading direction of the
hierarchy elements is inverted due to cultural requirements. So the
application NAS fits the naming structure of UseNet in a natural manner.

Different from DNS, HDDB has an unlimited number of record types for
each hierarchy. In order to guarantee a correct and consistent usage of
self defined record types, hard syntax checks are provided. To specify
such syntax definitions, regular expressions (regexp)[5] are used.

Each request contains a key mask and a type mask, both are regular
expessions. Any request should be resolved by exactly one server. If a
server has have different matching hierarchy delegations, it should not
resolve these delegations recursivly, it should return regular expessions.
On the other hand, the server should provide as many pieces of informations as
possible, including those from cache to the answer. Thus, a resolver can use
more closely located servers for further requests or cached data. That's why, primary
servers (SOA) only provide data to secondary servers and only secondary
servers (NS) answer resolver requests. This should reduce international
traffic and result in quicker response times.

Both, the resolver and the server can set a maximum of returned answers.
All unresolved answers are concated to an single regexp returned as the
last answer. This summarizing regexp MUST not match already tranferred
answers and SHOULD not match keys or types known to be nonexistent.


Key hierarchies
~~~~~~~~~~~~~~~
In order to speed up search access to huge data, an indexing key is used.
These keys are called first level keys. They reference data records. If
the set of first level keys is too large, a second key will be provided. This
key to the primary keys is called second level key. This method is used
recursivly until each key points to a small set of keys or data. This way each
set of data indexed by a key can be handled on a single server manually.

This draft decribes a key management system with the following axioms:
 - Every 1st level key points to exactly one data record and vice versa.
 - Every n-level key points to exactly one hierarchy record consisting
   of m-level keys (m<n) and vice versa.
 - There is exactly one highest level hierarchy, the top level hierarchy.
 - Every key is part of exactly one hierarchy record except the key to
   the top level hierarchy.

First level keys are called data keys and higher level keys are called
hierarchy keys. This is done, because hierarchy keys can be on different
levels for different data records. So the a general level of a key can not
be found by starting a count from the data itself. Each key has a
name. Key names must be unique in every hierarchy themselves.

Example (very incomplete):
                World 
                  |
   v----v-------v---+----v-------v------v--------v
  UNO Africa America Antarctica Asia Australia Europe
       |       |          |     |      |        |
        v------v-----v   -+-      v----+----v  
      Central North South      Australia Newzeeland
        |      |     |            |         |
           v---+-v                v         v
         Canada USA          Gouvernment Gouvernment
           |     |
           v     v
     Gouvernment Gouvernment

  Data keys are World/UNO, World/Australia/Australia/Gouvernment, ...
  Hierarchy keys are World, World/Africa, World/America/Central, ...
  Hierarchies are all horizontal lines in the diagramm.
  World/Antarctica is an empty hierarchy.
  Gouvernment, Australia, ... are not unique key names.


There is only one path - in which no hierarchy has a second entry - from the THL key
to any other key. So every key can be named by the names of all hierarchy
key this path came across.

All keys names are words and words are general strings. They may contain
every character any character set using MIME [6] in order to use any
language needed. Usually a word is finished by a white space. If a word
start with '"', the word is fished by the next '"', even if the next '"'
is not on the same line. Such a quoted word does not contain the '"'s. The
character '/' is a legal for any word, even for a key, explicitly. The
name of the THL key is the empty word "".

To write down the complete name of a key, every key name is written from
right to left cut by '/' starting with the TLH key following the
hierarchy levels. If the specified key is a hierarchy key, the complete
name MAY end with a trailing '/' and MUST end with this trailing '/', if a
data key exists in this hierarchy using the same name.

Apart from the complete name, a relative one can be used. A relative name of
a key does not start with the name of the THL key (the first character is
not a '/'). The complete name is derived from the relative name by
concatinating the complete name of the actual hierarchy and the relativ
name. There is no way to specify the upper hierarchy relatively. Thats why
every relative name walks down the hierarchy tree. The name "" in itself is
the relative name for the actual hierarchy key, but only "/" is the complete
name of the THL key.

A hierarchy level (short: hierarchy) is the set of all data and hierarchy
keys sharing the same maximal hierarchy key. The name of this maximal
hierarchy key ist the name of the hierarchy level.

A hierarchy tree is the set of all keys which can be reached relativly
from a hierarchy key. The name of a hierarchy tree is the name of this
common hierarchy.

Examples:
/usenet/news/de/talk/jokes
  "usenet" is a hierarchy key in the THL hierarchy.
  "news" is a hierarchy beyond "usenet".
  "de" and "talk" are hierarchy keys deeper in the tree.
  "jokes" is a data key and a hierarchy key, respectivly.
          The missing trailing "/" makes it to a data key.

/usenet/news/de/talk/jokes/
  means the hierarchy key for several subhierarchies.

funny (using the example above)
  "funny" is a relativ name of the data key
  /usenet/news/de/talk/jokes/funny.

/usenet/news/thur/net/admin
  is the data key "net/admin" beyond the hierarchy key /usenet/news/thur/.


Datasets
~~~~~~~~

A field is a triplet constisting of a data key, a field name and a data entry. Datasets
are sets of all fields with the same data key. Any field might occur more
than once in a dataset. Software MUST NOT assume any number of occurences
of datafields.

The field name MUST be declared within the same or any upper hierarchy before
used in any dataset. These declarations are part of the dataset of a
hierarchy key. Such a declaration defines scope and syntax, availibility
and defaults of the corresponding data entry.

Names are words defined in the last chapter. It is recommended to use full
uppercase words only for field names defined semantically in this draft,
lowercase words with an uppercase starting letter for hierarchy name
entries and finally full lowercase for data name entries. Sortware MUST
NOT depend on this convention.

There are 6 scopes divided into 2 classes a 3 modes:
 Class:
  HIERARCHY - Hierarchy keys only
  DATA      - Data keys only
combined with
  SINGLE    - This hierarchy level
  SUB       - This hierarchy tree
  SUBEX     - This hierarchy tree without this hierarchy level

Each declared field name can not be redeclared in his scope.

There are 2 availibility modes:
  MANDATORY - Field MUST be defined at least once for each key.
  OPTIONAL  - Field MAY be defined or not. If not, the DEFAULT
              entry is used.
             
The syntax entry is a regular expression[5]. If it does not start with
'^' and end with '$', the missing markers will be added automatically. Each
definition (data entry) MUST match this syntax regexp exactly, which is why
it is recommened to choose the most restrictive regexp to guarantee
correct databases. If the syntax entry consists of 2 words and the first
WORD ist "FROM", the second word will be interpreted as an already decleared
field name and the syntax entry will be copied (not linked) from the given
field. It's not possible to reference fields out of there scope. The
evaluation of this entry takes place at compile time of the database. This
technique is called macro.

The DEFAULT entry allows a predefinition of fields. If and only if a
dataset has at least one entry of the same entry name, the default entry
is deleted from this dataset. If the default entry should occur in a
dataset, too, a field containung the default entry must be inserted in ths
dataset explicitly. So OPTIONAL DEFAULT generates a changeable MANDATORY
entry, useful for static values. Macro technique is allowed, but the macro
is evaluated while an incoming question is answered, at run time. So this
macro is more a link than a copy. At compile time the macro is prooved
(regexp vs. regexp) to prevent run time errors. Run time errors are still
possible. It is possible to redefine a DATA SUB OPTIONAL entries in lower
hierarchies. This redefinition spawns the whole subtree.

Fields or field declarations are single lines. In order to allow better
editing, the trailing linefeed can be hidden by a '\'. Linefeeds in a word
(quoted) do count as a real linefeed but this linefeed is part of the
word, so "a word in a single line" can contain multiple lines.

If a line starts with a white space, the last key is implicitly repeated.
If a line starts with a ';', '#' or '%', the line is ignored (comment).
The linefeed of a commentline can be masked by a '\'.

Example:
  /  DEF SOA HIERARCHY SUB MANDATORY "\([:alpha:][-_:alnum:]*\.\)\{2,\}"
     SOA "hddb.thur.de."
  ; And now a entry making use of the first declaration.
  ; The syntax entry is evaluated, the default entry checked.
     DEF NS HIERARCHY SUB OPTIONAL FROM SOA DEFAULT FROM "SOA"
  ; Now, DEFAULT of NS = "hddb.thur.de."
   
  usenet SOA "hddb.fu-berlin.de."
  ; Now, DEFAULT of NS = "hddb.fu-berlin.de."

Management of key hierarchies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each hierarchy has a set of servers managing the database of this
hierarchy. One server of this set is the master server, all other server
mirror the database. The master server MUST NOT accept any questions other
than mirror requests, but it's possible to declare the master as its own
mirror. A resolver MUST use a mirror and SHOULD use the topological nearest
one.

The following entry names are needed to manage key hierarchies. They are
declared at the TLH. The syntax of these entries may change, so software
MUST request the current state explicitly:
  SOA     - Source of Authority, Masterserver
  NS      - Name Server, Mirror
  SERIAL  - Serial number of the database, higher numbers are newer ones.
  CONTACT - eMail of a responsible for the hierarchy
  REFRESH - minimum time in seconds after that a running mirror should
            start a new mirror cycle.
  RETRY   - minimum time in seconds after that a running mirror should
            restart a failed mirror cycle.
  EXPIRE  - minimum time in seconds since the last successful mirror
            cycle after that a mirror should stop mirroring this hierarchy.
  TTL     - maximum time in seconds after that a data field outside a
            mirror MUST be deleted from caches. While caching, the TTL
            MUST NOT be increased or resetted.

This resolver - mirror protocol an the master - mirror protocol are taken
from DNS [4]. Caching strategies are discussed at the same place, but
implementors must take care of regular expessions as answers.

... a lot to do ...

Bibliography
~~~~~~~~~~~~
 [1] Heiko Schlichting, heiko@fu-berlin.de, NetNews Administration System,
     ftp://ftp.fu-berlin.de/pub/doc/usenet/NAS.ps.gz (?), Apr 1994
 [2] Winfried Mueller, Winfried.Mueller@Jena.Thur.De, Religio Server,
     http://www.thur.de/religio/start.html, Dec 1994
 [3] Hyper-G
 [4] Domain Name Service
 [5] regular expressions

///////////////////////////////////////////////////////////////////////////

BNF-Syntax:

Esc            := '\\'
NLL            := *{ Esc '\n' }
NL             := NLL '\n'
LWS            := ' ' | '\t'
WS             := LWS | NLL LWS
Quote          := '"'
OctZiffer      := '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7'
HexZiffer      := OctZiffer | '8' | '9'
              | 'a' | 'b' | 'c' | 'd' | 'e' | 'f'
              | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
EscZeichen     := Esc 8-Bit-Zeichen | Esc 1,3{ OctZiffer }
              | Esc ( 'x' | 'X' ) 1,2{ HexZiffer }

Kommentarzeile := ( ';' | '#' | '%' ) *{ ! NL } NL
Zeichen        := ! ( WS | NL | NLL | Quote | EscZeichen )
String         := *{ Zeichen } | Quote *{ ! Quote } Quote

Name           := String
Feldbezeichner := Name
Feldtyp        := Name
Felddaten      := ( 'FROM' WS Feldtyp ) | String
RegExp         := String
Gueltigkeitsbereich := 'HIERARCHIE' | 'DATA'
Hierarchieordung:= 'SUB' | 'SINGLE' | 'SUBEX'
Notwendigkeit  := 'MANDATORY' | 'OPTIONAL'
Syntax         := ( 'FROM' WS Feldtyp ) | RegExp

Definition     := Feldbezeichner 'DEF' WS Feldtyp WS Gueltigkeitsbereich
                WS Hierarchieordnung WS Notwendigkeit WS Syntax
                0,1{ WS 'DEFAULT' Felddaten }
Daten          := Feldtyp WS Felddaten
Datenzeile     := Feldbezeichner 1,*{ WS Daten NL }
Kopfzeile      := Feldbezeichner 1,*{ WS ( Daten | Definition ) NL }
Kopfzeilen     := *{ Kopfzeile | Kommentarzeile }
Datenzeilen    := *{ Datenzeile | Kommentarzeile }
Hierarchie     := Kopfzeilen Datenzeilen


///////////////////////////////////////////////////////////////////////////

Beispiel:


/  DEF SOA     HIERARCHY SUB MANDATORY \
               \([:alpha:][-_:alnum:]*\.\)\{2,\}
   DEF CONTACT HIERARCHY SUB MANDATORY \
               \([-._:alnum:]*\)\{1,\}@\([:alpha:][-_:alnum:]*\.\)\{2,\}
   DEF SERIAL  HIERARCHY SUB MANDATORY [:num:]\{1,\}
   SOA         hddb.thur.de.
   CONTACT     Lutz.Donnerhacke@Jena.Thur.De
   SERIAL      95050101
   DEF REFRESH HIERARCHY SUB OPTIONAL FROM SERIAL  DEFAULT 3600
   DEF RETRY   HIERARCHY SUB OPTIONAL FROM SERIAL  DEFAULT 600
   DEF EXPIRE  HIERARCHY SUB OPTIONAL FROM SERIAL  DEFAULT 604800
   DEF TTL     HIERARCHY SINGLE MANDATORY FROM SERIAL
   TTL         99999999
   DEF TTL     HIERARCHY SUBEX OPTIONAL FROM SERIAL  DEFAULT 86400
   DEF NS      HIERARCHY SUB OPTIONAL FROM SOA     DEFAULT FROM SOA

/usenet/news    SOA      hddb.thur.de.
              CONTACT  Lutz.Donnerhacke@Jena.Thur.De
              SERIAL   9505011
   DEF Status       DATA SUB OPTIONAL  [ynmx] DEFAULT y
   DEF Description  DATA SUB MANDATORY [:print:][:print::space:]*
   DEF Charta       DATA SUB MANDATORY [:print::space:]*
   DEF Contact      DATA SUB OPTIONAL  FROM CONTACT DEFAULT FROM CONTACT
   DEF Created      DATA SUB OPTIONAL  [0-9]\{4\}/[0-9]\{2\}/[0-9]\{2\}
   DEF Deleted      DATA SUB OPTIONAL  FROM Created
   DEF Moderator    DATA SUB OPTIONAL  FROM Contact DEFAULT FROM Contact
   DEF FollowUp-To  DATA SUB OPTIONAL  \(\([:alpha:][-_+:alnum:]*\)\.\)*\2
   DEF Auth-PGP     DATA SUB OPTIONAL  [:alnum:]*
   DEF Archive      DATA SUB OPTIONAL  ftp://\([:alpha:][-_:alnum:]*\.\)\{2,\}\(/[:alnum:][-._:alnum:]*\)\{1,\}
   DEF HStatus      HIERARCHY SUB MANDATORY [grao][bst]
#     g(olbal), r(egional), a(lternativ), o(rgnanisation),
#     b(inary), s(ource), t(ext)
   DEF Language     DATA SUB OPTIONAL [:alpha:][:alpha:] DEFAULT en
   DEF Distribution HIERARCHY SUB OPTIONAL [:alpha:]* DEFAULT world
   DEF Charset      HIERARCHY SUB OPTIONAL ISO-[:num:]*-[:num:]* \
                               DEFAULT ISO-8859-1
   DEF Faq          DATA SUB OPTIONAL FROM Archive
   DEF Netiquette   HIERARCHY SUB OPTIONAL FROM Faq \
                  DEFAULT ftp://ftp.fu-berlin.de/pub/doc/usenet/netiquette
   DEF Vote-Rules   HIERARCHY SUB OPTIONAL FROM Faq \
                  DEFAULT ftp://ftp.fu-berlin.de/pub/doc/usenet/voting-rules
   DEF Rules        DATA SUB OPTIONAL FROM Faq
   DEF Max-Article-Size DATA SUB OPTIONAL FROM SERIAL DEFAULT 65000

$ORIGIN=/usenet/news
de      NS   hddb.fu-berlin.de.
       NS   hddb.xlink.net.

$ORIGIN=/usenet/news/thur
/       SOA     hddb.thur.de.
       SERIAL  9504252
       CONTACT usenet@thur.de.
       REFRESH 36000
       NS      news.uni-jena.de.
       Moderator DEFAULT Erik.Heinz@Jena.Thur.De.
       HStatus rt
       Language DEFAULT de
       Netiquette DEFAULT ftp://ftp.thur.de/pub/verein/nutzerordung

jena    Description "In und um Jena"
       Charta "Dies ist eine oeffentliches Diskussionsforum fuer die Menschen
in und um Jena oder die, die sich fuer Jena interessieren."

gera    Description "Die ostthueringer Metropole"
       Contact Carsten.Kruse@Gera.Thur.De
       Charta "Dies ist eine oeffentliches Diskussionsforum fuer die Menschen
in und um Gera oder die, die sich fuer Gera interessieren.
Die Newsgroup ist identisch mit dem lokalen Diskussionbrett der BEG Gera."

erfurt  Description "Die Landeshauptstadt"
       Charta "Dies ist eine oeffentliches Diskussionsforum fuer alle
in und um Erfurt oder die, die sich fuer Erfurt interessieren."
       Contact Dietmar.Thoerwirth@Erfurt.Thur.De

comm/dfue Description "Probleme der Datenfernuebertragung"
         Charta "..."

net/admin Description Probleme und Ausbau des Thueringen Netz
         Status m
         Contact Erik.Heinz@Jena.Thur.De
         Followup-To thur.net.news.groups
         Charta "..."

net/news/groups Description "Diskussionen um neue Gruppen"
              Charta "..."

fido     NS hddb.thur.de.

$ORIGIN=fido
/        SOA hddb.thur.de.
        Max-Article-Size DEFAULT 32000
        Contact Markus.Kaemmerer@Erfurt.Thur.De
        Moderator DEFAULT FROM Contact
        SERIAL 9502104
        DEF Gateway DATA SINGLE FROM SOA DEFAULT tvbbs.fido.thur.de.

announce Description Filelisten
        Gateway habbs.erfurt.thur.de.
        Charta "..."

thuringia Description "Fido in Thueringen"
        Charta "..."

///////////////////////////////////////////////////////////////////////////