Page 88 - Building Digital Libraries
P. 88
General-Purpose Technologies Useful for Digital Repositories
used for structuring legal documents to a standardized format. They were
working for IBM at the time, and the goal of the project was to create a
method of structuring documents so that those documents could be read
and acted upon by a computer. At the time, this action was mainly limited to
the formatting and reformatting of data for publication—but the underlying
concept would allow the documents to become more useful because tools
could be built to act upon the structured data. In 1970, IBM extended the
project to encompass general text processing and SGML was born. In the
early 1990s, a group from CERN developed HTML, a small subset of SGML
for the publication of linked documents on the Web. This metadata schema
provided a common tag set for created markup for online publication, and
it quickly became the lingua franca for publishing on the World Wide Web
(WWW). Finally, in 1996, a group known as the XML Working Group was
formed out of the W3C with the goal of creating a subset of SGML that
would enable generic SGML objects to be processed over the WWW in
much the same way that HTML is now (W3C, www.w3.org/TR/REC-xml/).
In figure 5.1, we see a number of items located below XML. These
represent a number of technologies or metadata schemas that have been
developed out of the XML specification. What’s important to note is that
XML is, in large part, simply a markup language for data. In and of itself,
XML has no inherent functionality, outside of the meaning and context
that it brings to the data that it describes. Now, this in and of itself is a
very powerful function. The ability to give elements within a document
properties and attributes allows one to enhance the available metadata by
creating context for the document elements. What’s more, XML is not a
“flat” metadata structure, but one that allows for the creation of hierarchical
relationships between elements within a document. An XML document can
literally become a digital object in and of itself. This is very different from a
markup language like MARC, which functions solely as a container for data
transfer. As a “flat” data structure, MARC lacks the ability to add context to
the bibliographic data that it contains. Moreover, data stored in XML has the
ability to exist separately from the object that it describes and still maintain
meaning, given its contextual and hierarchical nature. However, as stated
above, XML is in essence simply a fancy text file. What makes XML special
and useful as a metadata schema is the ancillary technologies currently
built around the XML specification that can be used to interpret meaning
from the various properties and attributes of a given element. At the heart
of these technologies is the XML parser. Currently, a number of different
XML parsers exist, including Saxon, a Java XML/XSLT parser; libxml, the
Gnome XML parsing library; and MSXML, the XML parser currently built
into the Windows operating system. These XML parsers offer users two
primary methods for interacting with an XML document: DOM (Document
Object Model) and SAX (Simple API for XML).
1. DOM—The Document Object Model should be familiar
to anyone who does web development using JavaScript.
DOM is a platform-neutral interface with the content
73