Page 88 - Building Digital Libraries
P. 88

General-Purpose Technologies Useful for Digital Repositories


                 used for structuring legal documents to a standardized format. They were
                 working for IBM at the time, and the goal of the project was to create a
                 method of structuring documents so that those documents could be read
                 and acted upon by a computer. At the time, this action was mainly limited to
                 the formatting and reformatting of data for publication—but the underlying
                 concept would allow the documents to become more useful because tools
                 could be built to act upon the structured data. In 1970, IBM extended the
                 project to encompass general text processing and SGML was born. In the
                 early 1990s, a group from CERN developed HTML, a small subset of SGML
                 for the publication of linked documents on the Web. This metadata schema
                 provided a common tag set for created markup for online publication, and
                 it quickly became the lingua franca for publishing on the World Wide Web
                 (WWW). Finally, in 1996, a group known as the XML Working Group was
                 formed out of the W3C with the goal of creating a subset of SGML that
                 would enable generic SGML objects to be processed over the WWW in
                 much the same way that HTML is now (W3C, www.w3.org/TR/REC-xml/).
                     In figure 5.1, we see a number of items located below XML. These
                 represent a number of technologies or metadata schemas that have been
                 developed out of the XML specification. What’s important to note is that
                 XML is, in large part, simply a markup language for data. In and of itself,
                 XML has no inherent functionality, outside of the meaning and context
                 that it brings to the data that it describes. Now, this in and of itself is a
                 very powerful function. The ability to give elements within a document
                 properties and attributes allows one to enhance the available metadata by
                 creating context for the document elements. What’s more, XML is not a
                 “flat” metadata structure, but one that allows for the creation of hierarchical
                 relationships between elements within a document. An XML document can
                 literally become a digital object in and of itself. This is very different from a
                 markup language like MARC, which functions solely as a container for data
                 transfer. As a “flat” data structure, MARC lacks the ability to add context to
                 the bibliographic data that it contains. Moreover, data stored in XML has the
                 ability to exist separately from the object that it describes and still maintain
                 meaning, given its contextual and hierarchical nature. However, as stated
                 above, XML is in essence simply a fancy text file. What makes XML special
                 and useful as a metadata schema is the ancillary technologies currently
                 built around the XML specification that can be used to interpret meaning
                 from the various properties and attributes of a given element. At the heart
                 of these technologies is the XML parser. Currently, a number of different
                 XML parsers exist, including Saxon, a Java XML/XSLT parser; libxml, the
                 Gnome XML parsing library; and MSXML, the XML parser currently built
                 into the Windows operating system. These XML parsers offer users two
                 primary methods for interacting with an XML document: DOM (Document
                 Object Model) and SAX (Simple API for XML).

                        1.  DOM—The Document Object Model should be familiar
                           to anyone who does web development using JavaScript.
                           DOM is a platform-neutral interface with the content
                                                                                                                      73
   83   84   85   86   87   88   89   90   91   92   93