Page 84 - Building Digital Libraries
P. 84

General-Purpose Technologies Useful for Digital Repositories


                 be viewed, edited, validated, and acted upon by XML tools, and it became
                 a part of the HTML5 vocabulary with the revision in 2014.  The interest
                                                                      1
                 behind XHTML lies in its strict validation rules and the ability to process
                 documents using XML parsing engines and XPath. What’s more, due to its
                 strict adherence to XML validation, browsers have an easier time processing
                 XHTML documents. Since XHTML is XML-conforming, it must respect
                 XML’s strict set of document validation rules relating to tags and character
                 data. This is very different from the implementation of HTML, which is in
                 many respects a sloppy markup language that has been loosely interpreted
                 by today’s web-browsing technology. It’s this promise of markup standard-
                 ization and strict validation that continues to excite those developing for the
                 Web. Additionally, the fact that XHTML documents can be validated and
                 acted upon by XML tools and technologies like XPath and XQuery should
                 not be underestimated. XHTML documents give web developers the abil-
                 ity to build documents for display, while still allowing the documents to be
                 parseable and actionable by other groups and users. For institutions wanting
                 to make their content available for users to build services or “mash-ups”
                 on top of their collections, but which lack the expertise to provide a fully
                 functional web-based API (application programming interface), offering
                 content in XHTML can help to expose their collections or services. Within
                 the library community, XHTML implementations are becoming easier to
                 find. Many ILS systems offer an interface that renders against an XHTML
                 schema, though the most public and widely used XHTML resource in the
                 library community is OCLC’s WorldCat.org service. WorldCat.org provides
                                                              2
                 a publicly searchable portal to the OCLC digital properties. However, what’s
                 more interesting is OCLC’s decision to have WorldCat.org generate content
                 in XHTML. By providing content in XHTML, OCLC has provided a mini-
                 mal tool set necessary to parse, extract, or embed data from the WorldCat
                 .org project into other services. For example, using an XPath statement
                 (which will be defined and explained below), one can easily extract the hold-
                 ings libraries from a document. So how do we identify XHTML records?
                 We can identify XHTML records by the document type defined at the top
                 of the source code. If we look at the source for the following record: www
                 .worldcat.org/oclc/26557254&referer=brief_results, we will find the follow-
                 ing statement at the top of the XHTML file:


                     <!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN”
                     “http://www.w3.org/TR/xhtm11/DTD/xhtm11-transitional
                     .dtd”><html xmlns=“http://www.w3.org/1999/xhtml” xml:lang=“en”
                     lang=“en”>

                 The source code identifies itself as XHTML to the web browser in the first
                 line of the source, the DOCTYPE statement which defines the DTD (docu-
                 ment type definition) that defines the parsing rules and elements that can
                 legally be used within the document. As an XHTML document, each ele-
                 ment or tag can be parsed using an XPath expression, in which the HTML
                 tag is represented as the root tag in the XML tree.

                                                                                                                      69
   79   80   81   82   83   84   85   86   87   88   89