Page 84 - Building Digital Libraries
P. 84
General-Purpose Technologies Useful for Digital Repositories
be viewed, edited, validated, and acted upon by XML tools, and it became
a part of the HTML5 vocabulary with the revision in 2014. The interest
1
behind XHTML lies in its strict validation rules and the ability to process
documents using XML parsing engines and XPath. What’s more, due to its
strict adherence to XML validation, browsers have an easier time processing
XHTML documents. Since XHTML is XML-conforming, it must respect
XML’s strict set of document validation rules relating to tags and character
data. This is very different from the implementation of HTML, which is in
many respects a sloppy markup language that has been loosely interpreted
by today’s web-browsing technology. It’s this promise of markup standard-
ization and strict validation that continues to excite those developing for the
Web. Additionally, the fact that XHTML documents can be validated and
acted upon by XML tools and technologies like XPath and XQuery should
not be underestimated. XHTML documents give web developers the abil-
ity to build documents for display, while still allowing the documents to be
parseable and actionable by other groups and users. For institutions wanting
to make their content available for users to build services or “mash-ups”
on top of their collections, but which lack the expertise to provide a fully
functional web-based API (application programming interface), offering
content in XHTML can help to expose their collections or services. Within
the library community, XHTML implementations are becoming easier to
find. Many ILS systems offer an interface that renders against an XHTML
schema, though the most public and widely used XHTML resource in the
library community is OCLC’s WorldCat.org service. WorldCat.org provides
2
a publicly searchable portal to the OCLC digital properties. However, what’s
more interesting is OCLC’s decision to have WorldCat.org generate content
in XHTML. By providing content in XHTML, OCLC has provided a mini-
mal tool set necessary to parse, extract, or embed data from the WorldCat
.org project into other services. For example, using an XPath statement
(which will be defined and explained below), one can easily extract the hold-
ings libraries from a document. So how do we identify XHTML records?
We can identify XHTML records by the document type defined at the top
of the source code. If we look at the source for the following record: www
.worldcat.org/oclc/26557254&referer=brief_results, we will find the follow-
ing statement at the top of the XHTML file:
<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN”
“http://www.w3.org/TR/xhtm11/DTD/xhtm11-transitional
.dtd”><html xmlns=“http://www.w3.org/1999/xhtml” xml:lang=“en”
lang=“en”>
The source code identifies itself as XHTML to the web browser in the first
line of the source, the DOCTYPE statement which defines the DTD (docu-
ment type definition) that defines the parsing rules and elements that can
legally be used within the document. As an XHTML document, each ele-
ment or tag can be parsed using an XPath expression, in which the HTML
tag is represented as the root tag in the XML tree.
69