Page 198 - Building Digital Libraries
P. 198
Sharing Data—Harvesting, Linking, and Distribution
the digitized content being made available through the HathiTrust. When
selecting materials to be digitized, the libraries used a wide range of criteria
to ensure that the materials selected were of significant value, and unique to
the HathiTrust. Additionally, participants had to evaluate the condition and
form of the content to be digitized, since Google’s process was optimized to
handle only the most common material forms in fair to good condition. This
meant that at the conclusion of the project, most libraries had thousands
of items that were identified as good candidates for digitization, but had
been disqualified due to the condition or form of the items. OSUL was no
different; the libraries had identified a wide range of materials to contribute
to the HathiTrust, but they would need a different workflow to digitize and
upload the content.
To move the project forward, the libraries partnered with the Internet
Archive. Materials were digitized with scanners, and software was purchased
from the Internet Archive which would streamline depositing the materials
in the Internet Archive’s management system. This would allow the librar-
ies to use the Internet Archive’s OpenLibrary platform in order to provide
wide access to the content. Moreover, once the content had been ingested
into the Internet Archive, materials could be identified and provided to
the HathiTrust through a partnership agreement between the Internet
Archive and the HathiTrust. To make the transfer, OSUL would just need
to generate a specially formatted XML file which included the metadata for
the digitized item in MARCXML, and a specially coded set of fields that
provided the necessary identifier information for the digitized materials at
the Internet Archive.
In early testing, the libraries utilized staff to hand code these transfer
files. MARC data would be converted to MARCXML using a range of tools,
and then a staff member would code the information related directly to the
Internet Archive files that were slated for data transfer. The problem was
that this process didn’t scale up. The libraries had identified hundreds of
thousands of potential items to digitize and transfer in this manner, mean-
ing that an automated process needed to be developed.
To facilitate the process, a plug-in was created in MarcEdit to leverage
the Internet Archive search API. The Internet Archive provides an API that
returns information about items in an XML format. Using this information,
the libraries could create a process that would query the Internet Archive,
retrieve the list of digitized materials over a specified period of time, extract
the specific item data, and then automatically generate the transfer file. (See
figure 7.7.)
Using the plug-in, the process for deposit shifted from a manual process
that could handle a few dozen records a month to an automated process
that was limited only by the organization’s capacity to digitize content. And
because the process was developed utilizing shared API and communica-
tion standards for both the Internet Archive and the HathiTrust, the plug-
in was purposefully developed so that its use and output wouldn’t be tied
just to OSUL’s local workflow, but would be made generic, so that it could
183