Page 198 - Building Digital Libraries
P. 198

Sharing Data—Harvesting, Linking, and Distribution


                 the digitized content being made available through the HathiTrust. When
                 selecting materials to be digitized, the libraries used a wide range of criteria
                 to ensure that the materials selected were of significant value, and unique to
                 the HathiTrust. Additionally, participants had to evaluate the condition and
                 form of the content to be digitized, since Google’s process was optimized to
                 handle only the most common material forms in fair to good condition. This
                 meant that at the conclusion of the project, most libraries had thousands
                 of items that were identified as good candidates for digitization, but had
                 been disqualified due to the condition or form of the items. OSUL was no
                 different; the libraries had identified a wide range of materials to contribute
                 to the HathiTrust, but they would need a different workflow to digitize and
                 upload the content.
                     To move the project forward, the libraries partnered with the Internet
                 Archive. Materials were digitized with scanners, and software was purchased
                 from the Internet Archive which would streamline depositing the materials
                 in the Internet Archive’s management system. This would allow the librar-
                 ies to use the Internet Archive’s OpenLibrary platform in order to provide
                 wide access to the content. Moreover, once the content had been ingested
                 into the Internet Archive, materials could be identified and provided to
                 the HathiTrust through a partnership agreement between the Internet
                 Archive and the HathiTrust. To make the transfer, OSUL would just need
                 to generate a specially formatted XML file which included the metadata for
                 the digitized item in MARCXML, and a specially coded set of fields that
                 provided the necessary identifier information for the digitized materials at
                 the Internet Archive.
                     In early testing, the libraries utilized staff to hand code these transfer
                 files. MARC data would be converted to MARCXML using a range of tools,
                 and then a staff member would code the information related directly to the
                 Internet Archive files that were slated for data transfer. The problem was
                 that this process didn’t scale up. The libraries had identified hundreds of
                 thousands of potential items to digitize and transfer in this manner, mean-
                 ing that an automated process needed to be developed.
                     To facilitate the process, a plug-in was created in MarcEdit to leverage
                 the Internet Archive search API. The Internet Archive provides an API that
                 returns information about items in an XML format. Using this information,
                 the libraries could create a process that would query the Internet Archive,
                 retrieve the list of digitized materials over a specified period of time, extract
                 the specific item data, and then automatically generate the transfer file. (See
                 figure 7.7.)
                     Using the plug-in, the process for deposit shifted from a manual process
                 that could handle a few dozen records a month to an automated process
                 that was limited only by the organization’s capacity to digitize content. And
                 because the process was developed utilizing shared API and communica-
                 tion standards for both the Internet Archive and the HathiTrust, the plug-
                 in was purposefully developed so that its use and output wouldn’t be tied
                 just to OSUL’s local workflow, but would be made generic, so that it could

                                                                                                                     183
   193   194   195   196   197   198   199   200   201   202   203