Page 194 - Building Digital Libraries
P. 194
Sharing Data—Harvesting, Linking, and Distribution
Facilitating Third-Party Indexing
There was a time when supporting protocols like OAI-PMH would result
in a higher likelihood of a digital repository being indexed by the major
commercial search providers. And this still may have some truth, since OAI-
PMH provides a structural entry point into a repository, and a documented
method to traverse all available content. But if this occurs, it’s more due to
the ability of an indexer’s crawler to traverse the OAI-PMH structure, rather
than to any built-in support for the format. Today, most OAI-PMH harvest-
ing is used by aggregators within the library or cultural heritage domains
to build large indexes of aggregated content, with the two largest being the
Digital Public Library of America (DPLA) and OCLC.
The DPLA utilizes OAI-PMH as the primary communication standard
between content providers and the aggregation of discovery and index
metadata related to an item. Given that OAI-PMH provides incremental
harvesting based on time and the number of records, it provides the mini-
mal functionality for the DPLA to keep metadata related to a specific collec-
tion current. OCLC, on the other hand, utilizes OAI-PMH as a method for
automatically generating MARC data for items in a digital collection. Using
this server, OCLC can enable users to map metadata elements harvested
through the OAI-PMH interface to their MARC record equivalents. The
process is messy and often produces very minimal records, but the process
does enable organizations to quickly create metadata records for inclusion
into OCLC’s WorldCat database, which then is made available through
search engines and a wide range of OCLC discovery products.
For search services outside of the library domain, indexing has moved
away from OAI-PMH to other technologies like site maps or embedded
linked data using formats like Schema.org. Site maps are essentially special
text files that provide minimal metadata about an item and a durable URL
that can be crawled and indexed. This simplifies the indexing process for
search providers, particularly when working with resources that generate a
lot of dynamic content or are primarily database-driven. Today, most mod-
ern digital library software supports this level of functionality.
The use of microformats, like Schema.org, enables organizations to
embed linked data at the meta tag level. This information is only read by the
indexer and is used to enrich their knowledge graphs within their systems,
and promote relationships between content. While the use of these formats
doesn’t necessarily lead to better indexing, the use does enable greater find-
ability, since the embedded microdata enables providers to better classify
content and build relationships to items that might not have otherwise been
obvious. For example, tagging an item about fishing on the Ohio River with
a geographical tag would enable the search provider to know that this image
may be relevant to a user in Ohio, regardless of whether that information
shows up anywhere within the visible metadata. This kind of linking is often
done outside of the library community, particularly in the business com-
munity, to easily surface and categorize information related to locations,
websites, hours of operation, and types of services provided.
179