Page 135 - Building Digital Libraries
P. 135
CHAPTER 6
This changed around 2017 when OCLC allowed all valid UTF-8 data to
be entered into cataloging records. At that point, data was no longer nor-
malized, and records placed into the bibliographic repository could have
multiple normalizations represented in a record. This has had significant
impacts for both indexing and display within library catalogs, as some sys-
tems assume very specific normalization rules, and are unable to process
data logically when mixed normalizations are present within a record. In
those cases, users will need to utilize a third-party tool, like MarcEdit, to
ensure that data conforms to a single normalized form, or develop the
scripts and workflows themselves to normalize their data prior to ingest.
Compatibility issues
While UTF-8 expands the support for a wider range of characters, most
MARC systems only support a very limited set of characters. This means that
when moving data between legacy and UTF-8-encoded systems, a method
of representing data outside of the traditional MARC-8 encoding schema
needed to be developed. Again, the Library of Congress provides a set of best
practices, and in this case, the NCR (numerical character reference) scheme
is utilized to support the lossless conversion between legacy and UTF-8
systems. Unfortunately, while this has been the best practice since the early
2000s, many legacy library systems still fail to recognize NCR encodings in
bibliographic data, causing display and indexing issues. What’s more, most
digital repository software fails to decode NCR-encoded data into the proper
UTF-8 equivalents, again affecting indexing (though not display).
While the move to UTF-8 has created some issues for libraries that are
intent on utilizing their legacy data streams or wish to move data between
data streams, the benefit of the increased number of characters that the
language supports has largely outweighed these issues. What’s more, well-
established workflows and tools exist that support the conversion of char-
acter data not only between character encoding (UTF-8 and MARC-8) but
between character normalizations (compatibility versus canonical), enabling
libraries to reuse copious amounts of legacy data with very little data loss.
MARC Dictionary
The MARC dictionary, which starts at byte 25, is made up of numerous
12-byte blocks where each block is representative of a single bibliographic
field. Each block then contains a field label (bytes 0–2), the field length
(bytes 3–6), and the position relative to the bibliographic data (bytes 7–11).
Field data is limited to 9,999 bytes, given that the field length can only
be expressed as a 4-byte value. In figure 6.1, we can look at the following
example from the directory: 245008200162. Here, we can break this block
down into the following sections:
• Field label: 245
• Field length: 0082
• Start position: 00162
120