Page 134 - Building Digital Libraries

P. 134

Metadata Formats

the record. This means that the length of a valid MARC record can never
exceed a total length, including directory and field data, of 99,999 bytes. And
it should also be noted that the length is indeed calculated against bytes, not
characters—a distinction that has become more important as more library
software transitions to UTF-8. For example, while an e with an acute (é) is
represented as a single UTF-8 character, it is deconstructed as two distinct
bytes. Within the MARC record leader and directory, this single character
would need to be represented as two bytes for the record/field lengths to be
valid. What’s more, the introduction of UTF-8 support into MARC records
has raised other issues—specifically ones related to the indexing of data ele-
ments, and to compatibility with legacy MARC-8 systems.

indexing issues
Indexing issues can be particularly tricky when working with data that
originates from MARC records. This is largely due to the fact that UTF-8
MARC records preserve compatibility with MARC-8 data. To do this,
UTF-8 MARC data is coded utilizing a compatibility normalization of the
UTF-8 language. What does this mean? Well, let’s think about that e with
an acute accent again (é). When represented in MARC-8, this value would
be created utilizing two distinct characters. These would be the “e” and the
modifier {acute}. Together, the system would recognize the e{acute} char-
acters and render them as a single value. In order to preserve the ability to
move between UTF-8 and MARC-8, the Library of Congress has specified
that data be coded utilizing the UTF-8 compatibility normalization. This
normalization retains the MARC-8 construction, utilizing two distinct char-
acters to represent the e with an acute accent (é), rather than a single code
point. This means that in UTF-8 MARC records, the (é) is represented as an
{acute} and an e ({acute}e). Again, a computer will recognize the presence of
the modifier, and render the data correctly. However, this introduces index-
ing issues, because the modifier, not the (é), is what tends to be indexed.
For systems that utilize legacy MARC data, or that utilize recommended
MARC data-encoding rules when coding data into UTF-8, these character
normalization rules lead to significant indexing issues, making it difficult
to support search and discovery for non-English scripts.
However, in recent years, this problem has gotten progressively more
difficult and complicated as many systems are no longer requiring UTF-8
data to be presented in the recommended UTF-8 normalization, resulting
in records that contain mixed normalization rules. At the operating system
level, these normalization rules have little impact on search and discovery
of content. These rules only come into play when making changes to data,
as data changes tend to happen using ordinal (or binary) case. So, while I
may have a file that represents an é using both composed and decomposed
characters, the system will generally render these items so that they look
identical. It’s only when these items are edited that it becomes clear that
the mixed normalization is present. In prior years, this tended not to be
a concern, as major bibliographic entities like OCLC normalized UTF-8.

119

129 130 131 132 133 134 135 136 137 138 139