Page 134 - Building Digital Libraries
P. 134

Metadata Formats


                 the record. This means that the length of a valid MARC record can never
                 exceed a total length, including directory and field data, of 99,999 bytes. And
                 it should also be noted that the length is indeed calculated against bytes, not
                 characters—a distinction that has become more important as more library
                 software transitions to UTF-8. For example, while an e with an acute (é) is
                 represented as a single UTF-8 character, it is deconstructed as two distinct
                 bytes. Within the MARC record leader and directory, this single character
                 would need to be represented as two bytes for the record/field lengths to be
                 valid. What’s more, the introduction of UTF-8 support into MARC records
                 has raised other issues—specifically ones related to the indexing of data ele-
                 ments, and to compatibility with legacy MARC-8 systems.


                 indexing issues
                 Indexing issues can be particularly tricky when working with data that
                 originates from MARC records. This is largely due to the fact that UTF-8
                 MARC records preserve compatibility with MARC-8 data. To do this,
                 UTF-8 MARC data is coded utilizing a compatibility normalization of the
                 UTF-8 language. What does this mean? Well, let’s think about that e with
                 an acute accent again (é). When represented in MARC-8, this value would
                 be created utilizing two distinct characters. These would be the “e” and the
                 modifier {acute}. Together, the system would recognize the e{acute} char-
                 acters and render them as a single value. In order to preserve the ability to
                 move between UTF-8 and MARC-8, the Library of Congress has specified
                 that data be coded utilizing the UTF-8 compatibility normalization. This
                 normalization retains the MARC-8 construction, utilizing two distinct char-
                 acters to represent the e with an acute accent (é), rather than a single code
                 point. This means that in UTF-8 MARC records, the (é) is represented as an
                 {acute} and an e ({acute}e). Again, a computer will recognize the presence of
                 the modifier, and render the data correctly. However, this introduces index-
                 ing issues, because the modifier, not the (é), is what tends to be indexed.
                 For systems that utilize legacy MARC data, or that utilize recommended
                 MARC data-encoding rules when coding data into UTF-8, these character
                 normalization rules lead to significant indexing issues, making it difficult
                 to support search and discovery for non-English scripts.
                     However, in recent years, this problem has gotten progressively more
                 difficult and complicated as many systems are no longer requiring UTF-8
                 data to be presented in the recommended UTF-8 normalization, resulting
                 in records that contain mixed normalization rules. At the operating system
                 level, these normalization rules have little impact on search and discovery
                 of content. These rules only come into play when making changes to data,
                 as data changes tend to happen using ordinal (or binary) case. So, while I
                 may have a file that represents an é using both composed and decomposed
                 characters, the system will generally render these items so that they look
                 identical. It’s only when these items are edited that it becomes clear that
                 the mixed normalization is present. In prior years, this tended not to be
                 a concern, as major bibliographic entities like OCLC normalized UTF-8.

                                                                                                                      119
   129   130   131   132   133   134   135   136   137   138   139