Page 135 - Building Digital Libraries
P. 135

CHAPTER 6


                                                   This changed around 2017 when OCLC allowed all valid UTF-8 data to
                                                   be entered into cataloging records. At that point, data was no longer nor-
                                                   malized, and records placed into the bibliographic repository could have
                                                   multiple normalizations represented in a record. This has had significant
                                                   impacts for both indexing and display within library catalogs, as some sys-
                                                   tems assume very specific normalization rules, and are unable to process
                                                   data logically when mixed normalizations are present within a record. In
                                                   those cases, users will need to utilize a third-party tool, like MarcEdit, to
                                                   ensure that data conforms to a single normalized form, or develop the
                                                   scripts and workflows themselves to normalize their data prior to ingest.


                                                   Compatibility issues
                                                   While UTF-8 expands the support for a wider range of characters, most
                                                   MARC systems only support a very limited set of characters. This means that
                                                   when moving data between legacy and UTF-8-encoded systems, a method
                                                   of representing data outside of the traditional MARC-8 encoding schema
                                                   needed to be developed. Again, the Library of Congress provides a set of best
                                                   practices, and in this case, the NCR (numerical character reference) scheme
                                                   is utilized to support the lossless conversion between legacy and UTF-8
                                                   systems. Unfortunately, while this has been the best practice since the early
                                                   2000s, many legacy library systems still fail to recognize NCR encodings in
                                                   bibliographic data, causing display and indexing issues. What’s more, most
                                                   digital repository software fails to decode NCR-encoded data into the proper
                                                   UTF-8 equivalents, again affecting indexing (though not display).
                                                      While the move to UTF-8 has created some issues for libraries that are
                                                   intent on utilizing their legacy data streams or wish to move data between
                                                   data streams, the benefit of the increased number of characters that the
                                                   language supports has largely outweighed these issues. What’s more, well-
                                                   established workflows and tools exist that support the conversion of char-
                                                   acter data not only between character encoding (UTF-8 and MARC-8) but
                                                   between character normalizations (compatibility versus canonical), enabling
                                                   libraries to reuse copious amounts of legacy data with very little data loss.


                                                   MARC Dictionary
                                                   The MARC dictionary, which starts at byte 25, is made up of numerous
                                                   12-byte blocks where each block is representative of a single bibliographic
                                                   field. Each block then contains a field label (bytes 0–2), the field length
                                                   (bytes 3–6), and the position relative to the bibliographic data (bytes 7–11).
                                                   Field data is limited to 9,999 bytes, given that the field length can only
                                                   be expressed as a 4-byte value. In figure 6.1, we can look at the following
                                                   example from the directory: 245008200162. Here, we can break this block
                                                   down into the following sections:

                                                          •	 Field label: 245
                                                          •	 Field length: 0082
                                                          •	 Start position: 00162

            120
   130   131   132   133   134   135   136   137   138   139   140