Authority-linking SQL functions don't account for non-numerical prefixes in $0 control number

Bug #2043818 reported by Mackenzie Johnson
32
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Evergreen
New
Undecided
Unassigned

Bug Description

EG version: 3.8.0
PostgreSQL version: 10.something, I believe

The SQL function authority.calculate_authority_linking uses the value in a $0 subfield to match against a related authority record if it already exists in your catalogue. The function runs on newly ingested records, and records the links in the authority.authority_linking table. My expectation is that the identifiers in the $0 subfield yield accurate links between related authorities.

What is happening: The function seems to be written with the assumption that the values in a $0 field would be overwritten by your local Control number identifier in parentheses followed by the linked authority's ID number, as is the current case when manually linking authorities to bib records. But this function appears to run when authority records are being ingested, and modern authority records tend to contain $0 subfields with the control numbers tracing generated by the Library of Congress, Library and Archives Canada, OCLC, etc.

Importantly, these control numbers -- at least for authorities generated by the above three sources -- all have letter-based prefixes. For example, a Canadian Subject Record would have the control number "cash12345" and the $0 would say "(CaOONL)cash12345"; FAST uses "fst1234567" with the $0 showing "(OCoLC)fst1234567" and LC/NACO/SACO authorities using "n 1234567" for name authorities and "sh 1234567" for subjects -- $0 then respectively being "(DLC)n 1234567" and "(DLC)sh 1234567". Yes. Whitespace. These control numbers can also be seen in the 010 field for LC-originating control numbers, 016 for others, or also possibly in the 035 -- I have seen that Library and Archives Canada has changed format for identifiers, so the identifier in the $0 field may only match against an identifier in either the 016 or the 035, but not necessarily both.

The issues this creates is that when this function regex matches on the trailing digits of a given $0 subfield and ONLY the trailing digits, and then tries to match it against an EG-generated record ID, it is likely to create incorrect links unless the local ID does actually match the controlled identifier.

I am not in a position to wait for a patch, so I have started (carefully) exploring fixes I can make at the local level, but of course I want to share any fixes so I'm glad I found the wiki page asking to please file an LP bug report before surprise dropping any pull requests.

The problem subsection of code from the function (which I copied from the main branch in Github, so it appears that this issue persists):

FOR link IN
-- Regex in the below substring call needs to include letters and whitespace; \d only takes digits
            SELECT SUBSTRING( x::TEXT, '\d+$' )
              FROM UNNEST(
                        XPATH(
                            '//*[@tag="'
                                || acsaf.tag
                                || '"]/*[@code="'
                                || acsaf.linking_subfield
                                || '"]/text()',
                            rec_marc_xml
                        )
                    ) x
        LOOP

            -- Ignore links that are null, malformed, circular, or point to
            -- non-existent authority records.
            /* Above points are still important, and I think matching against a local id
             * is still important but matching should ALSO be done against
             * alphanumeric identifiers found in 010, 016, and 035 fields
             * (keeping in mind that LC-based identifiers have a whitespace between
             * the prefix and the control number. Of course the cast to BIGINT
             * would not appreciate the inclusion of text so that condition would need adjusting.*/
            IF link IS NOT NULL AND link::BIGINT <> rec_id THEN
                PERFORM * FROM authority.record_entry WHERE id = link::BIGINT;
                IF FOUND THEN
                    aal.target := link::BIGINT;
                    aal.field := acsaf.id;
                    RETURN NEXT aal;
                END IF;
            END IF;
        END LOOP;
    END LOOP;

Revision history for this message
Mackenzie Johnson (mtjohnsonupei) wrote :

Updated the title, as I discovered that the same regex is used to try and extract and match subject headings in bib records to authority records in the biblio.extract_metabib_field_entry function (and thus could similarly create false matches between a bib record and an authority record).

summary: - authority.calculate_authority_linking doesn't account for non-numerical
- prefixes
+ Authority-linking SQL functions don't account for non-numerical prefixes
+ in $0 control number
tags: added: database
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.