Authority-linking SQL functions don't account for non-numerical prefixes in $0 control number
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
New
|
Undecided
|
Unassigned |
Bug Description
EG version: 3.8.0
PostgreSQL version: 10.something, I believe
The SQL function authority.
What is happening: The function seems to be written with the assumption that the values in a $0 field would be overwritten by your local Control number identifier in parentheses followed by the linked authority's ID number, as is the current case when manually linking authorities to bib records. But this function appears to run when authority records are being ingested, and modern authority records tend to contain $0 subfields with the control numbers tracing generated by the Library of Congress, Library and Archives Canada, OCLC, etc.
Importantly, these control numbers -- at least for authorities generated by the above three sources -- all have letter-based prefixes. For example, a Canadian Subject Record would have the control number "cash12345" and the $0 would say "(CaOONL)
The issues this creates is that when this function regex matches on the trailing digits of a given $0 subfield and ONLY the trailing digits, and then tries to match it against an EG-generated record ID, it is likely to create incorrect links unless the local ID does actually match the controlled identifier.
I am not in a position to wait for a patch, so I have started (carefully) exploring fixes I can make at the local level, but of course I want to share any fixes so I'm glad I found the wiki page asking to please file an LP bug report before surprise dropping any pull requests.
The problem subsection of code from the function (which I copied from the main branch in Github, so it appears that this issue persists):
FOR link IN
-- Regex in the below substring call needs to include letters and whitespace; \d only takes digits
SELECT SUBSTRING( x::TEXT, '\d+$' )
FROM UNNEST(
LOOP
-- Ignore links that are null, malformed, circular, or point to
-- non-existent authority records.
/* Above points are still important, and I think matching against a local id
* is still important but matching should ALSO be done against
* alphanumeric identifiers found in 010, 016, and 035 fields
* (keeping in mind that LC-based identifiers have a whitespace between
* the prefix and the control number. Of course the cast to BIGINT
* would not appreciate the inclusion of text so that condition would need adjusting.*/
IF link IS NOT NULL AND link::BIGINT <> rec_id THEN
IF FOUND THEN
END IF;
END IF;
END LOOP;
END LOOP;
Updated the title, as I discovered that the same regex is used to try and extract and match subject headings in bib records to authority records in the biblio. extract_ metabib_ field_entry function (and thus could similarly create false matches between a bib record and an authority record).