authority.extract_headings & authority.heading_field.component_xpath not parsing headings as intended

Bug #2045423 reported by Mackenzie Johnson
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Evergreen
New
Undecided
Unassigned

Bug Description

I want to talk about this blob from the authority.extract_headings function, which I've copied over from the main branch (starting from here: https://github.com/evergreen-library-system/Evergreen/blob/d7d32af55cc9edaef390a0eba3407664f59ee38f/Open-ILS/src/sql/Pg/011.schema.authority.sql#L1141C17-L1141C17):

 raw_text := NULL;

            -- now iterate over components of heading
            component_node_list := oils_xpath( idx.component_xpath, heading_node, ARRAY[ARRAY[xfrm.prefix, xfrm.namespace_uri]] );
            FOR component_node IN SELECT x FROM unnest(component_node_list) AS x LOOP
            -- XXX much of this should be moved into oils_xpath_string...
                curr_text := ARRAY_TO_STRING(array_remove(array_remove(
                    oils_xpath( '//text()', -- get the content of all the nodes within the main selected node
                        REGEXP_REPLACE( component_node, E'\\s+', ' ', 'g' ) -- Translate adjacent whitespace to a single space
                    ), ' '), ''), -- throw away morally empty (bankrupt?) strings
                    joiner
                );

                CONTINUE WHEN curr_text IS NULL OR curr_text = '';

                IF raw_text IS NOT NULL THEN
                    raw_text := raw_text || joiner;
                END IF;

                raw_text := COALESCE(raw_text,'') || curr_text;
            END LOOP;

xfrm is the alias here for the database table "config.xml_transform" (which stores all of XSLT schema), idx is the alias here for "authority.heading_field" (which one can also access in the staff ILS in Server Administration under the name "Authority Heading Fields").

For the unfamiliar, what the full authority.extract_headings function does is it takes a given MARC authority record (as MARCXML) and transforms the metadata into MADS and then through XPath sequesters each heading (and identifies whether it's the preferred heading, or variant or related, and notes the thesaurus.

Then, what it is supposed to do through the blob I've posted here, is to take each sequestered heading node, XPath out the child nodes (ie the subfields), then XPath out the text, clean up some whitespace, and then concatenate the text components together to create one heading string.

With the default settings, this is not what happens.

The XPath used to extract the component nodes from the heading is stored in the "component_xpath" column in authority.heading field. Currently, that value is "//mads21:*" where * in this case is not a wildcard but one of name, title, topic, temporal, geographic, or genre depending on what type of heading it is (so topical terms have "//mads21:topic"). But in both MARCXML and MADS/XML, subdivisions are not structured as child nodes of the component preceding it -- they're following-siblings, on the same level. So what actually happens is that only the components of a matching heading type are extracted and concatenated, while non-matching components are extracted separately. So if, for example, you have a Topical Term authority record with a built-in geographic subdivision, the function will split them up into separate strings.

This function is used mainly to populate authority.simple_heading through the authority.simple_heading_set function, and it is also used to generate headings when browsing. So as an example for what's going wrong, the Canadian Subject Heading "Prime ministers -- Canada -- Press conferences", when parsed into authority.simple_heading, is returning "Prime ministers Press conferences" as one simple heading, and "Canada" as a distinct heading (and being essentially represented in authority.simple_heading as a 150 and a 151 field on the same authority record). This in turn leads to multiple erroneous matches on the metabib.browse_entry_simple_heading table, as the full strings are not getting matched up.

Thankfully, there is already a viable solution as utilized elsewhere and explained here: https://bugs.launchpad.net/evergreen/+bug/1662541

I am suggesting that the default values of authority.heading_field.component_xpath be altered so, instead of using "//mads21:topic" for example, the XPath used is //*[local-name(./*[1])="topic"]/*

One somewhat significant change to this (aside from the function doing what it is supposed to do), is that wanting to include a particular name type would make the fixed component_xpath value a bit more cumbersome. //mads21:name[@type="personal"] has to become //*[local-name(./*[1])="name"]/*[@type="personal"]/parent::node()/* to return the same intended results as the XPath I suggested in the previous paragraph. If //*[local-name(./*[1])="name"]/* is suffice for all name types, great, but I haven't discovered a shorter XPath for names that doesn't run the risk of unnecessary looping.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.