Browse catalogue titles are doubly escaped?

Bug #1243023 reported by Dan Scott
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Undecided
Unassigned

Bug Description

* Evergreen master
* PostgreSQL 9.2.4
* Fedora 19

Load the sample data.
Browse the catalogue for "Piano concertos".
A number of the hits will display full "&" in the title, instead of the expected "&" - for example:

Piano concertos no. 13 & 15

The title in the record detail page itself appears as "&". So it seems that there is some double escaping happening at the browse display level.

Revision history for this message
Dan Wells (dbw2) wrote :

I can't reproduce this bug using 2.5 RC1+ on Debian Squeeze, nor on Ubuntu Lucid (all ampersands in browse appear normal). I'm missing a couple post-RC1 commits on these boxes, but none jump out at me as being particularly relevant. Perhaps this is another case of a newer Fedora prerequisite with slightly different behavior? Or maybe this record got ingested weird somehow? (i.e. maybe the entity-ized string got into metabib.browse_entry somehow? It would at least explain why it shows up right in one place and not the other.)

Revision history for this message
Dan Scott (denials) wrote :

Yeah, still double-escaped for me with a fresh install and --load-all-sample from master.

"Piano concertos 2 & 3" is fine in the search results, record details, and in the autocomplete box. It's just messed up in the browse search (http://localhost/eg/opac/browse?blimit=10;qtype=title;bterm=Piano%20concert;locg=1;bpivot=547 on my fresh install).

SELECT * FROM metabib.browse_entry where value ~ 'Piano concertos 2'; shows

438 | Piano concertos 2 & 3 | '2':3 '3':5 'amp':4 'concerto':2 'piano':1 | piano concertos 2 &amp 3

So... that's a problem :)

Note that I do have to pick #1242999 (the Encode.pm fix) just to get things to ingest and actually be able to test at all, so perhaps that's playing havoc somehow?

Revision history for this message
Dan Wells (dbw2) wrote :

This and #1242999 are certainly in the same area of influence, so it wouldn't surprise me if they are somehow related.

Hmmm, looking at how metabib.browse_entry is populated, I wonder if your record might be double-encoded itself. Do you see anything like "Piano concertos 2 & 3" in that record?

Revision history for this message
Dan Wells (dbw2) wrote :

Another shot in the dark: There's a step in the browse extraction where & is subbed in so that oils_xpath() can complete (which then turns & back into '&'). Does this:

select oils_xpath( '//text()', '<blah>this &amp; that</blah>');

decode the &amp; for you? I'd be pretty surprised if it didn't, but hey, I have no other ideas :)

Revision history for this message
Dan Scott (denials) wrote :

It would be nice to compare the results you get for the same queries I've been running if you use --load-all-sample (with and without 1242999 ideally).

For the record itself:

SELECT marc FROM biblio.record_entry WHERE id = 16;

...<datafield tag="245" ind1="0" ind2="0"><subfield code="a">Piano concertos 2 &amp; 3</subfield></datafield>...

For your shot in the dark:

select oils_xpath( '//text()', '<blah>this &amp; that</blah>');
     oils_xpath
---------------------
 {"this &amp; that"}
(1 row)

Revision history for this message
Dan Scott (denials) wrote :

Okay. Tried it on Ubuntu Precise with and without the branch from 1242999. No problems with escaping either way, and the database entry for metabib.browse_entry ID # 438 is 'Piano concertos 2 & 3'. I upgraded Encode.pm to the most recent available version and the results were the same. So yes, it sounds like it's another newer Fedora prereq at work. PostgreSQL 9.1 on Ubuntu vs. 9.2 on Fedora?

Whatever it may be, it doesn't look like it will have an immediate impact on most production systems for a while.

Revision history for this message
Dan Scott (denials) wrote :

I guess the best way to resolve this would be to add a pgTAP test to ingest the record and ensure that the metabib.browse_entry is as expected, given that we know that there is likely to be some situation in the future where this will become a problem (and we're not really likely to notice it right away, unless there is a unit test).

Revision history for this message
Dan Wells (dbw2) wrote :

Huh, oils_xpath normally decodes that &amp; (i.e. you should get 'this & that', not 'this &amp; that'), so I guess the shot in the dark was accurate. As promised, I'm surprised :)

This behavior has been reported as a possible bug in Postgresql 9.2+:
http://<email address hidden>

The bug hasn't gotten any traction for or against, so perhaps we should join the conversation?

I am not a pro at reading standards, but it seems to me that the XPath standard states pretty clearly that "Entity references to both internal and external entities are expanded. Character references are resolved." I interpret that as supporting the 9.1 behavior of interpreting the character reference before applying the XPath text extraction. Does that sound right to others here? Here is a link to the XPath 1.0 document (the version which PG9.2 still supports): http://www.w3.org/TR/xpath/

Revision history for this message
Dan Scott (denials) wrote :

Sounds like you nailed it! Yes, we should join that conversation. I can be the emissary, seeing as the bug hit me first (but is likely to hit the early adopters who grabbed 9.2 for performance reasons or whatever).

Revision history for this message
Dan Scott (denials) wrote :

By the way, I posted to the pgsql-bugs list at http://www.postgresql.org<email address hidden>

Tom Lane's response was that it was an intentional change in behaviour to be consistent with the PostgreSQL XML type: http://<email address hidden>

My response has thus far been ignored: http://www.postgresql.org<email address hidden>

So I'm not sure if we're going to see any action on PostgreSQL's part to resolve this problem.

Revision history for this message
Dan Wells (dbw2) wrote :

So, a small update for this bug. On PG 9.1, &amp; is not a problem, but metabib data containing &lt; and &gt; is in fact double-encoded. This doesn't just affect browse, but other metabib display extractions (e.g. facets).

The culprit appears to be these lines from biblio.extract_metabib_field_entry():

...
                         REGEXP_REPLACE( -- This escapes all &s not followed by "amp;". Data ise returned from oils_xpath (above) in UTF-8, not entity encoded
                            REGEXP_REPLACE( -- This escapes embeded <s
                                xml_node,
                                $re$(>[^<]+)(<)([^>]+<)$re$,
                                E'\\1&lt;\\3',
                                'g'
                            ),
                            '&(?!amp;)',
                            '&amp;',
                            'g'
                        ),
...

As stated in the comments, this code assumes that the data from oils_xpath is 100% entity-free, and tries to do the minimum to turn it back into valid XML. However, the comment is only half true, as the xpath functionality in PG appears to be already encoding the "famous five" characters needed for valid XML (&amp; &lt; &gt; &quot; &apos;).

Given recent history, this seems likely to be new behavior on the PG side of things. For PG 9.1+, I think we can just drop these replacements entirely. If that seems unsafe, we should at least only encode '&' when not followed by and of the famous five, not just 'amp'. This would allow us to be more generally compatible across possible version differences, at the expense of perhaps unneeded complexity (i.e. maybe this was just a bug all along).

Please note, this proposed change would NOT fix the original "&amp;" issue with PG 9.2+. It is the same symptom, but a different cause.

Revision history for this message
Dan Wells (dbw2) wrote :

After reading the pgbugs thread, and thinking about this a bit more, I now realize what I said above is possibly off because of the way this is all mixed up. Perhaps (even in 9.1) there are configurations where you need the escaping, and other cases where the escaping is harmful. Ugh. I need to think on this more, and probably do more testing.

On the plus side, if the 9.2 behavior sticks, at least we can be confident that the data will always be escaped, even when we don't want it to be ;)

Revision history for this message
Mike Rylander (mrylander) wrote :

Until and unless Pg reverts the change in question, I think we're going to need to sniff the version of postgres, loop over the array of results, and when the value does not start with '<' (is a text node) do the version-appropriate thing.

We can do that transparently (as a Pg superuser, at least) with something like the following:

alter function xpath(text,xml) rename to pgxpath;
alter function xpath(text,xml,text[]) rename to pgxpath;
create function xpath(text,xml) returns xml[] as $$ ... $$;
create function xpath(text,xml,text[]) returns xml[] as $$ ... $$;

Does anyone see a better path, other than the ideal of Pg bringing back spec-ish behavior?

Revision history for this message
Mike Rylander (mrylander) wrote :

To amend my previous comment, we needn't rename the builtin functions if we simply place our replacements in the evergreen schema. Our required search_path change will make them show up first.

Ben Shum (bshum)
Changed in evergreen:
status: New → Confirmed
tags: added: postgresql
Dan Wells (dbw2)
Changed in evergreen:
milestone: none → 2.6.0-rc1
tags: added: 2.6-rc-blocker
Changed in evergreen:
assignee: nobody → Mike Rylander (mrylander)
Revision history for this message
Mike Rylander (mrylander) wrote :
Download full text (3.4 KiB)

I've started looking into this a bit more deeply. I was concerned before that we might have a LOT of uses of the builtin XPATH() function, but it turens out that we don't. Here's what ack-grep says:

miker@foolery:~/git/ILS (version-specific-xpath)$ gack '(?i:\bxpath\()' --type=sql|grep -v upgrade
Open-ILS/src/sql/Pg/030.schema.metabib.sql:1506: FOR tmp_xml IN SELECT XPATH(attr_def.xpath, transformed_xml, ARRAY[ARRAY[xfrm.prefix, xfrm.namespace_uri]]) LOOP
Open-ILS/src/sql/Pg/011.schema.authority.sql:232: WHERE tag IN ( SELECT UNNEST(XPATH('//*[starts-with(@tag,"1")]/@tag',marcxml::XML)::TEXT[]))
Open-ILS/src/sql/Pg/011.schema.authority.sql:340: WHERE tag IN ( SELECT UNNEST(XPATH('//*[starts-with(@tag,"1")]/@tag',marcxml::XML)::TEXT[]))
Open-ILS/src/sql/Pg/011.schema.authority.sql:353: FOR tmp_xml IN SELECT UNNEST(XPATH('//*[@tag="'||tag_used||'"]', marcxml::XML)) LOOP
Open-ILS/src/sql/Pg/011.schema.authority.sql:467: SELECT UNNEST(XPATH('//*[starts-with(@tag,"1")]/@tag',marc::XML)::TEXT[])
Open-ILS/src/sql/Pg/011.schema.authority.sql:493: auth_field := XPATH('//*[@tag="'||main_entry.tag||'"][1]',source_xml::XML);
Open-ILS/src/sql/Pg/011.schema.authority.sql:494: auth_i1 = (XPATH('@ind1',auth_field[1]))[1];
Open-ILS/src/sql/Pg/011.schema.authority.sql:495: auth_i2 = (XPATH('@ind2',auth_field[1]))[1];
Open-ILS/src/sql/Pg/011.schema.authority.sql:502: ) INTO tmp_data FROM UNNEST(XPATH('//*[local-name()="subfield"]', auth_field[1]));
Open-ILS/src/sql/Pg/002.functions.config.sql:169:CREATE OR REPLACE FUNCTION oils_xpath ( TEXT, TEXT, ANYARRAY ) RETURNS TEXT[] AS 'SELECT XPATH( $1, $2::XML, $3 )::TEXT[];' LANGUAGE SQL IMMUTABLE;
Open-ILS/src/sql/Pg/002.functions.config.sql:170:CREATE OR REPLACE FUNCTION oils_xpath ( TEXT, TEXT ) RETURNS TEXT[] AS 'SELECT XPATH( $1, $2::XML )::TEXT[];' LANGUAGE SQL IMMUTABLE;
Open-ILS/src/sql/Pg/002.functions.config.sql:183:CREATE OR REPLACE FUNCTION oils_xpath ( TEXT, TEXT, ANYARRAY ) RETURNS TEXT[] AS 'SELECT XPATH( $1, $2::XML, $3 )::TEXT[];' LANGUAGE SQL IMMUTABLE;
Open-ILS/src/sql/Pg/002.functions.config.sql:184:CREATE OR REPLACE FUNCTION oils_xpath ( TEXT, TEXT ) RETURNS TEXT[] AS 'SELECT XPATH( $1, $2::XML )::TEXT[];' LANGUAGE SQL IMMUTABLE;
Open-ILS/src/sql/Pg/076.functions.url_verify.sql:78: SELECT (XPATH(current_selector.xpath || '/text()', b.marc::XML))[current_url_pos]::TEXT INTO current_url
Open-ILS/src/sql/Pg/076.functions.url_verify.sql:85: SELECT (XPATH(current_selector.xpath || '/../@tag', b.marc::XML))[current_url_pos]::TEXT INTO current_tag
Open-ILS/src/sql/Pg/076.functions.url_verify.sql:96: SELECT (XPATH(current_selector.xpath || '/@code', b.marc::XML))[current_url_pos]::TEXT INTO current_sf
Open-ILS/src/sql/Pg/990.schema.unapi.sql:1451: parts := parts || xpath(sub_xpath, subrec.marc::XML)::TEXT[];
Open-ILS/src/sql/Pg/999.functions.global.sql:1538: XPATH('//*[starts-with(@tag,"1")]/@tag',rec_marc_xml)::TEXT[]
Open-ILS/src/sql/Pg/999.functions.global.sql:1557: (XPATH('//*[@tag="' || acsaf.tag || '"]/*[@code="' ||

Some of those a...

Read more...

Revision history for this message
Mike Rylander (mrylander) wrote :

As promised:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/miker/smarter-oils_xpath-entity-decoding

Only the first of the top four commits (in chronological order) is required to fix the OP's issue, but testing of all the changes would be great.

I'm holding off the creation of an upgrade script until we decide exactly how much of this actually needs to go in for 2.6 (some is purely cleanup).

Changed in evergreen:
assignee: Mike Rylander (mrylander) → nobody
tags: added: pullrequest
Dan Wells (dbw2)
Changed in evergreen:
assignee: nobody → Dan Wells (dbw2)
Revision history for this message
Dan Wells (dbw2) wrote :

Mike's branch looks good and is working well in all my testing. I have signed-off, rebased, and also took a stab a creating the upgrade script and some low-level tests to target these changes.

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbwells/smarter-oils_xpath-entity-decoding-upgrade-tests-signoffs

I didn't notice anything crazy about it, so I am fine with the cleanup aspects of this going into 2.6.

One thing I wondered is why we are leaving some native XPATH calls here and there. I didn't see any which would be affected by this change (they aren't extracting text from nodes), so this is more curiosity than anything, and I am comfortable getting this in as it stands and touching up later (if needed).

Thanks, Mike!

Changed in evergreen:
assignee: Dan Wells (dbw2) → nobody
Revision history for this message
Mike Rylander (mrylander) wrote :

Thanks for testing, Dan!

The reason I left some native XPATH calls as they were is to avoid an extra level of indirection (and CPU cost) that the intermediate calls incur. Since they don't do anything exotic, where behavior has changed in PG, and aren't related to string extraction for user consumption, that seemed both simpler (less change) and prudent.

Revision history for this message
Dan Wells (dbw2) wrote :

Mike, thanks for the clarification. Just wanted to double-check and see if I was missing anything.

Revision history for this message
Dan Wells (dbw2) wrote :

With fresher eyes, I have force pushed some updates to the upgrade script. It now will:

1) Recreate oils_xslt_process() in the evergreen schema
2) Forcibly drop any public schema versions of these functions hanging around
3) commit instead of rollback (doh!)

Thanks,
Dan

Revision history for this message
Remington Steed (rjs7) wrote :

The following test method worked for me:

- built new db from master
- added pgTAP extension: "CREATE EXTENSION pgtap;"
- ran Dan's pgTAP tests (they fail)
- ran Dan's upgrade script (works fine)
- ran Dan's pgTAP tests (they pass)

Here's my signoff branch:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/rsteed/smarter-oils_xpath-entity-decoding-upgrade-tests-signoffs

working/user/rsteed/smarter-oils_xpath-entity-decoding-upgrade-tests-signoffs

Revision history for this message
Dan Wells (dbw2) wrote :

Please note, this upgrade script does not include any reingest steps. To see the effect on the browse entries (the original issue), you can do a browse only reingest, e.g.:

select count(metabib.reingest_metabib_field_entries(id, TRUE, FALSE, TRUE)) from biblio.record_entry;

Revision history for this message
Mike Rylander (mrylander) wrote :

To speed up reingest, we might want to just use the evergreen.xml_famous5_to_text() function directly. As an attempted first-pass, at least, we could do something like:

=# UPDATE metabib.browse_entry SET value=evergreen.xml_famous5_to_text(value) WHERE value LIKE '%&%;

This has the potential to fail if the decoded version of the string was saved previous to this problem existing, but in that case the slower browse-only reingest could be run instead.

Revision history for this message
Dan Wells (dbw2) wrote :
Revision history for this message
Ben Shum (bshum) wrote :

I took Dan's branch and ran with it. Pushed to master.

Changed in evergreen:
status: Confirmed → Fix Committed
Changed in evergreen:
status: Fix Committed → Fix Released
no longer affects: evergreen/2.5
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.