Bug #1243023 “Browse catalogue titles are doubly escaped?” : Bugs : Evergreen

Revision history for this message

Dan Wells (dbw2) wrote on 2013-10-29:

#1

I can't reproduce this bug using 2.5 RC1+ on Debian Squeeze, nor on Ubuntu Lucid (all ampersands in browse appear normal). I'm missing a couple post-RC1 commits on these boxes, but none jump out at me as being particularly relevant. Perhaps this is another case of a newer Fedora prerequisite with slightly different behavior? Or maybe this record got ingested weird somehow? (i.e. maybe the entity-ized string got into metabib.browse_entry somehow? It would at least explain why it shows up right in one place and not the other.)

Revision history for this message

Dan Scott (denials) wrote on 2013-11-01:

#2

Yeah, still double-escaped for me with a fresh install and --load-all-sample from master.

"Piano concertos 2 & 3" is fine in the search results, record details, and in the autocomplete box. It's just messed up in the browse search (http://localhost/eg/opac/browse?blimit=10;qtype=title;bterm=Piano%20concert;locg=1;bpivot=547 on my fresh install).

SELECT * FROM metabib.browse_entry where value ~ 'Piano concertos 2'; shows

438 | Piano concertos 2 & 3 | '2':3 '3':5 'amp':4 'concerto':2 'piano':1 | piano concertos 2 &amp 3

So... that's a problem :)

Note that I do have to pick #1242999 (the Encode.pm fix) just to get things to ingest and actually be able to test at all, so perhaps that's playing havoc somehow?

Revision history for this message

Dan Wells (dbw2) wrote on 2013-11-01:

#3

This and #1242999 are certainly in the same area of influence, so it wouldn't surprise me if they are somehow related.

Hmmm, looking at how metabib.browse_entry is populated, I wonder if your record might be double-encoded itself. Do you see anything like "Piano concertos 2 &amp; 3" in that record?

Revision history for this message

Dan Wells (dbw2) wrote on 2013-11-01:

#4

Another shot in the dark: There's a step in the browse extraction where & is subbed in so that oils_xpath() can complete (which then turns & back into '&'). Does this:

select oils_xpath( '//text()', '<blah>this & that</blah>');

decode the & for you? I'd be pretty surprised if it didn't, but hey, I have no other ideas :)

Revision history for this message

Dan Scott (denials) wrote on 2013-11-01:

#5

It would be nice to compare the results you get for the same queries I've been running if you use --load-all-sample (with and without 1242999 ideally).

For the record itself:

SELECT marc FROM biblio.record_entry WHERE id = 16;

...<datafield tag="245" ind1="0" ind2="0"><subfield code="a">Piano concertos 2 & 3</subfield></datafield>...

For your shot in the dark:

select oils_xpath( '//text()', '<blah>this & that</blah>');
oils_xpath
---------------------
{"this & that"}
(1 row)

Revision history for this message

Dan Scott (denials) wrote on 2013-11-02:

#6

Okay. Tried it on Ubuntu Precise with and without the branch from 1242999. No problems with escaping either way, and the database entry for metabib.browse_entry ID # 438 is 'Piano concertos 2 & 3'. I upgraded Encode.pm to the most recent available version and the results were the same. So yes, it sounds like it's another newer Fedora prereq at work. PostgreSQL 9.1 on Ubuntu vs. 9.2 on Fedora?

Whatever it may be, it doesn't look like it will have an immediate impact on most production systems for a while.

Revision history for this message

Dan Scott (denials) wrote on 2013-11-02:

#7

I guess the best way to resolve this would be to add a pgTAP test to ingest the record and ensure that the metabib.browse_entry is as expected, given that we know that there is likely to be some situation in the future where this will become a problem (and we're not really likely to notice it right away, unless there is a unit test).

Revision history for this message

Dan Wells (dbw2) wrote on 2013-11-04:

#8

Huh, oils_xpath normally decodes that & (i.e. you should get 'this & that', not 'this & that'), so I guess the shot in the dark was accurate. As promised, I'm surprised :)

This behavior has been reported as a possible bug in Postgresql 9.2+:
http://<email address hidden>

The bug hasn't gotten any traction for or against, so perhaps we should join the conversation?

I am not a pro at reading standards, but it seems to me that the XPath standard states pretty clearly that "Entity references to both internal and external entities are expanded. Character references are resolved." I interpret that as supporting the 9.1 behavior of interpreting the character reference before applying the XPath text extraction. Does that sound right to others here? Here is a link to the XPath 1.0 document (the version which PG9.2 still supports): http://www.w3.org/TR/xpath/

Revision history for this message

Dan Scott (denials) wrote on 2013-11-04:

#9

Sounds like you nailed it! Yes, we should join that conversation. I can be the emissary, seeing as the bug hit me first (but is likely to hit the early adopters who grabbed 9.2 for performance reasons or whatever).

Revision history for this message

Dan Scott (denials) wrote on 2013-11-20:

#10

By the way, I posted to the pgsql-bugs list at http://www.postgresql.org<email address hidden>

Tom Lane's response was that it was an intentional change in behaviour to be consistent with the PostgreSQL XML type: http://<email address hidden>

My response has thus far been ignored: http://www.postgresql.org<email address hidden>

So I'm not sure if we're going to see any action on PostgreSQL's part to resolve this problem.

Revision history for this message

Dan Wells (dbw2) wrote on 2013-11-20:

#11

So, a small update for this bug. On PG 9.1, & is not a problem, but metabib data containing < and > is in fact double-encoded. This doesn't just affect browse, but other metabib display extractions (e.g. facets).

The culprit appears to be these lines from biblio.extract_metabib_field_entry():

...
                         REGEXP_REPLACE( -- This escapes all &s not followed by "amp;". Data ise returned from oils_xpath (above) in UTF-8, not entity encoded
                            REGEXP_REPLACE( -- This escapes embeded <s
                                xml_node,
                                $re$(>[^<]+)(<)([^>]+<)$re$,
                                E'\\1<\\3',
                                'g'
                            ),
                            '&(?!amp;)',
                            '&',
                            'g'
                        ),
...

As stated in the comments, this code assumes that the data from oils_xpath is 100% entity-free, and tries to do the minimum to turn it back into valid XML. However, the comment is only half true, as the xpath functionality in PG appears to be already encoding the "famous five" characters needed for valid XML (& < > " ').

Given recent history, this seems likely to be new behavior on the PG side of things. For PG 9.1+, I think we can just drop these replacements entirely. If that seems unsafe, we should at least only encode '&' when not followed by and of the famous five, not just 'amp'. This would allow us to be more generally compatible across possible version differences, at the expense of perhaps unneeded complexity (i.e. maybe this was just a bug all along).

Please note, this proposed change would NOT fix the original "&" issue with PG 9.2+. It is the same symptom, but a different cause.

Revision history for this message

Dan Wells (dbw2) wrote on 2013-11-20:

#12

After reading the pgbugs thread, and thinking about this a bit more, I now realize what I said above is possibly off because of the way this is all mixed up. Perhaps (even in 9.1) there are configurations where you need the escaping, and other cases where the escaping is harmful. Ugh. I need to think on this more, and probably do more testing.

On the plus side, if the 9.2 behavior sticks, at least we can be confident that the data will always be escaped, even when we don't want it to be ;)

Revision history for this message

Mike Rylander (mrylander) wrote on 2013-11-20:

#13

Until and unless Pg reverts the change in question, I think we're going to need to sniff the version of postgres, loop over the array of results, and when the value does not start with '<' (is a text node) do the version-appropriate thing.

We can do that transparently (as a Pg superuser, at least) with something like the following:

alter function xpath(text,xml) rename to pgxpath;
alter function xpath(text,xml,text[]) rename to pgxpath;
create function xpath(text,xml) returns xml[] as $$ ... $$;
create function xpath(text,xml,text[]) returns xml[] as $$ ... $$;

Does anyone see a better path, other than the ideal of Pg bringing back spec-ish behavior?

Revision history for this message

Mike Rylander (mrylander) wrote on 2014-02-12:

#14

To amend my previous comment, we needn't rename the builtin functions if we simply place our replacements in the evergreen schema. Our required search_path change will make them show up first.

Ben Shum (bshum) on 2014-02-16

Changed in evergreen:
status:	New → Confirmed
tags:	added: postgresql

Dan Wells (dbw2) on 2014-03-03

Changed in evergreen:
milestone:	none → 2.6.0-rc1
tags:	added: 2.6-rc-blocker

Mike Rylander (mrylander) on 2014-03-03

Changed in evergreen:
assignee:	nobody → Mike Rylander (mrylander)

Revision history for this message

Mike Rylander (mrylander) wrote on 2014-03-03:

#15

Download full text (3.4 KiB)

I've started looking into this a bit more deeply. I was concerned before that we might have a LOT of uses of the builtin XPATH() function, but it turens out that we don't. Here's what ack-grep says:

miker@foolery:~/git/ILS (version-specific-xpath)$ gack Open-ILS/src/sql/Pg/030.schema.metabib.sql:1506: Open-ILS/src/sql/Pg/011.schema.authority.sql:232: Open-ILS/src/sql/Pg/011.schema.authority.sql:340: Open-ILS/src/sql/Pg/011.schema.authority.sql:353: Open-ILS/src/sql/Pg/011.schema.authority.sql:467: Open-ILS/src/sql/Pg/011.schema.authority.sql:493: Open-ILS/src/sql/Pg/011.schema.authority.sql:494: Open-ILS/src/sql/Pg/011.schema.authority.sql:495: Open-ILS/src/sql/Pg/011.schema.authority.sql:502: Open-ILS/src/sql/Pg/002.functions.config.sql:169:src/sql/Pg/002.functions.config.sql:170:src/sql/Pg/002.functions.config.sql:183:src/sql/Pg/002.functions.config.sql:184:src/sql/Pg/076.functions.url_verify.sql:78: Open-ILS/src/sql/Pg/076.functions.url_verify.sql:85: Open-ILS/src/sql/Pg/076.functions.url_verify.sql:96: Open-ILS/src/sql/Pg/990.schema.unapi.sql:1451: Open-ILS/src/sql/Pg/999.functions.global.sql:1538: Open-ILS/src/sql/Pg/999.functions.global.sql:1557: '(?i:\bxpath\()' --type=sql|grep -v upgrade
FOR tmp_xml IN SELECT XPATH(attr_def.xpath, transformed_xml, ARRAY[ARRAY[xfrm.prefix, xfrm.namespace_uri]]) LOOP
WHERE tag IN ( SELECT UNNEST(XPATH('//*[starts-with(@tag,"1")]/@tag',marcxml::XML)::TEXT[]))
WHERE tag IN ( SELECT UNNEST(XPATH('//*[starts-with(@tag,"1")]/@tag',marcxml::XML)::TEXT[]))
FOR tmp_xml IN SELECT UNNEST(XPATH('//*[@tag="'||tag_used||'"]', marcxml::XML)) LOOP
SELECT UNNEST(XPATH('//*[starts-with(@tag,"1")]/@tag',marc::XML)::TEXT[])
auth_field := XPATH('//*[@tag="'||main_entry.tag||'"][1]',source_xml::XML);
auth_i1 = (XPATH('@ind1',auth_field[1]))[1];
auth_i2 = (XPATH('@ind2',auth_field[1]))[1];
) INTO tmp_data FROM UNNEST(XPATH('//*[local-name()="subfield"]', auth_field[1]));
/>CREATE OR REPLACE FUNCTION oils_xpath ( TEXT, TEXT, ANYARRAY ) RETURNS TEXT[] AS 'SELECT XPATH( $1, $2::XML, $3 )::TEXT[];' LANGUAGE SQL IMMUTABLE;
/>CREATE OR REPLACE FUNCTION oils_xpath ( TEXT, TEXT ) RETURNS TEXT[] AS 'SELECT XPATH( $1, $2::XML )::TEXT[];' LANGUAGE SQL IMMUTABLE;
/>CREATE OR REPLACE FUNCTION oils_xpath ( TEXT, TEXT, ANYARRAY ) RETURNS TEXT[] AS 'SELECT XPATH( $1, $2::XML, $3 )::TEXT[];' LANGUAGE SQL IMMUTABLE;
/>CREATE OR REPLACE FUNCTION oils_xpath ( TEXT, TEXT ) RETURNS TEXT[] AS 'SELECT XPATH( $1, $2::XML )::TEXT[];' LANGUAGE SQL IMMUTABLE;
SELECT (XPATH(current_selector.xpath || '/text()', b.marc::XML))[current_url_pos]::TEXT INTO current_url
SELECT (XPATH(current_selector.xpath || '/../@tag', b.marc::XML))[current_url_pos]::TEXT INTO current_tag
SELECT (XPATH(current_selector.xpath || '/@code', b.marc::XML))[current_url_pos]::TEXT INTO current_sf
parts := parts || xpath(sub_xpath, subrec.marc::XML)::TEXT[];
XPATH('//*[starts-with(@tag,"1")]/@tag',rec_marc_xml)::TEXT[]
(XPATH('//*[@tag="' || acsaf.tag || '"]/*[@code="' ||

Some of those a...

I've started looking into this a bit more deeply.  I was concerned before that we might have a LOT of uses of the builtin XPATH() function, but it turens out that we don't.  Here's what ack-grep says:

miker@foolery:~/git/ILS (version-specific-xpath)$ gack '(?i:\bxpath\()' --type=sql|grep -v upgrade
Open-ILS/src/sql/Pg/030.schema.metabib.sql:1506:            FOR tmp_xml IN SELECT XPATH(attr_def.xpath, transformed_xml, ARRAY[ARRAY[xfrm.prefix, xfrm.namespace_uri]]) LOOP
Open-ILS/src/sql/Pg/011.schema.authority.sql:232:          WHERE tag IN ( SELECT  UNNEST(XPATH('//*[starts-with(@tag,"1")]/@tag',marcxml::XML)::TEXT[]))
Open-ILS/src/sql/Pg/011.schema.authority.sql:340:          WHERE tag IN ( SELECT  UNNEST(XPATH('//*[starts-with(@tag,"1")]/@tag',marcxml::XML)::TEXT[]))
Open-ILS/src/sql/Pg/011.schema.authority.sql:353:        FOR tmp_xml IN SELECT UNNEST(XPATH('//*[@tag="'||tag_used||'"]', marcxml::XML)) LOOP
Open-ILS/src/sql/Pg/011.schema.authority.sql:467:                    SELECT  UNNEST(XPATH('//*[starts-with(@tag,"1")]/@tag',marc::XML)::TEXT[])
Open-ILS/src/sql/Pg/011.schema.authority.sql:493:        auth_field := XPATH('//*[@tag="'||main_entry.tag||'"][1]',source_xml::XML);
Open-ILS/src/sql/Pg/011.schema.authority.sql:494:        auth_i1 = (XPATH('@ind1',auth_field[1]))[1];
Open-ILS/src/sql/Pg/011.schema.authority.sql:495:        auth_i2 = (XPATH('@ind2',auth_field[1]))[1];
Open-ILS/src/sql/Pg/011.schema.authority.sql:502:                ) INTO tmp_data FROM UNNEST(XPATH('//*[local-name()="subfield"]', auth_field[1]));
Open-ILS/src/sql/Pg/002.functions.config.sql:169:CREATE OR REPLACE FUNCTION oils_xpath ( TEXT, TEXT, ANYARRAY ) RETURNS TEXT[] AS 'SELECT XPATH( $1, $2::XML, $3 )::TEXT[];' LANGUAGE SQL IMMUTABLE;
Open-ILS/src/sql/Pg/002.functions.config.sql:170:CREATE OR REPLACE FUNCTION oils_xpath ( TEXT, TEXT ) RETURNS TEXT[] AS 'SELECT XPATH( $1, $2::XML )::TEXT[];' LANGUAGE SQL IMMUTABLE;
Open-ILS/src/sql/Pg/002.functions.config.sql:183:CREATE OR REPLACE FUNCTION oils_xpath ( TEXT, TEXT, ANYARRAY ) RETURNS TEXT[] AS 'SELECT XPATH( $1, $2::XML, $3 )::TEXT[];' LANGUAGE SQL IMMUTABLE;
Open-ILS/src/sql/Pg/002.functions.config.sql:184:CREATE OR REPLACE FUNCTION oils_xpath ( TEXT, TEXT ) RETURNS TEXT[] AS 'SELECT XPATH( $1, $2::XML )::TEXT[];' LANGUAGE SQL IMMUTABLE;
Open-ILS/src/sql/Pg/076.functions.url_verify.sql:78:            SELECT  (XPATH(current_selector.xpath || '/text()', b.marc::XML))[current_url_pos]::TEXT INTO current_url
Open-ILS/src/sql/Pg/076.functions.url_verify.sql:85:            SELECT  (XPATH(current_selector.xpath || '/../@tag', b.marc::XML))[current_url_pos]::TEXT INTO current_tag
Open-ILS/src/sql/Pg/076.functions.url_verify.sql:96:            SELECT  (XPATH(current_selector.xpath || '/@code', b.marc::XML))[current_url_pos]::TEXT INTO current_sf
Open-ILS/src/sql/Pg/990.schema.unapi.sql:1451:        parts := parts || xpath(sub_xpath, subrec.marc::XML)::TEXT[];
Open-ILS/src/sql/Pg/999.functions.global.sql:1538:                    XPATH('//*[starts-with(@tag,"1")]/@tag',rec_marc_xml)::TEXT[]
Open-ILS/src/sql/Pg/999.functions.global.sql:1557:            (XPATH('//*[@tag="' || acsaf.tag || '"]/*[@code="' ||

Some of those are, of course, the creation of oils_xpath().  So, my current approach will be to clean up the use of XPATH so we always use oils_xpath, and then teach oils_xpath to do the right thing as I described above (decode text nodes).

Code to come soon.

Revision history for this message

Mike Rylander (mrylander) wrote on 2014-03-03:

#16

As promised:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/miker/smarter-oils_xpath-entity-decoding

Only the first of the top four commits (in chronological order) is required to fix the OP's issue, but testing of all the changes would be great.

I'm holding off the creation of an upgrade script until we decide exactly how much of this actually needs to go in for 2.6 (some is purely cleanup).

Changed in evergreen:
assignee:	Mike Rylander (mrylander) → nobody

Mike Rylander (mrylander) on 2014-03-03

tags:

added: pullrequest

Dan Wells (dbw2) on 2014-03-10

Changed in evergreen:
assignee:	nobody → Dan Wells (dbw2)

Revision history for this message

Dan Wells (dbw2) wrote on 2014-03-10:

#17

Mike's branch looks good and is working well in all my testing. I have signed-off, rebased, and also took a stab a creating the upgrade script and some low-level tests to target these changes.

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbwells/smarter-oils_xpath-entity-decoding-upgrade-tests-signoffs

I didn't notice anything crazy about it, so I am fine with the cleanup aspects of this going into 2.6.

One thing I wondered is why we are leaving some native XPATH calls here and there. I didn't see any which would be affected by this change (they aren't extracting text from nodes), so this is more curiosity than anything, and I am comfortable getting this in as it stands and touching up later (if needed).

Thanks, Mike!

Changed in evergreen:
assignee:	Dan Wells (dbw2) → nobody

Revision history for this message

Mike Rylander (mrylander) wrote on 2014-03-10:

#18

Thanks for testing, Dan!

The reason I left some native XPATH calls as they were is to avoid an extra level of indirection (and CPU cost) that the intermediate calls incur. Since they don't do anything exotic, where behavior has changed in PG, and aren't related to string extraction for user consumption, that seemed both simpler (less change) and prudent.

Revision history for this message

Dan Wells (dbw2) wrote on 2014-03-11:

#19

Mike, thanks for the clarification. Just wanted to double-check and see if I was missing anything.

Revision history for this message

Dan Wells (dbw2) wrote on 2014-03-14:

#20

With fresher eyes, I have force pushed some updates to the upgrade script. It now will:

1) Recreate oils_xslt_process() in the evergreen schema
2) Forcibly drop any public schema versions of these functions hanging around
3) commit instead of rollback (doh!)

Thanks,
Dan

Revision history for this message

Remington Steed (rjs7) wrote on 2014-03-14:

#21

The following test method worked for me:

- built new db from master
- added pgTAP extension: "CREATE EXTENSION pgtap;"
- ran Dan's pgTAP tests (they fail)
- ran Dan's upgrade script (works fine)
- ran Dan's pgTAP tests (they pass)

Here's my signoff branch:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/rsteed/smarter-oils_xpath-entity-decoding-upgrade-tests-signoffs

working/user/rsteed/smarter-oils_xpath-entity-decoding-upgrade-tests-signoffs

Revision history for this message

Dan Wells (dbw2) wrote on 2014-03-14:

#22

Please note, this upgrade script does not include any reingest steps. To see the effect on the browse entries (the original issue), you can do a browse only reingest, e.g.:

select count(metabib.reingest_metabib_field_entries(id, TRUE, FALSE, TRUE)) from biblio.record_entry;

Revision history for this message

Mike Rylander (mrylander) wrote on 2014-03-15:

#23

To speed up reingest, we might want to just use the evergreen.xml_famous5_to_text() function directly. As an attempted first-pass, at least, we could do something like:

=# UPDATE metabib.browse_entry SET value=evergreen.xml_famous5_to_text(value) WHERE value LIKE '%&%;

This has the potential to fail if the decoded version of the string was saved previous to this problem existing, but in that case the slower browse-only reingest could be run instead.

Revision history for this message

Dan Wells (dbw2) wrote on 2014-03-19:

#24

I took Mike's idea and ran with it:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbwells/smarter_xpath_smarter_upgrade

Revision history for this message

Ben Shum (bshum) wrote on 2014-03-19:

#25

I took Dan's branch and ran with it. Pushed to master.

Changed in evergreen:
status:	Confirmed → Fix Committed

Evergreen Bug Maintenance (bugmaster) on 2014-04-25

Changed in evergreen:
status:	Fix Committed → Fix Released

Evergreen Bug Maintenance (bugmaster) on 2016-08-16

no longer affects:

evergreen/2.5

Evergreen

Browse catalogue titles are doubly escaped?

Bug Description

Other bug subscribers

Remote bug watches