Long browse entries cause index row size error

Bug #1695911 reported by Dan Wells on 2017-06-05
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
Undecided
Unassigned

Bug Description

EG 2.12
PG 9.4

When trying to ingest a record with an extremely long title, the following error occurs:

DBD::Pg::st execute failed: ERROR: index row size 3000 exceeds maximum 2712 for index "browse_entry_sort_value_value_key"
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
CONTEXT: SQL statement "INSERT INTO metabib.browse_entry
                    ( value, sort_value ) VALUES
                    ( value_prepped, ind_data.sort_value )"
PL/pgSQL function metabib.reingest_metabib_field_entries(bigint,boolean,boolean,boolean) line 64 at SQL statement
SQL statement "SELECT metabib.reingest_metabib_field_entries(NEW.id)"
PL/pgSQL function biblio.indexing_ingest_or_delete() line 55 [...]

browse_entry_sort_value_value_key is a UNIQUE CONSTRAINT on metabib.browse_entry. As such, perhaps using MD5 as suggested would not be the best idea, as it would invite extremely rare but eventual bugs. Beyond that, I do not really understand the particular goals of a constraint across (sort_value, value), so I am not sure what else to suggest (e.g. would 'value' alone be sufficient for well-formed data?).

Perhaps a simple truncation of the data before indexing would be sufficient. It has the same incompleteness problem as using MD5, but at least we know we are punishing the outliers, not entries based on random chance. Even with truncation, it seems extremely unlikely for a title to have its *first* uniqueness from the set more than 2,000 characters in (if that is what the 'size' here amounts to).

Mike Rylander (mrylander) wrote :

This isn't the first time we've run into this sort of issue. For another example, see: mfr. We truncate there (transparently, through a view and query rewriting), FWIW.

The reason we have a unique index over (sort_value, value) rather than just (value) or (sort_value) alone is because of non-filing prefixes. Two strings may have different value, er, values, but if they come from fields that allows and use non-filing characters, they may have colliding sort_value values. And, likewise, you may have to identical value values but one defines a non-filing character count and so should sort differently. So, both the raw value and the sort_value must be considered together when defining "unique".

Moving to a calculated value (MD5, SHA1, etc) would be non-good, since the index is used for searching and sorting the data.

Truncation seems reasonable, and, IMO, favoring truncation of the value column is probably preferable up to some minimum length. For instance, perhaps we truncate the sort_value column to no shorter than 2000, and give the balance of the ~2700 bytes to the value column. That allows the non-filing part of the value column to continue acting as a tie breaker, which is how things work today.

Thoughts?

> Truncation seems reasonable, and, IMO, favoring truncation of the value
> column is probably preferable up to some minimum length. For instance,
> perhaps we truncate the sort_value column to no shorter than 2000, and
> give the balance of the ~2700 bytes to the value column. That allows
> the non-filing part of the value column to continue acting as a tie
> breaker, which is how things work today.

+1 to truncation.

Dan Scott (denials) wrote :

My first attempt to create a unique index over our data with truncated values was:

CREATE UNIQUE INDEX CONCURRENTLY browse_entry_sort_value_value_key ON metabib.browse_entry (SUBSTRING(sort_value FROM 0 FOR 2048), SUBSTRING(value FROM 0 FOR 512));

This failed with a duplicate of a super-long title, thanks to Eighteenth Century Collections Online (ECCO):

ERROR: could not create unique index "browse_entry_sort_value_value_key"
DETAIL: Key ("substring"(sort_value, 0, 2048), "substring"(value, 0, 512))=(a compleat history of europe or a view of the affairs thereof civil and military for the year 1705 containing all the publick and secret transactions therein the several steps taken by france for an universal monarchy and to enslave her neighbours the wars in italy poland germany netherlands spain &c intermixd with great variety of original papers letters memoirs treaties &c several of which never before made publick with the remarkables of the year including particularly the lives of several eminent persons both at home and abroad that died therein and correc lists of persons in offices or places of trust in her majestys government illustrated with maps to be continued annually, A compleat history of Europe or, a view of the affairs thereof, civil and military, for the year, 1705. Containing all the publick and secret transactions therein; The several Steps taken by France, for an Universal Monarchy, and to Enslave her Neighbours; The Wars in Italy, Poland, Germany, Netherlands, Spain, &c. Intermix'd with great variety of original papers, Letters, Memoirs, Treaties, &c. Several of which never before made Publick. With the remarkables of the year; including particularly the Lives o) is duplicated.

My next attempt added the metabib.browse_entry.id column to the unique index to guarantee uniqueness:

CREATE UNIQUE INDEX CONCURRENTLY browse_entry_sort_value_value_key ON metabib.browse_entry (SUBSTRING(sort_value FROM 0 FOR 2048), SUBSTRING(value FROM 0 FOR 512), id);

Success! But then trying to add the unique constraint failed:

ALTER TABLE metabib.browse_entry ADD CONSTRAINT browse_entry_sort_value_value_key UNIQUE USING INDEX browse_entry_sort_value_value_key;
ERROR: index "browse_entry_sort_value_value_key" contains expressions
LINE 1: ALTER TABLE metabib.browse_entry ADD CONSTRAINT browse_entry...
                                             ^
DETAIL: Cannot create a primary key or unique constraint using such an index.

This is in accordance with https://www.postgresql.org/docs/9.4/static/sql-altertable.html in the section "ADD table_constraint_using_index" that states:
    This form adds a new PRIMARY KEY or UNIQUE constraint to a table based on an existing unique index.

    The index cannot have expression columns nor be a partial index.

So... not sure how to proceed.

(For what it's worth, the only difference between the two metabib.browse_entry rows is that the value of one of the rows has an uppercase 'P' at char position 530 vs. lowercase 'p' in the other row - thus outside of the arbitrary truncation limit, but resulting in an identical sort_value).

Bill Erickson (berick) wrote :

Stumbled on this while doing some local refactoring... Dan, I think with the UNIQUE INDEX the table constraint is not necessary. We have something like this locally:

BEGIN;

ALTER TABLE metabib.browse_entry
    DROP CONSTRAINT browse_entry_sort_value_value_key;
DROP INDEX metabib.browse_entry_sort_value_idx;

CREATE INDEX browse_entry_sort_value_idx
    ON metabib.browse_entry (SUBSTRING(sort_value FROM 1 FOR 2700));

CREATE UNIQUE INDEX browse_entry_sort_value_value_key
    ON metabib.browse_entry
    USING btree (MD5(value), MD5(sort_value));

COMMIT;

Duplicate inserts on metabib.browse_entry will result in unique constraint violations.
If this looks sane, I can push a working branch.

Bill Erickson (berick) wrote :

More testing, correction to my previous comment. In practice, truncating *_idx variant hasn't been necessary, only the *_key variant, which makes sense since the *_idx variant contains about half as much data. So more along the lines of:

BEGIN;
ALTER TABLE metabib.browse_entry
    DROP CONSTRAINT browse_entry_sort_value_value_key;
DROP INDEX IF EXISTS metabib.browse_entry_sort_value_value_key;

CREATE UNIQUE INDEX browse_entry_sort_value_value_key
    ON metabib.browse_entry
    USING btree (MD5(value), MD5(sort_value));
COMMIT;

Mike Rylander (mrylander) wrote :

Bill, have you confirmed that the truncated index is still used without having to adjust the WHERE clause? In the olden days, I believe that was not true (and part of the reason for the query rewriting stuff we do for mfr, IIRC).

Bill Erickson (berick) wrote :

I may be oversimplifying the issue, but my understanding of the browse_entry_sort_value_value_key index is that it acts purely as a unique constraint. The MD5 variant seems to handle that OK.

In my current code, I have not modified browse_entry_sort_value_idx, the one I assume is used by the search queries. It still contains the full value.

I'd be happy to run a query, Mike, if there's something you'd like me to test. This code is on a dev VM at the moment, but should be up on a large DB soon.

Mike Rylander (mrylander) wrote :

Sorry Bill, I was thinking that the unique index was doing double duty and providing a search function. If not, then the MD5 way seems fine. Do you see any reason not to do it like this:

CREATE UNIQUE INDEX browse_entry_sort_value_value_key
    ON metabib.browse_entry
    USING btree (MD5(value || sort_value));

to avoid the doubled storage cost and only call the MD5 function once per row on each insert/update?

(NOTE: I tested the speed of a SHA-1 digest, which is less collision-y, but DIGEST(xxx,'sha1') is about 40% slower than the direct MD5() function.)

Bill Erickson (berick) wrote :

Yeah, I was thinking the same thing at first... For _key, MD5(value || sort_value) makes sense. Thanks for the suggestion. I'll test that and push a branch in a bit.

Bill Erickson (berick) wrote :

Expansion of the above proposals:

1.

CREATE UNIQUE INDEX browse_entry_sort_value_value_key
    ON metabib.browse_entry USING btree (DIGEST((value || sort_value), 'sha1'));

I was getting md5 collisions during reingests, so I upgraded to sha1. So far the difference in indexing speed has not been noticeable.

2.

ALTER TABLE metabib.browse_entry ADD COLUMN combo_sort_value TEXT;

The value stored here will be SUBSTR((sort_value || value), 1, 2048);

This replaces 'ORDER BY sort_value, value' incantations.

This gives a single field that can be (safely) indexed, but still retains the secondary sort on 'value' except in extreme cases where the value is really long. In those cases, the sorting may not be perfect, but at that point we're sorting paragraphs of text, so I doubt it matters much. Plus this way the values can actually exist in the database.

browse / browse_pivot / bib and auth ingest functions updated to populate and use the new combo_sort_value field.

3.

Since we still need indexes on value and sort_value, may as well add them more carefully:

CREATE INDEX browse_entry_sort_value_idx
    ON metabib.browse_entry (SUBSTR(sort_value, 1, 2048));

CREATE INDEX browse_entry_value_idx
    ON metabib.browse_entry (SUBSTR(value, 1, 2048));

This will allow extremely long values to get ingested without failure and should (as I understand it) have no impact on entries whose values are less than 2048.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers