Long browse entries cause index row size error

Bug #1695911 reported by Dan Wells on 2017-06-05
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
Undecided
Unassigned

Bug Description

EG 2.12
PG 9.4

When trying to ingest a record with an extremely long title, the following error occurs:

DBD::Pg::st execute failed: ERROR: index row size 3000 exceeds maximum 2712 for index "browse_entry_sort_value_value_key"
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
CONTEXT: SQL statement "INSERT INTO metabib.browse_entry
                    ( value, sort_value ) VALUES
                    ( value_prepped, ind_data.sort_value )"
PL/pgSQL function metabib.reingest_metabib_field_entries(bigint,boolean,boolean,boolean) line 64 at SQL statement
SQL statement "SELECT metabib.reingest_metabib_field_entries(NEW.id)"
PL/pgSQL function biblio.indexing_ingest_or_delete() line 55 [...]

browse_entry_sort_value_value_key is a UNIQUE CONSTRAINT on metabib.browse_entry. As such, perhaps using MD5 as suggested would not be the best idea, as it would invite extremely rare but eventual bugs. Beyond that, I do not really understand the particular goals of a constraint across (sort_value, value), so I am not sure what else to suggest (e.g. would 'value' alone be sufficient for well-formed data?).

Perhaps a simple truncation of the data before indexing would be sufficient. It has the same incompleteness problem as using MD5, but at least we know we are punishing the outliers, not entries based on random chance. Even with truncation, it seems extremely unlikely for a title to have its *first* uniqueness from the set more than 2,000 characters in (if that is what the 'size' here amounts to).

Mike Rylander (mrylander) wrote :

This isn't the first time we've run into this sort of issue. For another example, see: mfr. We truncate there (transparently, through a view and query rewriting), FWIW.

The reason we have a unique index over (sort_value, value) rather than just (value) or (sort_value) alone is because of non-filing prefixes. Two strings may have different value, er, values, but if they come from fields that allows and use non-filing characters, they may have colliding sort_value values. And, likewise, you may have to identical value values but one defines a non-filing character count and so should sort differently. So, both the raw value and the sort_value must be considered together when defining "unique".

Moving to a calculated value (MD5, SHA1, etc) would be non-good, since the index is used for searching and sorting the data.

Truncation seems reasonable, and, IMO, favoring truncation of the value column is probably preferable up to some minimum length. For instance, perhaps we truncate the sort_value column to no shorter than 2000, and give the balance of the ~2700 bytes to the value column. That allows the non-filing part of the value column to continue acting as a tie breaker, which is how things work today.

Thoughts?

> Truncation seems reasonable, and, IMO, favoring truncation of the value
> column is probably preferable up to some minimum length. For instance,
> perhaps we truncate the sort_value column to no shorter than 2000, and
> give the balance of the ~2700 bytes to the value column. That allows
> the non-filing part of the value column to continue acting as a tie
> breaker, which is how things work today.

+1 to truncation.

Dan Scott (denials) wrote :

My first attempt to create a unique index over our data with truncated values was:

CREATE UNIQUE INDEX CONCURRENTLY browse_entry_sort_value_value_key ON metabib.browse_entry (SUBSTRING(sort_value FROM 0 FOR 2048), SUBSTRING(value FROM 0 FOR 512));

This failed with a duplicate of a super-long title, thanks to Eighteenth Century Collections Online (ECCO):

ERROR: could not create unique index "browse_entry_sort_value_value_key"
DETAIL: Key ("substring"(sort_value, 0, 2048), "substring"(value, 0, 512))=(a compleat history of europe or a view of the affairs thereof civil and military for the year 1705 containing all the publick and secret transactions therein the several steps taken by france for an universal monarchy and to enslave her neighbours the wars in italy poland germany netherlands spain &c intermixd with great variety of original papers letters memoirs treaties &c several of which never before made publick with the remarkables of the year including particularly the lives of several eminent persons both at home and abroad that died therein and correc lists of persons in offices or places of trust in her majestys government illustrated with maps to be continued annually, A compleat history of Europe or, a view of the affairs thereof, civil and military, for the year, 1705. Containing all the publick and secret transactions therein; The several Steps taken by France, for an Universal Monarchy, and to Enslave her Neighbours; The Wars in Italy, Poland, Germany, Netherlands, Spain, &c. Intermix'd with great variety of original papers, Letters, Memoirs, Treaties, &c. Several of which never before made Publick. With the remarkables of the year; including particularly the Lives o) is duplicated.

My next attempt added the metabib.browse_entry.id column to the unique index to guarantee uniqueness:

CREATE UNIQUE INDEX CONCURRENTLY browse_entry_sort_value_value_key ON metabib.browse_entry (SUBSTRING(sort_value FROM 0 FOR 2048), SUBSTRING(value FROM 0 FOR 512), id);

Success! But then trying to add the unique constraint failed:

ALTER TABLE metabib.browse_entry ADD CONSTRAINT browse_entry_sort_value_value_key UNIQUE USING INDEX browse_entry_sort_value_value_key;
ERROR: index "browse_entry_sort_value_value_key" contains expressions
LINE 1: ALTER TABLE metabib.browse_entry ADD CONSTRAINT browse_entry...
                                             ^
DETAIL: Cannot create a primary key or unique constraint using such an index.

This is in accordance with https://www.postgresql.org/docs/9.4/static/sql-altertable.html in the section "ADD table_constraint_using_index" that states:
    This form adds a new PRIMARY KEY or UNIQUE constraint to a table based on an existing unique index.

    The index cannot have expression columns nor be a partial index.

So... not sure how to proceed.

(For what it's worth, the only difference between the two metabib.browse_entry rows is that the value of one of the rows has an uppercase 'P' at char position 530 vs. lowercase 'p' in the other row - thus outside of the arbitrary truncation limit, but resulting in an identical sort_value).

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers