Rows doubled or tripled in metabib.%_field_entry tables

Bug #1032208 reported by Bob Wicksall on 2012-08-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Medium
Unassigned
2.3
Medium
Unassigned
2.4
Medium
Unassigned

Bug Description

Evergreen 2.2.0

Rows in metabib.title_field_entry, subject_field_entry, series_field_entry, and author_field_entry are doubled or tripled due to bad logic in biblio.extract_metabib_field_entry. This results in these tables being 2 or more times their correct size.

This was introduced in 2.2.0 when the logic for browse_field and facet_field were added to biblio.extract_metabib_field_entry. 2.1 is not affected.

The duplicates are caused when biblio.extract_metabib_field_entry returns TRUE in the search_field column for all rows even if they should just be facet_field or browse_field.

Bob Wicksall (bwicksall) wrote :

I'm attaching a modified biblio.extract_metabib_field_entry that should resolve this issue. On our test server this has had a fairly significant impact on row counts and table sizes:

  author_field_entry 1786920 rows before and 494144 rows after
  series_field_entry 330018 rows before and 108101 rows after
  subject_field_entry 4506630 rows before and 971917 rows after
  title_field_entry 1032180 rows before and 528558 rows after

I'm not sure if this translates to faster searches but it "feels" faster to me.

Mike Rylander (mrylander) wrote :

This looks great to me. I'll get it into master, 2.3 and 2.2. Thanks, Bob!

Changed in evergreen:
milestone: none → 2.3.0-beta2
milestone: 2.3.0-beta2 → none
Changed in evergreen:
assignee: nobody → Mike Rylander (mrylander)
Mike Rylander (mrylander) wrote :

I've pushed collab/miker/reduce_index_bloat to the working repo for further testing. I reworked the patch a little to reduce the amount of change a little bit. Thanks, Bob!

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/miker/reduce_index_bloat

tags: added: pullrequest
Changed in evergreen:
assignee: Mike Rylander (mrylander) → nobody
Dan Scott (denials) wrote :
Download full text (3.3 KiB)

A test with just the concerto sample set of 100 bibs shows the following difference:

Before the patch:

evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry;
 count
-------
   448
(1 row)

evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry;
 count
-------
   765
(1 row)

After the patch:

evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry;
 count
-------
   174
(1 row)

evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry;
 count
-------
   161
(1 row)

I started looking at adding a call to the upgrade script that would trim the duplicate fields from an existing 2.2 or 2.3 system that already had dupes, and came up with:

DELETE FROM metabib.title_field_entry
  WHERE id NOT IN (
    SELECT MAX(id) FROM metabib.title_field_entry
    GROUP BY source, field, value, index_vector
  );

But that only eliminated 132 rows - not the 274 I expected. Looking at the results, I see "duplicates" like:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano';
 id | source | field | value
-----+--------+-------+------------------
 173 | 45 | 5 | Concertos, piano
 178 | 47 | 5 | Concertos, piano
 265 | 63 | 5 | Concertos, piano
 313 | 72 | 5 | Concertos, piano
(4 rows)

So I thought - hey, don't we want to retain one row for each source? And then, when I recreate the database schema with the duplicate-reduction patch applied, and run the same query, I get:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano';
 id | source | field | value
-----+--------+-------+------------------
  68 | 45 | 5 | Concertos, piano
 103 | 63 | 5 | Concertos, piano
 120 | 72 | 5 | Concertos, piano
(3 rows)

.. which seemed _really_ weird. If we want to keep all unique source/field/value rows, shouldn't the "Concertos, piano" from bib #47 be in there too? Of course, the answer is that for bib #47, which has two 730 (uniform title) entries, there is the following entry:

71 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra |

... basically, in the function, the raw_text variable gets concatenated with all of the matching values for that metabib field definition, and the final result includes all of the matching text - which is supposed to be added in a single row. Here's what the old (bad) approach created:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE source = 47 and field = 5 ORDER BY id ASC;
 id | source | field | value
-----+--------+-------+----------------------------------------------------------------------
 178 | 47 | 5 | Concertos, piano
 179 | 47 | 5 | Rapsodie sur un thème de Paganini, piano, orchestra
 180 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra
(3 rows)

So, refining our clean-up command a bit, we could do the following:

DELETE FROM metabib.title_field_entry
  WHERE id NOT IN (
    SELECT MAX(id) FROM metabib.title_field_entry
    GROUP BY source, field
  );

... and indeed, that del...

Read more...

Dan Scott (denials) wrote :

Note: I'm not 100% sure this is ready for prime time.

If the old, bad approach created:

 id | source | field | value
-----+--------+-------+----------------------------------------------------------------------
 178 | 47 | 5 | Concertos, piano
 179 | 47 | 5 | Rapsodie sur un thème de Paganini, piano, orchestra
 180 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra
(3 rows)

and the new approach creates:

 id | source | field | value
-----+--------+-------+----------------------------------------------------------------------
 180 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra
(1 rows)

... then it seems like there is a loss of granularity for search purposes (e.g. searching for "starts with Rapsodie").

More testing needed?

Bob Wicksall (bwicksall) wrote :

I think the "old, bad approach" created some additional granularity as a side effect. Is that granularity valid? Is searching for "starts with Rapsodie" a valid search or should it be "starts with Concertos and contains Rapsodie "?

The old approach also created a lot of the following:

field_class | field | source | value

title | 6 | 427639 | Persuasion /
title | 6 | 427639 | Persuasion /
subject | 11 | 427639 | England
subject | 11 | 427639 | England
subject | 11 | 427639 | England
subject | 14 | 427639 | Man-woman relationships
subject | 14 | 427639 | Man-woman relationships
subject | 14 | 427639 | Man-woman relationships

Changed in evergreen:
milestone: none → 2.4.0-alpha
Changed in evergreen:
status: New → Confirmed
Ben Shum (bshum) on 2013-03-03
Changed in evergreen:
milestone: 2.4.0-alpha1 → 2.4.0-beta
Ben Shum (bshum) on 2013-03-17
Changed in evergreen:
milestone: 2.4.0-beta → 2.4.0-rc
importance: Undecided → Medium
Ben Shum (bshum) on 2013-04-27
Changed in evergreen:
milestone: 2.4.0-rc → none
Dan Scott (denials) wrote :

Bob: I _do_ think the granularity has value; the "starts with" in the cases I was looking at referred to individual movements in a symphony, and it's absolutely reasonable to expect that researchers want to search for those songs by name.

That is distinct from the entirely duplicated entries for a single record, which offers no value to researchers and simply bloats the table and indexes.

Dan Scott (denials) wrote :

Over in bug 1187433, I offered up http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbs/another_metabib_ingest_fix which I believe provides a nice balance between reducing index bloat while maintaining index granularity.

Ben Shum (bshum) on 2013-08-22
no longer affects: evergreen/2.2
Changed in evergreen:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers