Evergreen

Bug #1032208
Comment #4

Comment 4 for bug 1032208

Revision history for this message

Dan Scott (denials) wrote on 2012-08-16:

A test with just the concerto sample set of 100 bibs shows the following difference:

Before the patch:

evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry;
count
-------
448
(1 row)

evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry;
count
-------
765
(1 row)

After the patch:

evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry;
count
-------
174
(1 row)

evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry;
count
-------
161
(1 row)

I started looking at adding a call to the upgrade script that would trim the duplicate fields from an existing 2.2 or 2.3 system that already had dupes, and came up with:

DELETE FROM metabib.title_field_entry
  WHERE id NOT IN (
    SELECT MAX(id) FROM metabib.title_field_entry
    GROUP BY source, field, value, index_vector
  );

But that only eliminated 132 rows - not the 274 I expected. Looking at the results, I see "duplicates" like:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano';
id | source | field | value
-----+--------+-------+------------------
173 | 45 | 5 | Concertos, piano
178 | 47 | 5 | Concertos, piano
265 | 63 | 5 | Concertos, piano
313 | 72 | 5 | Concertos, piano
(4 rows)

So I thought - hey, don't we want to retain one row for each source? And then, when I recreate the database schema with the duplicate-reduction patch applied, and run the same query, I get:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano';
id | source | field | value
-----+--------+-------+------------------
68 | 45 | 5 | Concertos, piano
103 | 63 | 5 | Concertos, piano
120 | 72 | 5 | Concertos, piano
(3 rows)

.. which seemed _really_ weird. If we want to keep all unique source/field/value rows, shouldn't the "Concertos, piano" from bib #47 be in there too? Of course, the answer is that for bib #47, which has two 730 (uniform title) entries, there is the following entry:

71 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra |

... basically, in the function, the raw_text variable gets concatenated with all of the matching values for that metabib field definition, and the final result includes all of the matching text - which is supposed to be added in a single row. Here's what the old (bad) approach created:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE source = 47 and field = 5 ORDER BY id ASC;
id | source | field | value
-----+--------+-------+----------------------------------------------------------------------
178 | 47 | 5 | Concertos, piano
179 | 47 | 5 | Rapsodie sur un thème de Paganini, piano, orchestra
180 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra
(3 rows)

So, refining our clean-up command a bit, we could do the following:

DELETE FROM metabib.title_field_entry
  WHERE id NOT IN (
    SELECT MAX(id) FROM metabib.title_field_entry
    GROUP BY source, field
  );

... and indeed, that deletes 274 rows.

So, after all that learning -- I've signed off on the initial commit, and added another commit that includes index debloating at user/dbs/reduce_index_bloat in working. If that's good, then let's push it!

A test with just the concerto sample set of 100 bibs shows the following difference:

Before the patch:

evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry;
 count 
-------
   448
(1 row)

evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry;
 count 
-------
   765
(1 row)

After the patch:

evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry;
 count 
-------
   174
(1 row)

evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry;
 count 
-------
   161
(1 row)

I started looking at adding a call to the upgrade script that would trim the duplicate fields from an existing 2.2 or 2.3 system that already had dupes, and came up with:

DELETE FROM metabib.title_field_entry
  WHERE id NOT IN (
    SELECT MAX(id) FROM metabib.title_field_entry
    GROUP BY source, field, value, index_vector
  );

But that only eliminated 132 rows - not the 274 I expected. Looking at the results, I see "duplicates" like:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano';
 id  | source | field |      value       
-----+--------+-------+------------------
 173 |     45 |     5 | Concertos, piano
 178 |     47 |     5 | Concertos, piano
 265 |     63 |     5 | Concertos, piano
 313 |     72 |     5 | Concertos, piano
(4 rows)

So I thought - hey, don't we want to retain one row for each source? And then, when I recreate the database schema with the duplicate-reduction patch applied, and run the same query, I get:

SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano';
 id  | source | field |      value       
-----+--------+-------+------------------
  68 |     45 |     5 | Concertos, piano
 103 |     63 |     5 | Concertos, piano
 120 |     72 |     5 | Concertos, piano
(3 rows)

71 |     47 |     5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra |

SELECT id, source, field, value FROM metabib.title_field_entry WHERE source = 47 and field = 5 ORDER BY id ASC;
 id  | source | field |                                value                                 
-----+--------+-------+----------------------------------------------------------------------
 178 |     47 |     5 | Concertos, piano
 179 |     47 |     5 | Rapsodie sur un thème de Paganini, piano, orchestra
 180 |     47 |     5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra
(3 rows)

So, refining our clean-up command a bit, we could do the following:

DELETE FROM metabib.title_field_entry
  WHERE id NOT IN (
    SELECT MAX(id) FROM metabib.title_field_entry
    GROUP BY source, field
  );

... and indeed, that deletes 274 rows.