A test with just the concerto sample set of 100 bibs shows the following difference:
Before the patch:
evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry;
count
-------
448
(1 row)
evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry;
count
-------
765
(1 row)
After the patch:
evergreen=# SELECT COUNT(*) FROM metabib.title_field_entry;
count
-------
174
(1 row)
evergreen=# SELECT COUNT(*) FROM metabib.author_field_entry;
count
-------
161
(1 row)
I started looking at adding a call to the upgrade script that would trim the duplicate fields from an existing 2.2 or 2.3 system that already had dupes, and came up with:
DELETE FROM metabib.title_field_entry
WHERE id NOT IN (
SELECT MAX(id) FROM metabib.title_field_entry
GROUP BY source, field, value, index_vector
);
But that only eliminated 132 rows - not the 274 I expected. Looking at the results, I see "duplicates" like:
SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano';
id | source | field | value
-----+--------+-------+------------------
173 | 45 | 5 | Concertos, piano
178 | 47 | 5 | Concertos, piano
265 | 63 | 5 | Concertos, piano
313 | 72 | 5 | Concertos, piano
(4 rows)
So I thought - hey, don't we want to retain one row for each source? And then, when I recreate the database schema with the duplicate-reduction patch applied, and run the same query, I get:
SELECT id, source, field, value FROM metabib.title_field_entry WHERE value = 'Concertos, piano';
id | source | field | value
-----+--------+-------+------------------
68 | 45 | 5 | Concertos, piano
103 | 63 | 5 | Concertos, piano
120 | 72 | 5 | Concertos, piano
(3 rows)
.. which seemed _really_ weird. If we want to keep all unique source/field/value rows, shouldn't the "Concertos, piano" from bib #47 be in there too? Of course, the answer is that for bib #47, which has two 730 (uniform title) entries, there is the following entry:
71 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra |
... basically, in the function, the raw_text variable gets concatenated with all of the matching values for that metabib field definition, and the final result includes all of the matching text - which is supposed to be added in a single row. Here's what the old (bad) approach created:
SELECT id, source, field, value FROM metabib.title_field_entry WHERE source = 47 and field = 5 ORDER BY id ASC;
id | source | field | value
-----+--------+-------+----------------------------------------------------------------------
178 | 47 | 5 | Concertos, piano
179 | 47 | 5 | Rapsodie sur un thème de Paganini, piano, orchestra
180 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra
(3 rows)
So, refining our clean-up command a bit, we could do the following:
DELETE FROM metabib.title_field_entry
WHERE id NOT IN (
SELECT MAX(id) FROM metabib.title_field_entry
GROUP BY source, field
);
... and indeed, that deletes 274 rows.
So, after all that learning -- I've signed off on the initial commit, and added another commit that includes index debloating at user/dbs/reduce_index_bloat in working. If that's good, then let's push it!
A test with just the concerto sample set of 100 bibs shows the following difference:
Before the patch:
evergreen=# SELECT COUNT(*) FROM metabib. title_field_ entry;
count
-------
448
(1 row)
evergreen=# SELECT COUNT(*) FROM metabib. author_ field_entry;
count
-------
765
(1 row)
After the patch:
evergreen=# SELECT COUNT(*) FROM metabib. title_field_ entry;
count
-------
174
(1 row)
evergreen=# SELECT COUNT(*) FROM metabib. author_ field_entry;
count
-------
161
(1 row)
I started looking at adding a call to the upgrade script that would trim the duplicate fields from an existing 2.2 or 2.3 system that already had dupes, and came up with:
DELETE FROM metabib. title_field_ entry title_field_ entry
WHERE id NOT IN (
SELECT MAX(id) FROM metabib.
GROUP BY source, field, value, index_vector
);
But that only eliminated 132 rows - not the 274 I expected. Looking at the results, I see "duplicates" like:
SELECT id, source, field, value FROM metabib. title_field_ entry WHERE value = 'Concertos, piano'; ------- +------ -+----- ------- ------
id | source | field | value
-----+-
173 | 45 | 5 | Concertos, piano
178 | 47 | 5 | Concertos, piano
265 | 63 | 5 | Concertos, piano
313 | 72 | 5 | Concertos, piano
(4 rows)
So I thought - hey, don't we want to retain one row for each source? And then, when I recreate the database schema with the duplicate-reduction patch applied, and run the same query, I get:
SELECT id, source, field, value FROM metabib. title_field_ entry WHERE value = 'Concertos, piano'; ------- +------ -+----- ------- ------
id | source | field | value
-----+-
68 | 45 | 5 | Concertos, piano
103 | 63 | 5 | Concertos, piano
120 | 72 | 5 | Concertos, piano
(3 rows)
.. which seemed _really_ weird. If we want to keep all unique source/field/value rows, shouldn't the "Concertos, piano" from bib #47 be in there too? Of course, the answer is that for bib #47, which has two 730 (uniform title) entries, there is the following entry:
71 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra |
... basically, in the function, the raw_text variable gets concatenated with all of the matching values for that metabib field definition, and the final result includes all of the matching text - which is supposed to be added in a single row. Here's what the old (bad) approach created:
SELECT id, source, field, value FROM metabib. title_field_ entry WHERE source = 47 and field = 5 ORDER BY id ASC; ------- +------ -+----- ------- ------- ------- ------- ------- ------- ------- ------- ------- --
id | source | field | value
-----+-
178 | 47 | 5 | Concertos, piano
179 | 47 | 5 | Rapsodie sur un thème de Paganini, piano, orchestra
180 | 47 | 5 | Concertos, piano Rapsodie sur un thème de Paganini, piano, orchestra
(3 rows)
So, refining our clean-up command a bit, we could do the following:
DELETE FROM metabib. title_field_ entry title_field_ entry
WHERE id NOT IN (
SELECT MAX(id) FROM metabib.
GROUP BY source, field
);
... and indeed, that deletes 274 rows.
So, after all that learning -- I've signed off on the initial commit, and added another commit that includes index debloating at user/dbs/ reduce_ index_bloat in working. If that's good, then let's push it!