Extremely slow batch marc loading in 2.2 Vandelay
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
Fix Released
|
Undecided
|
Unassigned | ||
2.2 |
Fix Released
|
Undecided
|
Unassigned | ||
2.3 |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Using match sets with more than one variable in Vandelay generally results in extremely slow load times against our production database of approx 1.4m bibs. Results of our most recent testing are below. Note that C/W Mars at MassLNC shared similar results with us; their bibs are approx 2.1 m. Lebbeous has already weighed in on this problem in our ESI ticket # 19674.
It definitely seems that's it's the 020a (ISBN) matching that causes the huge slowdown. Our 2 test files include a a 6 record file from a 70 record file .
1) Record Match Set 035a (matching on 035 subfield a)+
6 record MARC file = under a second (found 1 match)
70 record MARC file = 5-6 seconds (found 0 matches)
Performance is comparable to 2.0. 035a is a local control number and we don't seem to use
it much, so I don't think this is a useful matchpoint for us. I only tested it to compare
it to CWMARS results.
2) Record Match Set 010a (matching on 010 subfield a -- this is LC Number)
6 record MARC file = under a second (found 3 matches)
70 record MARC file = 5-6 seconds (found 40 matches)
Performance is good and comparable to 2.0. We didn't match on this field in 2.0, but we've recommended this to our sites in the interim, until we can sort out the extreme slowness in ISBN matching.
3) Record Match Set 020a (matching on 020 subfield a -- ISBN)
6 record MARC file = 3-4 minutes (found 6 matches)
70 record MARC file = 35 minutes (found 53 matches)
Loading is abysmally slow at about 2 records per minute.
4) Record Match Set 020a035a010a (matching on 020 subfield a OR 035 subfield a OR 010
subfield a)
6 record MARC file = 4 minutes - 1.5 records/minute (found 6 matches)
70 record MARC file = didn't try this because of time
The 70 record file takes about 45-50 minutes
tags: | added: pullrequest |
Changed in evergreen: | |
status: | Fix Committed → Fix Released |
More testing results from C/W MARS, with 2.1 million bibs in the database.
A batch import with 45 records:
010 only: less than 1 minute
020 only: less than 30 seconds
035 only: less than 30 seconds
010, 020, and 035: 8 minutes - only 8 records per minute
To see if quality metrics impacted the queuing, I queued the same file with the Record Match Set It did not.
020, 035, and 022 WITH quality metrics: 8.5 minutes.
A batch import of 219 records (no attached items) matching on 020, 022 and 035 WITH quality metrics. It took 28 minutes to process and 2 minutes to load for a total of 30 minutes: 7 records per minute.
C/W MARS has added an index for matching on the 020, 022 and 024 fields. Imports with a match set matching on 020 have improved since that time. However, C/W MARS is still seeing slow load times when matching on more than one fields at a time.