Incorrect copy status caused by reshelving process colliding with item checkout

Bug #1018011 reported by Bill Erickson
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Medium
Unassigned
3.1
Fix Released
Medium
Unassigned
3.2
Fix Released
Medium
Unassigned
3.3
Fix Released
Medium
Unassigned

Bug Description

Evergreen 2.1+

In a busy production system with a large database, the reshelving-complete script can take several minutes to run. This can lead to the following scenario:

copy checked in -> status 7
copy picked up for processing by the reshelving-complete process
copy checked out -> status 1
reshelving-complete process completes -> status 0.

This results in an open circulation on a copy with a status of 0, which apart from just being wrong, prevents renewals of the circulation due to bad copy status.

The occurrence increases the more often the reshelving-complete process is run. For example, running it hourly in a busy system creates many (relatively speaking) opportunities for this to happen.

Tags: checkout
Revision history for this message
Ben Shum (bshum) wrote :

Confirmed to occur in our 2.2 systems. Just had a case reported today, but there may be others lurking.

Changed in evergreen:
status: New → Confirmed
importance: Undecided → Medium
milestone: none → 2.2.1
Changed in evergreen:
milestone: 2.2.1 → 2.2.2
Ben Shum (bshum)
no longer affects: evergreen/2.1
Ben Shum (bshum)
no longer affects: evergreen/master
Changed in evergreen:
milestone: 2.4.0-beta → 2.4.0-rc
Ben Shum (bshum)
Changed in evergreen:
milestone: 2.4.0-rc → none
Ben Shum (bshum)
no longer affects: evergreen/2.2
Revision history for this message
Tim Spindler (tspindler-cwmars) wrote :

It appears we may be seeing this in 2.4.3 version of our system

Bill Erickson (berick)
no longer affects: evergreen/2.4
no longer affects: evergreen/2.3
Revision history for this message
Blake GH (bmagic) wrote :

I found that this query identifies the situation:

select "all",barcode from
(select string_agg(status::text,',' order by audit_id) as "all",barcode from
auditor.asset_copy_history group by barcode)
 as aach
 where "all" ~ '7,1,0'

Finding items that went from re-shelving to checked out to available

Revision history for this message
Blake GH (bmagic) wrote :

Confirmed in 2.6.1

Revision history for this message
Thomas Berezansky (tsbere) wrote :

This branch *might* be a suitable fix for this issue. But I don't know how to test it given the nature of the issue itself. As such I am not adding a pullrequest for it, though if someone else comes up with a testing plan feel free.

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/tsbere/lp1018011

no longer affects: evergreen/2.5
no longer affects: evergreen/2.6
no longer affects: evergreen/2.7
Revision history for this message
Blake GH (bmagic) wrote :

We have been using this branch in production for 3 years, it's working just fine. Thanks Thomas!

tags: added: pullrequest
Revision history for this message
Andrea Neiman (aneiman) wrote :

Given that this branch is nearly 5 years old, was written for (presumably) 2.6, and was not pullrequested by the original author -- I'm removing the pullrequest tag. Blake or anyone else who is using it in production and/or has testing, obviously feel free to post an updated branch for review.

tags: removed: pullrequest
Elaine Hardy (ehardy)
tags: added: checkout
Revision history for this message
Jason Boyer (jboyer) wrote :

Given the size of the change and Blake's prolonged testing I've pushed this to rel_3_1 through master. It's difficult to say that this will 100% stop this race condition from happening, but it does considerably narrow the window where it may potentially happen.

And of course I forgot to add my signoff to the commits before pushing, but rest assured I certainly *intended* to use git cherry-pick -s ... instead :-/

Changed in evergreen:
status: Confirmed → Fix Committed
Revision history for this message
Dan Wells (dbw2) wrote :

I know this is bug is done now, but seeing this commit go through made me curious about a few things.

1) Does anyone know why we are doing the subselect here at all? Is it performance related?

2) I was a little surprised to learn this query structure isn't atomic OOTB, but I guess makes sense for other more typical cases. Another more overt option here might be using "SELECT ... FOR UPDATE" in the subquery.

No sense it beating this dead horse if the bug is resolved; just throwing these out for later cases.

Dan

Revision history for this message
Mike Rylander (mrylander) wrote :

Hi Dan, I'll try to answer your questions.

1) Two basic reasons: because it gives PG the flexibility to plan how it believes it will be fastest, and more generally, doing it with a subselect is almost certainly faster than either supplying a potentially huge IN-list, or looping over copies and doing one at a time from the Perl code.

2) We don't want to use FOR UPDATE because it can block circs (and any other writes) on all target copies for up to the nearly duration of the entire query. The reason the subselect is not necessarily atomic is because the default isolation level in PG (and, I'd bet real money, the isolation level of every Evergreen use of PG) is read-committed. If PG decides to first materialize the result of the subselect then the list of copies may be (and, in the original case sometimes were) stale later on. The status check in the outer WHERE is evaluated against each row from a separate scan of the copy table just as the row is to be updated, and because of read-committed isolation, status may have a different value due to the commit of a circ between subselect materialization and the actual update of a copy. Copies at the "end" of the outer UPDATE scan are more likely to be hit by this.

This does, indeed, completely address the issue because either the circ commits first and the UPDATE sees a status value that it uses to disregard the row, or the UPDATE commits first and the circulation simply overwrites what it does later on.

The one remaining risk is that a copy at the "front" of the list may get "reshelved" (and thus locked by the UPDATE transaction) and the circ will have to wait for the UPDATE to commit before it can move forward and update the row. That can be mitigated in practice by either making the reshelving update fast by running it very often (which is common), or by only ever running it during off-hours when circs are unlikely to happen.

Revision history for this message
Andrea Neiman (aneiman) wrote :

Per JBoyer, this was released for 3.1.15, 3.2.9, 3.3.4, and 3.4-beta1; so I've marked it accordingly (with an assist from and thanks to Galen for temporarily reopening milestones).

Changed in evergreen:
milestone: none → 3.4-beta1
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.