Evergreen

Incorrect copy status caused by reshelving process colliding with item checkout

Bug #1018011 reported by Bill Erickson on 2012-06-26

This bug affects 6 people

	Status	Importance	Assigned to	Milestone
Evergreen	Fix Released	Medium	Unassigned	Evergreen 3.4-beta1
3.1	Fix Released	Medium	Unassigned	Evergreen 3.1.15
3.2	Fix Released	Medium	Unassigned	Evergreen 3.2.9
3.3	Fix Released	Medium	Unassigned	Evergreen 3.3.4

Bug Description

Evergreen 2.1+

In a busy production system with a large database, the reshelving-complete script can take several minutes to run. This can lead to the following scenario:

copy checked in -> status 7
copy picked up for processing by the reshelving-complete process
copy checked out -> status 1
reshelving-complete process completes -> status 0.

This results in an open circulation on a copy with a status of 0, which apart from just being wrong, prevents renewals of the circulation due to bad copy status.

The occurrence increases the more often the reshelving-complete process is run. For example, running it hourly in a busy system creates many (relatively speaking) opportunities for this to happen.

Tags:

Revision history for this message

Ben Shum (bshum) wrote on 2012-06-27:

Confirmed to occur in our 2.2 systems. Just had a case reported today, but there may be others lurking.

Changed in evergreen:
status:	New → Confirmed
importance:	Undecided → Medium
milestone:	none → 2.2.1

Jason Stephenson (jstephenson) on 2012-08-07

Changed in evergreen:
milestone:	2.2.1 → 2.2.2

Ben Shum (bshum) on 2012-12-21

no longer affects:

evergreen/2.1

Ben Shum (bshum) on 2013-03-17

no longer affects:	evergreen/master
Changed in evergreen:
milestone:	2.4.0-beta → 2.4.0-rc

Ben Shum (bshum) on 2013-04-27

Changed in evergreen:
milestone:	2.4.0-rc → none

Ben Shum (bshum) on 2013-08-22

no longer affects:

evergreen/2.2

Revision history for this message

Tim Spindler (tspindler-cwmars) wrote on 2014-01-30:

It appears we may be seeing this in 2.4.3 version of our system

Bill Erickson (berick) on 2014-10-14

no longer affects:	evergreen/2.4
no longer affects:	evergreen/2.3

Revision history for this message

Blake GH (bmagic) wrote on 2014-10-15:

I found that this query identifies the situation:

select "all",barcode from
(select string_agg(status::text,',' order by audit_id) as "all",barcode from
auditor.asset_copy_history group by barcode)
as aach
where "all" ~ '7,1,0'

Finding items that went from re-shelving to checked out to available

Revision history for this message

Blake GH (bmagic) wrote on 2014-10-15:

Confirmed in 2.6.1

Revision history for this message

Thomas Berezansky (tsbere) wrote on 2014-10-15:

This branch *might* be a suitable fix for this issue. But I don't know how to test it given the nature of the issue itself. As such I am not adding a pullrequest for it, though if someone else comes up with a testing plan feel free.

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/tsbere/lp1018011

Evergreen Bug Maintenance (bugmaster) on 2016-08-16

no longer affects:	evergreen/2.5
no longer affects:	evergreen/2.6
no longer affects:	evergreen/2.7

Revision history for this message

Blake GH (bmagic) wrote on 2017-09-22:

We have been using this branch in production for 3 years, it's working just fine. Thanks Thomas!

Tiffany Little (tslittle) on 2019-03-04

tags:

added: pullrequest

Revision history for this message

Andrea Neiman (aneiman) wrote on 2019-03-04:

Given that this branch is nearly 5 years old, was written for (presumably) 2.6, and was not pullrequested by the original author -- I'm removing the pullrequest tag. Blake or anyone else who is using it in production and/or has testing, obviously feel free to post an updated branch for review.

tags:

removed: pullrequest

Elaine Hardy (ehardy) on 2019-03-07

tags:

added: checkout

Revision history for this message

Jason Boyer (jboyer) wrote on 2019-09-09:

Given the size of the change and Blake's prolonged testing I've pushed this to rel_3_1 through master. It's difficult to say that this will 100% stop this race condition from happening, but it does considerably narrow the window where it may potentially happen.

And of course I forgot to add my signoff to the commits before pushing, but rest assured I certainly *intended* to use git cherry-pick -s ... instead :-/

Changed in evergreen:
status:	Confirmed → Fix Committed

Revision history for this message

Dan Wells (dbw2) wrote on 2019-09-10:

I know this is bug is done now, but seeing this commit go through made me curious about a few things.

1) Does anyone know why we are doing the subselect here at all? Is it performance related?

2) I was a little surprised to learn this query structure isn't atomic OOTB, but I guess makes sense for other more typical cases. Another more overt option here might be using "SELECT ... FOR UPDATE" in the subquery.

No sense it beating this dead horse if the bug is resolved; just throwing these out for later cases.

Dan

Revision history for this message

Mike Rylander (mrylander) wrote on 2019-09-10:

#10

Hi Dan, I'll try to answer your questions.

1) Two basic reasons: because it gives PG the flexibility to plan how it believes it will be fastest, and more generally, doing it with a subselect is almost certainly faster than either supplying a potentially huge IN-list, or looping over copies and doing one at a time from the Perl code.

2) We don't want to use FOR UPDATE because it can block circs (and any other writes) on all target copies for up to the nearly duration of the entire query. The reason the subselect is not necessarily atomic is because the default isolation level in PG (and, I'd bet real money, the isolation level of every Evergreen use of PG) is read-committed. If PG decides to first materialize the result of the subselect then the list of copies may be (and, in the original case sometimes were) stale later on. The status check in the outer WHERE is evaluated against each row from a separate scan of the copy table just as the row is to be updated, and because of read-committed isolation, status may have a different value due to the commit of a circ between subselect materialization and the actual update of a copy. Copies at the "end" of the outer UPDATE scan are more likely to be hit by this.

This does, indeed, completely address the issue because either the circ commits first and the UPDATE sees a status value that it uses to disregard the row, or the UPDATE commits first and the circulation simply overwrites what it does later on.

The one remaining risk is that a copy at the "front" of the list may get "reshelved" (and thus locked by the UPDATE transaction) and the circ will have to wait for the UPDATE to commit before it can move forward and update the row. That can be mitigated in practice by either making the reshelving update fast by running it very often (which is common), or by only ever running it during off-hours when circs are unlikely to happen.

Revision history for this message

Andrea Neiman (aneiman) wrote on 2020-01-08:

#11

Per JBoyer, this was released for 3.1.15, 3.2.9, 3.3.4, and 3.4-beta1; so I've marked it accordingly (with an assist from and thanks to Galen for temporarily reopening milestones).

Changed in evergreen:
milestone:	none → 3.4-beta1
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.