InUseExceptions in stress tests

Bug #1076517 reported by Peter Beaman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Akiban Persistit
Fix Released
High
Peter Beaman

Bug Description

Akiban Persistit r390

InUseException in stress tests: frequently (about 1/3 of the time when running stress tests on ec2 instances, one of the tests will fail with InUseExceptions on either Tree or Buffer instances. These are more frequent on ec2 instances than on live machines, and the difference appears to be that virtualized I/O can sometimes take a significant amount of wall-clock time (up to 30 seconds).

Further, instrumented code shows that this is not a deadlock - it appears to be a livelock.

The following is our working hypothesis on how this happens.

Prior to an earlier bug fix, thread A attempts to get page P from the buffer pool, thread B attempts to get page Q, discovers that it needs to read the page from disk, and before A gets a claim on the buffer, choses the buffer containing P to evict and reuse to hold Q. In this scenario, thread A does not detect that the buffer it is waiting for has changed until it can get a claim on the buffer. And further, because the identities of the pages being awaited by the two threads is different, an actual deadlock and result.

This bug was described by https://bugs.launchpad.net/akiban-persistit/+bug/1021734. However, the fix isn't quite right.

The original fix was for thread A, while waiting for a claim on the buffer containing page P, to wait for only a short interval. A would then recheck the identity of the page contained in the awaited buffer to verify it still contains P before retrying.

The problem with this is that on a very heavily loaded system (or when stress tests run an an ec2 instance with unreliable I/O times), there can be a livelock. Each time Thread A times out, it loses its place in the queue and goes back to the end. We think this is the mechanism for occasional timeouts now seen in stress tests, especially on ec2 instances.

Although it is extremely unlikely a customer will encounter this I am marking this as HIGH because it is interfering with the goal of getting stress test runs to succeed regularly on ec2 instances.

Related branches

Peter Beaman (pbeaman)
Changed in akiban-persistit:
assignee: nobody → Peter Beaman (pbeaman)
Revision history for this message
Peter Beaman (pbeaman) wrote :

Probably related: we have newly identified an actual deadlock. The unit test Bug1017957Test fails frequently and quickly on an Ubuntu VM, which is how we found this mechanism. It is possible that some or all phenomena described above are caused by this mechanism, which is described here:

Thread A invokes Exchange#raw_RemoveKeyRangeInternal and finds the key in a page using the "quick delete option". In this code path the thread has used information from the LevelCache to find the page P; immediately before using the LevelCache the thread verified using a comparison of Tree#getGeneration() to _cacheTreeGeneration that the LevelCache was safe to use.

Subsequent to that check by Thread A, Thread B performs a structure delete of a key range that spans the key being deleted by Thread A; B's deletion causes page P to be placed on the garbage chain. Thread B completes its deletion and then continues into code that inserts new data, and which therefore attempts reallocate page P from the garbage chain.

Or more precisely, Thread B learns the address of page P from the garbage chain, but when it attempts to actually get the buffer, discovers that thread A has already latched the page.

Thread A, while attempting to delete the key (which is now already moribund because B deleted it along with its siblings), discovers that the value contains a long record, and therefore calls VolumeStructure#harvestLongRecords. The attempt to deallocate a long record chain by Thread A now runs into a latch that Thread B has already taken on the volume head page.

Deadlock.

The fix for this is small: Thread A must recheck the validity of the LevelCache after latching page P; if the tree has changed, then the code needs to release P, abandon the quick-delete path and use the normal index tree delineation process.

description: updated
Revision history for this message
Peter Beaman (pbeaman) wrote :

A fix for the mechanism described on 2012-11-19 has been proposed. I am not marking this as Fix Committed until we have more experience with stress tests.

Changed in akiban-persistit:
milestone: none → 3.2.2
Revision history for this message
Peter Beaman (pbeaman) wrote :

We found another mechanism.

A thread is blocked while waiting for a wwDependency to resolve. That is, it is attempting to modify an MVV that also contains an uncommitted change by another concurrent thread. Three threads are required to create this deadlock:

Thread A in a transaction:
- modifies value for key K
- attempts to lock the tree T for another operation

Thread B in a transaction:
- locks tree T in order to perform a structure insert
- attempts to modify the value for key K
- blocks in wwDependency waiting for A to commit or abort

Thread C, not in a transaction, after B locks T but before A attempts to lock T:
- attempts to acquire an exclusive claim on T

Both A and B take reader claims on T and normally do not conflict. That's why this deadlock is extremely rare. The key element here is that Thread C enqueues an writer claim on the AbstractQueuedSynchronizer queue before A attempts to claim the tree. Because the synchronizer implements a fair policy, the writer must be serviced before any subsequently enqueued readers.

And note: a fair policy is required; it is easy to observe starvation when a non-fair policy is used.

Fortunately the fix is extremely easy: the implementation of timed wwDependency already backs off to release claims on pages; the mistake was not also releasing the reader claim on the tree.

Revision history for this message
Peter Beaman (pbeaman) wrote :

Adding to the previous comment:

Thread C needs to be non-transactional because it needs to be performing an operation that requires an exclusive claim on the Tree. Currently there are two such operations: adding an index level (happens very rarely) and a range delete that spans pages. Transactional range deletes are not supported yet, so by definition if C is performing a range delete it is not in a transaction.

However, C could could be the CLEANUP_MANAGER removing a left-edge antivalue. Therefore this pattern could occur in normal operation of Akiban Server.

Peter Beaman (pbeaman)
Changed in akiban-persistit:
status: Confirmed → Fix Committed
Revision history for this message
Peter Beaman (pbeaman) wrote :

We have now had two successive 8 x 8hour stress test runs (total of 128 hours of experience) without unexplained InUseExceptions. The one InUseException we did see appears to have been caused by extremely slow I/O on an EC2 instance: a journal flush operation took nearly 60 seconds.

I'm going to mark this bug as Fix Released, but we will continue to watch for recurrences of this issue.

Changed in akiban-persistit:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.