InUseExceptions in stress tests
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Akiban Persistit |
Fix Released
|
High
|
Peter Beaman |
Bug Description
Akiban Persistit r390
InUseException in stress tests: frequently (about 1/3 of the time when running stress tests on ec2 instances, one of the tests will fail with InUseExceptions on either Tree or Buffer instances. These are more frequent on ec2 instances than on live machines, and the difference appears to be that virtualized I/O can sometimes take a significant amount of wall-clock time (up to 30 seconds).
Further, instrumented code shows that this is not a deadlock - it appears to be a livelock.
The following is our working hypothesis on how this happens.
Prior to an earlier bug fix, thread A attempts to get page P from the buffer pool, thread B attempts to get page Q, discovers that it needs to read the page from disk, and before A gets a claim on the buffer, choses the buffer containing P to evict and reuse to hold Q. In this scenario, thread A does not detect that the buffer it is waiting for has changed until it can get a claim on the buffer. And further, because the identities of the pages being awaited by the two threads is different, an actual deadlock and result.
This bug was described by https:/
The original fix was for thread A, while waiting for a claim on the buffer containing page P, to wait for only a short interval. A would then recheck the identity of the page contained in the awaited buffer to verify it still contains P before retrying.
The problem with this is that on a very heavily loaded system (or when stress tests run an an ec2 instance with unreliable I/O times), there can be a livelock. Each time Thread A times out, it loses its place in the queue and goes back to the end. We think this is the mechanism for occasional timeouts now seen in stress tests, especially on ec2 instances.
Although it is extremely unlikely a customer will encounter this I am marking this as HIGH because it is interfering with the goal of getting stress test runs to succeed regularly on ec2 instances.
Related branches
- Nathan Williams: Approve
-
Diff: 52 lines (+11/-6)2 files modifiedsrc/main/java/com/persistit/Exchange.java (+8/-5)
src/test/java/com/persistit/DumpTaskTest.java (+3/-1)
- Nathan Williams: Approve
-
Diff: 118 lines (+17/-11)3 files modifiedsrc/main/java/com/persistit/Exchange.java (+9/-2)
src/main/java/com/persistit/VolumeStructure.java (+3/-4)
src/test/java/com/persistit/stress/unit/AccumulatorRestart.java (+5/-5)
Changed in akiban-persistit: | |
assignee: | nobody → Peter Beaman (pbeaman) |
Changed in akiban-persistit: | |
status: | Confirmed → Fix Committed |
Probably related: we have newly identified an actual deadlock. The unit test Bug1017957Test fails frequently and quickly on an Ubuntu VM, which is how we found this mechanism. It is possible that some or all phenomena described above are caused by this mechanism, which is described here:
Thread A invokes Exchange# raw_RemoveKeyRa ngeInternal and finds the key in a page using the "quick delete option". In this code path the thread has used information from the LevelCache to find the page P; immediately before using the LevelCache the thread verified using a comparison of Tree#getGenerat ion() to _cacheTreeGener ation that the LevelCache was safe to use.
Subsequent to that check by Thread A, Thread B performs a structure delete of a key range that spans the key being deleted by Thread A; B's deletion causes page P to be placed on the garbage chain. Thread B completes its deletion and then continues into code that inserts new data, and which therefore attempts reallocate page P from the garbage chain.
Or more precisely, Thread B learns the address of page P from the garbage chain, but when it attempts to actually get the buffer, discovers that thread A has already latched the page.
Thread A, while attempting to delete the key (which is now already moribund because B deleted it along with its siblings), discovers that the value contains a long record, and therefore calls VolumeStructure #harvestLongRec ords. The attempt to deallocate a long record chain by Thread A now runs into a latch that Thread B has already taken on the volume head page.
Deadlock.
The fix for this is small: Thread A must recheck the validity of the LevelCache after latching page P; if the tree has changed, then the code needs to release P, abandon the quick-delete path and use the normal index tree delineation process.