Deadlock detected in nightly stress tests

Bug #1021734 reported by Peter Beaman
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Akiban Persistit
Fix Released
Critical
Peter Beaman

Bug Description

Deadlock caused the MixtureTxn2 suite to fail a couple of times. Relevant threads:

Stress5 [main] FAILED: com.persistit.exception.InUseException: Thread Thread-205 failed to acquire writer claim on Page 19,103 in volume persistit(/tmp/persistit_tests/persistit) at index 2,773 timestamp=1,428,600,258 status=vdw
r1 <CLEANUP_MANAGER> type=Data
        at com.persistit.BufferPool.get(BufferPool.java:745)
        at com.persistit.VolumeStructure.allocPage(VolumeStructure.java:449)
        at com.persistit.Exchange.putLevel(Exchange.java:1783)
        at com.persistit.Exchange.storeInternal(Exchange.java:1532)
        at com.persistit.Exchange.store(Exchange.java:1287)
        at com.persistit.Exchange.store(Exchange.java:2531)
        at com.persistit.stress.unit.Stress5.executeTest(Stress5.java:102)
        at com.persistit.stress.AbstractStressTest.run(AbstractStressTest.java:93)
        at java.lang.Thread.run(Thread.java:662)

Stress1 [main] FAILED: com.persistit.exception.InUseException: Unable to acquire claim on persistit(/tmp/persistit_tests/persistit)
        at com.persistit.VolumeStorageV2.claimHeadBuffer(VolumeStorageV2.java:413)
        at com.persistit.VolumeStructure.allocPage(VolumeStructure.java:423)
        at com.persistit.Exchange.putLevel(Exchange.java:1783)
        at com.persistit.Exchange.storeInternal(Exchange.java:1532)
        at com.persistit.Exchange.store(Exchange.java:1287)
        at com.persistit.Exchange.store(Exchange.java:2531)
        at com.persistit.stress.unit.Stress1.executeTest(Stress1.java:82)
        at com.persistit.stress.AbstractStressTest.run(AbstractStressTest.java:93)
        at java.lang.Thread.run(Thread.java:662)

(Unfortunately due to bug in the stress test framework the thread names are incorrect. Therefore we can't tell which of the many stack traces like the second one represents the deadlock partner, but the relevant information is that VolumeStructure#allocPage latches the volume head page and then tries to latch some other page; some other thread has a latch on that page and wants the volume head page.

Related branches

Peter Beaman (pbeaman)
visibility: private → public
Revision history for this message
Peter Beaman (pbeaman) wrote :

Originally rated this as High due to the relative difficulty in reproducing it. However, the bug mechanism is very simple and could easily occur at a customer site, so I am moving this to Critical. Besides, we need the stress tests to run for days, not hours, and this is a blocker.

Changed in akiban-persistit:
assignee: nobody → Peter Beaman (pbeaman)
importance: High → Critical
Changed in akiban-persistit:
milestone: none → future
Changed in akiban-persistit:
milestone: future → 3.1.8
Revision history for this message
Peter Beaman (pbeaman) wrote :

Upon further review the mechanism is not so simple. The page that's unavailable to Thread-205 is owned by the CLEANUP_MANAGER thread. Unfortunately the state of the CLEANUP_MANAGER thread was not recorded in this report. However, what's likely is that CLEANUP_MANAGER is pruning page 19,103, and in the process of doing so is calling VolumeStructure#harvestLongRecords which in turn attempts to latch the volume head page already owned by Thread-205.

Peter Beaman (pbeaman)
Changed in akiban-persistit:
status: Confirmed → Fix Committed
Revision history for this message
Peter Beaman (pbeaman) wrote :

Found and committed a fix for a deadlock condition in BufferPool#get(...). At this point it's impossible to prove that this bug was caused by the same mechanism, but it is at least likely. I am marking this "FIX COMMITTED" based on that work. We will be running the stress test suite again nightly, so if there is a different deadlock mechanism it wil show up.

Peter Beaman (pbeaman)
Changed in akiban-persistit:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.