Stress tests occasionally fail spectacularly

Bug #1017957 reported by Peter Beaman
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Akiban Persistit
Critical
Peter Beaman

Bug Description

During the past week the 8-hour stress test suite has generated several CorruptVolumeExceptions and other related phenomena. Examples:

Stress6 [main] FAILED: com.persistit.exception.CorruptVolumeException: Volume persistit(/tmp/persistit_tests/persistit) level=0 page=15684 initialPage=57164 key=<{"stress6",98,5,"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"}> walked right more than 50 pages last page visited=81324
        at com.persistit.Exchange.corrupt(Exchange.java:3884)
        at com.persistit.Exchange.searchLevel(Exchange.java:1250)
        at com.persistit.Exchange.searchTree(Exchange.java:1125)
        at com.persistit.Exchange.storeInternal(Exchange.java:1443)
        at com.persistit.Exchange.store(Exchange.java:1294)
        at com.persistit.Exchange.store(Exchange.java:2534)
        at com.persistit.stress.unit.Stress6.executeTest(Stress6.java:98)
        at com.persistit.stress.AbstractStressTest.run(AbstractStressTest.java:93)
        at java.lang.Thread.run(Thread.java:662)

Stress2txn [main] FAILED: com.persistit.exception.RebalanceException
        at com.persistit.Buffer.join(Buffer.java:2523)
        at com.persistit.Exchange.raw_removeKeyRangeInternal(Exchange.java:3367)
        at com.persistit.Exchange.removeKeyRangeInternal(Exchange.java:3070)
        at com.persistit.Exchange.removeInternal(Exchange.java:2999)
        at com.persistit.Exchange.remove(Exchange.java:2927)
        at com.persistit.stress.unit.Stress2txn.executeTest(Stress2txn.java:231)
        at com.persistit.stress.AbstractStressTest.run(AbstractStressTest.java:93)
        at java.lang.Thread.run(Thread.java:662)

Stress2txn [main] FAILED: com.persistit.exception.CorruptVolumeException: LONG_RECORD chain is invalid at page 111919 - invalid page type: Page 111,919 in volume persistit(/tmp/persistit_tests/persistit) at index 1,559 timestamp=909,787,072 status=vr1 type=Data
        at com.persistit.LongRecordHelper.corrupt(LongRecordHelper.java:243)
        at com.persistit.LongRecordHelper.fetchLongRecord(LongRecordHelper.java:103)
        at com.persistit.Exchange.fetchFixupForLongRecords(Exchange.java:2841)
        at com.persistit.Exchange.fetchFromValueInternal(Exchange.java:2778)
        at com.persistit.Exchange.fetchFromBufferInternal(Exchange.java:2747)
        at com.persistit.Exchange.traverse(Exchange.java:2157)
        at com.persistit.Exchange.traverse(Exchange.java:1960)
        at com.persistit.Exchange.traverse(Exchange.java:1897)
        at com.persistit.Exchange.next(Exchange.java:2330)
        at com.persistit.stress.unit.Stress2txn.executeTest(Stress2txn.java:188)
        at com.persistit.stress.AbstractStressTest.run(AbstractStressTest.java:93)
        at java.lang.Thread.run(Thread.java:662)

Related branches

Revision history for this message
Peter Beaman (pbeaman) wrote :

We have the broken volume and journal files but they are too large to store in Launchpad.

Peter Beaman (pbeaman)
Changed in akiban-persistit:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Peter Beaman (pbeaman) wrote :
Download full text (6.3 KiB)

More information. There are four test failures in which a tree appears to be corrupt; we have the volume and journal for each of them. For two of these we can find the root cause of the failures with the integrity checker. Both of these cases are similar and I suspect a common bug mechanism for all four.

In this comment I will paste the details for one of these. The files are currently on donald .../_failed_Mixture3_20120625170544. Mixture3 runs 140 concurrent threads but has no transactions. Thus this failure is unlikely to be related to MVV handling or pruning.

The reported failure during the stress test is:

Stress6 [main] FAILED: com.persistit.exception.CorruptVolumeException: Volume persistit(/tmp/persistit_tests/persistit) level=0 page=15684 initialPage=57164 key=<{"stress6",98,5,"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"}> wal
ked right more than 50 pages last page visited=81324
        at com.persistit.Exchange.corrupt(Exchange.java:3884)
        at com.persistit.Exchange.searchLevel(Exchange.java:1250)
        at com.persistit.Exchange.searchTree(Exchange.java:1125)
        at com.persistit.Exchange.storeInternal(Exchange.java:1443)
        at com.persistit.Exchange.store(Exchange.java:1294)
        at com.persistit.Exchange.store(Exchange.java:2534)
        at com.persistit.stress.unit.Stress6.executeTest(Stress6.java:98)
        at com.persistit.stress.AbstractStressTest.run(AbstractStressTest.java:93)
        at java.lang.Thread.run(Thread.java:662)

This Exception is entirely consistent with the state of the tree detected by IntergrityCheck.

Analysis of the IntegrityCheck follows:

Volume,Tree,Faults,IndexPages,IndexBytes,DataPages,DataBytes,LongRecordPages,LongRecordBytes,MvvPages,MvvRecords,MvvOverhead,MvvAntiValues,IndexHoles,PrunedPages
"persistit","shared",4,34,330100,18394,158695628,0,0,0,0,0,0,853,0
"*","*",4,34,330100,18394,158695628,0,0,0,0,0,0,853,0

  (Note: no MVV pages, records or overhead, consistent with there being no transactions in Mixture3.)

  Tree persistit:shared Invalid right sibling address in page 81,326 after walking right 719 (path 9,835->110,879->75,079) depth=3

(because index record at page 110,879 offset 740 is bogus. Page 75,079 has a right pointer to 56,164. 57,164 is an empty page.)

  Tree persistit:shared left sibling final key is less than parent key (path 9,835->110,879->57,164) depth=3

(same issue)

  Tree persistit:shared Invalid right sibling address in page 43,595 after walking right 430 (path 9,835->110,879->57,164) depth=3

(same issue - pointer to page 57,164 is bogus, therefore its right pointer chain is also bogus relative to the index page)

  Tree persistit:shared left sibling final key is less than parent key (path 9,835->110,879->56,519) depth=3

(yeah, same thing here, too.)

  Tree persistit:shared has 853 unindexed pages

(probably not - report is most likely due to walking pages from 57,164 that aren't really part of the tree)

Relevant sections of interested pages follow:

The index page above all the trouble:

Page 110,879 in volume persistit(./persistit) at index @17,969 status v type Index1
  type=2 alloc=4,916 slack=0 keyBlockStart=32 keyBlockEnd=78...

Read more...

Peter Beaman (pbeaman)
Changed in akiban-persistit:
importance: High → Critical
Revision history for this message
Peter Beaman (pbeaman) wrote :

We were able to isolate the failure and recreate it in seconds with a new class, Bug1017957Test. With that we were able to find and fix the code paths causing the bug.

Changed in akiban-persistit:
status: Confirmed → Fix Committed
Changed in akiban-persistit:
milestone: none → 3.1.2
visibility: private → public
Changed in akiban-persistit:
assignee: nobody → Peter Beaman (pbeaman)
Peter Beaman (pbeaman)
Changed in akiban-persistit:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers