Very long recovery time due to many temporary tree IT records

Bug #1018526 reported by Peter Beaman
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Akiban Persistit
High
Peter Beaman

Bug Description

An Akiban Server instance running in an AMI is taking 20+ minutes to start. During this time the CPU is 100% busy doing this:

"main" prio=10 tid=0x00007f4998007000 nid=0x35a runnable [0x00007f49a0449000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.cs.UTF_8.updatePositions(Unknown Source)
        at sun.nio.cs.UTF_8$Decoder.xflow(Unknown Source)
        at sun.nio.cs.UTF_8$Decoder.decodeArrayLoop(Unknown Source)
        at sun.nio.cs.UTF_8$Decoder.decodeLoop(Unknown Source)
        at java.nio.charset.CharsetDecoder.decode(Unknown Source)
        at java.lang.StringCoding$StringDecoder.decode(Unknown Source)
        at java.lang.StringCoding.decode(Unknown Source)
        at java.lang.String.<init>(Unknown Source)
        at com.persistit.JournalRecord$IT.getTreeName(JournalRecord.java:805)
        at com.persistit.RecoveryManager.scanIdentifyTree(RecoveryManager.java:995)
        at com.persistit.RecoveryManager.scanOneRecord(RecoveryManager.java:918)
        at com.persistit.RecoveryManager.findAndValidateKeystone(RecoveryManager.java:773)
        at com.persistit.RecoveryManager.buildRecoveryPlan(RecoveryManager.java:1332)
        at com.persistit.Persistit.initializeRecovery(Persistit.java:642)
        at com.persistit.Persistit.initialize(Persistit.java:592)
        - locked <0x0000000350492388> (a com.persistit.Persistit)
        at com.persistit.Persistit.initialize(Persistit.java:559)
        at com.akiban.server.service.tree.TreeServiceImpl.start(TreeServiceImpl.java:169)
        - locked <0x000000035048dae8> (a com.akiban.server.service.tree.TreeServiceImpl)
        at com.akiban.server.service.servicemanager.GuicedServiceManager$2.onStart(GuicedServiceManager.java:278)
        at com.akiban.server.service.servicemanager.GuicedServiceManager$2.onStart(GuicedServiceManager.java:272)
        at com.akiban.server.service.servicemanager.Guicer.startServiceIfApplicable(Guicer.java:281)
        at com.akiban.server.service.servicemanager.Guicer.startService(Guicer.java:259)
        - locked <0x00000003504871f0> (a java.util.Collections$SynchronizedSet)
        at com.akiban.server.service.servicemanager.Guicer.get(Guicer.java:78)
        at com.akiban.server.service.servicemanager.GuicedServiceManager.startServices(GuicedServiceManager.java:68)
        at com.akiban.server.AkServer.main(AkServer.java:175)

Upon further investigation we found that the journal file contains over 300Mbyte of IT (Identify Tree) records, almost entirely trees created in temporary volumes during sort operations. There are 8,406,774 IT records in the journal.

Temporary trees and volumes should not identified in the journal at all.

Further, this system is exhibiting the behavior predicted in https://bugs.launchpad.net/akiban-persistit/+bug/956333.

Related branches

Peter Beaman (pbeaman)
Changed in akiban-persistit:
assignee: nobody → Peter Beaman (pbeaman)
description: updated
Peter Beaman (pbeaman)
Changed in akiban-persistit:
status: Confirmed → In Progress
Peter Beaman (pbeaman)
Changed in akiban-persistit:
status: In Progress → Fix Committed
visibility: private → public
Revision history for this message
Peter Beaman (pbeaman) wrote :

The proposed fix incorporates logic to clean up existing databases. The first startup recovery on a site having this problem using the modified version of Persistit will delete all the temporary trees from the tree map. The next journal file rollover event after that will write a tree map without the temporary tree IT records, and subsequent to that they will not reappear.

Changed in akiban-persistit:
milestone: none → 3.1.2
Peter Beaman (pbeaman)
Changed in akiban-persistit:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers