Missing txn-revnos in mongodb leads to missing status updates

Bug #1666396 reported by Paul Gear
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Won't Fix
Undecided
Unassigned

Bug Description

Today I found a 3rd occurrence of the missing txn-revnos in mongodb which were described in bug #1484105 (https://bugs.launchpad.net/juju-core/+bug/1484105/comments/16 & https://bugs.launchpad.net/juju-core/+bug/1484105/comments/21).

Version history for the affected environment is as follows:

2015-07-06 1.20.14-trusty-amd64
2016-01-12 1.24.7-trusty-amd64
2016-09-06 1.25.6-trusty-amd64 (current version)

I followed the process which Tim walked me through previously (https://pastebin.canonical.com/175298/ https://pastebin.canonical.com/175299/ https://pastebin.canonical.com/175300/) to resolve it.

Tags: canonical-is
Revision history for this message
Paul Gear (paulgear) wrote :

Short summary of mongodb commands involved:

db.statuses.find({"txn-revno": {$exists:false}}).count()
db.statuses.find({"txn-queue": {$exists:false}}).count()
db.statuses.update({"txn-revno": {$exists:false}},{$set: {"txn-revno": NumberLong(2), "lp1666396-2017-02-21": true }},{multi: true})
db.statuses.update({"txn-queue": {$exists:false}},{$set: {"txn-queue": [], "lp1666396-2017-02-21": true }},{multi: true})
db.statuses.find({"txn-queue": {$exists:false}}).count()
db.statuses.find({"txn-revno": {$exists:false}}).count()
db.statuses.find({"lp1666396-2017-02-21": true}).count()

Note that "lp1666396-2017-02-21" is an arbitrary text tag which is used to confirm the document counts.

Assigning to Tim at his request.

Changed in juju-core:
assignee: nobody → Tim Penhey (thumper)
tags: added: canonical-is
Revision history for this message
Tim Penhey (thumper) wrote :

Alright, I have double checked the code that adds in the unknown statuses, and the upgrade path is definitely doing things right.

You mentioned that this was after deploy of a charm, is that right?

This is deploy into the 1.25.6 environment? As opposed to deploy into the 1.24.7 environment and then upgrade?

Changed in juju-core:
status: New → Incomplete
Revision history for this message
Paul Gear (paulgear) wrote :

Yes, in each case where I've seen this, it has been during upgrade of the ksplice charm. Deploy in the latest case was months after the 1.25.6 upgrade.

Changed in juju-core:
status: Incomplete → New
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Paul Gear (paulgear),

Do you happen to have the logs from the time of charm upgrade (ksplice)? Preferably before, during and after? ;)

Changed in juju-core:
status: New → Incomplete
Revision history for this message
Tim Penhey (thumper) wrote :

We are pretty gobsmacked about this. Just can't figure out what is going on with the missing txn-revno fields.

Has this failure only been on one environment? Or has it occurred on many? Does it occur on all environments?

Is it just the ksplice charm? Has this been seen with other charms?

Paul Gear (paulgear)
description: updated
Revision history for this message
Paul Gear (paulgear) wrote :

@anastasia-macmood: I've uploaded logs for the environment which failed to https://private-fileshare.canonical.com/~paulgear/lp1666396/; machine-0.log goes back to 2015.

@thumper: I have seen this situation in 3 different production environments. All are hosted on ProdStack 4.5; the only common circumstance in which I have seen this happen is directly after the upgrade of the ksplice charm.

Changed in juju-core:
status: Incomplete → New
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Paul Gear,
I am marking this issue as Won't Fix for now:
1. It's narrow corner case that only affects upgrades to ksplice charm;
2. There is a workaround - manually fixing the database;
3. It's 1.25 and we can only address Critical bugs that have no workaround.

Changed in juju-core:
status: New → Won't Fix
assignee: Tim Penhey (thumper) → nobody
Revision history for this message
Tim Penhey (thumper) wrote :

Finally found the root cause of this, it was the script linked in comment #16 of bug 1516989.

I have updated the script to include the txn-revno and txn-queue fields:

http://pastebin.ubuntu.com/24186473/

See: https://bugs.launchpad.net/juju-core/+bug/1516989/comments/17

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.