Bug #1484105 “juju upgrade-charm returns ERROR state changing to...” : Bugs : juju-core

Curtis Hovey (sinzui) on 2015-08-12

tags:	added: upgrade-charm
tags:	added: upgrade-juju

Curtis Hovey (sinzui) on 2015-08-14

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 1.25.0

Revision history for this message

Junien F (axino) wrote on 2015-08-27:

#1

I think I found the cause of this bug : stale data in the DB.

> db.services.findOne({_id : "ubuntu-mirror"} ).relationcount
8
> db.relations.find({ _id: { $regex: "ubuntu-mirror:" } }).count()
7

When trying to upgrade the charm, one of the transaction (txnOp) asserts that "relationcount" in the service is 7, while in the DB it 8. So the set of transaction fails, juju retries 4 times and ends up throwing an ExcessiveContention error.

Is my analysis correct ? How can I progress with this ?

Thanks !

Revision history for this message

Junien F (axino) wrote on 2015-08-27:

#2

Note that I enabled verbose logging on juju-db for a few hours, and no updates to relationcount were made during that time frame. Time frame during which I (unsuccessfully) tried to upgrade the charm several times.

Revision history for this message

Junien F (axino) wrote on 2015-08-27:

#3

I also noted the following discrepancy :

> db.units.find({ _id: { $regex: "^ubuntu-mirror/" } }).count()
8
> db.services.findOne({_id : "ubuntu-mirror"} ).unitcount
16

Is this normal/worrying/wrong/fixable ?

Curtis Hovey (sinzui) on 2015-08-27

Changed in juju-core:
milestone:	1.25-alpha1 → 1.25-beta1

Alexis Bruemmer (alexis-bruemmer) on 2015-09-16

Changed in juju-core:
milestone:	1.25-beta1 → 1.25-beta2

Cheryl Jennings (cherylj) on 2015-10-22

Changed in juju-core:
milestone:	1.25-beta2 → 1.25.1

Cheryl Jennings (cherylj) on 2015-11-10

tags:

added: bug-squad

Cheryl Jennings (cherylj) on 2015-11-11

Changed in juju-core:
milestone:	1.25.1 → 1.25.2

Cheryl Jennings (cherylj) on 2015-11-24

Changed in juju-core:
milestone:	1.25.2 → 1.25.3

Cheryl Jennings (cherylj) on 2016-01-18

Changed in juju-core:
milestone:	1.25.3 → 1.25.4

Cheryl Jennings (cherylj) on 2016-01-27

Changed in juju-core:
milestone:	1.25.4 → 1.25.5

Cheryl Jennings (cherylj) on 2016-03-22

Changed in juju-core:
milestone:	1.25.5 → 1.25.6

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2016-05-03:

#4

I'm hitting something similar with Juju 1.25.5 and the Openstack provider:

machine-0: 2016-05-03 19:24:13 ERROR juju.worker.instanceupdater updater.go:275 cannot set addresses on "13": cannot set addresses of machine 13: state changing too quick
machine-0: 2016-05-03 19:24:18 ERROR juju.worker runner.go:223 exited "instancepoller": cannot set addresses of machine 13: state changing too quickly; try again soon

Basically one (out of many) unit fails to get a private-address even its network is defo configured and up. If i reboot the unit it still never gets it almost as if Juju gets into this cycle an can never get out.

tags:

added: sts

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-06-09:

#5

@hopem - the case of not setting the machine address should be solved by bug #1537585.

A significant effort was needed to determine that asserting on a struct can fail due to ordering (see the mailing list discussion here: https://lists.ubuntu.com/archives/juju-dev/2016-June/005602.html)

It looks like there may also be a problem with the assertion in SetCharm which could've caused the original problem reported in this bug.

Curtis Hovey (sinzui) on 2016-07-14

Changed in juju-core:
milestone:	1.25.6 → 1.25.7

Jorge Niedbalski (niedbalski) on 2016-08-11

tags:

removed: sts

Anastasia (anastasia-macmood) on 2016-09-01

Changed in juju-core:
status:	Triaged → Won't Fix

Revision history for this message

Junien F (axino) wrote on 2016-09-01:

#6

Hi !

What's up with "won't fix" ? This is still happening. I'm still taking any advice to repair this environment.

Thank you

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-09-01:

#7

@Junien

This is against 1.25 which is in Critical-bug-fixes only mode. From reading comment#5, this bug was fixed by another. Are you seeing the problem with 1.25.6? The last communication here was back in June. If this bug is Critical, please comment why, for example lack of workaround, etc. We'll be happy to raise it higher \o/

Our current focus in on 2.0. Is this still happening with 2.0-beta16?

Changed in juju-core:
status:	Won't Fix → Incomplete
milestone:	1.25.7 → none

Revision history for this message

Junien F (axino) wrote on 2016-09-01:

#8

This is a pretty critical and touchy environment. Is it safe to try to upgrade it to 1.25.6 ?

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-09-01:

#9

@Junien,
What Juju version is this environment currently running?

Revision history for this message

Junien F (axino) wrote on 2016-09-01:

#10

1.18.4 heh - nothing's changed since the original report

Revision history for this message

Tim Penhey (thumper) wrote on 2016-09-01:

#11

You need to upgrade to 1.20.14 first. And from there you can upgrade to 1.25.6.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-11-01:

#12

[Expired for juju-core because there has been no activity for 60 days.]

Changed in juju-core:
status:	Incomplete → Expired

Revision history for this message

Paul Gear (paulgear) wrote on 2017-01-06:

#13

We're still seeing this on 1.25.9 on an important production site.

Changed in juju-core:
status:	Expired → New
tags:	added: canonical-is

Revision history for this message

Paul Gear (paulgear) wrote on 2017-01-06:

#14

Restarting machine 0 had no effect on the situation, and a large number of units have stale charms which should have been upgraded.

Revision history for this message

Paul Gear (paulgear) wrote on 2017-01-06:

#15

https://private-fileshare.canonical.com/~paulgear/lp:1484105/all-machines.log has log for this recurrence on 1.25.9. The production system is in a state where 16 units of a given subordinate charm are upgraded and 27 others of the same service are not upgraded. All jujuds have been restarted without any change, and the above all-machines.log shows a curious silence between 2016-12-30 and today.

Revision history for this message

Junien F (axino) wrote on 2017-01-06:

#16

Download full text (3.8 KiB)

Hi,

This time, it appears to be because a "status" document doesn't have a "txn-revno" field :

juju:PRIMARY> db.statuses.find({ _id: { $regex: "ksplice/31#charm" } }).pretty()
{
        "_id" : "0b399859-692f-4e41-8f6d-c3a837cd26c3:u#ksplice/31#charm",
        "env-uuid" : "0b399859-692f-4e41-8f6d-c3a837cd26c3",
        "status" : "unknown",
        "statusdata" : {

        },
        "statusinfo" : "",
        "txn-queue" : [
                "5855f8181c56b63028334e84_66b7dbac",
                "58613aac1c56b670e01053dd_d6947399",
                "5865225a1c56b670e01bcf58_949b14d0",
                "5865e7ff1c56b670e01e1457_62b06131",
                "586b8f841c56b670e02eb517_35ffc19a"
        ]
}

Let's look at the txn-queue :

juju:PRIMARY> db.txns.find({ _id: ObjectId('5855f8181c56b63028334e84') }).pretty()
{
        "_id" : ObjectId("5855f8181c56b63028334e84"),
        "n" : "165ddd7e",
        "o" : [
                {
                        "c" : "statuses",
                        "d" : "0b399859-692f-4e41-8f6d-c3a837cd26c3:u#ksplice/31#charm",
                        "a" : {
                                "txn-revno" : NumberLong(0)
                        },
                        "u" : {
                                "$set" : {
                                        "env-uuid" : "0b399859-692f-4e41-8f6d-c3a837cd26c3",
                                        "status" : "active",
                                        "statusinfo" : "Effective kernel 3.13.0-105-generic",
                                        "statusdata" : {

                                        },
                                        "updated" : NumberLong("1482029080376315530"),
                                        "neverset" : false
                                }
                        }
                }
        ],
        "s" : 5
}

As you can see, the only assertion is that txn-revno == 0, which I guess isn't the case because there no "txn-revno" field ? This causes the state to be set to 5 ("aborted"). And causes the status update to fail.

The other queued txn are similar, e.g. :

juju:PRIMARY> db.txns.find({ _id: ObjectId('58613aac1c56b670e01053dd') }).pretty()
{
        "_id" : ObjectId("58613aac1c56b670e01053dd"),
        "n" : "189c1ad6",
        "o" : [
                {
                        "c" : "statuses",
                        "d" : "0b399859-692f-4e41-8f6d-c3a837cd26c3:u#ksplice/31#charm",
                        "a" : {
                                "txn-revno" : NumberLong(0)
                        },
                        "u" : {
                                "$set" : {
                                        "env-uuid" : "0b399859-692f-4e41-8f6d-c3a837cd26c3",
                                        "status" : "active",
                                        "statusinfo" : "Effective kernel 3.13.0-106-generic",
                                        "statusdata" : {

                                        },
                                        "updated" : NumberLong("1482767020431622924"),
                                        "neverset" : false
                                }
               ...

Hi,

This time, it appears to be because a "status" document doesn't have a "txn-revno" field :

juju:PRIMARY> db.statuses.find({ _id: { $regex: "ksplice/31#charm" } }).pretty()
{
        "_id" : "0b399859-692f-4e41-8f6d-c3a837cd26c3:u#ksplice/31#charm",
        "env-uuid" : "0b399859-692f-4e41-8f6d-c3a837cd26c3",
        "status" : "unknown",
        "statusdata" : {

},
        "statusinfo" : "",
        "txn-queue" : [
                "5855f8181c56b63028334e84_66b7dbac",
                "58613aac1c56b670e01053dd_d6947399",
                "5865225a1c56b670e01bcf58_949b14d0",
                "5865e7ff1c56b670e01e1457_62b06131",
                "586b8f841c56b670e02eb517_35ffc19a"
        ]
}

Let's look at the txn-queue :

juju:PRIMARY> db.txns.find({ _id: ObjectId('5855f8181c56b63028334e84') }).pretty()
{
        "_id" : ObjectId("5855f8181c56b63028334e84"),
        "n" : "165ddd7e",
        "o" : [
                {
                        "c" : "statuses",
                        "d" : "0b399859-692f-4e41-8f6d-c3a837cd26c3:u#ksplice/31#charm",
                        "a" : {
                                "txn-revno" : NumberLong(0)
                        },
                        "u" : {
                                "$set" : {
                                        "env-uuid" : "0b399859-692f-4e41-8f6d-c3a837cd26c3",
                                        "status" : "active",
                                        "statusinfo" : "Effective kernel 3.13.0-105-generic",
                                        "statusdata" : {

},
                                        "updated" : NumberLong("1482029080376315530"),
                                        "neverset" : false
                                }
                        }
                }
        ],
        "s" : 5
}

As you can see, the only assertion is that txn-revno == 0, which I guess isn't the case because there no "txn-revno" field ? This causes the state to be set to 5 ("aborted"). And causes the status update to fail.

The other queued txn are similar, e.g. :

juju:PRIMARY> db.txns.find({ _id: ObjectId('58613aac1c56b670e01053dd') }).pretty()
{
        "_id" : ObjectId("58613aac1c56b670e01053dd"),
        "n" : "189c1ad6",
        "o" : [
                {
                        "c" : "statuses",
                        "d" : "0b399859-692f-4e41-8f6d-c3a837cd26c3:u#ksplice/31#charm",
                        "a" : {
                                "txn-revno" : NumberLong(0)
                        },
                        "u" : {
                                "$set" : {
                                        "env-uuid" : "0b399859-692f-4e41-8f6d-c3a837cd26c3",
                                        "status" : "active",
                                        "statusinfo" : "Effective kernel 3.13.0-106-generic",
                                        "statusdata" : {

},
                                        "updated" : NumberLong("1482767020431622924"),
                                        "neverset" : false
                                }
                        }
                }
        ],
        "s" : 5
}

Example of a status document which works - you can see the txn-revno field :

juju:PRIMARY> db.statuses.find({ _id: { $regex: "ksplice/45#charm" } }).pretty()
{
        "_id" : "0b399859-692f-4e41-8f6d-c3a837cd26c3:u#ksplice/45#charm",
        "env-uuid" : "0b399859-692f-4e41-8f6d-c3a837cd26c3",
        "neverset" : false,
        "status" : "active",
        "statusdata" : {

},
        "statusinfo" : "Effective kernel 3.13.0-106-generic",
        "txn-queue" : [
                "586f68f61c56b6227e004006_17264e09"
        ],
        "txn-revno" : NumberLong(6194),
        "updated" : NumberLong("1483696374166671919")
}

What should we do here ?

Thanks

Revision history for this message

Junien F (axino) wrote on 2017-01-06:

#17

To be clear : the symptom here is not an inability to upgrade charm. The symptom is that some units can't update their statuses.

Revision history for this message

Tim Penhey (thumper) wrote on 2017-01-08:

#18

Wow... this is very screwed.

There should be nothing that removes the txn-revno field in a document.

The reason that the assert to check txn-revno == 0 is due to the code that is updating the document. It is attempting to make sure that the document hasn't changed since it read it, using the assumption that the txn-revno is always there, because it should be. There is no reason that there would be a txn-queue but no txn-revno.

The simplest solution is to manually add a txn-revno to the document. Any positive integer should be fine. If you want to set the correct value, you'd have to scan the txn collection for an operation on that document, the last revno will be in the "r" field. Feel free to contact me on IRC if you need help.

Revision history for this message

Paul Gear (paulgear) wrote on 2017-01-09:

#19

Today I worked with Tim to manually add a txn-revno to all the documents which required it. No explanation was found for how they got this way, but all of the status updates which were previously broken are now working, so there's probably not much point in keeping this open any longer. I've placed detailed logs in cRT#98820.

Anastasia (anastasia-macmood) on 2017-01-09

Changed in juju-core:
status:	New → Fix Committed

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-01-10:

#20

Marked as Fix Committed to close the bug as it seems to be the closest fit in this instance.

Anastasia (anastasia-macmood) on 2017-01-19

Changed in juju-core:
status:	Fix Committed → Fix Released

Revision history for this message

Paul Gear (paulgear) wrote on 2017-02-03:

#21

We had a new occurrence of the same mongodb corruption we saw at https://bugs.launchpad.net/juju-core/+bug/1484105/comments/16 and subsequent comments. Following the same steps of manually adding the missing txn-revno and txn-queue fields fixed the issue.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-02-13:

#22

Paul, you say that you've seen the issue again, with what version of Juju, and what version of Mongo?

Revision history for this message

Paul Gear (paulgear) wrote on 2017-02-14:

#23

@jameinel, versions at present are juju 1.25.9-trusty-amd64 & juju-mongodb 2.4.9-0ubuntu3 (I assume the dpkg version is the one you need for the latter?)

Here's a version history for this env:

2015-06-11 [1.20.14-trusty-amd64 gc]
2015-12-04 [1.24.7-trusty-amd64 gc]
2016-07-07 [1.25.5-trusty-amd64 gc]
2016-08-12 [1.25.6-trusty-amd64 gc]
2016-11-25 [1.25.8-trusty-amd64 gc]
2016-12-28 [1.25.9-trusty-amd64 gc]

I don't know at what point the missing txn-revnos showed up, but I discovered them on 2017-02-03, after the upgrade to 1.25.9.

Revision history for this message

Paul Gear (paulgear) wrote on 2017-02-21:

#24

Opened a separate bug for the missing txn-revnos problem: https://bugs.launchpad.net/juju-core/+bug/1666396

Revision history for this message

Simon Kollberg (simonklb) wrote on 2017-05-30:

#25

I'm experiencing this bug as well.

Seems to be happening while removing an application and the deploying the same application again. But not for every charm.

% juju --version
2.2-beta4-xenial-amd64

$ sudo apt list --installed | grep juju-mongodb
juju-mongodb3.2/xenial-updates,now 3.2.12-0ubuntu1~16.04 amd64 [installed]

Juju provider: manual

juju-core

juju upgrade-charm returns ERROR state changing too quickly; try again soon

Bug Description

Other bug subscribers

Remote bug watches