juju upgrade-charm returns ERROR state changing too quickly; try again soon

Bug #1484105 reported by Junien F
52
This bug affects 10 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Unassigned

Bug Description

Hi,

When trying to upgrade a charm on a juju environment (deploying only on MAAS nodes), I get the following error :
$ juju upgrade-charm ubuntu-mirror
Added charm "local:precise/ubuntu-mirror-150" to the environment.
ERROR state changing too quickly; try again soon

Logs for the request :
http://pastebin.ubuntu.com/12061378/

I dug a bit and found the following : approximately every 17 min, juju detects a new address for each MAAS node and I see the following in machine-0.log :
juju.worker.instanceupdater updater.go:264 machine "0" has new addresses: [public:foo.example.com]

And in all-machines.log, for each unit of the environment, I get :
unit-foo-XY 2015-08-12 12:11:32 INFO juju.worker.upgrader upgrader.go:121 desired tool version: 1.18.4

We started an 1.16 -> 1.18 upgrade at the end of July, I guess it's not completed yet ?

How can I upgrade the charm ? I tried restarting the jujud-machine-0, to no avail. juju status is clean.

Thanks

Curtis Hovey (sinzui)
tags: added: upgrade-charm
tags: added: upgrade-juju
Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.25.0
Revision history for this message
Junien F (axino) wrote :

I think I found the cause of this bug : stale data in the DB.

> db.services.findOne({_id : "ubuntu-mirror"} ).relationcount
8
> db.relations.find({ _id: { $regex: "ubuntu-mirror:" } }).count()
7

When trying to upgrade the charm, one of the transaction (txnOp) asserts that "relationcount" in the service is 7, while in the DB it 8. So the set of transaction fails, juju retries 4 times and ends up throwing an ExcessiveContention error.

Is my analysis correct ? How can I progress with this ?

Thanks !

Revision history for this message
Junien F (axino) wrote :

Note that I enabled verbose logging on juju-db for a few hours, and no updates to relationcount were made during that time frame. Time frame during which I (unsuccessfully) tried to upgrade the charm several times.

Revision history for this message
Junien F (axino) wrote :

I also noted the following discrepancy :

> db.units.find({ _id: { $regex: "^ubuntu-mirror/" } }).count()
8
> db.services.findOne({_id : "ubuntu-mirror"} ).unitcount
16

Is this normal/worrying/wrong/fixable ?

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25-alpha1 → 1.25-beta1
Changed in juju-core:
milestone: 1.25-beta1 → 1.25-beta2
Changed in juju-core:
milestone: 1.25-beta2 → 1.25.1
tags: added: bug-squad
Changed in juju-core:
milestone: 1.25.1 → 1.25.2
Changed in juju-core:
milestone: 1.25.2 → 1.25.3
Changed in juju-core:
milestone: 1.25.3 → 1.25.4
Changed in juju-core:
milestone: 1.25.4 → 1.25.5
Changed in juju-core:
milestone: 1.25.5 → 1.25.6
Revision history for this message
Edward Hope-Morley (hopem) wrote :

I'm hitting something similar with Juju 1.25.5 and the Openstack provider:

machine-0: 2016-05-03 19:24:13 ERROR juju.worker.instanceupdater updater.go:275 cannot set addresses on "13": cannot set addresses of machine 13: state changing too quick
machine-0: 2016-05-03 19:24:18 ERROR juju.worker runner.go:223 exited "instancepoller": cannot set addresses of machine 13: state changing too quickly; try again soon

Basically one (out of many) unit fails to get a private-address even its network is defo configured and up. If i reboot the unit it still never gets it almost as if Juju gets into this cycle an can never get out.

tags: added: sts
Revision history for this message
Cheryl Jennings (cherylj) wrote :

@hopem - the case of not setting the machine address should be solved by bug #1537585.

A significant effort was needed to determine that asserting on a struct can fail due to ordering (see the mailing list discussion here: https://lists.ubuntu.com/archives/juju-dev/2016-June/005602.html)

It looks like there may also be a problem with the assertion in SetCharm which could've caused the original problem reported in this bug.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.6 → 1.25.7
tags: removed: sts
Changed in juju-core:
status: Triaged → Won't Fix
Revision history for this message
Junien F (axino) wrote :

Hi !

What's up with "won't fix" ? This is still happening. I'm still taking any advice to repair this environment.

Thank you

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Junien

This is against 1.25 which is in Critical-bug-fixes only mode. From reading comment#5, this bug was fixed by another. Are you seeing the problem with 1.25.6? The last communication here was back in June. If this bug is Critical, please comment why, for example lack of workaround, etc. We'll be happy to raise it higher \o/

Our current focus in on 2.0. Is this still happening with 2.0-beta16?

Changed in juju-core:
status: Won't Fix → Incomplete
milestone: 1.25.7 → none
Revision history for this message
Junien F (axino) wrote :

This is a pretty critical and touchy environment. Is it safe to try to upgrade it to 1.25.6 ?

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Junien,
What Juju version is this environment currently running?

Revision history for this message
Junien F (axino) wrote :

1.18.4 heh - nothing's changed since the original report

Revision history for this message
Tim Penhey (thumper) wrote :

You need to upgrade to 1.20.14 first. And from there you can upgrade to 1.25.6.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju-core because there has been no activity for 60 days.]

Changed in juju-core:
status: Incomplete → Expired
Revision history for this message
Paul Gear (paulgear) wrote :

We're still seeing this on 1.25.9 on an important production site.

Changed in juju-core:
status: Expired → New
tags: added: canonical-is
Revision history for this message
Paul Gear (paulgear) wrote :

Restarting machine 0 had no effect on the situation, and a large number of units have stale charms which should have been upgraded.

Revision history for this message
Paul Gear (paulgear) wrote :

https://private-fileshare.canonical.com/~paulgear/lp:1484105/all-machines.log has log for this recurrence on 1.25.9. The production system is in a state where 16 units of a given subordinate charm are upgraded and 27 others of the same service are not upgraded. All jujuds have been restarted without any change, and the above all-machines.log shows a curious silence between 2016-12-30 and today.

Revision history for this message
Junien F (axino) wrote :
Download full text (3.8 KiB)

Hi,

This time, it appears to be because a "status" document doesn't have a "txn-revno" field :

juju:PRIMARY> db.statuses.find({ _id: { $regex: "ksplice/31#charm" } }).pretty()
{
        "_id" : "0b399859-692f-4e41-8f6d-c3a837cd26c3:u#ksplice/31#charm",
        "env-uuid" : "0b399859-692f-4e41-8f6d-c3a837cd26c3",
        "status" : "unknown",
        "statusdata" : {

        },
        "statusinfo" : "",
        "txn-queue" : [
                "5855f8181c56b63028334e84_66b7dbac",
                "58613aac1c56b670e01053dd_d6947399",
                "5865225a1c56b670e01bcf58_949b14d0",
                "5865e7ff1c56b670e01e1457_62b06131",
                "586b8f841c56b670e02eb517_35ffc19a"
        ]
}

Let's look at the txn-queue :

juju:PRIMARY> db.txns.find({ _id: ObjectId('5855f8181c56b63028334e84') }).pretty()
{
        "_id" : ObjectId("5855f8181c56b63028334e84"),
        "n" : "165ddd7e",
        "o" : [
                {
                        "c" : "statuses",
                        "d" : "0b399859-692f-4e41-8f6d-c3a837cd26c3:u#ksplice/31#charm",
                        "a" : {
                                "txn-revno" : NumberLong(0)
                        },
                        "u" : {
                                "$set" : {
                                        "env-uuid" : "0b399859-692f-4e41-8f6d-c3a837cd26c3",
                                        "status" : "active",
                                        "statusinfo" : "Effective kernel 3.13.0-105-generic",
                                        "statusdata" : {

                                        },
                                        "updated" : NumberLong("1482029080376315530"),
                                        "neverset" : false
                                }
                        }
                }
        ],
        "s" : 5
}

As you can see, the only assertion is that txn-revno == 0, which I guess isn't the case because there no "txn-revno" field ? This causes the state to be set to 5 ("aborted"). And causes the status update to fail.

The other queued txn are similar, e.g. :

juju:PRIMARY> db.txns.find({ _id: ObjectId('58613aac1c56b670e01053dd') }).pretty()
{
        "_id" : ObjectId("58613aac1c56b670e01053dd"),
        "n" : "189c1ad6",
        "o" : [
                {
                        "c" : "statuses",
                        "d" : "0b399859-692f-4e41-8f6d-c3a837cd26c3:u#ksplice/31#charm",
                        "a" : {
                                "txn-revno" : NumberLong(0)
                        },
                        "u" : {
                                "$set" : {
                                        "env-uuid" : "0b399859-692f-4e41-8f6d-c3a837cd26c3",
                                        "status" : "active",
                                        "statusinfo" : "Effective kernel 3.13.0-106-generic",
                                        "statusdata" : {

                                        },
                                        "updated" : NumberLong("1482767020431622924"),
                                        "neverset" : false
                                }
               ...

Read more...

Revision history for this message
Junien F (axino) wrote :

To be clear : the symptom here is not an inability to upgrade charm. The symptom is that some units can't update their statuses.

Revision history for this message
Tim Penhey (thumper) wrote :

Wow... this is very screwed.

There should be nothing that removes the txn-revno field in a document.

The reason that the assert to check txn-revno == 0 is due to the code that is updating the document. It is attempting to make sure that the document hasn't changed since it read it, using the assumption that the txn-revno is always there, because it should be. There is no reason that there would be a txn-queue but no txn-revno.

The simplest solution is to manually add a txn-revno to the document. Any positive integer should be fine. If you want to set the correct value, you'd have to scan the txn collection for an operation on that document, the last revno will be in the "r" field. Feel free to contact me on IRC if you need help.

Revision history for this message
Paul Gear (paulgear) wrote :

Today I worked with Tim to manually add a txn-revno to all the documents which required it. No explanation was found for how they got this way, but all of the status updates which were previously broken are now working, so there's probably not much point in keeping this open any longer. I've placed detailed logs in cRT#98820.

Changed in juju-core:
status: New → Fix Committed
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Marked as Fix Committed to close the bug as it seems to be the closest fit in this instance.

Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
Paul Gear (paulgear) wrote :

We had a new occurrence of the same mongodb corruption we saw at https://bugs.launchpad.net/juju-core/+bug/1484105/comments/16 and subsequent comments. Following the same steps of manually adding the missing txn-revno and txn-queue fields fixed the issue.

Revision history for this message
John A Meinel (jameinel) wrote :

Paul, you say that you've seen the issue again, with what version of Juju, and what version of Mongo?

Revision history for this message
Paul Gear (paulgear) wrote :

@jameinel, versions at present are juju 1.25.9-trusty-amd64 & juju-mongodb 2.4.9-0ubuntu3 (I assume the dpkg version is the one you need for the latter?)

Here's a version history for this env:

2015-06-11 [1.20.14-trusty-amd64 gc]
2015-12-04 [1.24.7-trusty-amd64 gc]
2016-07-07 [1.25.5-trusty-amd64 gc]
2016-08-12 [1.25.6-trusty-amd64 gc]
2016-11-25 [1.25.8-trusty-amd64 gc]
2016-12-28 [1.25.9-trusty-amd64 gc]

I don't know at what point the missing txn-revnos showed up, but I discovered them on 2017-02-03, after the upgrade to 1.25.9.

Revision history for this message
Paul Gear (paulgear) wrote :

Opened a separate bug for the missing txn-revnos problem: https://bugs.launchpad.net/juju-core/+bug/1666396

Revision history for this message
Simon Kollberg (simonklb) wrote :

I'm experiencing this bug as well.

Seems to be happening while removing an application and the deploying the same application again. But not for every charm.

% juju --version
2.2-beta4-xenial-amd64

$ sudo apt list --installed | grep juju-mongodb
juju-mongodb3.2/xenial-updates,now 3.2.12-0ubuntu1~16.04 amd64 [installed]

Juju provider: manual

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.