Canonical Juju

juju upgrade from 2.3.7 to 2.3.8 failed

Bug #1779682 reported by Junien F on 2018-07-02

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Low	Unassigned

Bug Description

Hi,

I tried upgrading a big juju controller from 2.3.7 to 2.3.8 today, and it failed. The controllers are HA-enabled (3 machines). It looks like it took machine 1 around 10 min to notice that there was an update, and 16 minutes to start on 2.3.8. After what it took ~43 min to start the upgrade steps.

Logs of machine 0 (OK) : https://pastebin.canonical.com/p/gsgBqkmPkD/
Logs of machine 1 (FAIL) : https://pastebin.canonical.com/p/nMqmNm74Qv/
Logs of machine 2 (OK) : https://pastebin.canonical.com/p/BJq2wQGGVR/

This also led to the creation of weird documents in the upgradeInfo collection :

juju:PRIMARY> db.upgradeInfo.find({targetVersion:"2.3.8"}).pretty()
{
        "_id" : "ObjectIdHex(\"5b39cd55a75667418dbd04f6\")",
        "previousVersion" : "2.3.7",
        "targetVersion" : "2.3.8",
        "status" : "aborted",
        "started" : ISODate("2018-07-02T06:44:27.548Z"),
        "controllersReady" : [
                "0",
                "2"
        ],
        "controllersDone" : [ ],
        "txn-revno" : NumberLong(2),
        "txn-queue" : [ ]
}
{
        "_id" : "current",
        "previousVersion" : "2.3.7",
        "targetVersion" : "2.3.8",
        "status" : "pending",
        "started" : ISODate("2018-07-02T07:41:21.107Z"),
        "controllersReady" : [
                "1"
        ],
        "controllersDone" : [ ],
        "txn-revno" : NumberLong(5),
        "txn-queue" : [ ]
}

All this generated a lot of churn on the mongodb server (simple requests taking 5 to 10 seconds), which led to overall slowness to interact with this controller.

Thanks

Tags:

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-07-10:

I wonder if it is related to bug # 1778614...

Could you please give us more information about the compositions of the controllers? i.e. what was deployed on the controller model, what relations, any subordinates?...

We'd need to know more to reproduce as I am sure that we do test some upgrade scenarios.

Changed in juju:
status:	New → Incomplete

Revision history for this message

Junien F (axino) wrote on 2018-07-16:

Hi,

Here it is https://pastebin.canonical.com/p/khYn4PftYG/ (Canonical employees only).

Changed in juju:
status:	Incomplete → New

Revision history for this message

John A Meinel (jameinel) wrote on 2018-10-22:

I wonder 2 things:

1) We have seen this happen where 1 controller takes a while to respond to an upgrade request, causing things to go into a 'split' upgrade. Where 2 of the machines are trying to upgrade with one doc, and the other is stuck on another doc, and since 2 think the upgrade is aborted, they never report on the new doc, and they all end up pending because they don't see all 3 controllers agree that it is time to upgrade.
2) One reason controllers weren't shutting down is that they weren't disabling new incoming requests when they were in 'shutdown' mode. And they were stuck running, because a new login would come in before the last connection was rejected. I believe we already have a patch for that in 2.4 (from Tim) that rejects incoming connections and still allows for graceful shutdown of existing connections.

I wonder if we've shrunk the 2-upgrade-doc issue such that it isn't a problem now.

Changed in juju:
status:	New → Triaged

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	Undecided → Low
tags:	added: expirebugs-bot

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.