juju upgrade from 2.3.7 to 2.3.8 failed

Bug #1779682 reported by Junien F
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

Hi,

I tried upgrading a big juju controller from 2.3.7 to 2.3.8 today, and it failed. The controllers are HA-enabled (3 machines). It looks like it took machine 1 around 10 min to notice that there was an update, and 16 minutes to start on 2.3.8. After what it took ~43 min to start the upgrade steps.

Logs of machine 0 (OK) : https://pastebin.canonical.com/p/gsgBqkmPkD/
Logs of machine 1 (FAIL) : https://pastebin.canonical.com/p/nMqmNm74Qv/
Logs of machine 2 (OK) : https://pastebin.canonical.com/p/BJq2wQGGVR/

This also led to the creation of weird documents in the upgradeInfo collection :

juju:PRIMARY> db.upgradeInfo.find({targetVersion:"2.3.8"}).pretty()
{
        "_id" : "ObjectIdHex(\"5b39cd55a75667418dbd04f6\")",
        "previousVersion" : "2.3.7",
        "targetVersion" : "2.3.8",
        "status" : "aborted",
        "started" : ISODate("2018-07-02T06:44:27.548Z"),
        "controllersReady" : [
                "0",
                "2"
        ],
        "controllersDone" : [ ],
        "txn-revno" : NumberLong(2),
        "txn-queue" : [ ]
}
{
        "_id" : "current",
        "previousVersion" : "2.3.7",
        "targetVersion" : "2.3.8",
        "status" : "pending",
        "started" : ISODate("2018-07-02T07:41:21.107Z"),
        "controllersReady" : [
                "1"
        ],
        "controllersDone" : [ ],
        "txn-revno" : NumberLong(5),
        "txn-queue" : [ ]
}

All this generated a lot of churn on the mongodb server (simple requests taking 5 to 10 seconds), which led to overall slowness to interact with this controller.

Thanks

Revision history for this message
Anastasia (anastasia-macmood) wrote :

I wonder if it is related to bug # 1778614...

Could you please give us more information about the compositions of the controllers? i.e. what was deployed on the controller model, what relations, any subordinates?...

We'd need to know more to reproduce as I am sure that we do test some upgrade scenarios.

Changed in juju:
status: New → Incomplete
Revision history for this message
Junien F (axino) wrote :

Hi,

Here it is https://pastebin.canonical.com/p/khYn4PftYG/ (Canonical employees only).

Changed in juju:
status: Incomplete → New
Revision history for this message
John A Meinel (jameinel) wrote :

I wonder 2 things:

1) We have seen this happen where 1 controller takes a while to respond to an upgrade request, causing things to go into a 'split' upgrade. Where 2 of the machines are trying to upgrade with one doc, and the other is stuck on another doc, and since 2 think the upgrade is aborted, they never report on the new doc, and they all end up pending because they don't see all 3 controllers agree that it is time to upgrade.
2) One reason controllers weren't shutting down is that they weren't disabling new incoming requests when they were in 'shutdown' mode. And they were stuck running, because a new login would come in before the last connection was rejected. I believe we already have a patch for that in 2.4 (from Tim) that rejects incoming connections and still allows for graceful shutdown of existing connections.

I wonder if we've shrunk the 2-upgrade-doc issue such that it isn't a problem now.

Changed in juju:
status: New → Triaged
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Undecided → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.