Canonical Juju

problems upgrading from 2.2.8 to 2.2.9

Bug #1742604 reported by Paul Collins on 2018-01-11

This bug report is a duplicate of: Bug #1746265: juju-upgrade from 2.2.9 to 2.3.2 fails with state changing too quickly. Edit Remove

This bug affects 3 people

	Status	Importance	Assigned to
Canonical Juju	Triaged	High	Unassigned
2.2	Won't Fix	High	Unassigned
2.3	Triaged	High	Unassigned

Bug Description

Today I upgraded a large Juju 2 controller from 2.2.8 to 2.2.9. Machines 0 and 2 registered the requested upgrade, but machine 1 did not. I restarted it manually at 02:38:18 and at 02:44:19 to try to "unstick" it. After the first restarted, it seemingly started consuming a large number of file descriptors. I was not able to capture a full lsof, but jujud-machine-1 had 18992 FDs and the mongod primary process (running on the same machine) had 6340. There were also two panics, with the following backtrace, at 02:47:56 and 02:52:42:

2018-01-11 02:52:42 INFO juju.worker runner.go:483 stopped "apiserver", err: error pinging mongo: Closed explicitly
panic: Session already closed

goroutine 925137 [running]:
gopkg.in/mgo%2ev2.(*Session).cluster(0xc420235ba0, 0x41331f)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/gopkg.in/mgo.v2/session.go:1624 +0x81
gopkg.in/mgo%2ev2.copySession(0xc420235ba0, 0xc42cc29801, 0x4473f8)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/gopkg.in/mgo.v2/session.go:540 +0x40
gopkg.in/mgo%2ev2.(*Session).Copy(0xc420235ba0, 0x627)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/gopkg.in/mgo.v2/session.go:1590 +0x40
github.com/juju/juju/state.(*StatePool).openState(0xc4203b5920, 0xc43a0fdf80, 0x24, 0x800, 0xc46b4d7800, 0x627)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/juju/juju/state/pool.go:155 +0x39
github.com/juju/juju/state.(*StatePool).Get(0xc4203b5920, 0xc43a0fdf80, 0x24, 0x0, 0x0, 0x0, 0x0)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/juju/juju/state/pool.go:140 +0x332
github.com/juju/juju/state/statemetrics.statePoolShim.Get(0xc4203b5920, 0xc43a0fdf80, 0x24, 0x0, 0x0, 0xc44e6ccb70, 0xc447e28d40, 0x0)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/juju/juju/state/statemetrics/state.go:65 +0x43
github.com/juju/juju/state/statemetrics.(*Collector).updateModelMetrics(0xc420013a40, 0x3b7c220, 0xc45f424c00)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/juju/juju/state/statemetrics/statemetrics.go:196 +0x1c4
github.com/juju/juju/state/statemetrics.(*Collector).updateMetrics(0xc420013a40)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/juju/juju/state/statemetrics/statemetrics.go:147 +0x1fd
github.com/juju/juju/state/statemetrics.(*Collector).Collect(0xc420013a40, 0xc421b3a060)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/juju/juju/state/statemetrics/statemetrics.go:128 +0x1ae
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func2(0xc43688bd70, 0xc421b3a060, 0x3b75320, 0xc420013a40)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/prometheus/client_golang/prometheus/registry.go:382 +0x61
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/prometheus/client_golang/prometheus/registry.go:383 +0x2ec

Here's machine-1.log starting from a few minutes before the upgrade ("upgrade requested from 2.2.8 to 2.2.9" logged at 02:25:28) : https://private-fileshare.canonical.com/~pjdc/machine-1.log-2.2.9-upgrade.xz

And some controller info:

applications: 946
machines: 558
models: 179
units: 2482

Tags:

Revision history for this message

Joel Sing (jsing) wrote on 2018-01-11:

The out of file descriptors sounds like LP#1739380.

Revision history for this message

Tim Penhey (thumper) wrote on 2018-01-11:

It does, and it looks like it wasn't just the iterators :-(

Revision history for this message

John A Meinel (jameinel) wrote on 2018-01-13:

bug #1739380

Revision history for this message

John A Meinel (jameinel) wrote on 2018-01-13:

Download full text (4.5 KiB)

Looking at the logs, the controller was *very* unhappy before the upgrade triggered. The log file starts with:
machine-log pre upgrade
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/972ef575-2d51-4a97-85c6-34b9bfc272ac/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.111:17070/model/d8d6172f-2e4b-4456-8feb-3def756c2300/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/403fcc26-10a9-4a01-8853-6b2d2dc170b6/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/393e0299-2f17-4608-8e66-5c3bdf41d47e/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.111:17070/model/f7dd66cc-63ab-4373-8b9c-ef4450bba54c/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.111:17070/model/f524d1a3-2c43-4fda-81b0-a90839c413bf/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/664c4806-8e91-4a36-832f-97e31679d697/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/34445fff-8911-4fb3-879e-484e21ef4352/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.109:17070/model/0554bee5-31b3-4dd3-8404-db1b0e68398c/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/69f7e364-670e-44ef-898b-b5492ea4f308/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/4d8861b0-adf6-4010-82e0-978c8fc58dc5/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/1ce99b16-50be-4930-8221-e782cc942463/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.109:17070/model/84538afb-d225-43a2-8fd9-76c5c6661d03/api"
...

Which looks to be lots of workers just starting to connect to the controller. (Indicating they had either recently started, or been restarted, or something).
It then follows with a lot of messages like:
2018-01-11 02:19:46 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try again (try again)
2018-01-11 02:19:46 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try again (try again)

That would hint that *some* of these are getting rate limited.

Then we see a huge number of these failures:
2018-01-11 02:19:49 ERROR juju.api.watcher watcher.go:90 error trying to stop watcher: txn watcher restarted

The core txn watcher restarting is a rather serious event, and probably does cause a lot of internal churn.

The "failed to stop watcher" feels very weird. Why would we get an error failing to stop the watcher? Maybe our teardown is getting an error while tearing down and failing to finish cleaning up what it can?

A couple minutes later we're still getting rate limited:
2018-01-11 0...

That would hint that *some* of these are getting rate limited.

Then we see a huge number of these failures:
2018-01-11 02:19:49 ERROR juju.api.watcher watcher.go:90 error trying to stop watcher: txn watcher restarted

The core txn watcher restarting is a rather serious event, and probably does cause a lot of internal churn.

A couple minutes later we're still getting rate limited:
2018-01-11 02:21:30 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try again (try again)

And then we get the first "Closed explicitly"
2018-01-11 02:21:31 ERROR juju.worker.dependency engine.go:551 "compute-provisioner" manifold worker returned unexpected error: could not get controller info: cannot get controllers document: Closed explicitly

Mostly, the controller appears to be thrashing itself during various tear-down activities, which may be why it wasn't able to see that an upgrade was pending? I'm not sure because it does register this at the very end of the file:
2018-01-11 02:25:52 INFO juju.worker.upgrader error.go:33 upgraded from 2.2.8-xenial-amd64 to 2.2.9-xenial-amd64 ("https://10.25.2.109:17070/model/4da59b22-9710-4e69-840a-be49ee864a97/tools/2.2.9-xenial-amd64")
ERROR must restart: an agent upgrade is available

It may also be that machine 0 and 2 noticed earlier that an upgrade had triggered, and then whatever workers on machine-1 that were connected to 0 and 2 were getting restarted, triggering load, and preventing it from upgrading itself?

We should rethink how model agents are working, because they shouldn't be connecting to other controllers. If an worker for model X is running on machine-1, it should only use its connection to localhost to process that model.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-01-31:

I am not sure that we are going to release another 2.2.x release. But I'll add it to the bug report for @John A Meinel (jameinel) and @Tim Penhey (thumper) to decide. It might be that we'd need to provide a script to fix broken environments.

I am triaging this as Critical for 2.3 and High for 2.4 as we'd need a fix/upgrade step.

Changed in juju:
status:	New → Triaged
importance:	Undecided → High

Anastasia (anastasia-macmood) on 2018-01-31

tags:

added: upgrade-juju

Revision history for this message

Tim Penhey (thumper) wrote on 2018-02-01:

Related I think to bug 1746265

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1746265 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.