problems upgrading from 2.2.8 to 2.2.9

Bug #1742604 reported by Paul Collins on 2018-01-11
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju
High
Unassigned
2.2
High
Unassigned
2.3
High
Unassigned

Bug Description

Today I upgraded a large Juju 2 controller from 2.2.8 to 2.2.9. Machines 0 and 2 registered the requested upgrade, but machine 1 did not. I restarted it manually at 02:38:18 and at 02:44:19 to try to "unstick" it. After the first restarted, it seemingly started consuming a large number of file descriptors. I was not able to capture a full lsof, but jujud-machine-1 had 18992 FDs and the mongod primary process (running on the same machine) had 6340. There were also two panics, with the following backtrace, at 02:47:56 and 02:52:42:

2018-01-11 02:52:42 INFO juju.worker runner.go:483 stopped "apiserver", err: error pinging mongo: Closed explicitly
panic: Session already closed

goroutine 925137 [running]:
gopkg.in/mgo%2ev2.(*Session).cluster(0xc420235ba0, 0x41331f)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/gopkg.in/mgo.v2/session.go:1624 +0x81
gopkg.in/mgo%2ev2.copySession(0xc420235ba0, 0xc42cc29801, 0x4473f8)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/gopkg.in/mgo.v2/session.go:540 +0x40
gopkg.in/mgo%2ev2.(*Session).Copy(0xc420235ba0, 0x627)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/gopkg.in/mgo.v2/session.go:1590 +0x40
github.com/juju/juju/state.(*StatePool).openState(0xc4203b5920, 0xc43a0fdf80, 0x24, 0x800, 0xc46b4d7800, 0x627)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/juju/juju/state/pool.go:155 +0x39
github.com/juju/juju/state.(*StatePool).Get(0xc4203b5920, 0xc43a0fdf80, 0x24, 0x0, 0x0, 0x0, 0x0)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/juju/juju/state/pool.go:140 +0x332
github.com/juju/juju/state/statemetrics.statePoolShim.Get(0xc4203b5920, 0xc43a0fdf80, 0x24, 0x0, 0x0, 0xc44e6ccb70, 0xc447e28d40, 0x0)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/juju/juju/state/statemetrics/state.go:65 +0x43
github.com/juju/juju/state/statemetrics.(*Collector).updateModelMetrics(0xc420013a40, 0x3b7c220, 0xc45f424c00)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/juju/juju/state/statemetrics/statemetrics.go:196 +0x1c4
github.com/juju/juju/state/statemetrics.(*Collector).updateMetrics(0xc420013a40)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/juju/juju/state/statemetrics/statemetrics.go:147 +0x1fd
github.com/juju/juju/state/statemetrics.(*Collector).Collect(0xc420013a40, 0xc421b3a060)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/juju/juju/state/statemetrics/statemetrics.go:128 +0x1ae
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func2(0xc43688bd70, 0xc421b3a060, 0x3b75320, 0xc420013a40)
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/prometheus/client_golang/prometheus/registry.go:382 +0x61
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
        /tmp/tmpOjeNS0/juju-core_2.2.9/src/github.com/prometheus/client_golang/prometheus/registry.go:383 +0x2ec

Here's machine-1.log starting from a few minutes before the upgrade ("upgrade requested from 2.2.8 to 2.2.9" logged at 02:25:28) : https://private-fileshare.canonical.com/~pjdc/machine-1.log-2.2.9-upgrade.xz

And some controller info:

applications: 946
machines: 558
models: 179
units: 2482

Joel Sing (jsing) wrote :

The out of file descriptors sounds like LP#1739380.

Tim Penhey (thumper) wrote :

It does, and it looks like it wasn't just the iterators :-(

John A Meinel (jameinel) wrote :
Download full text (4.5 KiB)

Looking at the logs, the controller was *very* unhappy before the upgrade triggered. The log file starts with:
machine-log pre upgrade
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/972ef575-2d51-4a97-85c6-34b9bfc272ac/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.111:17070/model/d8d6172f-2e4b-4456-8feb-3def756c2300/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/403fcc26-10a9-4a01-8853-6b2d2dc170b6/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/393e0299-2f17-4608-8e66-5c3bdf41d47e/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.111:17070/model/f7dd66cc-63ab-4373-8b9c-ef4450bba54c/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.111:17070/model/f524d1a3-2c43-4fda-81b0-a90839c413bf/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/664c4806-8e91-4a36-832f-97e31679d697/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/34445fff-8911-4fb3-879e-484e21ef4352/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.109:17070/model/0554bee5-31b3-4dd3-8404-db1b0e68398c/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/69f7e364-670e-44ef-898b-b5492ea4f308/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/4d8861b0-adf6-4010-82e0-978c8fc58dc5/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://localhost:17070/model/1ce99b16-50be-4930-8221-e782cc942463/api"
2018-01-11 02:19:41 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.109:17070/model/84538afb-d225-43a2-8fd9-76c5c6661d03/api"
...

Which looks to be lots of workers just starting to connect to the controller. (Indicating they had either recently started, or been restarted, or something).
It then follows with a lot of messages like:
2018-01-11 02:19:46 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try again (try again)
2018-01-11 02:19:46 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try again (try again)

That would hint that *some* of these are getting rate limited.

Then we see a huge number of these failures:
2018-01-11 02:19:49 ERROR juju.api.watcher watcher.go:90 error trying to stop watcher: txn watcher restarted

The core txn watcher restarting is a rather serious event, and probably does cause a lot of internal churn.

The "failed to stop watcher" feels very weird. Why would we get an error failing to stop the watcher? Maybe our teardown is getting an error while tearing down and failing to finish cleaning up what it can?

A couple minutes later we're still getting rate limited:
2018-01-11 0...

Read more...

Anastasia (anastasia-macmood) wrote :

I am not sure that we are going to release another 2.2.x release. But I'll add it to the bug report for @John A Meinel (jameinel) and @Tim Penhey (thumper) to decide. It might be that we'd need to provide a script to fix broken environments.

I am triaging this as Critical for 2.3 and High for 2.4 as we'd need a fix/upgrade step.

Changed in juju:
status: New → Triaged
importance: Undecided → High
tags: added: upgrade-juju
Tim Penhey (thumper) wrote :

Related I think to bug 1746265

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers