charm hook failures after controller model upgrade from 2.1.2 to 2.3.4

Bug #1755155 reported by Sandor Zeestraten
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Incomplete
High
Unassigned

Bug Description

# Quick summary
After upgrading a controller from 2.1 => 2.2 or newer, it is necessary to upgrade the model agents as well, or you may get errors about "unexpected end of JSON input".

# Issue
I upgraded our juju controllers (3x HA, MAAS cloud) from 2.1.2 to 2.3.4 with `juju upgrade-juju -m controller --agent-version 2.3.4`.
This took about 15 min and the upgrade seemed to be successfull on the controller model.
Directly after this, 2 of the charms deployed in another model started reporting hook failures for all its units.
The charms in question are `nova-cloud-controller` and `nova-compute` deployed on a model called `openstack`

Please note that the juju environment and charms were working fine before the upgrade.
I have tested this exact upgrade scenario (2.1.2 to 2.3.4, 3x HA, same cloud, same base deployment of openstack) multiple times in staging previously without running into this issue.
Also, when looking at the logs for controller and the failing juju units, there are some seemingly related connection error messages which weren't there before upgrading.
So I think it is safe to say that this is not a charm issue, but a juju issue.

Any help would be appreciated.

# Logs
N.B. All the different units show the same error messages, i.e. nova-cloud-controller/0, nova-cloud-controller/1 and nova-cloud-controller/2 have the same messaging and exact same traceback in the logs.
Same goes for the controllers and nova-compute charm, therefore I only added one excerpt from each.

## Controllers
`juju status -m controller --format yaml`: https://pastebin.com/kaxpuL8M
`juju ssh -m controller 0 'sudo less /var/log/juju/machine-0.log'`: https://pastebin.com/L3hFuZvb
All the controllers show some variation of these error messages. The IP's seem to correspond to connections from the juju controller to the nova-cloud-controller units.

## nova-cloud-controller
`juju status -m openstack nova-cloud-controller --format yaml`: https://pastebin.com/uiS2BXBX
`juju ssh nova-cloud-controller/0 'sudo less /var/log/juju/unit-nova-cloud-controller-0.log'`: https://pastebin.com/GYKBASUu

## nova-compute
`juju status -m openstack nova-compute --format yaml`: https://pastebin.com/6XJcLWNL
`juju ssh -m openstack nova-compute/0 'sudo less /var/log/juju/unit-nova-compute-0.log'`: https://pastebin.com/Pp4PymTa

# Troubleshooting steps
* Tried to restart the juju controllers one by one
* Tried to restart the nova-cloud-controller and nova-compute juju services (i.e. jujud-unit-nova-cloud-controller-0.service, jujud-unit-nova-compute-0.service and etc)

* Running the relation-get command manually for nova-cloud-controller:
$ juju run --unit nova-cloud-controller/0 'relation-get --format=json -r identity-service:28 - keystone/0'
{"admin_token":"redacted","api_version":"2","auth_host":"keystone.maas","auth_port":"35357","auth_protocol":"http","private-address":"aa.bb.2.130","service_host":"keystone.maas","service_password":"redacted","service_port":"5000","service_protocol":"http","service_tenant":"services","service_tenant_id":"4baaf52f802a47fa8309b56c10b95e6c","service_username":"nova"}

$ juju run --unit nova-cloud-controller/1 'relation-get --format=json -r identity-service:28 - keystone/0'
{"admin_token":"redacted","api_version":"2","auth_host":"keystone.maas","auth_port":"35357","auth_protocol":"http","private-address":"aa.bb.2.130","service_host":"keystone.maas","service_password":"redacted","service_port":"5000","service_protocol":"http","service_tenant":"services","service_tenant_id":"4baaf52f802a47fa8309b56c10b95e6c","service_username":"nova"}

$ juju run --unit nova-cloud-controller/2 'relation-get --format=json -r identity-service:28 - keystone/0'
{"admin_token":"redacted","api_version":"2","auth_host":"keystone.maas","auth_port":"35357","auth_protocol":"http","private-address":"aa.bb.2.130","service_host":"keystone.maas","service_password":"redacted","service_port":"5000","service_protocol":"http","service_tenant":"services","service_tenant_id":"4baaf52f802a47fa8309b56c10b95e6c","service_username":"nova"}

* Running the relation-get command manually for nova-compute:
$ juju run --unit nova-compute/0 'relation-get --format=json -r cloud-compute:38 network_manager nova-cloud-controller/0'
ERROR timed out waiting for result from: unit nova-compute/0
...
$ juju run --unit nova-compute/19 'relation-get --format=json -r cloud-compute:38 network_manager nova-cloud-controller/0'
ERROR timed out waiting for result from: unit nova-compute/1

As noted above, running the command which is causing these hook failures manually works fine for the nova-cloud-controller units, but not for the nova-compute units.
I do not understand why the former is failing, however the latter is most likely failing as the relation in question is related to the nova-cloud-controller which is also failing.

# Versions
Juju 2.1.2, 2.3.4
MAAS 2.1.2

description: updated
description: updated
description: updated
Revision history for this message
John A Meinel (jameinel) wrote :

If you have very large configuration, I think upgrading the agents in the model might fix some of the things like "unexpected end of JSON input".

IIRC, the issue is that in 2.3 (2.2?) we introduced a json/websocket library that can transmit responses in frames, that ended up not supported by older clients.

And looking at the status output, it does look like all the agents in the "openstack" model are running 2.1.2 still, while the controller is running 2.3.4.

That should be as simple as:
 juju upgrade-juju -m openstack

(if necessary, you might pass --agent-version=2.3.4)

Revision history for this message
Sandor Zeestraten (szeestraten) wrote :

Hey John, thank you for taking a look.

We haven't upgraded the deployed model yet as we didn't expect the upgrade of the controller model to have this side effect of breaking the deployed model. As you can probably understand, we'd didn't want to push through an update while things are in an broken state.

Couple of quick questions:
1. How come none of the many other charms in the "openstack" model are complaining about this?
2. Is there a specific issue/fix related to your hunch? As I said earlier, we didn't see this in multiple rounds of staging which only has a few less machines/charms deployed.
3. Will the agents for nova-cloud-controller and nova-compute charms in question in the "openstack" model handle being upgraded even though they are experiencing hook failures?
4. In order to double check things, will the --dry-running flag do any useful prechecks in this case of upgrading a model?

Revision history for this message
John A Meinel (jameinel) wrote :

The bug in question is bug #1697936

Essentially, there was an issue between 2.1 and 2.2 which only seems to trigger if your charm has a large config. I believe the issue we encountered it with was nova-compute which shares a certificate with keystone, which can be fairly large.

I do believe the agents can be upgraded even if they are in error. It may require running "juju resolved", but I believe Upgrade supersedes running hooks.

I don't know why you wouldn't have seen it in staging, but maybe you aren't using SSL there? (And thus not generating certificates which then increases the size of relation-data/config?)

I don't believe that --dry-run will tell you much more than what version it would want to upgrade to.

Revision history for this message
John A Meinel (jameinel) wrote :

I *think* this is just a symptom of bug #1697936 which can be fixed by upgrading the model agents (vs the controller agents).

Changed in juju:
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Sandor Zeestraten (szeestraten) wrote :

Hey John, we upgraded the model which resolved the issue as you suspected. Thank you very much for the help! I'm a bit unsure on the correct way to mark this bug besides duplicate, so I'll leave that to you if that's OK.

I'm adding some of the error messages below for future folks.

Charm log:
ERROR juju.api.watcher watcher.go:87 error trying to stop watcher: connection is shut down
INFO leader-settings-changed error: codec.ReadHeader error: error receiving message: unexpected end of JSON input
ERROR juju.worker.dependency engine.go:547 "api-caller" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: unexpected end of JSON input

Controller log:
ERROR juju.rpc server.go:562 error writing response: *websocket.netError write tcp aa.bb.2.175:17070->aa.bb.2.171:45406: write: connection reset by peer

John A Meinel (jameinel)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.