failed unit due to "shutting down: open [...]/charm/metadata.yaml: no such file or directory"

Bug #1882600 reported by Paul Collins on 2020-06-08
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju
Medium
Unassigned

Bug Description

I've ended up with a failed k8s workload charm unit in a different model to that in LP:1882146 ("cannot exec"). However, there are differences.

1) The error is:

application-mattermost: 2020-06-08 21:37:32 DEBUG juju.worker.leadership mattermost/15 waiting for mattermost leadership release gave err: error blocking on leadership release: connection is shut down
application-mattermost: 2020-06-08 21:37:32 DEBUG juju.worker.caasoperator killing "mattermost/15"
application-mattermost: 2020-06-08 21:37:32 INFO juju.worker.caasoperator stopped "mattermost/15", err: leadership failure: error making a leadership claim: connection is shut down
application-mattermost: 2020-06-08 21:37:32 DEBUG juju.worker.caasoperator "mattermost/15" done: leadership failure: error making a leadership claim: connection is shut down
application-mattermost: 2020-06-08 21:37:32 ERROR juju.worker.caasoperator exited "mattermost/15": leadership failure: error making a leadership claim: connection is shut down
application-mattermost: 2020-06-08 21:37:32 DEBUG juju.worker.caasoperator no restart, removing "mattermost/15" from known workers
application-mattermost: 2020-06-08 21:37:40 DEBUG juju.worker.uniter starting uniter for "mattermost/15"
application-mattermost: 2020-06-08 21:37:40 DEBUG juju.worker.caasoperator start "mattermost/15"
application-mattermost: 2020-06-08 21:37:40 INFO juju.worker.caasoperator start "mattermost/15"
application-mattermost: 2020-06-08 21:37:40 DEBUG juju.worker.caasoperator "mattermost/15" started
application-mattermost: 2020-06-08 21:37:40 DEBUG juju.worker.leadership mattermost/15 making initial claim for mattermost leadership
application-mattermost: 2020-06-08 21:37:40 INFO juju.worker.uniter unit "mattermost/15" started
application-mattermost: 2020-06-08 21:37:50 INFO juju.worker.leadership mattermost leadership for mattermost/15 denied
application-mattermost: 2020-06-08 21:37:50 DEBUG juju.worker.leadership mattermost/15 is not mattermost leader
application-mattermost: 2020-06-08 21:37:50 DEBUG juju.worker.leadership mattermost/15 waiting for mattermost leadership release
application-mattermost: 2020-06-08 21:37:51 INFO juju.worker.uniter unit "mattermost/15" shutting down: open /var/lib/juju/agents/unit-mattermost-15/charm/metadata.yaml: no such file or directory
application-mattermost: 2020-06-08 21:37:51 DEBUG juju.worker.uniter.remotestate got leadership change for mattermost/15: leader
application-mattermost: 2020-06-08 21:37:51 INFO juju.worker.caasoperator stopped "mattermost/15", err: open /var/lib/juju/agents/unit-mattermost-15/charm/metadata.yaml: no such file or directory
application-mattermost: 2020-06-08 21:37:51 DEBUG juju.worker.caasoperator "mattermost/15" done: open /var/lib/juju/agents/unit-mattermost-15/charm/metadata.yaml: no such file or directory
application-mattermost: 2020-06-08 21:37:51 ERROR juju.worker.caasoperator exited "mattermost/15": open /var/lib/juju/agents/unit-mattermost-15/charm/metadata.yaml: no such file or directory
application-mattermost: 2020-06-08 21:37:51 INFO juju.worker.caasoperator restarting "mattermost/15" in 3s
application-mattermost: 2020-06-08 21:37:54 INFO juju.worker.caasoperator start "mattermost/15"
application-mattermost: 2020-06-08 21:37:54 DEBUG juju.worker.caasoperator "mattermost/15" started

2) Restarting the controller does not fix the problem.

Next I tried scaling down the application to 0 units. The other two units also got stuck in a similar although perhaps not identical state.

Then I thought I'd try copying the charm back into the unit directories on the mattermost-operator-0 pod, to see what would happen. This triggered a panic in one of the units and resulted in the other two entering a state whereby bouncing the controller did remove the units.

So at least we have a workaround, although since this model is hosting a soon-to-be-production service, it'd be nice not to have to rely on it.

Paul Collins (pjdc) on 2020-06-08
description: updated
Paul Collins (pjdc) on 2020-06-08
description: updated
Ian Booth (wallyworld) wrote :

Is this juju 2.8.0?
Are the reproduction steps the same as the referenced bug?

Paul Collins (pjdc) wrote :

Yes, Juju 2.8.0. The reproduction steps are not entirely clear to me at this time.

However, it also just happened over the weekend when nobody was working on the model. Here's "juju debug-log"; mattermost/64 is the unit that got stuck: https://private-fileshare.canonical.com/~pjdc/lp1882600.txt

I was able to get rid of the stuck unit by copying the charm directory from application-mattermost to unit-mattermost-64 and bouncing the controllers.

Barry Price (barryprice) wrote :

Ran into this same issue today after an upgrade to 2.8.3

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
Paul Collins (pjdc) on 2021-01-05
description: updated
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers