failed unit due to "shutting down: open [...]/charm/metadata.yaml: no such file or directory"

Bug #1882600 reported by Paul Collins
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth

Bug Description

I've ended up with a failed k8s workload charm unit in a different model to that in LP:1882146 ("cannot exec"). However, there are differences.

1) The error is:

application-mattermost: 2020-06-08 21:37:32 DEBUG juju.worker.leadership mattermost/15 waiting for mattermost leadership release gave err: error blocking on leadership release: connection is shut down
application-mattermost: 2020-06-08 21:37:32 DEBUG juju.worker.caasoperator killing "mattermost/15"
application-mattermost: 2020-06-08 21:37:32 INFO juju.worker.caasoperator stopped "mattermost/15", err: leadership failure: error making a leadership claim: connection is shut down
application-mattermost: 2020-06-08 21:37:32 DEBUG juju.worker.caasoperator "mattermost/15" done: leadership failure: error making a leadership claim: connection is shut down
application-mattermost: 2020-06-08 21:37:32 ERROR juju.worker.caasoperator exited "mattermost/15": leadership failure: error making a leadership claim: connection is shut down
application-mattermost: 2020-06-08 21:37:32 DEBUG juju.worker.caasoperator no restart, removing "mattermost/15" from known workers
application-mattermost: 2020-06-08 21:37:40 DEBUG juju.worker.uniter starting uniter for "mattermost/15"
application-mattermost: 2020-06-08 21:37:40 DEBUG juju.worker.caasoperator start "mattermost/15"
application-mattermost: 2020-06-08 21:37:40 INFO juju.worker.caasoperator start "mattermost/15"
application-mattermost: 2020-06-08 21:37:40 DEBUG juju.worker.caasoperator "mattermost/15" started
application-mattermost: 2020-06-08 21:37:40 DEBUG juju.worker.leadership mattermost/15 making initial claim for mattermost leadership
application-mattermost: 2020-06-08 21:37:40 INFO juju.worker.uniter unit "mattermost/15" started
application-mattermost: 2020-06-08 21:37:50 INFO juju.worker.leadership mattermost leadership for mattermost/15 denied
application-mattermost: 2020-06-08 21:37:50 DEBUG juju.worker.leadership mattermost/15 is not mattermost leader
application-mattermost: 2020-06-08 21:37:50 DEBUG juju.worker.leadership mattermost/15 waiting for mattermost leadership release
application-mattermost: 2020-06-08 21:37:51 INFO juju.worker.uniter unit "mattermost/15" shutting down: open /var/lib/juju/agents/unit-mattermost-15/charm/metadata.yaml: no such file or directory
application-mattermost: 2020-06-08 21:37:51 DEBUG juju.worker.uniter.remotestate got leadership change for mattermost/15: leader
application-mattermost: 2020-06-08 21:37:51 INFO juju.worker.caasoperator stopped "mattermost/15", err: open /var/lib/juju/agents/unit-mattermost-15/charm/metadata.yaml: no such file or directory
application-mattermost: 2020-06-08 21:37:51 DEBUG juju.worker.caasoperator "mattermost/15" done: open /var/lib/juju/agents/unit-mattermost-15/charm/metadata.yaml: no such file or directory
application-mattermost: 2020-06-08 21:37:51 ERROR juju.worker.caasoperator exited "mattermost/15": open /var/lib/juju/agents/unit-mattermost-15/charm/metadata.yaml: no such file or directory
application-mattermost: 2020-06-08 21:37:51 INFO juju.worker.caasoperator restarting "mattermost/15" in 3s
application-mattermost: 2020-06-08 21:37:54 INFO juju.worker.caasoperator start "mattermost/15"
application-mattermost: 2020-06-08 21:37:54 DEBUG juju.worker.caasoperator "mattermost/15" started

2) Restarting the controller does not fix the problem.

Next I tried scaling down the application to 0 units. The other two units also got stuck in a similar although perhaps not identical state.

Then I thought I'd try copying the charm back into the unit directories on the mattermost-operator-0 pod, to see what would happen. This triggered a panic in one of the units and resulted in the other two entering a state whereby bouncing the controller did remove the units.

So at least we have a workaround, although since this model is hosting a soon-to-be-production service, it'd be nice not to have to rely on it.

Paul Collins (pjdc)
description: updated
Paul Collins (pjdc)
description: updated
Revision history for this message
Ian Booth (wallyworld) wrote :

Is this juju 2.8.0?
Are the reproduction steps the same as the referenced bug?

Revision history for this message
Paul Collins (pjdc) wrote :

Yes, Juju 2.8.0. The reproduction steps are not entirely clear to me at this time.

However, it also just happened over the weekend when nobody was working on the model. Here's "juju debug-log"; mattermost/64 is the unit that got stuck: https://private-fileshare.canonical.com/~pjdc/lp1882600.txt

I was able to get rid of the stuck unit by copying the charm directory from application-mattermost to unit-mattermost-64 and bouncing the controllers.

Revision history for this message
Barry Price (barryprice) wrote :

Ran into this same issue today after an upgrade to 2.8.3

Pen Gale (pengale)
Changed in juju:
status: New → Triaged
importance: Undecided → Medium
Paul Collins (pjdc)
description: updated
Revision history for this message
Ian Booth (wallyworld) wrote :

Seen again today with mattermost charm

Changed in juju:
milestone: none → 2.9.12
importance: Medium → High
Revision history for this message
Haw Loeung (hloeung) wrote :

Ran into this earlier, model running Juju 2.8.7. The workaround to copy the charm directory on the operator pod worked.

Changed in juju:
milestone: 2.9.12 → 2.9.13
Ian Booth (wallyworld)
Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.9.13 → 2.9.14
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.