failed unit due to "cannot exec in a stopped state"

Bug #1882146 reported by Paul Collins
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

While testing a k8s workload charm with Juju 2.8-rc3, I executed the following changes in quick succession:

juju upgrade-charm [...] # caused pod spec changes, pods recycled
juju config [...] # also pod spec changes, pods recycled again

Final state:

[agnew(~)] juju status mattermost/6
Model Controller Cloud/Region Version SLA Timestamp
mattermost mm-rc3 k8s/localhost 2.8-rc3 unsupported 09:43:42+12:00

SAAS Status Store URL
postgresql active mm-rc3 admin/database.postgresql

App Version Status Scale Charm Store Rev OS Address Notes
mattermost active 0/2 mattermost local 4 kubernetes 10.152.183.202

Unit Workload Agent Address Ports Message
mattermost/6 terminated failed 10.1.1.93 8065/TCP unit stopped by the cloud

[agnew(~)] juju debug-log --replay | grep mattermost/6
application-mattermost: 09:34:31 DEBUG juju.worker.uniter starting uniter for "mattermost/6"
application-mattermost: 09:34:31 DEBUG juju.worker.caasoperator start "mattermost/6"
application-mattermost: 09:34:31 INFO juju.worker.caasoperator start "mattermost/6"
application-mattermost: 09:34:31 DEBUG juju.worker.caasoperator "mattermost/6" started
application-mattermost: 09:34:31 DEBUG juju.worker.leadership mattermost/6 making initial claim for mattermost leadership
application-mattermost: 09:34:31 INFO juju.worker.leadership mattermost leadership for mattermost/6 denied
application-mattermost: 09:34:31 DEBUG juju.worker.leadership mattermost/6 waiting for mattermost leadership release
application-mattermost: 09:34:31 INFO juju.worker.uniter unit "mattermost/6" started
application-mattermost: 09:34:32 DEBUG juju.worker.leadership mattermost/6 is not mattermost leader
application-mattermost: 09:34:32 DEBUG juju.worker.caasoperator trigger running status for caas unit mattermost/6
application-mattermost: 09:34:35 DEBUG unit.mattermost/6.juju-log Legacy hooks/install does not exist.
application-mattermost: 09:34:35 DEBUG unit.mattermost/6.juju-log Emitting Juju event install.
application-mattermost: 09:34:35 DEBUG juju.worker.leadership mattermost/6 is not mattermost leader
application-mattermost: 09:34:39 DEBUG unit.mattermost/6.juju-log db:0: Legacy hooks/db-relation-created does not exist.
application-mattermost: 09:34:39 DEBUG unit.mattermost/6.juju-log db:0: Emitting Juju event db_relation_created.
application-mattermost: 09:34:42 DEBUG unit.mattermost/6.juju-log Legacy hooks/leader-settings-changed does not exist.
application-mattermost: 09:34:42 DEBUG unit.mattermost/6.juju-log Emitting Juju event leader_settings_changed.
application-mattermost: 09:34:42 DEBUG unit.mattermost/6.juju-log mirroring app relation data for relation 0
application-mattermost: 09:34:42 DEBUG juju.worker.leadership mattermost/6 is not mattermost leader
application-mattermost: 09:34:51 INFO juju.worker.uniter unit "mattermost/6" shutting down: executing operation "remote init": attempt count exceeded: container not running not found
application-mattermost: 09:34:51 DEBUG juju.worker.uniter.remotestate got leadership change for mattermost/6: leader
application-mattermost: 09:34:51 INFO juju.worker.caasoperator stopped "mattermost/6", err: executing operation "remote init": attempt count exceeded: container not running not found
application-mattermost: 09:34:51 DEBUG juju.worker.caasoperator "mattermost/6" done: executing operation "remote init": attempt count exceeded: container not running not found
application-mattermost: 09:34:51 ERROR juju.worker.caasoperator exited "mattermost/6": executing operation "remote init": attempt count exceeded: container not running not found
application-mattermost: 09:34:51 INFO juju.worker.caasoperator restarting "mattermost/6" in 3s
application-mattermost: 09:34:54 INFO juju.worker.caasoperator start "mattermost/6"
application-mattermost: 09:34:54 DEBUG juju.worker.caasoperator "mattermost/6" started
application-mattermost: 09:35:13 INFO juju.worker.caasoperator stopped "mattermost/6", err: executing operation "remote init": Internal error occurred: error executing command in container: failed to exec in container: failed to create exec "65308676472d86e7dbbd91c44e2e52ab28b5ed85ae135ce8a57d454d887aefbb": cannot exec in a stopped state: unknown
application-mattermost: 09:35:13 DEBUG juju.worker.caasoperator "mattermost/6" done: executing operation "remote init": Internal error occurred: error executing command in container: failed to exec in container: failed to create exec "65308676472d86e7dbbd91c44e2e52ab28b5ed85ae135ce8a57d454d887aefbb": cannot exec in a stopped state: unknown
application-mattermost: 09:35:13 ERROR juju.worker.caasoperator exited "mattermost/6": executing operation "remote init": Internal error occurred: error executing command in container: failed to exec in container: failed to create exec "65308676472d86e7dbbd91c44e2e52ab28b5ed85ae135ce8a57d454d887aefbb": cannot exec in a stopped state: unknown
application-mattermost: 09:35:13 INFO juju.worker.caasoperator restarting "mattermost/6" in 3s
application-mattermost: 09:35:16 INFO juju.worker.caasoperator start "mattermost/6"
application-mattermost: 09:35:16 DEBUG juju.worker.caasoperator "mattermost/6" started
[agnew(~)] _

Revision history for this message
Paul Collins (pjdc) wrote :

Upgrading the controller to 2.8.0 has somehow made the failed units go away, perhaps simply due to the controller restarting?

Revision history for this message
Ian Booth (wallyworld) wrote :

With deployments, pods get new identities as they are restarted so Juju will mark the original unit as stopped until the pod is removed from the cluster, and a new unit will be created when the new pod spins up.

It seems you've found a race in dealing with this when pod updates happen quickly. Restarting the controller causes Juju to reevaluate the state if the cluster and resync the juju model. It could also be that the pods were terminating and transitioned to removed during the controller restart.

Changed in juju:
milestone: none → 2.8.1
importance: Undecided → High
status: New → Triaged
tags: added: k8s
Revision history for this message
Tim Penhey (thumper) wrote :

Holding off on work like this due to redesign of where the agents run. Will keep on 2.8-next for now, maybe make invalid later due to more awesome design.

Changed in juju:
milestone: 2.8.1 → 2.8-next
Revision history for this message
Haw Loeung (hloeung) wrote :

Saw something similar (maybe?) to this with K8s charming using microk8s.

| application-content-cache: 02:35:08 DEBUG juju.worker.caasoperator retrying exec request, in 5 attempt, container not running not found
| application-content-cache: 02:35:08 ERROR juju.worker.uniter resolver loop error: executing operation "remote init": attempt count exceeded: container not running not found
| application-content-cache: 02:35:08 DEBUG juju.worker.uniter [AGENT-STATUS] failed: resolver loop error
| application-content-cache: 02:35:08 INFO juju.worker.uniter unit "content-cache/14" shutting down: executing operation "remote init": attempt count exceeded: container not running not found
| application-content-cache: 02:35:08 DEBUG juju.worker.uniter juju-run listener stopping
| application-content-cache: 02:35:08 DEBUG juju.worker.uniter.remotestate got leadership change for content-cache/14: leader
| application-content-cache: 02:35:08 DEBUG juju.worker.uniter juju-run listener stopped
| application-content-cache: 02:35:08 INFO juju.worker.caasoperator stopped "content-cache/14", err: executing operation "remote init": attempt count exceeded: container not running not found
| application-content-cache: 02:35:08 DEBUG juju.worker.caasoperator "content-cache/14" done: executing operation "remote init": attempt count exceeded: container not running not found
| application-content-cache: 02:35:08 ERROR juju.worker.caasoperator exited "content-cache/14": executing operation "remote init": attempt count exceeded: container not running not found
| application-content-cache: 02:35:08 INFO juju.worker.caasoperator restarting "content-cache/14" in 3s
| application-content-cache: 02:35:09 DEBUG juju.worker.uniter.operation committing operation "remote init"

This is with Juju 2.8.1. No new pods were being provisioned until I bounced some things, in my case, rebooted the entire VM.

Revision history for this message
Haw Loeung (hloeung) wrote :

| Unit Workload Agent Address Ports Message
| content-cache/14 terminated failed 10.1.75.77 80/TCP unit stopped by the cloud
| content-cache/15 terminated executing 10.1.75.78 80/TCP (remove)
| content-cache/18 waiting allocating 10.1.75.81 80/TCP agent initializing
| content-cache/19 waiting allocating 10.1.75.82 80/TCP agent initializing

Full Juju status output - https://paste.ubuntu.com/p/tDwxcw3BR3/

Revision history for this message
Ian Booth (wallyworld) wrote :

This could be an issue with the charm in that each time the charm sends in a pod spec, k8s will need to do a rolling update of the Deployment, which could trigger the charm to send in a new pod spec etc. There was a case recently with the dex-auth charm where the pod spec included a hash of the config and so the pod spec was never stable and it remained in that loop.

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.