charm container crashes resulting in storage-attach hook error

Bug #1993309 reported by Hemanth Nakkina
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Harry Pidcock

Bug Description

Unit is in crash loopback off with restarting failed in juju+microk8s environment

snap versions of juju and microk8s
$ snap list | egrep 'juju|microk8s'
juju 3.0-rc2 21049 3.0/beta canonical** -
microk8s v1.25.2 4049 1.25-strict/stable canonical** -

Bundle used: https://github.com/hemanthnakkina/snap-sunbeam/blob/main/bundles/etc/bundles/control-plane.yaml

Status of the application:
```
$ juju status -m openstack glance
Model Controller Cloud/Region Version SLA Timestamp
openstack microk8s-localhost microk8s/localhost 3.0-rc2 unsupported 13:58:08Z

App Version Status Scale Charm Channel Rev Address Exposed Message
glance waiting 0/1 glance-k8s edge 8 10.152.183.227 no installing agent

Unit Workload Agent Address Ports Message
glance/0 error lost 10.1.136.12 crash loop backoff: back-off 5m0s restarting failed container=charm pod=glance-0_openstack(7352e697-e03c-4d76-8d7f-5ba28dea1380)

```

```
$ microk8s.kubectl -n openstack get po glance-0
NAME READY STATUS RESTARTS AGE
glance-0 1/2 CrashLoopBackOff 6 (4m26s ago) 15m
```

Glance charm container logs:
https://paste.ubuntu.com/p/mJycDSngRb/

Log excerpts that are of interest:
unit-glance-0: 13:47:19 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM
unit-glance-0: 13:47:19 ERROR juju.worker.uniter.operation hook "install" (via hook dispatching script: dispatch) failed: signal: terminated
unit-glance-0: 13:47:19 INFO juju.worker.uniter awaiting error resolution for "install" hook
...
unit-glance-0: 13:48:45 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM
unit-glance-0: 13:48:45 ERROR juju.worker.uniter.operation hook "local-repository-storage-attached" (via hook dispatching script: dispatch) failed: signal: terminated
unit-glance-0: 13:48:45 INFO juju.worker.uniter awaiting error resolution for "storage-attached" hook
unit-glance-0: 13:48:49 INFO juju.worker.uniter awaiting error resolution for "storage-attached" hook
...
unit-glance-0: 13:49:06 INFO juju.worker.uniter hooks are retried true
unit-glance-0: 13:49:06 INFO juju.worker.uniter awaiting error resolution for "storage-attached" hook
unit-glance-0: 13:49:07 INFO juju.worker.uniter awaiting error resolution for "storage-attached" hook
unit-glance-0: 13:49:11 INFO juju.worker.uniter awaiting error resolution for "storage-attached" hook
unit-glance-0: 13:49:11 INFO juju.worker.uniter awaiting error resolution for "storage-attached" hook
unit-glance-0: 13:49:11 ERROR juju.worker.uniter resolver loop error: preparing operation "run storage-attached (local-repository/0) hook" for glance/0: could not retrieve storage for id: local-repository/0: storage not found
unit-glance-0: 13:49:12 INFO juju.worker.uniter unit "glance/0" shutting down: preparing operation "run storage-attached (local-repository/0) hook" for glance/0: could not retrieve storage for id: local-repository/0: storage not found

Revision history for this message
Hemanth Nakkina (hemanth-n) wrote (last edit ):

The charm container restarted after 30 sec due to healthcheck failure and if restart happens during execution of storage-attached hook resulting in hook failure, further retries of storage-attached hook resulted in error "storage not found".

There is no special handling on storage-attached hook in the application glance-k8s charm code.

description: updated
Ian Booth (wallyworld)
Changed in juju:
milestone: none → 3.0-rc3
status: New → Triaged
importance: Undecided → High
Ian Booth (wallyworld)
Changed in juju:
importance: High → Critical
Changed in juju:
status: Triaged → In Progress
assignee: nobody → Harry Pidcock (hpidcock)
Revision history for this message
Jose C. Massón (jose-masson) wrote :
Download full text (4.1 KiB)

I'm seeing the same bug on 2.9.35.

I'm running microk8s and juju in the Multipass blueprint charm-dev: https://github.com/canonical/multipass-blueprints/blob/main/v1/charm-dev.yaml

$ snap list | egrep 'juju|microk8s'
juju 2.9.35 20892 2.9/stable canonical** classic
microk8s v1.25.2 4055 1.25/stable canonical** classic

While deploying grafana-k8s: https://charmhub.io/grafana-k8s?channel=edge I'm getting:

$ juju status --color --relations
Model Controller Cloud/Region Version SLA Timestamp
cos charm-dev microk8s/localhost 2.9.35 unsupported 19:55:51-03:00

App Version Status Scale Charm Channel Rev Address Exposed Message
grafana 8.2.6 waiting 0/1 grafana-k8s edge 48 10.152.183.145 no installing agent

Unit Workload Agent Address Ports Message
grafana/0* error lost 10.1.157.86 crash loop backoff: back-off 2m40s restarting failed container=charm pod=grafana-0_cos(dde63079-6d8e-4290-b3df-c123a1792747)

Relation provider Requirer Interface Type Message
grafana:grafana grafana:grafana grafana_peers peer

And the debug-log:

unit-grafana-0: 19:54:58.681 DEBUG juju.machinelock machine lock released for grafana/0 uniter (run storage-attached (database/1) hook)
unit-grafana-0: 19:54:58.681 DEBUG juju.worker.uniter.operation lock released for grafana/0
unit-grafana-0: 19:54:58.682 ERROR juju.worker.uniter resolver loop error: preparing operation "run storage-attached (database/1) hook" for grafana/0: could not retrieve storage for id: database/1: storage not found
unit-grafana-0: 19:54:58.682 DEBUG juju.worker.uniter [AGENT-STATUS] failed: resolver loop error
unit-grafana-0: 19:54:58.732 INFO juju.worker.uniter unit "grafana/0" shutting down: preparing operation "run storage-attached (database/1) hook" for grafana/0: could not retrieve storage for id: database/1: storage not found

And show-status-log:

$ juju show-status-log grafana/0
Time Type Status Message
20 Oct 2022 19:49:54-03:00 juju-unit executing running grafana-pebble-ready hook
20 Oct 2022 19:49:55-03:00 juju-unit executing running litestream-pebble-ready hook
20 Oct 2022 19:49:57-03:00 juju-unit executing running grafana-pebble-ready hook
20 Oct 2022 19:49:59-03:00 juju-unit executing running database-storage-attached hook
20 Oct 2022 19:50:02-03:00 juju-unit error hook failed: "storage-attached"
20 Oct 2022 19:50:07-03:00 juju-unit failed resolver loop error
20 Oct 2022 19:50:33-03:00 juju-unit error crash loop backoff: back-off 10s restarting failed container=charm pod=grafana-0_cos(dde63079-6d8e-4290-b3df-c123a1792747)
20 Oct 2022 19:50:47-03:00 juju-unit error hook failed: "storage-attached"
20 Oct 2022 19:50:52-03:00 juju-unit failed resolver loop error
20 Oct 2022 19:51:18-03:00 juju-unit error crash loop backoff: back-off 20s restarting failed container=charm pod=grafana-0_cos(dde63079-6d8e-4290-b3df-c123a1792747)
20 Oct 2022 19:51:47-03:00 juju-unit error hook failed: "storage-att...

Read more...

Revision history for this message
James Page (james-page) wrote :

I think I'm seeing the same issue:

$ snap list | egrep 'juju|microk8s'
juju 3.0-rc3-1353f4f 21095 3.0/edge canonical** -
microk8s v1.25.2 4049 1.25-strict/stable canonical** -

pretty much the whole deployment reliably ends up in the same state:

glance/0 error lost 10.1.75.36 crash loop backoff: back-off 5m0s restarting failed container=charm pod=glance-
0_sunbeam(e699d731-61f7-4019-9f8d-3a641333d15e)

I've attached a full replay of the debug log for this unit.

You see the unit download and verify the charm and then it exits out:

unit-glance-0: 14:10:22 INFO juju.worker.uniter.charm downloading ch:amd64/focal/glance-k8s-10 from API server
unit-glance-0: 14:10:22 INFO juju.downloader downloading from ch:amd64/focal/glance-k8s-10
unit-glance-0: 14:10:22 INFO juju.downloader download complete ("ch:amd64/focal/glance-k8s-10")
unit-glance-0: 14:10:22 INFO juju.downloader download verified ("ch:amd64/focal/glance-k8s-10")
unit-glance-0: 14:10:52 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM
unit-glance-0: 14:11:04 INFO juju.cmd running containerAgent [3.0-rc3 1353f4f4e91a06d61024d85cfc4d0588110a4838 gc go1.19.2]

and then subsequent re-starts of the container show:

unit-glance-0: 14:11:04 INFO juju.worker.leadership glance/0 promoted to leadership of glance
unit-glance-0: 14:11:04 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-glance-0
unit-glance-0: 14:11:04 INFO juju.worker.uniter unit "glance/0" started
unit-glance-0: 14:11:04 INFO juju.worker.caasupgrader abort check blocked until version event received
unit-glance-0: 14:11:04 INFO juju.worker.caasupgrader unblocking abort check
unit-glance-0: 14:11:04 INFO juju.worker.uniter resuming charm install
unit-glance-0: 14:11:04 INFO juju.worker.uniter.charm detected interrupted deploy of charm "ch:amd64/focal/glance-k8s-10"
unit-glance-0: 14:11:35 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM

from the charm container log output:

[glance-0] 2022-10-21T14:20:17.429Z [pebble] Check "readiness" failure 2 (threshold 3): received non-20x status code 418
[glance-0] 2022-10-21T14:20:27.427Z [pebble] Check "readiness" failure 3 (threshold 3): received non-20x status code 418
[glance-0] 2022-10-21T14:20:27.427Z [pebble] Check "readiness" failure threshold 3 hit, triggering action
[glance-0] 2022-10-21T14:20:27.427Z [pebble] Service "container-agent" on-check-failure action is "shutdown", triggering server exit
[glance-0] 2022-10-21T14:20:27.427Z [pebble] Server exiting!
[glance-0] 2022-10-21T14:20:27.485Z [pebble] Stopping all running services.

Revision history for this message
John A Meinel (jameinel) wrote :

Hm:
 3.0-rc3-1353f4f

Does have this PR in it supposedly:
commit 28ec5db9dc263af65ee091b62f9d2eee4eab9536
Merge: a5873009c2 7dfc2004a9
Author: Juju bot <email address hidden>
Date: Fri Oct 21 02:56:46 2022 +0200

    Merge pull request #14798 from hpidcock/fix-pebble-probes

    https://github.com/juju/juju/pull/14798

However, what we really need is the first line of the log with
2022-10-20 20:10:08 INFO juju.cmd supercommand.go:56 running jujud [3.0-rc2 1b20f0ecb018e4478d78dc7d018e7b0d0b07bca4 gc go1.19.2]

to know if we're running the right controller version.

Revision history for this message
John A Meinel (jameinel) wrote :

The glance debug log doesn't look correct to me:

<html>
<head>
  <title>OpenID transaction in progress</title>
</head>
...

Revision history for this message
James Page (james-page) wrote :
Revision history for this message
James Page (james-page) wrote :

Re-attached with the actual log content.

Revision history for this message
James Page (james-page) wrote :

I've reverted back to using microk8s from 1.25/stable and Juju 2.9/stable - however I still seem to be getting the same error - digging in further now.

Revision history for this message
James Page (james-page) wrote :

Reverted to using 2.9.34 agent binaries and this issue disappears.

Changed in juju:
milestone: 3.0-rc3 → 3.0.1
Harry Pidcock (hpidcock)
Changed in juju:
milestone: 3.0.1 → 2.9.37
Revision history for this message
Harry Pidcock (hpidcock) wrote :
Changed in juju:
status: In Progress → Fix Committed
Ian Booth (wallyworld)
Changed in juju:
status: Fix Committed → In Progress
Harry Pidcock (hpidcock)
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Saw the issue with juju 3.0.0

$ snap list | grep juju
juju 3.0.0 21122 3.0/stable canonical** -

$ juju controllers
Use --refresh option with this command to see the latest information.

Controller Model User Access Cloud/Region Models Nodes HA Version
microk8s-localhost* - admin superuser microk8s/localhost 2 1 - 3.0.0

Excerpts from log:
unit-keystone-0: 11:20:12 INFO juju.cmd running containerAgent [3.0.0 35c560704ee254219ae0c4a37810bde5278e99bb gc go1.19.2]
unit-keystone-0: 11:20:12 INFO juju.cmd.containeragent.unit start "unit"
unit-keystone-0: 11:20:12 INFO juju.worker.upgradesteps upgrade steps for 3.0.0 have already been run.
...
...
unit-keystone-0: 11:20:42 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM
unit-keystone-0: 11:20:42 ERROR juju.worker.uniter.operation hook "install" (via hook dispatching script: dispatch) failed: signal: terminated
unit-keystone-0: 11:20:42 INFO juju.worker.uniter awaiting error resolution for "install" hook
unit-keystone-0: 11:33:10 INFO juju.cmd running containerAgent [3.0.0 35c560704ee254219ae0c4a37810bde5278e99bb gc go1.19.2]
unit-keystone-0: 11:33:10 INFO juju.cmd.containeragent.unit start "unit"
unit-keystone-0: 11:33:10 INFO juju.worker.upgradesteps upgrade steps for 3.0.0 have already been run.
...
...
unit-keystone-0: 11:33:39 INFO juju.worker.uniter.operation ran "leader-elected" hook (via hook dispatching script: dispatch)
unit-keystone-0: 11:33:40 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM
unit-keystone-0: 11:33:40 ERROR juju.worker.uniter.operation hook "keystone-pebble-ready" (via hook dispatching script: dispatch) failed: signal: terminated
unit-keystone-0: 11:33:40 ERROR juju.worker.uniter pebble poll failed for container "keystone": failed to send pebble-ready event: hook failed
...
...
unit-keystone-0: 11:34:08 INFO juju.worker.uniter awaiting error resolution for "storage-attached" hook
unit-keystone-0: 11:34:08 ERROR juju.worker.uniter resolver loop error: preparing operation "run storage-attached (fernet-keys/6) hook" for keystone/0: could not retrieve storage for id: fernet-keys/6: storage not found
unit-keystone-0: 11:34:08 INFO juju.worker.uniter unit "keystone/0" shutting down: preparing operation "run storage-attached (fernet-keys/6) hook" for keystone/0: could not retrieve storage for id: fernet-keys/6: storage not found

Complete log:
https://pastebin.ubuntu.com/p/q6zZSs2JDt/

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.