Status is not consistent with actual status - status is stuck on install

Bug #1980114 reported by Leon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

After a load test has been running for a while, I come back to look at juju status as and see a unit in maintenance (installing agent):

$ juju status scrape-config/0
Model Controller Cloud/Region Version SLA Timestamp
cos-lite-load-test uk8s microk8s/localhost 2.9.32 unsupported 14:08:36Z

App Version Status Scale Charm Channel Rev Address Exposed Message
scrape-config waiting 1 prometheus-scrape-config-k8s edge 32 10.152.183.198 no installing agent

Unit Workload Agent Address Ports Message
scrape-config/0* maintenance idle 10.1.79.207

On the other hand, show-status-log suggests that we're past install and already after start:

$ juju show-status-log scrape-config/0
Time Type Status Message
28 Jun 2022 11:05:33Z workload active
28 Jun 2022 13:04:02Z workload maintenance stopping charm software
28 Jun 2022 13:04:02Z juju-unit executing running stop hook
28 Jun 2022 13:04:06Z juju-unit executing running start hook
28 Jun 2022 13:05:44Z juju-unit idle
28 Jun 2022 13:05:44Z workload maintenance

The log seems to confirm

$ juju debug-log --include unit-scrape-config-0
unit-scrape-config-0: 13:04:08 DEBUG unit.scrape-config/0.juju-log Charm called itself via hooks/start.
unit-scrape-config-0: 13:04:08 DEBUG unit.scrape-config/0.juju-log Legacy hooks/start exited with status 0.
unit-scrape-config-0: 13:04:08 DEBUG unit.scrape-config/0.juju-log yaml does not have libyaml extensions, using slower pure Python yaml loader
unit-scrape-config-0: 13:04:08 DEBUG unit.scrape-config/0.juju-log Using local storage: /var/lib/juju/agents/unit-scrape-config-0/charm/.unit-state.db already exists
unit-scrape-config-0: 13:04:08 DEBUG unit.scrape-config/0.juju-log Emitting Juju event start.
unit-scrape-config-0: 14:00:23 DEBUG unit.scrape-config/0.juju-log Operator Framework 1.5.0+4.g3714655 up and running.
unit-scrape-config-0: 14:00:23 DEBUG unit.scrape-config/0.juju-log Legacy hooks/update-status does not exist.
unit-scrape-config-0: 14:00:23 DEBUG unit.scrape-config/0.juju-log yaml does not have libyaml extensions, using slower pure Python yaml loader
unit-scrape-config-0: 14:00:23 DEBUG unit.scrape-config/0.juju-log Using local storage: /var/lib/juju/agents/unit-scrape-config-0/charm/.unit-state.db already exists
unit-scrape-config-0: 14:00:23 DEBUG unit.scrape-config/0.juju-log Emitting Juju event update_status.

So it seems that `juju status` is not consistent with the actual status.

Revision history for this message
Joseph Phillips (manadart) wrote :

This looks like an issue with the model cache not being updated.
Are there any controller log errors?

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Leon (sed-i) wrote :

There is nothing in `juju debug-log`.

I think I came across another variation of this when scaling down from 2 units to zero: status is stuck on unknown/lost (terminated) forever, even though kubectl reports no units:

$ kubectl describe po prometheus-1 -n welcome

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Normal Killing 4m21s kubelet Stopping container charm
  Normal Killing 4m21s kubelet Stopping container prometheus
  Warning Unhealthy 4m21s kubelet Readiness probe failed: Get "http://10.1.16.214:38813/v1/health?level=ready": dial tcp 10.1.16.214:38813: connect: connection refused
  Warning Unhealthy 4m21s kubelet Liveness probe failed: Get "http://10.1.16.214:38813/v1/health?level=alive": dial tcp 10.1.16.214:38813: connect: connection refused

... after 5 min ...

$ kubectl describe po prometheus-0 -n welcome
Error from server (NotFound): pods "prometheus-0" not found

$ kubectl describe po prometheus-1 -n welcome
Error from server (NotFound): pods "prometheus-1" not found

$ juju status

Model Controller Cloud/Region Version SLA Timestamp
welcome chdv32 microk8s/localhost 2.9.32 unsupported 10:03:56-04:00

App Version Status Scale Charm Channel Rev Address Exposed Message
prometheus terminated 0 prometheus-k8s 0 10.152.183.138 no unit stopped by the cloud

Unit Workload Agent Address Ports Message
prometheus/0 unknown lost 10.1.16.226 agent lost, see 'juju show-status-log prometheus/0'
prometheus/1 unknown lost 10.1.16.214 agent lost, see 'juju show-status-log prometheus/1'

Relation provider Requirer Interface Type Message
prometheus:prometheus-peers prometheus:prometheus-peers prometheus_peers peer

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This Medium-priority bug has not been updated in 60 days, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Medium → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.