Unit is stuck in unknown/lost status when scaling down

Bug #1977582 reported by Leon
80
This bug affects 13 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Thomas Miller
3.1
Fix Released
High
Thomas Miller
3.2
Fix Released
High
Thomas Miller

Bug Description

With Juju 2.9.31, on scaledown, the highest unit is never removed and is stuck in unknown/lost status.

The message says "agent lost, see 'juju show-status-log prometheus/2'"
however:

$ juju show-status-log prometheus/2
ERROR no status history available

I was able to reproduce this on github actions and locally.

Example of failing integration tests for this reason:

https://github.com/canonical/loki-k8s-operator/runs/6715106687?check_suite_focus=true#step:4:331
https://github.com/canonical/prometheus-k8s-operator/runs/6678292826?check_suite_focus=true#step:4:842
https://github.com/canonical/prometheus-k8s-operator/runs/6715955362?check_suite_focus=true#step:4:896

Leon (sed-i)
summary: - Unit is Unit stuck in unknown/lost status when scaling down
+ Unit is stuck in unknown/lost status when scaling down
Revision history for this message
Leon (sed-i) wrote :

Everything works fine when I use --agent-version 2.9.29.

Revision history for this message
Jose C. Massón (jose-masson) wrote :

I can confirm this issue with Juju version 2.9.31

Revision history for this message
John A Meinel (jameinel) wrote :

I know Harry has been looking at teardown, it would be good to know if there was a specific regression here.

Changed in juju:
assignee: nobody → Harry Pidcock (hpidcock)
importance: Undecided → High
milestone: none → 2.9.33
status: New → Triaged
Revision history for this message
Harry Pidcock (hpidcock) wrote :

After an investigation yesterday, this is caused by the charm taking more than 30 seconds to run stop/remove/-relation-departed etc hooks during pod termination. Although this looks like a regression, the behaviour was already there but was unfortunately hidden by another bug.

Ultimately this will be properly fixed by a fix for https://bugs.launchpad.net/juju/+bug/1951415 which work is ongoing.

In the short term the plan is to increase the terminationGracePeriodSeconds for the pod to a higher number, this isn't a great fix, but for now may help with mitigating this.

Changed in juju:
milestone: 2.9.33 → 2.9.34
Revision history for this message
Guillermo (gcalvino) wrote :

As commented with Ian, I also found that the termination of the pods takes longer with Juju 2.9.32 than in previous versions. In principle I thought it was related to Pebble, but I tried with a podspec charm and a reactive one and found the same issue.

To test the issue:

```
juju deploy osm-keystone --channel latest/candidate --resource keystone-image=opensourcemano/keystone:testing-daily -n 3
juju deploy charmed-osm-mariadb-k8s -n 3

juju relate osm-keystone mariadb-k8s
```

Then I scale-in the pods to 1 unit:

```
juju scale-application osm-keystone 1
```

With Juju 2.9.32 the scale-in operation took 5 min, while with Juju 2.9.29 took 32 seconds.

I talked to I an to include this comment here, but maybe I need to open a new bug. If so, just tell me and I will create it.

Changed in juju:
milestone: 2.9.34 → 2.9.35
Changed in juju:
milestone: 2.9.35 → 2.9.36
Revision history for this message
Ben Hoyt (benhoyt) wrote (last edit ):

Just for the record, I was able to easily reproduce this using the 2.9 branch (version: 2.9.36-ubuntu-amd64, git-commit: ac2f28107be60be5799abda62163d582bb5dffd1) using the following sequence (based on Ryan Barry's Mattermost message):

$ juju bootstrap microk8s k8s
$ juju deploy parca-k8s parca --channel=edge
$ juju scale-application parca 3
# wait for ready
$ juju scale-application parca 1
$ juju status

And juju status said "agent lost" for units parca/1 and parca/2.

Changed in juju:
milestone: 2.9.36 → 2.9.37
Changed in juju:
milestone: 2.9.37 → 2.9.38
Revision history for this message
Harry Pidcock (hpidcock) wrote :

This problem is due to pebble SIGKILLing the containeragent after kubernetes gives the SIGTERM to pebble.

Changed in juju:
assignee: Harry Pidcock (hpidcock) → Thomas Miller (tlmiller)
milestone: 2.9.38 → 2.9.39
status: Triaged → In Progress
Changed in juju:
milestone: 2.9.39 → 2.9.40
Changed in juju:
milestone: 2.9.40 → 2.9.41
Changed in juju:
milestone: 2.9.41 → 2.9.42
Changed in juju:
milestone: 2.9.42 → 2.9.43
Changed in juju:
milestone: 2.9.43 → 2.9.44
Revision history for this message
Alex Lutay (taurus) wrote :

AFAIK, the fix is already in `2.9/candidate` snap channel.

Is the last post here correct?
> Changed in juju, milestone: 2.9.43 → 2.9.44

Revision history for this message
Alex Lutay (taurus) wrote :

For the history: the issue has been addressed on Juju 2.9/stable (2.9.43).

Revision history for this message
Liam Young (gnuoy) wrote :

Can I confirm that this will be fixed in juju 3.X as well ? I've only tested with 3.2.0 but it still seems to be an issue there. thanks

Revision history for this message
Harry Pidcock (hpidcock) wrote :

Will be released in 3.2 soon

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.