Canonical Juju

Unit is stuck in unknown/lost status when scaling down

Bug #1977582 reported by Leon on 2022-06-03

This bug affects 13 people

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	High	Thomas Miller	Canonical Juju 2.9.44
3.1	Fix Released	High	Thomas Miller	Canonical Juju 3.1.5
3.2	Fix Released	High	Thomas Miller	Canonical Juju 3.2.2

Bug Description

With Juju 2.9.31, on scaledown, the highest unit is never removed and is stuck in unknown/lost status.

The message says "agent lost, see 'juju show-status-log prometheus/2'"
however:

$ juju show-status-log prometheus/2
ERROR no status history available

I was able to reproduce this on github actions and locally.

Example of failing integration tests for this reason:

https://github.com/canonical/loki-k8s-operator/runs/6715106687?check_suite_focus=true#step:4:331
https://github.com/canonical/prometheus-k8s-operator/runs/6678292826?check_suite_focus=true#step:4:842
https://github.com/canonical/prometheus-k8s-operator/runs/6715955362?check_suite_focus=true#step:4:896

Leon (sed-i) on 2022-06-03

summary:

- Unit is Unit stuck in unknown/lost status when scaling down
+ Unit is stuck in unknown/lost status when scaling down

Revision history for this message

Leon (sed-i) wrote on 2022-06-03:

Everything works fine when I use --agent-version 2.9.29.

Revision history for this message

Jose C. Massón (jose-masson) wrote on 2022-06-07:

I can confirm this issue with Juju version 2.9.31

Revision history for this message

John A Meinel (jameinel) wrote on 2022-06-07:

I know Harry has been looking at teardown, it would be good to know if there was a specific regression here.

Changed in juju:
assignee:	nobody → Harry Pidcock (hpidcock)
importance:	Undecided → High
milestone:	none → 2.9.33
status:	New → Triaged

Revision history for this message

Harry Pidcock (hpidcock) wrote on 2022-06-08:

After an investigation yesterday, this is caused by the charm taking more than 30 seconds to run stop/remove/-relation-departed etc hooks during pod termination. Although this looks like a regression, the behaviour was already there but was unfortunately hidden by another bug.

Ultimately this will be properly fixed by a fix for https://bugs.launchpad.net/juju/+bug/1951415 which work is ongoing.

In the short term the plan is to increase the terminationGracePeriodSeconds for the pod to a higher number, this isn't a great fix, but for now may help with mitigating this.

Canonical Juju QA Bot (juju-qa-bot) on 2022-08-05

Changed in juju:
milestone:	2.9.33 → 2.9.34

Revision history for this message

Guillermo (gcalvino) wrote on 2022-08-08:

As commented with Ian, I also found that the termination of the pods takes longer with Juju 2.9.32 than in previous versions. In principle I thought it was related to Pebble, but I tried with a podspec charm and a reactive one and found the same issue.

To test the issue:

```
juju deploy osm-keystone --channel latest/candidate --resource keystone-image=opensourcemano/keystone:testing-daily -n 3
juju deploy charmed-osm-mariadb-k8s -n 3

juju relate osm-keystone mariadb-k8s
```

Then I scale-in the pods to 1 unit:

```
juju scale-application osm-keystone 1
```

With Juju 2.9.32 the scale-in operation took 5 min, while with Juju 2.9.29 took 32 seconds.

I talked to I an to include this comment here, but maybe I need to open a new bug. If so, just tell me and I will create it.

Canonical Juju QA Bot (juju-qa-bot) on 2022-09-01

Changed in juju:
milestone:	2.9.34 → 2.9.35

Canonical Juju QA Bot (juju-qa-bot) on 2022-10-07

Changed in juju:
milestone:	2.9.35 → 2.9.36

Revision history for this message

Ben Hoyt (benhoyt) wrote on 2022-10-19 (last edit on 2022-10-19):

Just for the record, I was able to easily reproduce this using the 2.9 branch (version: 2.9.36-ubuntu-amd64, git-commit: ac2f28107be60be5799abda62163d582bb5dffd1) using the following sequence (based on Ryan Barry's Mattermost message):

$ juju bootstrap microk8s k8s
$ juju deploy parca-k8s parca --channel=edge
$ juju scale-application parca 3
# wait for ready
$ juju scale-application parca 1
$ juju status

And juju status said "agent lost" for units parca/1 and parca/2.

Canonical Juju QA Bot (juju-qa-bot) on 2022-10-21

Changed in juju:
milestone:	2.9.36 → 2.9.37

Canonical Juju QA Bot (juju-qa-bot) on 2022-11-03

Changed in juju:
milestone:	2.9.37 → 2.9.38

Revision history for this message

Harry Pidcock (hpidcock) wrote on 2023-01-06:

This problem is due to pebble SIGKILLing the containeragent after kubernetes gives the SIGTERM to pebble.

Changed in juju:
assignee:	Harry Pidcock (hpidcock) → Thomas Miller (tlmiller)
milestone:	2.9.38 → 2.9.39
status:	Triaged → In Progress

Canonical Juju QA Bot (juju-qa-bot) on 2023-01-23

Changed in juju:
milestone:	2.9.39 → 2.9.40

Canonical Juju QA Bot (juju-qa-bot) on 2023-02-17

Changed in juju:
milestone:	2.9.40 → 2.9.41

Canonical Juju QA Bot (juju-qa-bot) on 2023-02-27

Changed in juju:
milestone:	2.9.41 → 2.9.42

Canonical Juju QA Bot (juju-qa-bot) on 2023-03-01

Changed in juju:
milestone:	2.9.42 → 2.9.43

Canonical Juju QA Bot (juju-qa-bot) on 2023-06-02

Changed in juju:
milestone:	2.9.43 → 2.9.44

Revision history for this message

Alex Lutay (taurus) wrote on 2023-06-05:

AFAIK, the fix is already in `2.9/candidate` snap channel.

Is the last post here correct?
> Changed in juju, milestone: 2.9.43 → 2.9.44

Revision history for this message

Alex Lutay (taurus) wrote on 2023-06-14:

For the history: the issue has been addressed on Juju 2.9/stable (2.9.43).

Revision history for this message

Liam Young (gnuoy) wrote on 2023-06-28:

#10

Can I confirm that this will be fixed in juju 3.X as well ? I've only tested with 3.2.0 but it still seems to be an issue there. thanks

Revision history for this message

Harry Pidcock (hpidcock) wrote on 2023-07-06:

#11

Will be released in 3.2 soon

Changed in juju:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2023-07-20

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.