Canonical Juju

Machines stuck in "stopped" agent state

Bug #1818045 reported by Casey Marshall on 2019-02-28

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Incomplete	Low	Unassigned

Bug Description

Juju 2.4.7 model, prodstack-is controller, model stg-omnibus.

I had cs:~cmars/xenial/kafka-5 deployed, wanted to remove it and redeploy a different kafka charm. There were two units and storage was attached (cinder,50G). There were subordinates related to kafka, at least filebeat and telegraf.

I did `juju remove-application kafka`. The application and units were removed, and the storage volumes were detached. The two machines that hosted the two kafka units went to a "stopped" agent state, but stayed there and are not cleaning up. I can SSH into the machines with `juju ssh <machine#>`.

I've tried the following, with no effect:
- `juju remove-machine --force` on the machines
- Restarting the jujud agents with systemctl
- Rebooting the machines

Tags:

Revision history for this message

Casey Marshall (cmars) wrote on 2019-02-28:

Juju status: https://pastebin.canonical.com/p/WRjh89gVnH/

Revision history for this message

Khanh Nguyen (knguyen93) wrote on 2019-02-28:

juju_logs.tgz Edit (1.9 MiB, application/x-tar)

Attached is the juju logs

Anastasia (anastasia-macmood) on 2019-02-28

tags:	added: teardown
Changed in juju:
assignee:	nobody → Anastasia (anastasia-macmood)
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Martin Hilton (martin-hilton) wrote on 2019-02-28:

machine-13.log Edit (4.8 KiB, text/plain)

I have a similar problem where a machine is stuck in down.

Machine State DNS Inst id Series AZ Message
0 started 185.125.191.222 3bbabc15-6672-4127-b2bb-d947c5ddd845 xenial prodstack-zone-1 ACTIVE
1 started 10.15.4.4 5223d439-8f7e-4c3d-96d4-136384c95ac3 xenial prodstack-zone-2 ACTIVE
2 started 10.15.4.5 264b89d9-01cd-463a-9c14-d097f1624ea8 xenial prodstack-zone-1 ACTIVE
3 started 10.15.4.7 8979e984-8e5f-48ae-b0aa-e66e769d88c9 xenial prodstack-zone-1 ACTIVE
4 started 10.15.4.8 4896300d-3689-4a55-b9f8-b03fdf7bd8d3 xenial prodstack-zone-2 ACTIVE
5 started 10.15.4.9 46ca7600-f7a8-4d83-af99-8d2fc1fbffcf xenial prodstack-zone-1 ACTIVE
6 started 10.15.4.10 f058acf6-2e2a-4e9b-b5d4-4abac684a91a xenial prodstack-zone-2 ACTIVE
7 started 10.15.4.11 bfb78b68-c413-40a1-85be-16ca81e2fe4e xenial prodstack-zone-1 ACTIVE
8 started 10.15.4.12 f7a695ee-a1c5-43e7-bbc4-fa9f0ccb3c71 xenial prodstack-zone-2 ACTIVE
9 started 10.15.4.6 008745c8-eb80-4d0d-a270-5f9a5344cd15 xenial prodstack-zone-2 ACTIVE
10 started 10.15.4.13 d89bd084-107a-4232-8f8f-78402aa4aa62 xenial prodstack-zone-2 ACTIVE
12 started 10.15.4.15 ffa3d47a-8c8c-4d6e-9040-b9c519fa8e95 xenial prodstack-zone-1 ACTIVE
13 down 10.15.4.16 6ea0c754-71e1-46f4-8162-2aa121cdcf05 xenial prodstack-zone-1 ACTIVE
14 started 10.15.4.18 93d94d7c-8e5e-44c8-a097-4473a9782f37 xenial prodstack-zone-2 ACTIVE
15 started 10.15.4.20 8ec02b7d-d18f-47a4-8458-de278fe72186 xenial prodstack-zone-1 ACTIVE

See attached log

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2019-03-01:

interesteing extract from machine-85.log (keeping here for future reference):

2019-01-09 04:03:12 ERROR juju.worker.storageprovisioner common.go:115 failed to set status: cannot set status: no reachable servers
2019-01-09 04:03:59 ERROR juju.worker.dependency engine.go:632 "unconverted-api-workers" manifold worker returned unexpected error: cannot get machine 85: EOF
2019-01-09 04:03:59 ERROR juju.worker.dependency engine.go:632 "machiner" manifold worker returned unexpected error: cannot read environment config: model "2743e5f9-74f2-492e-89ab-a272135d3328": cannot read settings: EOF
2019-01-09 04:03:59 ERROR juju.worker.dependency engine.go:632 "storage-provisioner" manifold worker returned unexpected error: attaching filesystems: publishing attachment of filesystem 23 to machine 85 to state: cannot set info for filesystem attachment 23:85: cannot get filesystem: EOF

2019-02-21 02:05:12 ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: [2743e5] "machine-85" cannot open api: try again (try again)
2019-02-26 10:56:37 ERROR juju.worker.dependency engine.go:632 "storage-provisioner" manifold worker returned unexpected error: getting life of filesystem-23 attached to machine-85: filesystem "23" on "machine 85" not found

Changed in juju:
status:	Triaged → In Progress

Anastasia (anastasia-macmood) on 2019-03-17

Changed in juju:
assignee:	Anastasia (anastasia-macmood) → nobody

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2019-03-17:

I cannot reproduce this at all.

I have tried different versions of Juju - 2.4.8, 2.5.3, 2.6-wip.
I have tried different kafka charms - both from the charmstore (without storage) and ~cmars one with storage, including with combination of subordinates and peers like filebeat and telegraf.

Your version of kafka failed to install and, in fact, all hooks failed. I have used 'resolved' to get passed the hooks to get to 'stable' model.

What I did see was that there was a bit of flakiness around storage. For example, when attaching storage before the machine has reported as started, storage was stuck in permanently 'pending' status. I think that there is a bug for that. However, all machines, application and models were removed/destroyed successfully.

I have spent most of the time in 2.4 since this is the version that was causing issue for you. However, whilst I did manage to get a similar error logged as you (see your extract in comment # 4), the machine consistently went away without a problem for me.

In the versions after 2.4, we did improve how we deal with /dev device mapping. It could be accounting for the fact that I saw less flakiness in later versions.

At this point, without a reproducible scenario to help address the root cause, I prefer to focus on providing you and other users with 'remove-application' and 'remove-unit' --force. It is not ideal since it does not fix what got you in that state. However, at least, it will give you a way forward.

As part of this work, I am also ensuring that 'remove-machine --force' will ignore and will succeed despite storage errors, directly addressing your attempts at clean up.

Changed in juju:
status:	In Progress → Triaged

Revision history for this message

Tim Penhey (thumper) wrote on 2019-03-21:

Can we please get the following information

JUJU_DEV_FEATURE_FLAGS=developer-mode juju dump-db

For the model. Thanks.

Revision history for this message

Casey Marshall (cmars) wrote on 2019-03-21:

Updated an open RT we have on this issue: https://rt.admin.canonical.com/Ticket/Display.html?id=116744#txn-2724284

Revision history for this message

Marco M (mmawaw) wrote on 2019-04-01:

dump-db.yaml.bz2 Edit (25.0 KiB, application/octet-stream)

I don't mean to hijack the ticket, but am experiencing exactly the same (on MaaS). After removing the CDK bundle, all machines are in stopped state but still very alive and all software is still running.

I also tried removing machines individually with --force and restarting the services and vms.

I am attaching a sanitized database dump.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2019-05-07:

We have added several fixes for removals and destructions in Juju 2.6. Additionally, remove-unit, remove-application and destroy-model have gained '--force' flag. The flag is provided as a hammer for difficult cases where you are 100% sure that you want an entity [machine, unit, application, model] gone. Could you please try that version of Juju?

I will mark this report as Incomplete. However, if you are still experiencing similar issue with a newer Juju, please create a new report and include reproduction steps as well as relevant logs.

Changed in juju:
status:	Triaged → Incomplete

Revision history for this message

Marco M (mmawaw) wrote on 2019-05-07:

#10

Thanks, I have updated the juju client to 2.6-rc2. I guess I need to upgrade the controller as well, but the "juju upgrade-controller" command does not allow me to specify a beta/candidate release, and it's stuck to 2.5.4.

My current (working) k8s model has 3 machines in stopped state which were previously used by worker units that I have removed. I tried to remove --force them without success.

How do I upgrade the controller to 2.6?
Thanks

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2019-05-08:

#11

@Marco M (mmawaw),

To upgrade your controller to a release candidate, you'll also need to specify agent-stream as release candidates are published into 'devel'. For example, to upgrade from 2.5.5, I ran:

'juju upgrade-controller --agent-version 2.6-rc2 --agent-stream devel'

You will also need to upgrade your model once the controller is on 2.6RC:

'juju upgrade-model -m <YOUR MODEL NAME> --agent-stream devel --agent-version 2.6-rc2'

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

#12

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.