Machines stuck in "stopped" agent state

Bug #1818045 reported by Casey Marshall
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Incomplete
Low
Unassigned

Bug Description

Juju 2.4.7 model, prodstack-is controller, model stg-omnibus.

I had cs:~cmars/xenial/kafka-5 deployed, wanted to remove it and redeploy a different kafka charm. There were two units and storage was attached (cinder,50G). There were subordinates related to kafka, at least filebeat and telegraf.

I did `juju remove-application kafka`. The application and units were removed, and the storage volumes were detached. The two machines that hosted the two kafka units went to a "stopped" agent state, but stayed there and are not cleaning up. I can SSH into the machines with `juju ssh <machine#>`.

I've tried the following, with no effect:
- `juju remove-machine --force` on the machines
- Restarting the jujud agents with systemctl
- Rebooting the machines

Revision history for this message
Casey Marshall (cmars) wrote :
Revision history for this message
Khanh Nguyen (knguyen93) wrote :

Attached is the juju logs

tags: added: teardown
Changed in juju:
assignee: nobody → Anastasia (anastasia-macmood)
status: New → Triaged
importance: Undecided → High
Revision history for this message
Martin Hilton (martin-hilton) wrote :

I have a similar problem where a machine is stuck in down.

Machine State DNS Inst id Series AZ Message
0 started 185.125.191.222 3bbabc15-6672-4127-b2bb-d947c5ddd845 xenial prodstack-zone-1 ACTIVE
1 started 10.15.4.4 5223d439-8f7e-4c3d-96d4-136384c95ac3 xenial prodstack-zone-2 ACTIVE
2 started 10.15.4.5 264b89d9-01cd-463a-9c14-d097f1624ea8 xenial prodstack-zone-1 ACTIVE
3 started 10.15.4.7 8979e984-8e5f-48ae-b0aa-e66e769d88c9 xenial prodstack-zone-1 ACTIVE
4 started 10.15.4.8 4896300d-3689-4a55-b9f8-b03fdf7bd8d3 xenial prodstack-zone-2 ACTIVE
5 started 10.15.4.9 46ca7600-f7a8-4d83-af99-8d2fc1fbffcf xenial prodstack-zone-1 ACTIVE
6 started 10.15.4.10 f058acf6-2e2a-4e9b-b5d4-4abac684a91a xenial prodstack-zone-2 ACTIVE
7 started 10.15.4.11 bfb78b68-c413-40a1-85be-16ca81e2fe4e xenial prodstack-zone-1 ACTIVE
8 started 10.15.4.12 f7a695ee-a1c5-43e7-bbc4-fa9f0ccb3c71 xenial prodstack-zone-2 ACTIVE
9 started 10.15.4.6 008745c8-eb80-4d0d-a270-5f9a5344cd15 xenial prodstack-zone-2 ACTIVE
10 started 10.15.4.13 d89bd084-107a-4232-8f8f-78402aa4aa62 xenial prodstack-zone-2 ACTIVE
12 started 10.15.4.15 ffa3d47a-8c8c-4d6e-9040-b9c519fa8e95 xenial prodstack-zone-1 ACTIVE
13 down 10.15.4.16 6ea0c754-71e1-46f4-8162-2aa121cdcf05 xenial prodstack-zone-1 ACTIVE
14 started 10.15.4.18 93d94d7c-8e5e-44c8-a097-4473a9782f37 xenial prodstack-zone-2 ACTIVE
15 started 10.15.4.20 8ec02b7d-d18f-47a4-8458-de278fe72186 xenial prodstack-zone-1 ACTIVE

See attached log

Revision history for this message
Anastasia (anastasia-macmood) wrote :

interesteing extract from machine-85.log (keeping here for future reference):

2019-01-09 04:03:12 ERROR juju.worker.storageprovisioner common.go:115 failed to set status: cannot set status: no reachable servers
2019-01-09 04:03:59 ERROR juju.worker.dependency engine.go:632 "unconverted-api-workers" manifold worker returned unexpected error: cannot get machine 85: EOF
2019-01-09 04:03:59 ERROR juju.worker.dependency engine.go:632 "machiner" manifold worker returned unexpected error: cannot read environment config: model "2743e5f9-74f2-492e-89ab-a272135d3328": cannot read settings: EOF
2019-01-09 04:03:59 ERROR juju.worker.dependency engine.go:632 "storage-provisioner" manifold worker returned unexpected error: attaching filesystems: publishing attachment of filesystem 23 to machine 85 to state: cannot set info for filesystem attachment 23:85: cannot get filesystem: EOF

2019-02-21 02:05:12 ERROR juju.worker.dependency engine.go:632 "api-caller" manifold worker returned unexpected error: [2743e5] "machine-85" cannot open api: try again (try again)
2019-02-26 10:56:37 ERROR juju.worker.dependency engine.go:632 "storage-provisioner" manifold worker returned unexpected error: getting life of filesystem-23 attached to machine-85: filesystem "23" on "machine 85" not found

Changed in juju:
status: Triaged → In Progress
Changed in juju:
assignee: Anastasia (anastasia-macmood) → nobody
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I cannot reproduce this at all.

I have tried different versions of Juju - 2.4.8, 2.5.3, 2.6-wip.
I have tried different kafka charms - both from the charmstore (without storage) and ~cmars one with storage, including with combination of subordinates and peers like filebeat and telegraf.

Your version of kafka failed to install and, in fact, all hooks failed. I have used 'resolved' to get passed the hooks to get to 'stable' model.

What I did see was that there was a bit of flakiness around storage. For example, when attaching storage before the machine has reported as started, storage was stuck in permanently 'pending' status. I think that there is a bug for that. However, all machines, application and models were removed/destroyed successfully.

I have spent most of the time in 2.4 since this is the version that was causing issue for you. However, whilst I did manage to get a similar error logged as you (see your extract in comment # 4), the machine consistently went away without a problem for me.

In the versions after 2.4, we did improve how we deal with /dev device mapping. It could be accounting for the fact that I saw less flakiness in later versions.

At this point, without a reproducible scenario to help address the root cause, I prefer to focus on providing you and other users with 'remove-application' and 'remove-unit' --force. It is not ideal since it does not fix what got you in that state. However, at least, it will give you a way forward.

As part of this work, I am also ensuring that 'remove-machine --force' will ignore and will succeed despite storage errors, directly addressing your attempts at clean up.

Changed in juju:
status: In Progress → Triaged
Revision history for this message
Tim Penhey (thumper) wrote :

Can we please get the following information

JUJU_DEV_FEATURE_FLAGS=developer-mode juju dump-db

For the model. Thanks.

Revision history for this message
Casey Marshall (cmars) wrote :
Revision history for this message
Marco M (mmawaw) wrote :

I don't mean to hijack the ticket, but am experiencing exactly the same (on MaaS). After removing the CDK bundle, all machines are in stopped state but still very alive and all software is still running.

I also tried removing machines individually with --force and restarting the services and vms.

I am attaching a sanitized database dump.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

We have added several fixes for removals and destructions in Juju 2.6. Additionally, remove-unit, remove-application and destroy-model have gained '--force' flag. The flag is provided as a hammer for difficult cases where you are 100% sure that you want an entity [machine, unit, application, model] gone. Could you please try that version of Juju?

I will mark this report as Incomplete. However, if you are still experiencing similar issue with a newer Juju, please create a new report and include reproduction steps as well as relevant logs.

Changed in juju:
status: Triaged → Incomplete
Revision history for this message
Marco M (mmawaw) wrote :

Thanks, I have updated the juju client to 2.6-rc2. I guess I need to upgrade the controller as well, but the "juju upgrade-controller" command does not allow me to specify a beta/candidate release, and it's stuck to 2.5.4.

My current (working) k8s model has 3 machines in stopped state which were previously used by worker units that I have removed. I tried to remove --force them without success.

How do I upgrade the controller to 2.6?
Thanks

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Marco M (mmawaw),

To upgrade your controller to a release candidate, you'll also need to specify agent-stream as release candidates are published into 'devel'. For example, to upgrade from 2.5.5, I ran:

'juju upgrade-controller --agent-version 2.6-rc2 --agent-stream devel'

You will also need to upgrade your model once the controller is on 2.6RC:

'juju upgrade-model -m <YOUR MODEL NAME> --agent-stream devel --agent-version 2.6-rc2'

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.