Juju messes up when terminating AWS instances

Bug #1768064 reported by Dmitriy Kropivnitskiy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

Looks like there is some basic "order of actions" bug in Juju when it is trying to terminate multiple AWS instances. I have seen this happen with both destroy-model command and remove-unit command. It seems that the instance gets terminated before juju marks the machine as stopped (I can observe the instance being terminated in AWS console and the machine is marked as "started" in juju status) resulting in juju repeatedly trying to communicate with a dead instance. As a result shutting down even a single instance takes a long time, since juju does a lot of retries.

There are a few specifics to my setup that should be noted. I am using an existing VPC, so I have bootstrapped my controller via vpc-id-force=true. I have set multiple spaces (two actually, public and private) and my machines are spread between them (this does not seem to make any difference though, the issue I am describing seems to happen to machines in either space). Not sure if this matters or not, but I am using "instance-type" constraints. Juju version is 2.3.7 on both controller and the model.

The model I am using is as follows, 1 machine is a t2.small that runs easyrsa and kubernetes-master and 3 machines are t2.large running 3 units of etcd and 3 units of kubernetes-worker. And everything is tied together with flannel. Latest charms from "containers" for everything.

This should be fairly easy to replicate, but once I am done bringing my cluster back up, I will try to create a minimal repeatable setup for this issue.

description: updated
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I am moving this bug under "juju" project for triaging. "juju-core" is dedicated to Juju 1.x series exclusively.

no longer affects: juju-core
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I had a closer look at your description and it is by design that Juju first terminates cloud instance and then will eventually mark the machine as terminated in Juju.

I do, however, agree that the lag and the re-tires are not necessarily when we can deterministically decide that the machine needs to be mark as stopped.

We have recently introduced a way to allow providers to use a predefined set of callbacks that are relevant to the context in which cloud call is being made. This seems to be a perfect case where we need to have a callback added to allow machines to be marked as stopped on a successful instance termination.

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
tags: added: usability
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Medium → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.