remove-unit on last unit doesn't always remove machine

Bug #1680936 reported by Aaron Bentley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Medium
Unassigned

Bug Description

As seen here:
http://reports.vapour.ws/releases/5102/job/hammer-time-gce-xenial/attempt/19

If all lxc containers are removed from a machine, a subsequent remove-unit may or may not delete the machine. This appears to depend on whether the containers were completely removed when remove-unit was invoked.

I've attached a script to reproduce the issue. By default, it fails to remove machine 0. If WAIT_LXD is set to "true", it waits until the container is removed before trying to remove the machine, and succeeds.

Example failure:
$ juju add-model container-fun7; ./containers_and_units.bash
+ juju deploy ubuntu
Located charm "cs:ubuntu-10".
Deploying charm "cs:ubuntu-10".
+ sleep 1
+ juju add-machine lxd:0
created container 0/lxd/0
+ juju remove-machine 0/lxd/0
+ echo 'Waiting for removal of machine 0/lxd/0'
Waiting for removal of machine 0/lxd/0
+ '[' false == true ']'
+ juju remove-unit ubuntu/0
+ echo 'Waiting for removal of machine 0.'
Waiting for removal of machine 0.
+ wait_for_null '.machines."0"'
++ date +%s
+ deadline=1491590952
+ set +x
....................................................................................................................................................................................
FAILURE: machine 0 was not removed.
Model Controller Cloud/Region Version
container-fun7 container-fun2 aws/us-west-1 2.1.2

App Version Status Scale Charm Store Rev OS Notes
ubuntu waiting 0 ubuntu jujucharms 10 ubuntu

Unit Workload Agent Machine Public address Ports Message

Machine State DNS Inst id Series AZ
0 started 54.193.18.186 i-0ca37bfbfd450a1de xenial us-west-1b

Example success:
$ juju add-model container-fun8; WAIT_LXD=true containers_and_units.bash
Using credential 'credentials' cached in controller
Added 'container-fun8' model on aws/us-west-1 with credential 'credentials' for user 'admin'
+ juju deploy ubuntu
Located charm "cs:ubuntu-10".
Deploying charm "cs:ubuntu-10".
+ sleep 1
+ juju add-machine lxd:0
created container 0/lxd/0
+ juju remove-machine 0/lxd/0
+ echo 'Waiting for removal of machine 0/lxd/0'
Waiting for removal of machine 0/lxd/0
+ '[' true == true ']'
+ wait_for_null '.machines."0".containers."0/lxd/0"'
++ date +%s
+ deadline=1491591412
+ set +x
...................................................................................................
Query .machines."0".containers."0/lxd/0" went null.
Waiting for removal of machine 0.
.........................................
Query .machines."0" went null.
SUCCESS: machine 0 was removed.

Revision history for this message
Aaron Bentley (abentley) wrote :
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1680936] [NEW] remove-unit on last unit doesn't always remove machine
Download full text (4.3 KiB)

So it doesn't seem surprising that if you have a container on a machine
removing a unit from the machine should not destroy the machine. It also
seems to follow that if you deployed just a container and then remove that
container you don't want to destroy that machine (you're quite likely to
want to create another container on it). Really the only time remove unit
seems very good for removing the machine is if the machine was provisioned
explicitly for that unit and only that unit.

remove-machine does exist as does remove-machine --force if you wanted to
cascade delete everything on the machine.

I'm open to other feedback but having remove-machine kill the host machine
if it is the last container doesn't feel like the right answer. (I'm not
sure that killing the host machine when removing the last unit when it's
been used for containers is the right thing either, TBH.)

John
=:->

On Apr 7, 2017 11:05 PM, "Aaron Bentley" <email address hidden>
wrote:

> Public bug reported:
>
> As seen here:
> http://reports.vapour.ws/releases/5102/job/hammer-time-gce-
> xenial/attempt/19
>
> If all lxc containers are removed from a machine, a subsequent remove-
> unit may or may not delete the machine. This appears to depend on
> whether the containers were completely removed when remove-unit was
> invoked.
>
> I've attached a script to reproduce the issue. By default, it fails to
> remove machine 0. If WAIT_LXD is set to "true", it waits until the
> container is removed before trying to remove the machine, and succeeds.
>
> Example failure:
> $ juju add-model container-fun7; ./containers_and_units.bash
> + juju deploy ubuntu
> Located charm "cs:ubuntu-10".
> Deploying charm "cs:ubuntu-10".
> + sleep 1
> + juju add-machine lxd:0
> created container 0/lxd/0
> + juju remove-machine 0/lxd/0
> + echo 'Waiting for removal of machine 0/lxd/0'
> Waiting for removal of machine 0/lxd/0
> + '[' false == true ']'
> + juju remove-unit ubuntu/0
> + echo 'Waiting for removal of machine 0.'
> Waiting for removal of machine 0.
> + wait_for_null '.machines."0"'
> ++ date +%s
> + deadline=1491590952
> + set +x
> ............................................................
> ............................................................
> ............................................................
> FAILURE: machine 0 was not removed.
> Model Controller Cloud/Region Version
> container-fun7 container-fun2 aws/us-west-1 2.1.2
>
> App Version Status Scale Charm Store Rev OS Notes
> ubuntu waiting 0 ubuntu jujucharms 10 ubuntu
>
> Unit Workload Agent Machine Public address Ports Message
>
> Machine State DNS Inst id Series AZ
> 0 started 54.193.18.186 i-0ca37bfbfd450a1de xenial us-west-1b
>
> Example success:
> $ juju add-model container-fun8; WAIT_LXD=true containers_and_units.bash
> Using credential 'credentials' cached in controller
> Added 'container-fun8' model on aws/us-west-1 with credential
> 'credentials' for user 'admin'
> + juju deploy ubuntu
> Located charm "cs:ubuntu-10".
> Deploying charm "cs:ubuntu-10".
> + sleep 1
> + juju add-machine lxd:0
> cre...

Read more...

Revision history for this message
Aaron Bentley (abentley) wrote :

I do not necessarily think that remove-unit should destroy machines. I just think that its behaviour should be consistent.

This bug is about inconsistent behaviour depending on timing. The same commands are issued in the same order, but different results are produced. It all depends on how much time passes between the remove-machine and remove-unit commands.

AIUI, Juju works by modelling a target state and then performing operations in order to achieve that state. The target state after the "remove-machine" should be identical whether or not the container has actually been removed.

Add-unit appears to be consulting the current state when determining whether to remove the machine. I believe that instead, it should be consulting the target state.

Revision history for this message
Aaron Bentley (abentley) wrote :

Doh. *remove-unit* appears to be consulting the current state when determining whether to remove the machine. I believe that instead, it should be consulting the target state.

Ian Booth (wallyworld)
tags: added: teardown
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I can no longer reproduce this scenario on Juju 2.6 - there have been a considerable number of changes in both provisioning and lxd and entity destruction/removal areas.

When following repro steps, every single time a machine got removed. I am marking this as Fix Committed for 2.6.

Changed in juju:
status: Triaged → Fix Committed
milestone: none → 2.6-rc2
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.