juju remove-cloud race condition

Bug #1840685 reported by Kenneth Koski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
Medium
Unassigned

Bug Description

For a k8s model...
I've got a script that runs these two commands sequentially:

    juju destroy-model MODEL --yes --destroy-storage --force
    juju remove-cloud -c CONTROLLER CLOUD

Sometimes, I will get this error message, and the cloud will not get removed:

    ERROR cloud is used by 1 model

If I then manually delete the cloud, it works fine. This appears to be a race condition where destroy-model sometimes doesn't wait long enough before returning.

Tags: remove-cloud
Ian Booth (wallyworld)
description: updated
Changed in juju:
milestone: none → 2.7-beta1
status: New → Triaged
importance: Undecided → High
Changed in juju:
milestone: 2.7-beta1 → 2.7-rc1
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I am a bit surprised that your script does not check if the model is gone before doing anything else... Having a check will safeguard you from many possible surprises :D

However, about your specific concern....
Destroying a model is a complex multi-step process that has some asynchronous steps. It's possible that occasionally the operation to remove a cloud gets attempted before some model steps are yet to run.

We could potentially add a --force to remove-cloud operation but I am not convinced that we REALLY want to force remove a cloud that is still being used by some models... We have no means of "changing" a cloud for a model, so these models might become unusable and indesructible.

Changed in juju:
importance: High → Medium
milestone: 2.7-rc1 → none
tags: added: remove-cloud
Revision history for this message
Kenneth Koski (knkski) wrote :

I'd agree that a --force option for cloud removal isn't ideal. I think what I really want is to have `juju destroy-model ...` synchronously ensure that the model is completely gone (particularly since the command currently gives off the appearance of being synchronous). Alternatively, it would be nice to have it be completely asynchronous, and just print out a message that the model will be deleted eventually, and it would then offer a `--wait` flag to force the synchronous behavior.

Changed in juju:
assignee: nobody → Anastasia (anastasia-macmood)
status: Triaged → In Progress
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Kenneth Koski,

I cannot reproduce locally. Could you please add logs form a run where you are seeing it?

I've traced the code but nothing obvious jumped at me... We suspect that the problem is that the model cloud reference count is not decremented by the time 'remove-cloud' is called. However, this operation is part of a dying model removal and should have been completed if 'destroy-model' returned successfully.

Changed in juju:
status: In Progress → Incomplete
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Meanwhile I'll ensure that destroy-model command does not return until the undertaker is run, i,e. until the model is transitioned form Dying to Dead.

Changed in juju:
status: Incomplete → In Progress
Revision history for this message
Anastasia (anastasia-macmood) wrote :

So I looked into this further today...

In simple terms, 'destroy-model' command will loop checking model status and reporting its counts until the command can no longer obtain a status, at which point Juju assumes that it is because the model no longer exists. Just for reference, the status will always return for as long as the model is in the database.... even if it's marked as 'Dead'...

The code that seems to misbehave here is when the model is transitioned from Dying to Dead, specifically where we decrease the reference count of models that use a particular cloud. This code is run by an "undertaker".

So as far as I can see, the model correctly and in this order:
* decrements ref count of models in the cloud;
* transitions from Dying to Dead;
* gets removed from the database;
* 'destroy-model' command completes.

I really do not see how it can fail...

So, in addition to sharing a log, could you please also share your script that fails intermittently? Feel free to sanitize it if it contains private data...

Changed in juju:
status: In Progress → Incomplete
Revision history for this message
Kenneth Koski (knkski) wrote :

Sorry about the slow response. It happens intermittently, and I wasn't able to trigger it by trying it a number of times just now. I think this bug can probably be closed until I can reproduce with logs.

Changed in juju:
assignee: Anastasia (anastasia-macmood) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.