volumeAttachmentPlan in Dying state causes "state changing too quickly"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
John A Meinel | ||
2.7 |
Fix Released
|
High
|
John A Meinel |
Bug Description
On prodstack we have some models that have Cleanup actions dying with "state changing too quickly". Using a database dump, we were able to debug the following:
If you end up with a machine that has a volume attached, and try to destroy that machine, it can get into a state where the VolumeAttachmen
This ends up being because the code around "do I have an attachment plan" is:
func (sb *storageBackend) DetachVolume(host names.Tag, volume names.VolumeTag) (err error) {
...
buildTxn := func(attempt int) ([]txn.Op, error) {
...
if plans, err := sb.machineVolum
return nil, errors.Trace(err)
} else {
if len(plans) > 0 {
return detachStorageAt
}
}
return detachVolumeOps
}
return sb.mb.db(
}
And detachStorageAt
func detachStorageAt
id := volumeAttachmen
return []txn.Op{{
C: volumeAttachmen
Id: id,
Assert: isAliveDoc,
Update: bson.D{{"$set", bson.D{{"life", Dying}}}},
}}
}
^- in this case, the Cleanup code is asserting that the VolumeAttachmen
It is unclear at this time what the best behavior is (is it to return ErrNoOps, or to return a WaitingForDetac
Trying to trace through the code, it seems that the intent is:
1) when getting a destroy machine
1a) mark the VolumeAttachment as dying
1b) which marks the VolumeAttachmen
2) Which gets noticed by the Machine Agent's Storage Provisioner, which triggers processDyingVol
2a) Which will do something based on the attachment type (iscsi triggers a logout)
2b) which then calls back with RemoveVolumeAtt
2c) which lets us proceed to remove the volume attachment itself.
However, in the case the machine is genuinely dead, something like 'remove-machine --force' should be able to ignore the steps that would require the machine agent to acknowledge them. It should still try to clean things up with the underlying Provider (so that the volume that was attached but has a lifecycle associated with the machine lifetime and not the model, will get cleaned up)
Other notes:
cleanupDyingEnt
It may be that if cleanupAttachme
However, cleanupAttachme
detachVolumeOps would run into a similar "if it is already dying it will give a state-changing-
func detachVolumeOps
return []txn.Op{{
C: volumeAttachmentsC,
Id: volumeAttachmen
Assert: isAliveDoc,
Update: bson.D{{"$set", bson.D{{"life", Dying}}}},
}}
}
Note that while DetachVolume *does* have a check for volumeattachmen
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
status: | Fix Committed → Fix Released |