volumeAttachmentPlan in Dying state causes "state changing too quickly"

Bug #1860542 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
John A Meinel
2.7
Fix Released
High
John A Meinel

Bug Description

On prodstack we have some models that have Cleanup actions dying with "state changing too quickly". Using a database dump, we were able to debug the following:

If you end up with a machine that has a volume attached, and try to destroy that machine, it can get into a state where the VolumeAttachmentPlan ended up in Dying, but and the cleanup job trying to remove the VolumeAttachment ends up failing with "state changing too quickly".

This ends up being because the code around "do I have an attachment plan" is:
func (sb *storageBackend) DetachVolume(host names.Tag, volume names.VolumeTag) (err error) {

...
 buildTxn := func(attempt int) ([]txn.Op, error) {
...
  if plans, err := sb.machineVolumeAttachmentPlans(host, volume); err != nil {
   return nil, errors.Trace(err)
  } else {
   if len(plans) > 0 {
    return detachStorageAttachmentOps(host, volume), nil
   }
  }
  return detachVolumeOps(host, volume), nil
 }
 return sb.mb.db().Run(buildTxn)
}

And detachStorageAttachmentOps is:
func detachStorageAttachmentOps(host names.Tag, v names.VolumeTag) []txn.Op {
 id := volumeAttachmentId(host.Id(), v.Id())
 return []txn.Op{{
  C: volumeAttachmentPlanC,
  Id: id,
  Assert: isAliveDoc,
  Update: bson.D{{"$set", bson.D{{"life", Dying}}}},
 }}
}

^- in this case, the Cleanup code is asserting that the VolumeAttachmentPlan is not already dying, but it is. We need a check in the buildTxn if plans[0].Life != Alive. to match the assertion that is happening in detachStorageAttachmentOps.

It is unclear at this time what the best behavior is (is it to return ErrNoOps, or to return a WaitingForDetachment error.)

Trying to trace through the code, it seems that the intent is:
1) when getting a destroy machine
1a) mark the VolumeAttachment as dying
1b) which marks the VolumeAttachmentPlan
2) Which gets noticed by the Machine Agent's Storage Provisioner, which triggers processDyingVolumePlans
2a) Which will do something based on the attachment type (iscsi triggers a logout)
2b) which then calls back with RemoveVolumeAttachmentPlan
2c) which lets us proceed to remove the volume attachment itself.

However, in the case the machine is genuinely dead, something like 'remove-machine --force' should be able to ignore the steps that would require the machine agent to acknowledge them. It should still try to clean things up with the underlying Provider (so that the volume that was attached but has a lifecycle associated with the machine lifetime and not the model, will get cleaned up)

Other notes:
cleanupDyingEntityStorage does have a bit of logic to call RemoveVolumeAttachmentPlan

It may be that if cleanupAttachmentsForDyingVolume didn't exit with an error, that we would be able to proceed to something that would actually cleanup the Dying VolumeAttachmentPlan.

However, cleanupAttachmentsForDyingVolume either does detachStorageAttachmentOps or detachVolumeOps, but never does both. The callback RemoveVolumeAttachmentPlan deletes the attachment plan and then triggers detachVolumeOps.
detachVolumeOps would run into a similar "if it is already dying it will give a state-changing-too-quickly":
func detachVolumeOps(host names.Tag, v names.VolumeTag) []txn.Op {
 return []txn.Op{{
  C: volumeAttachmentsC,
  Id: volumeAttachmentId(host.Id(), v.Id()),
  Assert: isAliveDoc,
  Update: bson.D{{"$set", bson.D{{"life", Dying}}}},
 }}
}

Note that while DetachVolume *does* have a check for volumeattachment.Life() != Dying, all the other code paths that I saw calling detachVolumeOps (such as RemoveVolumeAttachmentPlan, and destroyVolumeOps do not have a check for VolumeAttachment.Life())

Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Harry Pidcock (hpidcock)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.