ability to delete instances in 'BUILD' state

Bug #907808 reported by Dan Prince on 2011-12-22
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Dan Prince

Bug Description

Using nova Essex I can no longer delete instances in 'BUILD' state.

The ability to delete instances in 'BUILD' state is very useful from an end user prospective. We used to support it (in Diablo) and we should in Essex as well.

Dan Prince (dan-prince) on 2011-12-22
Changed in nova:
assignee: nobody → Dan Prince (dan-prince)
importance: Undecided → High

Pick your poison:

1) Delete instances that are building causing interesting problems for nova-compute while the instance it's trying to build is torn down while it's using it.
2) Can't delete if instance gets stuck in BUILD.

The latter is a bug elsewhere in the code. Instances should never get stuck in BUILD, they should either go to ACTIVE or ERROR.

I'd much rather fix the underlying problem than paper over it.

Gabe Westmaas (westmaas) wrote :

I think we have to do both. Of course we want to fix the underlying problem, but in the mean time, until we know about those problems, I don't want end users to have to call on operations to clean up the issues.

It is true that when they get deleted, we have a harder time troubleshooting, but the customer experience is better. I think we need to look at logging, functional testing and other techniques to continue to identify these cases, but need to be able to clean things up in the meantime.

Mark Washenberger (markwash) wrote :

By the way, we discovered that poison #1 was actually a red herring based on misreading log files.

Gabe Westmaas (westmaas) wrote :

Based on my mis-reading a log line, that is.

I'd like to see this addressed now as users are definitely complaining about this. I think we need to sit down and figure out two things:
1) What actual actions can we say for all hypervisors should not be allowed at the same time?
2) What is the best place to prevent those from happening?

Unfortunately, I think the spec is incorrect in this case and should be updated to reflect how users actually want to use this, once we decide on which can be allowed at the same time. In any case our API reports statuses indicating its been queued, so it should be fine to queue things in some way, regardless of the current state of the machine. I think this is a case where we took the spec as definitive for defining state transitions in the blueprint, and we shouldn't have done that. Obviously this discussion would have been ideal on the bp discussion, but we missed that.

Aside from that, there are a number of other states that reporting 409s with the current implementation, where they are reported as ACTIVE and still returning a 409.

Prior to making our long term decisions, I would be ok with seeing the change backed out, certainly something should be done.

Poison #1 isn't a red herring. I'm not sure what you are expecting to happen when an instance is deleted out from under nova-compute when it's trying to build it. It's a classic race condition and will fail like one too.

I don't think we made a mistake following the spec, that's what it is there for. However, I agree that the spec is too strict. Users do have an expectation that they won't end up in a situation they can't fix, even if it's a result of bugs on our end.

Since bugs can occur anywhere in nova, we can be left with an instance cannot be deleted in pretty much any state. We still don't want race conditions in normal situations, so it's trying to figure out when an instance has been in a state for too long. That's a problem I don't want to try to solve. We will never know how long is long enough.

I favor moving these instances to another project for troubleshooting purposes. This will solve the user experience problem by not having it owned by them anymore, but also gives developers and operations the opportunity to inspect the instance and figure out what went wrong that caused it to get stuck. We may end up with instances that were deleted too early but ended up completing their operation, but that should be easy to detect and automatically cleanup with scripts (the aforementioned race condition). We would be able to provide the experience users expect while still providing the information developers need to troubleshoot the problem.

Gabe Westmaas (westmaas) wrote :

I'd expect the instance to end up in a deleted state, which I believe is what happened, correct? There were of course errors from Xen, but still the end result was what everyone expected.

Yes, I misspoke, the spec was too strict, and it should have been caught sooner, and that was the mistake. Following the spec was not the mistake.

I'm fine with doing that, but right now we need to move forward with something. I don't know if this will be universally accepted, but as long as we have the right configuration available (whether or not to do it, which project to move to, what to do when that account is full from a quota standpoint, etc) we can do this.

In the meantime, I'd like to just take out the decorators on delete. I think the one remaining issue, which we should fix (differently) is that deleting while resizing has some race conditions - most of which will end with a deleted server (as expected), but in some cases may not.

Also, this doesn't fix our inability to rebuild, change password, or other things while taking a snapshot, along with other such state issues. There are a slew of vm and task states that translate to ACTIVE status and the user still gets a 409 back. Again, this can be fixed in the short term by updating the decorators.

What I propose is:
1) Remove decorators on delete
2) Update other decorators
3) Add in the troubleshooting abilities Johannes mentions above

I think longer term we should look at removing more and more of those restrictions at the API layer and adding more serialization lower down in the stack to resolve these race conditions.

Chris Behrens (cbehrens) wrote :

I agree with removing the decorator on delete for now.... and dealing with the consequences. I don't think it's a problem for 'BUILD'. We'll get exceptions, as you mention, but they wouldn't cause any harm. I think for resize we may just need to deal with it until we can come up with something better.

And yeah, it is definitely an issue that 'task_state' is not exposed to a user, but we have restrictions on it and return 409s when vm_state is ACTIVE. From the API docs, it'll say ACTIVE is a valid state for the action.. yet we might 409 because task_state is 'snapshotting'. User has no idea what's going on, although the message being returned is somewhat informative.

#3 would be helpful... With #1, allow delete... but move the instance if it's in BUILD state and it is of a certain age.

Dan Prince (dan-prince) on 2012-01-17
Changed in nova:
status: New → In Progress

I'm very confused.

Why did this bug sit for weeks without any comment and now it's important enough to rush through a revert of behavior? We even discussed alternative fixes a while ago.

What is wrong with my proposal that can't be implemented instead?

I feel like I'm being ignored.

Chris Behrens (cbehrens) wrote :

I think we agree with your proposal, but complaints are rising about not being able to delete things that have failed. It's become pretty clear that we have to have delete working in all cases. So, I don't think you're being ignored, but I think it's fair to have a better 'customer experience' ASAP and then implement the added troubleshooting helpfulness. (...not that things erroring in the first place is a great customer experience. But people do try to delete the failed instances and then get even more frustrated when they can't clean up their account).

Reviewed: https://review.openstack.org/3112
Committed: http://github.com/openstack/nova/commit/c7d2f020f0fdf04b24bd21668e7a02796f1f5538
Submitter: Jenkins
Branch: master

commit c7d2f020f0fdf04b24bd21668e7a02796f1f5538
Author: Dan Prince <email address hidden>
Date: Tue Jan 17 14:33:16 2012 -0500

    Allow instances in 'BUILD' state to be deleted.

    Fixes LP Bug #907808.

    Change-Id: I4332e9e822db507951af07bd654a27b3e2ce3973

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2012-01-25
Changed in nova:
milestone: none → essex-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2012-04-05
Changed in nova:
milestone: essex-3 → 2012.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers