periodic task for erroring build timeouts tries to set error state on deleted instances

Bug #1501556 reported by Sam Morrison on 2015-10-01
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)

Bug Description

In our nova-compute logs we get a ton of these messages over and over

2015-10-01 11:01:54.781 30811 WARNING nova.compute.manager [req-f61f4f85-72e7-481b-a8a3-90551bdc4b58 - - - - -] [instance: 75f733b5-842e-4bde-9570-efa2735e6f12] Instance build timed out. Set to error state.

Upon looking in the DB they are all deleted

select deleted_at, deleted, vm_state, task_state from instances where uuid = '75f733b5-842e-4bde-9570-efa2735e6f12';
| deleted_at | deleted | vm_state | task_state |
| 2015-08-17 00:47:18 | 12283 | building | deleting |

We have instance_build_timeout = 3600

I think _check_instance_build_time in compute.manager needs to filter on deleted instances but there may be a reason it checks deleted instances too.

Hans Lindgren (hanlind) wrote :

Looks like vm_state is 'building' although it should be 'deleted' for a deleted instance.

tags: added: compute
jichenjc (jichenjc) wrote :

agree vm_state should be 'DELETED' , did someone operate the db directly?
otherwise task_state = deleteing and vm_state is building seems weird

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Chuck Carmack (chuckcarmack75) wrote :

It seems like Delete was called on the instance while it was in building state, and the instance was destroyed but not saved.

I think save was supposed to update the vm_state and task_state columns, while destroy was able to update the deleted_at column.

Changed in nova:
assignee: nobody → Pushkar Umaranikar (pushkar-umaranikar)

Fix proposed to branch: master

Changed in nova:
status: Confirmed → In Progress
John Garbutt (johngarbutt) wrote :

So I think there is a bug for cells where instances get stuck in the deleting state for some time, and only eventually heal, so that is what is exposing this bug (I am guessing).

Sam Morrison (sorrison) wrote :

I don't think this is cells related. This is happening on the compute nodes on the local compute DB. It may be that cells causes instances to get into this state in the first place but the instance build timeouts code is all local to the compute node so cells shouldn't be taking a part here.

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Reason: This review is > 6 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Maciej Szankin (mszankin) wrote :

This bug report has an assignee for a while now but there is no patch
for that. It looks like that the chance of getting a patch is low.
I'm going to remove the assignee to signal to others that they can take
over if they like.
If you want to work on this, please:
* add yourself as assignee AND
* set the status to "In Progress" AND
* provide a (WIP) patch within the next 2 weeks after that.
If you need assistance, reach out on the IRC channel #openstack-nova or
use the mailing list.

Changed in nova:
status: In Progress → Confirmed
assignee: Pushkar Umaranikar (pushkar-umaranikar) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers