Resize confirm fails if nova-compute is restarted after resize

Bug #1774252 reported by Matthew Booth
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

Originally reported in RH bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1584315

Reproduced on OSP12 (Pike).

After resizing an instance but before confirm, update_available_resource will fail on the source compute due to bug 1774249. If nova compute is restarted at this point before the resize is confirmed, the update_available_resource period task will never have succeeded, and therefore ResourceTracker's compute_nodes dict will not be populated at all.

When confirm calls _delete_allocation_after_move() it will fail with ComputeHostNotFound because there is no entry for the current node in ResourceTracker. The error looks like:

2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [req-4f7d5d63-fc05-46ed-b505-41050d889752 09abbd4893bb45eea8fb1d5e40635339 d4483d13a6ef41b2ae575ddbd0c59141 - default default] [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] Setting instance vm_state to ERROR: ComputeHostNotFound: Compute host compute-1.localdomain could not be found.
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] Traceback (most recent call last):
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7445, in _error_out_instance_on_exception
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] yield
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3757, in _confirm_resize
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] migration.source_node)
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3790, in _delete_allocation_after_move
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] cn_uuid = rt.get_node_uuid(nodename)
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 155, in get_node_uuid
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] raise exception.ComputeHostNotFound(host=nodename)
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] ComputeHostNotFound: Compute host compute-1.localdomain could not be found.

Revision history for this message
jichenjc (jichenjc) wrote :

if fix previous bug, this should be avoidable ?
 otherwise, delete a migration record should update compude node and without compute node this should raise some exception, so it's a work as designed behaviour?

tags: added: resource-tracker
tags: added: compute
Revision history for this message
Matthew Booth (mbooth-9) wrote :

jichenjc,

It would be avoided with a brief race, which is that confirm will be able to proceed after the periodic task has run for the first time.

I filed this separately because I think this code should be more defensive. We've got code called in an unknown number of ways which will fail unless some other code has run before it, but we don't do anything to ensure that the other code has run first.

Firstly, we should obviously fix the bug which is causing the periodic to fail today. Once we've done that, we should either:

* Move the initial population of ResourceTracker to init_host so that the compute host won't start processing jobs until it has been successfully initialized.

* Make ResourceTracker do something defensive and non-faily if we try to do stuff with it before it has been initialized.

I suspect that the former would be better. Anyway, this is a separate issue from the immediate bug, which is why I created 2.

Revision history for this message
jichenjc (jichenjc) wrote :

you mentioned

>>>the update_available_resource period task will never have succeeded, and therefore ResourceTracker's compute_nodes dict will not be populated at all

I think you are suggesting run update_available_resource before any request can be schedule to the compute node, right? but I am little bit confused why periodic task can't success , you run the populate before the init_host would help to make it success

>>>It would be avoided with a brief race, which is that confirm will be able to proceed after the periodic task has run for the first time.

>>>Move the initial population of ResourceTracker to init_host so that the compute host won't start processing jobs until it has been successfully initialized.

I am not sure whether that's true because
https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L1157

as you can see, pre hook seems already call the update resource before the service start to listen

my guess is
https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L7344
might have logs and doesn't prevent the service to be doable

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.