Wallaby: Revert resize wedged with status: VERIFY_RESIZE

Bug #1957939 reported by Will Szumski
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Unassigned
Wallaby
Fix Committed
Undecided
Unassigned

Bug Description

Description
===========

When attempting to revert a resize operation, the instance will be get wedged in the VERIFY_RESIZE state.

Steps to reproduce
==================

- Boot instance
- Resize instance to larger flavor
- Wait for status VERIFY_RESIZE
- Run openstack server resize revert <server>
- Instance moves to status REVERT_RESIZE
- Wait some time
- Instance moves back to VERIFY_RESIZE

There then doesn't seem to be a way to recover the server via the CLI once you hit this condition:

(rally) stackhpc@cerebro:~/shakespeare$ openstack server resize confirm 9ba8ad66-d16b-4023-adb9-d5898c73b6db
Instance has not been resized. (HTTP 400) (Request-ID: req-53d1e915-ea31-40ee-a9de-126d3089e944)
(rally) stackhpc@cerebro:~/shakespeare$ openstack server resize --revert 9ba8ad66-d16b-4023-adb9-d5898c73b6db

Both of these operations fail.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   Kolla-ansible deployed wallaby on centos8-stream. This is a source
   build using the following version of nova: `23.1.1.dev6`, installed via
   pip.

2. Which hypervisor did you use?

   libvirt with KVM (libvirt: 7.9.0)

2. Which storage type did you use?

   Ceph Octopus

3. Which networking type did you use?

   Neutron with openvswitch

Tags: neutron resize
Revision history for this message
Will Szumski (willjs) wrote :
Revision history for this message
Will Szumski (willjs) wrote :
description: updated
Revision history for this message
Will Szumski (willjs) wrote :

We could resize and revert when the instance was resized to the same host with the configuration option `allow_resize_to_same_host = true`.

description: updated
Will Szumski (willjs)
description: updated
description: updated
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :
Download full text (4.4 KiB)

Looks like it's a config issue, your VIF creation fails when you resize :

2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [req-ea986ad9-09c8-4f2e-a685-b1535a4b1601 a877cc12f6b84f9c966a1b21b8cf0dbf 9437e136467242b2b67c35d36a2dcfa0 - default default] [instance: c303e65f-f34c-45ab-bc94-885088537c4c] Fa
iled to allocate network(s): nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] Traceback (most recent call last):
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 7233, in _create_guest_with_network
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] post_xml_callback=post_xml_callback)
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] next(self.gen)
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 479, in wait_for_instance_event
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] actual_event = event.wait()
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] File "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/event.py", line 125, in wait
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] result = hub.switch()
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] File "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 313, in switch
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] return self.greenlet.switch()
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] eventlet.timeout.Timeout: 300 seconds
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c]
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] During handling of the above exception, another exception occurred:
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c]
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] Traceback (most recent call last):
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-885088537c4c] File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 2413, in _build_and_run_instance
2022-01-12 16:27:06.063 7 ERROR nova.compute.manager [instance: c303e65f-f34c-45ab-bc94-8850...

Read more...

Changed in nova:
status: New → Incomplete
tags: added: resize
Revision history for this message
Pablo (pescobar001) wrote :

I tried to reproduce the problem and I cannot find any error in the neutron logs but I found this other bug report which is very similar

https://bugzilla.redhat.com/show_bug.cgi?id=1678681

Revision history for this message
Pablo (pescobar001) wrote :

I have found similar errors in the nova debug log as those reported in the bug report above. To be honest I don't have the knowledge to identify this as the root cause but in the previous bug report one of the developers claims that these logs entries are related to the problem and it's "a legit nova bug"

2022-01-21 17:15:09.421 7 DEBUG nova.compute.manager [req-c99a6c87-b072-4960-8ef1-f4ec3a7eaa6c 611779e17c1d46d497c03be4ff61db7e d51060ea788b4430bad788cbc8e0bdd9 - default default] [instance: 5d7e2d79-e20c-4c3c-9a23-b3c54f6d35dc] No waiting events found dispatching network-vif-plugged-64146d55-e450-4329-9340-36409145ab95 pop_instance_event /var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py:319

2022-01-21 17:16:20.347 7 DEBUG nova.compute.manager [req-39c7bc03-cf1b-42f7-a3ce-2bf9048a59d1 611779e17c1d46d497c03be4ff61db7e d51060ea788b4430bad788cbc8e0bdd9 - default default] [instance: 5d7e2d79-e20c-4c3c-9a23-b3c54f6d35dc] No waiting events found dispatching network-vif-plugged-64146d55-e450-4329-9340-36409145ab95 pop_instance_event /var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py:319

2022-01-21 17:15:44.303 7 DEBUG nova.compute.manager [req-c83d4dc4-9dd6-45d7-8e2a-8857192f8233 648668eab36942a88df88fae42ca1a99 70bb85f24e6e46298e6adc980761f9d9 - default default] [instance: 5d7e2d79-e20c-4c3c-9a23-b3c54f6d35dc] Preparing to wait for external event network-vif-plugged-64146d55-e450-4329-9340-36409145ab95 prepare_for_instance_event /var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py:281

2022-01-21 17:17:52.757 7 DEBUG nova.compute.manager [req-06a3a008-388d-417c-9304-83984b71b04a 648668eab36942a88df88fae42ca1a99 70bb85f24e6e46298e6adc980761f9d9 - default default] [instance: 5d7e2d79-e20c-4c3c-9a23-b3c54f6d35dc] Preparing to wait for external event network-vif-plugged-64146d55-e450-4329-9340-36409145ab95 prepare_for_instance_event /var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py:281

Changed in nova:
status: Incomplete → New
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

You have to look at the neutron logs to see if neutron successfully plugged the vifs. Without having a more complete nova and neutron debug logs attached it is hard to say why the vif plug was timed out.

About the secondary error: "ValueError: Circular reference detected". That is probably a valid bug. We had a similar one in the past, but that is fixed in wallaby: https://review.opendev.org/q/topic:bug%252F1864665

I'm setting this to incomplete.

Changed in nova:
status: New → Incomplete
tags: added: neutron
Revision history for this message
Pavel Mracek (mrakotiq) wrote :
Download full text (9.0 KiB)

I also hit this bug and found some interesting log lines on node where the vm should have been resized. It looks like the port removal event comes at a time when the vm is already removed. And throw exeption in /usr/lib/python3/dist-packages/nova/compute/manager.py on line 10542 (xena), or in master here https://github.com/openstack/nova/blob/aad31e6ba489f720f5bdc765c132fd0f059a0329/nova/compute/manager.py#L10834

... but why the event/error is not delivered and the request ends with a timeout, I don't know.

Hope this helps:

Log from "destination" node
#############################
2022-10-04 11:23:04.610 1978602 INFO nova.virt.libvirt.driver [-] [instance: b6713c7b-78b2-4a98-9fd9-d15bba46e7b6] Instance destroyed successfully.
2022-10-04 11:23:04.691 1978602 INFO nova.virt.libvirt.driver [req-02388e45-5075-4b7c-bd51-813f450d0031 50764bebfebd116f803dd84621c29053b54595d95b67f9d2b5fbb2717d69a656 9a0c0e73b2e04a90a7f0de3a4ba391c9 - default default] [instance: b6713c7b-78b2-4a98-9fd9-d15bba46e7b6] Deleting instance files /var/lib/nova/instances/b6713c7b-78b2-4a98-9fd9-d15bba46e7b6_del
2022-10-04 11:23:05.974 1978602 INFO nova.virt.libvirt.driver [req-02388e45-5075-4b7c-bd51-813f450d0031 50764bebfebd116f803dd84621c29053b54595d95b67f9d2b5fbb2717d69a656 9a0c0e73b2e04a90a7f0de3a4ba391c9 - default default] [instance: b6713c7b-78b2-4a98-9fd9-d15bba46e7b6] Deletion of /var/lib/nova/instances/b6713c7b-78b2-4a98-9fd9-d15bba46e7b6_del complete
2022-10-04 11:23:06.792 1978602 WARNING nova.compute.manager [req-7f65479a-15f5-401d-a966-75b0ab57dccb ffcb10315af847c9bc45971bea122fb7 - - b7913c95910e4431a16009e665db28bc -] [instance: b6713c7b-78b2-4a98-9fd9-d15bba46e7b6] Received unexpected event network-vif-unplugged-1f532f40-5723-4e5b-bb6f-34d544e85896 for instance with vm_state resized and task_state resize_reverting.
2022-10-04 11:23:08.836 1978602 WARNING nova.compute.manager [req-c359ae0a-ee99-422a-a150-2c6be0d18774 ffcb10315af847c9bc45971bea122fb7 - - b7913c95910e4431a16009e665db28bc -] [instance: b6713c7b-78b2-4a98-9fd9-d15bba46e7b6] Received unexpected event network-vif-plugged-1f532f40-5723-4e5b-bb6f-34d544e85896 for instance with vm_state resized and task_state resize_reverting.
vvvvv
2022-10-04 11:23:09.145 1978602 INFO nova.compute.manager [req-793b4785-a4e7-4515-aa6d-050d27726911 ffcb10315af847c9bc45971bea122fb7 - - b7913c95910e4431a16009e665db28bc -] [instance: b6713c7b-78b2-4a98-9fd9-d15bba46e7b6] Failed to process external instance event network-changed-1f532f40-5723-4e5b-bb6f-34d544e85896 due to: Instance b6713c7b-78b2-4a98-9fd9-d15bba46e7b6 could not be found.
^^^^^
2022-10-04 11:23:19.613 1978602 INFO nova.compute.manager [-] [instance: b6713c7b-78b2-4a98-9fd9-d15bba46e7b6] VM Stopped (Lifecycle Event)
2022-10-04 11:28:12.084 1978602 WARNING nova.compute.manager [req-f5ddf196-9dd4-4ec4-ad06-ded090886a0a ffcb10315af847c9bc45971bea122fb7 - - b7913c95910e4431a16009e665db28bc -] [instance: b6713c7b-78b2-4a98-9fd9-d15bba46e7b6] Received unexpected event network-vif-unplugged-1f532f40-5723-4e5b-bb6f-34d544e85896 for instance with vm_state error and task_state None.

Logs from controllers splited by req-id:
#####################################...

Read more...

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

We just upgraded nova to wallaby(23.2.1) in our test environment and after the upgrade our tempest tests fail for the revert resize tests. (e.g. tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert)
After we confirmed that resize revert is not working anymore by manually testing it we searched and found this bug report.

I can confirm exactly what Will Szumski states in this bug report, its the exakt same issue for us.
-> Resize instance to bigger flavor
-> Revert the resize
-> Wait for log event (Timeout waiting for Neutron events: [('network-vif-plugged', '4aa5c785-816b-4ccc-bd4c-54c9cf408ba4')])
-> Instance is back in VERIFY_RESIZE state

As Will Szumski also stated after that the instance does not seem to be recoverable a confirm aswell as a revert resize both results in http 400 (Instance has not been resized) afterwards.

First we thought that it might be a problem that we were still running neutron victoria in our test environment but I can confirm that this is still an issue with both nova and neutron running wallaby together. As the nova upgrade to wallaby triggered this issue I assume there might be a bug somewhere in nova.

We are running nova with libvirt 8.0.0 / qemu 4.2.1, and neutron ovs/hybridIptables.

This is a complete showstopper for us to further upgrade as users are able to effectively bring vms in a state which they cannot recover from anymore.

Is there any help we can give to find the root cause of this issue?

Because there are at least 3 cloud operators in this bug report that are hitting this issue independently I moved this issue back to New from Incomplete.

Changed in nova:
status: Incomplete → New
Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

I dug a bit around in the wallaby code for changes regarding resize revert and it seems that the following change is responsible for this bug [1]

After I reverted this change in wallaby (23.2.1) in our test environment the problem completely disappeared and all resize reverts are working again.

I cant explain what exactly is going on but maybe some nova developer understands whats happening here with this change and why this breaks resize reverts.

[1] https://github.com/openstack/nova/commit/66c7f00e1d9d7c0eebe46eb4b24b2b21f7413789#diff-486ddc37ce8e7c105a51566dd2cffbda034f28fae7eaad2e74abf470ab296352

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

After thinking a bit about the mentioned patch [1], I think I understand the culprit of this issue.
The patch adds the same function call `self.unplug_vifs(instance, network_info)` twice.
In most setups this result in one call to the function in the new try - except case which unpluggs the vif.
And then we call the same operation a second time which already happend so we are timing out there.
This also explains why I am seeing unexpected vif-unplugg events in nova-compute as it gets events a second time even though it already did the work.
Also furthermore this explains why Will Szumski does not hit the issue when he resizes to the same host as you never get to execute the event two times.

We tested to add a try - expect around the second call to `self.unplug_vifs(instance, network_info)` but seems a bit like a dirty hack aswell.
Maybe someone else knows how to fix this properly.

[1] https://github.com/openstack/nova/commit/66c7f00e1d9d7c0eebe46eb4b24b2b21f7413789#diff-486ddc37ce8e7c105a51566dd2cffbda034f28fae7eaad2e74abf470ab296352

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

The branch stable/wallaby got a git revert for "Revert resize: wait for events according to hybrid plug" [1]

This seems to fix the overall issue that is described here. After applying the patch from stabel/wallaby I was able to remove my manual backport/revert of patch [2]

[1] https://review.opendev.org/c/openstack/nova/+/828419
[2] https://github.com/openstack/nova/commit/66c7f00e1d9d7c0eebe46eb4b24b2b21f7413789#diff-486ddc37ce8e7c105a51566dd2cffbda034f28fae7eaad2e74abf470ab296352

Revision history for this message
melanie witt (melwitt) wrote :
Changed in nova:
status: New → Fix Released
Revision history for this message
melanie witt (melwitt) wrote :

The stable/wallaby change https://review.opendev.org/c/openstack/nova/+/828419 has merged (thank you Maximilian!).

There will not be an official release for Wallaby as it is in Extended Maintenance (https://releases.openstack.org/) but it can be obtained by pulling the stable/wallaby branch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.