Nova evacuate instance fails as port tries to bind to dead host

Bug #1911474 reported by J-P methot
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

Description
===========
Running Nova evacuate (instance ID) (compute) works once, but then fails for the second instance. Logs show that Nova tells Neutron to bind the port on the dead compute node. This issue has happened twice to us in production, but we haven't been able to reproduce outside of production.

Steps to reproduce
==================
1. Have a dead compute node (shutdown may be fine?)
2. Execute "nova evacuate EVACUATED_SERVER_NAME HOST_B"
3. Execute "nova evacuate EVACUATED_SERVER_NAME2 HOST_B"
4. Watch in horror as neutron tries to bind the port to the dead host

Expected result
===============
A working VM evacuated to a working compute host.

Actual result
=============
A failed evacuate, a VM in error status on a dead compute host.

Environment
===========
CentOS 7.8.2003
Openstack Rocky
openstack-nova-scheduler-18.3.0-1.el7.noarch
openstack-nova-api-18.3.0-1.el7.noarch
openstack-nova-common-18.3.0-1.el7.noarch
openstack-nova-novncproxy-18.3.0-1.el7.noarch
openstack-nova-placement-api-18.3.0-1.el7.noarch
python-nova-18.3.0-1.el7.noarch
openstack-nova-console-18.3.0-1.el7.noarch
python2-novaclient-11.0.1-1.el7.noarch
openstack-nova-conductor-18.3.0-1.el7.noarch

Libvirt + KVM
libvirt-4.5.0-33.el7_8.1.x86_64
qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64

Ceph 14.2.11

Neutron-Openvswitch

Logs & Configs
==============
*These logs are from production which wasn't in Debug mode when the issue happened. Since this is production, I can't really force reproduction of this issue while debug is on*

f3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-unplugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22
2021-01-06 13:31:31.750 2858 INFO nova.osapi_compute.wsgi.server [req-4f9b3e17-1a9d-48f0-961a-bbabdf922ad6 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] 10.30.1.224 "POST /v2.1/os-server-external-events HTTP/1.1" status: 200 len: 1091 time: 0.4987640
2021-01-06 13:31:40.145 2863 INFO nova.osapi_compute.wsgi.server [req-abaac9df-7338-4d10-9326-4006021ff54d 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1 HTTP/1.1" status: 302 len: 318 time: 0.0072701
2021-01-06 13:31:40.156 2863 INFO nova.osapi_compute.wsgi.server [req-c393e74b-a118-4a98-8a83-be6007913dc0 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/ HTTP/1.1" status: 200 len: 789 time: 0.0070350
2021-01-06 13:31:43.289 2865 INFO nova.osapi_compute.wsgi.server [req-b87268b7-a673-44c1-9162-f9564647ec33 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/servers/4aeb7761-cb23-4c51-93dd-79b55afbc7dc HTTP/1.1" status: 200 len: 5654 time: 2.7543190
2021-01-06 13:31:43.413 2863 INFO nova.osapi_compute.wsgi.server [req-4cab23ba-c5cb-4dda-bf42-bc452d004783 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/servers/4aeb7761-cb23-4c51-93dd-79b55afbc7dc/os-volume_attachments HTTP/1.1" status: 200 len: 770 time: 0.1135709
2021-01-06 13:31:43.883 2865 INFO nova.osapi_compute.wsgi.server [req-f5e5a586-65f3-4798-b03b-98e01326a00b 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/flavors/574a7152-f079-4337-b1eb-b7eca4370b73 HTTP/1.1" status: 200 len: 877 time: 0.5751688
2021-01-06 13:31:47.194 2864 INFO nova.api.openstack.compute.server_external_events [req-7e639b1f-8408-4e8e-9bb8-54588290edfe 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-plugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22

*I was asked to use the event list to find further logs. However, the event list did not return anything when I queried it. Could this be because the problematic instance was deleted and recreated?*

Revision history for this message
suzhengwei (sue.sam) wrote :

Use the following cli to found the req_id of the evacuation.
#nova instance-action-list $instance_uuid
Then show the request information.
#nova instance-action $instance_uuid $req_id

More information is also needed by grep the request-id in all nova-api/scheduler and dest nova-compute log files.

Revision history for this message
J-P methot (jphilippemethot) wrote :
Download full text (3.3 KiB)

Per company policy I am forbidden to search the logs directly in openstack. However, here's what I get when I search our log server with the request ID

@timestamp,host,type,message,hostname,program
"2021-01-06T18:31:31.802Z","10.30.1.225","syslog","2021-01-06 13:31:31.750 2858 INFO nova.osapi_compute.wsgi.server [req-4f9b3e17-1a9d-48f0-961a-bbabdf922ad6 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] 10.30.1.224 ""POST /v2.1/os-server-external-events HTTP/1.1"" status: 200 len: 1091 time: 0.4987640","controller1-zone2","nova-api"
"2021-01-06T18:31:31.802Z","10.30.1.225","syslog","2021-01-06 13:31:31.705 2858 INFO nova.api.openstack.compute.server_external_events [req-4f9b3e17-1a9d-48f0-961a-bbabdf922ad6 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-unplugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22.zone2.openstack.fr.planethoster.net","controller1-zone2","nova-api"
"2021-01-06T18:31:31.801Z","10.30.1.225","syslog","2021-01-06 13:31:31.704 2858 INFO nova.api.openstack.compute.server_external_events [req-4f9b3e17-1a9d-48f0-961a-bbabdf922ad6 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-changed:7b85335d-2b32-4313-b72d-41b8f71e277a for instance 37869ae7-ce6f-4413-a877-bead0ba1dc8c on compute4.zone2.openstack.fr.planethoster.net","controller1-zone2","nova-api"
"2021-01-06T18:31:31.799Z","10.30.1.175","syslog","2021-01-06 13:31:31.753 3472 WARNING nova.compute.manager [req-4f9b3e17-1a9d-48f0-961a-bbabdf922ad6 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] [instance: 4aeb7761-cb23-4c51-93dd-79b55afbc7dc] Received unexpected event network-vif-unplugged-80371c01-930d-4ea2-9d28-14438e948b65 for instance with vm_state active and task_state rebuilding.","compute46","nova-compute"
"2021-01-06T18:31:31.774Z","10.30.1.133","syslog","2021-01-06 13:31:31.728 26607 WARNING nova.compute.manager [req-4f9b3e17-1a9d-48f0-961a-bbabdf922ad6 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] [instance: 37869ae7-ce6f-4413-a877-bead0ba1dc8c] Received unexpected event network-vif-unplugged-7b85335d-2b32-4313-b72d-41b8f71e277a for instance with vm_state active and task_state None.","compute4","nova-compute"
"2021-01-06T18:31:31.750Z","10.30.1.225","syslog","2021-01-06 13:31:31.703 2858 INFO nova.api.openstack.compute.server_external_events [req-4f9b3e17-1a9d-48f0-961a-bbabdf922ad6 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-unplugged:7b85335d-2b32-4313-b72d-41b8f71e277a for instance 37869ae7-ce6f-4413-a877-bead0ba1dc8c on compute4.zone2.openstack.fr.planethoster.net","controller1-zone2","nova-api"
"2021-01-06T18:31:31.750Z","10.30.1.225","syslog","2021-01-06 13:31:31.701 2858 INFO nova.api.openstack.compute.server_external_events [req-4f9b3e17-1a9d-48f0-961a-bbabdf922ad6 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-changed:32fc9333-96b8-4db7-a547-e226...

Read more...

Revision history for this message
Lee Yarwood (lyarwood) wrote :

This smells like the original compute and the network agents within it were still running?

https://docs.openstack.org/api-ref/compute/?expanded=evacuate-server-evacuate-action-detail

As called out in the docs operators need to ensure the host is fenced before calling for an evacuation, can you confirmed that happens in your prod env?

Lee Yarwood (lyarwood)
Changed in nova:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.