The nova-live-migration job has 2 nodes and in addition to running live migration tests it also runs evacuate tests for image-backed and volume-backed servers.
Seeing intermittent failures, with debug in this change:
https://review.openstack.org/#/c/571325/
It looks like the network-vif-plugged event is coming to the API before nova-compute has registered a callback for that event, so nova-compute does not process the event and then times out waiting for the event to complete by the time it registered.
The API processes the network-vif-plugged for that server here:
http://logs.openstack.org/25/571325/10/gate/nova-live-migration/52e1cd0/logs/screen-n-api.txt.gz#_Jan_29_01_11_49_707004
Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755 <email address hidden>[22319]: DEBUG nova.api.openstack.wsgi [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling method: <bound method ServerExternalEventsController.create of <nova.api.openstack.compute.server_external_events.ServerExternalEventsController object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed", "tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif-plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]} {{(pid=22322) _process_stack /opt/stack/new/nova/nova/api/openstack/wsgi.py:520}}
Jan 29 01:11:49.765687 ubuntu-xenial-rax-ord-0002201755 <email address hidden>[22319]: INFO nova.api.openstack.compute.server_external_events [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Creating event network-vif-plugged:e241f79f-fb0d-4961-b0c8-aea9de2755bf for instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 on ubuntu-xenial-rax-ord-0002201758
Jan 29 01:11:49.776578 ubuntu-xenial-rax-ord-0002201755 <email address hidden>[22319]: DEBUG nova.compute.api [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 is migrating, copying events to all relevant hosts: set([u'ubuntu-xenial-rax-ord-0002201758', u'ubuntu-xenial-rax-ord-0002201755']) {{(pid=22322) _get_relevant_hosts /opt/stack/new/nova/nova/compute/api.py:4818}}
Jan 29 01:11:49.786947 ubuntu-xenial-rax-ord-0002201755 <email address hidden>[22319]: INFO nova.api.openstack.requestlog [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] 10.210.224.40 "POST /compute/v2.1/os-server-external-events" status: 200 len: 183 microversion: 2.1 time: 0.084393
Which is weird because that's before the vif plugging timeout in the compute logs. From the compute logs, we plugged the vif here:
Jan 29 01:11:55.571807 ubuntu-xenial-rax-ord-0002201755 nova-compute[15903]: INFO os_vif [None req-252304a1-6eff-4ff3-aa4d-b4e0ab87601c demo admin] Successfully plugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:66:03:76,bridge_name='br-int',has_traffic_filtering=True,id=e241f79f-fb0d-4961-b0c8-aea9de2755bf,network=Network(2812d9e2-bfa8-4397-9a81-5dab9fe5a03e),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tape241f79f-fb')
So it looks like maybe we're getting the network-vif-plugged event before we're ready for it and so we miss the event and then timeout? That's pretty weird though because the libvirt driver's spawn method is what should be registering and waiting for the vif plugged event:
https://github.com/openstack/nova/blob/c134feda3d9527dbc9735e4ae9cd35c4782f1fb4/nova/virt/libvirt/driver.py#L5673
Looks like the compute does register the event callback for network-vif-plugged here:
Jan 29 01:11:55.567189 ubuntu-xenial-rax-ord-0002201755 nova-compute[15903]: DEBUG nova.compute.manager [None req-252304a1-6eff-4ff3-aa4d-b4e0ab87601c demo admin] [instance: 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67] Preparing to wait for external event network-vif-plugged-e241f79f-fb0d-4961-b0c8-aea9de2755bf {{(pid=15903) prepare_for_instance_event /opt/stack/new/nova/nova/compute/manager.py:325}}
Which is too late:
Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755 <email address hidden>[22319]: DEBUG nova.api.openstack.wsgi [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling method: <bound method ServerExternalEventsController.create of <nova.api.openstack.compute.server_external_events.ServerExternalEventsController object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed", "tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif-plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]} {{(pid=22322) _process_stack /opt/stack/new/nova/nova/api/openstack/wsgi.py:520}}
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timeout%20waiting%20for%20%5B('network-vif-plugged'%5C%22%20AND%20message%3A%5C%22for%20instance%20with%20vm_state%20error%20and%20task_state%20rebuild_spawning.%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d
4 hits in the last 7 days, check and gate, all failures.
We're hitting this in our internal CI, but for revert_resize instead of evacuate [1]. Same root cause though: by the time we've started listening to the event, it's already been received and discarded.
[1] https:/ /bugzilla. redhat. com/show_ bug.cgi? id=1678681