test_server_security_groups failed to reboot with "Domain not found: no domain with matching uuid" because of missing vif event from linuxbridge agent

Bug #1744361 reported by Ihar Hrachyshka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
melanie witt
Ocata
Fix Committed
Undecided
Lee Yarwood
Pike
Fix Committed
Medium
Lee Yarwood
Queens
Fix Released
Medium
melanie witt
neutron
Invalid
Undecided
Unassigned

Bug Description

This happened in master (Queens) in tempest-linuxbridge job:

http://logs.openstack.org/56/534456/1/check/neutron-tempest-linuxbridge/7693ca5/logs/testr_results.html.gz

From tempest log:

Traceback (most recent call last):
  File "tempest/api/compute/security_groups/test_security_groups.py", line 105, in test_server_security_groups
    'ACTIVE')
  File "tempest/common/waiters.py", line 96, in wait_for_server_status
    raise lib_exc.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: (SecurityGroupsTestJSON:test_server_security_groups) Server dd2ec3b0-f909-4104-ae2a-9b878d936ed4 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: HARD_REBOOT. Current task state: reboot_started_hard.

In nova-compute log, we see this:

Jan 19 10:38:37.869397 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: WARNING nova.virt.libvirt.driver [None req-838e0ebe-afe5-4cca-abcc-4ba116889cba service nova] [instance: dd2ec3b0-f909-4104-ae2a-9b878d936ed4] Timeout waiting for vif plugging callback for instance with vm_state active and task_state reboot_started_hard.: Timeout: 300 seconds
Jan 19 10:38:37.880789 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: DEBUG nova.compute.manager [None req-838e0ebe-afe5-4cca-abcc-4ba116889cba service nova] [instance: dd2ec3b0-f909-4104-ae2a-9b878d936ed4] Checking state {{(pid=28907) _get_power_state /opt/stack/new/nova/nova/compute/manager.py:1183}}
Jan 19 10:38:37.885265 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR nova.compute.manager [None req-838e0ebe-afe5-4cca-abcc-4ba116889cba service nova] [instance: dd2ec3b0-f909-4104-ae2a-9b878d936ed4] Cannot reboot instance: Domain not found: no domain with matching uuid 'dd2ec3b0-f909-4104-ae2a-9b878d936ed4' (instance-00000023): libvirtError: Domain not found: no domain with matching uuid 'dd2ec3b0-f909-4104-ae2a-9b878d936ed4' (instance-00000023)
Jan 19 10:38:37.981484 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: DEBUG nova.compute.manager [None req-838e0ebe-afe5-4cca-abcc-4ba116889cba service nova] [instance: dd2ec3b0-f909-4104-ae2a-9b878d936ed4] Instance has been destroyed from under us while trying to set it to ERROR {{(pid=28907) _set_instance_obj_error_state /opt/stack/new/nova/nova/compute/manager.py:586}}
Jan 19 10:38:38.150498 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server [None req-838e0ebe-afe5-4cca-abcc-4ba116889cba service nova] Exception during message handling: libvirtError: Domain not found: no domain with matching uuid 'dd2ec3b0-f909-4104-ae2a-9b878d936ed4' (instance-00000023)
Jan 19 10:38:38.150813 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
Jan 19 10:38:38.151125 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
Jan 19 10:38:38.151388 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
Jan 19 10:38:38.151668 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
Jan 19 10:38:38.151939 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
Jan 19 10:38:38.152205 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
Jan 19 10:38:38.152491 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
Jan 19 10:38:38.152705 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/exception_wrapper.py", line 76, in wrapped
Jan 19 10:38:38.152962 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server function_name, call_dict, binary)
Jan 19 10:38:38.153268 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Jan 19 10:38:38.153481 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server self.force_reraise()
Jan 19 10:38:38.153710 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Jan 19 10:38:38.153942 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
Jan 19 10:38:38.154238 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/exception_wrapper.py", line 67, in wrapped
Jan 19 10:38:38.154512 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
Jan 19 10:38:38.154794 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 186, in decorated_function
Jan 19 10:38:38.155086 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server "Error: %s", e, instance=instance)
Jan 19 10:38:38.155354 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Jan 19 10:38:38.155598 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server self.force_reraise()
Jan 19 10:38:38.155850 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Jan 19 10:38:38.156133 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
Jan 19 10:38:38.156369 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 156, in decorated_function
Jan 19 10:38:38.156603 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
Jan 19 10:38:38.156856 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/utils.py", line 944, in decorated_function
Jan 19 10:38:38.157092 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
Jan 19 10:38:38.157335 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 214, in decorated_function
Jan 19 10:38:38.157592 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server kwargs['instance'], e, sys.exc_info())
Jan 19 10:38:38.157845 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Jan 19 10:38:38.158114 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server self.force_reraise()
Jan 19 10:38:38.158316 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Jan 19 10:38:38.158512 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
Jan 19 10:38:38.158710 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 202, in decorated_function
Jan 19 10:38:38.158908 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
Jan 19 10:38:38.159229 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 3213, in reboot_instance
Jan 19 10:38:38.159465 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server self._set_instance_obj_error_state(context, instance)
Jan 19 10:38:38.159721 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Jan 19 10:38:38.161021 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server self.force_reraise()
Jan 19 10:38:38.161349 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Jan 19 10:38:38.161588 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
Jan 19 10:38:38.161807 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/compute/manager.py", line 3188, in reboot_instance
Jan 19 10:38:38.162017 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server bad_volumes_callback=bad_volumes_callback)
Jan 19 10:38:38.162329 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2478, in reboot
Jan 19 10:38:38.162660 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server block_device_info)
Jan 19 10:38:38.162986 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2583, in _hard_reboot
Jan 19 10:38:38.163312 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server block_device_info=block_device_info)
Jan 19 10:38:38.163642 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5298, in _create_domain_and_network
Jan 19 10:38:38.163904 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server destroy_disks_on_failure)
Jan 19 10:38:38.164127 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5222, in _cleanup_failed_start
Jan 19 10:38:38.164343 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server if guest and guest.is_active():
Jan 19 10:38:38.164583 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/virt/libvirt/guest.py", line 553, in is_active
Jan 19 10:38:38.164891 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server return self._domain.isActive()
Jan 19 10:38:38.165124 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit
Jan 19 10:38:38.165331 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server result = proxy_call(self._autowrap, f, *args, **kwargs)
Jan 19 10:38:38.165548 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call
Jan 19 10:38:38.165759 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server rv = execute(f, *args, **kwargs)
Jan 19 10:38:38.166402 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute
Jan 19 10:38:38.166610 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server six.reraise(c, e, tb)
Jan 19 10:38:38.166812 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker
Jan 19 10:38:38.167027 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server rv = meth(*args, **kwargs)
Jan 19 10:38:38.167254 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 1368, in isActive
Jan 19 10:38:38.167456 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server if ret == -1: raise libvirtError ('virDomainIsActive() failed', dom=self)
Jan 19 10:38:38.167663 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: ERROR oslo_messaging.rpc.server libvirtError: Domain not found: no domain with matching uuid 'dd2ec3b0-f909-4104-ae2a-9b878d936ed4' (instance-00000023)

Despite nova claims it didn't receive plugged event, here it is sent in q-svc:

Jan 19 10:33:35.836759 ubuntu-xenial-rax-iad-0002010825 neutron-server[20644]: DEBUG neutron.notifiers.nova [-] Sending events: [{'status': 'completed', 'tag': u'567525a4-4b16-4004-91d5-545086d01227', 'name': 'network-vif-plugged', 'server_uuid': u'dd2ec3b0-f909-4104-ae2a-9b878d936ed4'}] {{(pid=20733) send_events /opt/stack/new/neutron/neutron/notifiers/nova.py:241}}
Jan 19 10:33:35.838577 ubuntu-xenial-rax-iad-0002010825 neutron-server[20644]: DEBUG novaclient.v2.client [-] REQ: curl -g -i --cacert "/opt/stack/data/ca-bundle.pem" -X POST https://104.239.173.102/compute/v2.1/os-server-external-events -H "User-Agent: python-novaclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-OpenStack-Nova-API-Version: 2.1" -H "X-Auth-Token: {SHA1}ba0cdcef7f7e42ba834fc341acb0aca1b904075b" -d '{"events": [{"status": "completed", "tag": "567525a4-4b16-4004-91d5-545086d01227", "name": "network-vif-plugged", "server_uuid": "dd2ec3b0-f909-4104-ae2a-9b878d936ed4"}]}' {{(pid=20733) _http_log_request /usr/local/lib/python2.7/dist-packages/keystoneauth1/session.py:372}}

The nova-compute log contains this message when the event received:

Jan 19 10:33:35.986648 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: DEBUG nova.compute.manager [None req-4bcf94cf-2048-4e99-94e6-646519900469 service nova] [instance: dd2ec3b0-f909-4104-ae2a-9b878d936ed4] No waiting events found dispatching network-vif-plugged-567525a4-4b16-4004-91d5-545086d01227 {{(pid=28907) pop_instance_event /opt/stack/new/nova/nova/compute/manager.py:363}}
Jan 19 10:33:35.987219 ubuntu-xenial-rax-iad-0002010825 nova-compute[28907]: WARNING nova.compute.manager [None req-4bcf94cf-2048-4e99-94e6-646519900469 service nova] [instance: dd2ec3b0-f909-4104-ae2a-9b878d936ed4] Received unexpected event network-vif-plugged-567525a4-4b16-4004-91d5-545086d01227 for instance with vm_state active and task_state rebooting_hard.

Looks like it ignored the event due to instance state, and never got a new one?

Changed in neutron:
status: New → Confirmed
tags: added: gate-failure
tags: added: linuxbridge
Revision history for this message
Matt Riedemann (mriedem) wrote :

I wonder if this is related to https://review.openstack.org/#/c/400384/.

Revision history for this message
Matt Riedemann (mriedem) wrote :

nova-compute isn't ignoring the event due to the instance state, it's just logging the current state of the instance when it got the event.

Matt Riedemann (mriedem)
tags: added: libvirt
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/538103

Revision history for this message
melanie witt (melwitt) wrote :

Unfortunately the log link from comment 0 is a "File Not Found" today because of zuul issues, but in examining the code I think what's happening is we're waiting for a plug event for a different vif['id'] than the one we're receiving from neutron.

The patch [1] Matt linked in comment 1 changed the behavior of hard reboot such that it will unplug vifs and then re-plug vifs when previously it did not do anything with the vifs.

So upon receiving a reboot request, we query network_info from neutron and pass it down to the driver reboot, the driver uses network_info to unplug vifs, then uses the same network_info to plug vifs again while re-creating the virt domain.

When we plug vifs, we only register to wait on events for vifs whose status is not 'active' in the network_info. And when we unplug vifs, we don't wait for any indication from neutron that the unplugging has been completed. So, I wonder if this is a race where we unplug vifs and then immediately try to plug them again before they have transitioned away from 'active'.

The thing that doesn't make sense is why we're waiting for a different vif_id to be plugged instead of not waiting for any events at all. I thought there would only be one vif_id and if it's not 'active', we would have an empty event list and therefore not wait for any events. Do vif_ids change if unplugged and plugged again?

I've added logging of the event_names to the "Timeout waiting ..." log message in a related patch to help with debugging when it happens again.

[1] https://review.openstack.org/#/c/400384/

Revision history for this message
melanie witt (melwitt) wrote :
Download full text (4.0 KiB)

Looking at a new set of logs for the same bug today, I found that the message in n-cpu [1]:

Jan 29 16:01:36.673427 ubuntu-xenial-rax-ord-0002240582 nova-compute[29444]: WARNING nova.compute.manager [None req-1cb07971-b6f2-41f9-b34b-bc03b867abdb service nova] [instance: 3fa55d94-b1f4-42f4-8ba0-9ca46c71d7c0] Received unexpected event network-vif-plugged-8de74fd2-a3bc-4d41-9c11-c04f25b52b6d for instance with vm_state active and task_state rebooting_hard.

is received after we have begun the hard reboot, while we are tearing down the virt domain for the instance, *before* we even attempt to unplug the vif and re-plug it. Could this be an erroneous re-sending of the original plug event that occurred during the initial instance creation?

This is the event for the initial instance boot, from q-svc:

Jan 29 16:01:31.012692 ubuntu-xenial-rax-ord-0002240582 neutron-server[21191]: DEBUG neutron.notifiers.nova [-] Sending events: [{'status': 'completed', 'tag': u'8de74fd2-a3bc-4d41-9c11-c04f25b52b6d', 'name': 'network-vif-plugged', 'server_uuid': u'3fa55d94-b1f4-42f4-8ba0-9ca46c71d7c0'}] {{(pid=21288) send_events /opt/stack/new/neutron/neutron/notifiers/nova.py:241}}

We later see a network-changed event just before the reboot in q-svc. Maybe this comes from the :

Jan 29 16:01:35.091047 ubuntu-xenial-rax-ord-0002240582 neutron-server[21191]: DEBUG neutron.notifiers.nova [-] Sending events: [{'tag': u'8de74fd2-a3bc-4d41-9c11-c04f25b52b6d', 'name': 'network-changed', 'server_uuid': u'3fa55d94-b1f4-42f4-8ba0-9ca46c71d7c0'}] {{(pid=21286) send_events /opt/stack/new/neutron/neutron/notifiers/nova.py:241}}

Reboot initiated in n-cpu:

Jan 29 16:01:35.600736 ubuntu-xenial-rax-ord-0002240582 nova-compute[29444]: INFO nova.compute.manager [None req-8935127a-f033-4675-92c5-b1720131f285 tempest-SecurityGroupsTestJSON-504835102 tempest-SecurityGroupsTestJSON-504835102] [instance: 3fa55d94-b1f4-42f4-8ba0-9ca46c71d7c0] Rebooting instance

Then we see neutron send an unexpected plug event, we haven't plugged the vif yet:

Jan 29 16:01:36.514905 ubuntu-xenial-rax-ord-0002240582 neutron-server[21191]: DEBUG neutron.notifiers.nova [-] Sending events: [{'status': 'completed', 'tag': u'8de74fd2-a3bc-4d41-9c11-c04f25b52b6d', 'name': 'network-vif-plugged', 'server_uuid': u'3fa55d94-b1f4-42f4-8ba0-9ca46c71d7c0'}] {{(pid=21288) send_events /opt/stack/new/neutron/neutron/notifiers/nova.py:241}}

And then n-cpu logs "Received unexpected event".

Then, n-cpu unplugs the vif,

Jan 29 16:01:37.704617 ubuntu-xenial-rax-ord-0002240582 nova-compute[29444]: INFO os_vif [None req-1cb07971-b6f2-41f9-b34b-bc03b867abdb service nova] Successfully unplugged vif VIFBridge(active=False,address=fa:16:3e:58:eb:2a,bridge_name='brq169b5db1-db',has_traffic_filtering=True,id=8de74fd2-a3bc-4d41-9c11-c04f25b52b6d,network=Network(169b5db1-db8e-4c45-b3ad-23e2aa4e1908),plugin='linux_bridge',port_profile=<?>,preserve_on_delete=False,vif_name='tap8de74fd2-a3')

prepares to wait for the plug event,

Jan 29 16:01:38.080682 ubuntu-xenial-rax-ord-0002240582 nova-compute[29444]: DEBUG nova.compute.manager [None req-125bc0b4-7bac-413f-9a8f-0f56750108bd service nova] [instance: 3fa55d94-b1f4-42f4-8ba0-9ca46c71d...

Read more...

Revision history for this message
melanie witt (melwitt) wrote :

I think I found the answer. It looks like this was already documented in our code in a different place [1]:

        # NOTE(mriedem): vifs_already_plugged=True here, regardless of whether
        # or not we've migrated to another host, because we unplug VIFs locally
        # and the status change in the port might go undetected by the neutron
        # L2 agent (or neutron server) so neutron may not know that the VIF was
        # unplugged in the first place and never send an event.
        guest = self._create_domain_and_network(context, xml, instance,
                                        network_info,
                                        block_device_info=block_device_info,
                                        power_on=power_on,
                                        vifs_already_plugged=True,
                                        post_xml_callback=gen_confdrive)

So, we shouldn't be waiting for vif plug events if we've only locally unplugged vifs.

[1] https://github.com/openstack/nova/blob/d189e68/nova/virt/libvirt/driver.py#L8088-L8092

Changed in nova:
assignee: nobody → melanie witt (melwitt)
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/540168

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/538103
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6ea939b792279595e37b80fff3d44c1e712d4fc7
Submitter: Zuul
Branch: master

commit 6ea939b792279595e37b80fff3d44c1e712d4fc7
Author: melanie witt <email address hidden>
Date: Fri Jan 26 01:30:51 2018 +0000

    Log the events we timed out waiting for while plugging vifs

    We currently log that we timed out waiting for vif plugging callbacks
    from neutron but we don't include detail about which vif_ids we were
    waiting for. This adds the event_names to the log message to aid
    in debugging.

    Related-Bug: #1744361

    Change-Id: I8b67f1049b6a968b1dd0d839a7f0b30aa1730eeb

tags: added: queens-rc-potential
Revision history for this message
Matt Riedemann (mriedem) wrote :

We'll also have to backport the nova change to stable/pike since we merged this which caused the regression:

https://review.openstack.org/#/c/531407/

Changed in nova:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/540168
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=236bb54493385bcfe2034e8b0fd7a387fa66635a
Submitter: Zuul
Branch: master

commit 236bb54493385bcfe2034e8b0fd7a387fa66635a
Author: melanie witt <email address hidden>
Date: Thu Feb 1 22:27:57 2018 +0000

    Don't wait for vif plug events during _hard_reboot

    Originally, in change Id188d48609f3d22d14e16c7f6114291d547a8986 we
    added a re-initialization of volumes, encryptors, and vifs to hard
    reboot. When creating the libvirt domain and network, we were waiting
    for vif plug events from neutron when we replugged the vifs. Then, we
    started seeing timeouts in the linuxbridge gate job because compute
    was timing out waiting for plug events from neutron during a hard
    reboot.

    It turns out that the behavior of neutron plug events depends on what
    vif type we're using and we're also using a stale network info_cache
    throughout the hard reboot code path, so we can't be 100% sure we know
    which vifs to wait for plug events from anyway. We coincidentally get
    some info_cache refreshes from network-changed events from neutron,
    but we shouldn't rely on that.

    Ideally, we could do something like wait for an unplug event after we
    unplug the vif, then refresh the network_info cache, then wait for the
    plug event. BUT, in the case of the os-vif linuxbridge unplug method,
    it is a no-op, so I don't think we could expect to get an unplug
    event for it (and we don't see any network-vif-unplugged events sent
    in the q-svc log for the linuxbridge job during a hard reboot).

    Closes-Bug: #1744361

    Change-Id: Ib0cf5d55750f13d0499a570f14024dca551ed4d4

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/540555

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/540555
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=83ffc76edd360929925fc4bb66e5dcbf8254a30e
Submitter: Zuul
Branch: stable/pike

commit 83ffc76edd360929925fc4bb66e5dcbf8254a30e
Author: melanie witt <email address hidden>
Date: Thu Feb 1 22:27:57 2018 +0000

    Don't wait for vif plug events during _hard_reboot

    Originally, in change Id188d48609f3d22d14e16c7f6114291d547a8986 we
    added a re-initialization of volumes, encryptors, and vifs to hard
    reboot. When creating the libvirt domain and network, we were waiting
    for vif plug events from neutron when we replugged the vifs. Then, we
    started seeing timeouts in the linuxbridge gate job because compute
    was timing out waiting for plug events from neutron during a hard
    reboot.

    It turns out that the behavior of neutron plug events depends on what
    vif type we're using and we're also using a stale network info_cache
    throughout the hard reboot code path, so we can't be 100% sure we know
    which vifs to wait for plug events from anyway. We coincidentally get
    some info_cache refreshes from network-changed events from neutron,
    but we shouldn't rely on that.

    Ideally, we could do something like wait for an unplug event after we
    unplug the vif, then refresh the network_info cache, then wait for the
    plug event. BUT, in the case of the os-vif linuxbridge unplug method,
    it is a no-op, so I don't think we could expect to get an unplug
    event for it (and we don't see any network-vif-unplugged events sent
    in the q-svc log for the linuxbridge job during a hard reboot).

    Closes-Bug: #1744361

    Change-Id: Ib0cf5d55750f13d0499a570f14024dca551ed4d4
    (cherry picked from commit 236bb54493385bcfe2034e8b0fd7a387fa66635a)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/540680

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/540680
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b9f9e0912101f28d4a3706b16f194b4f17ae60c2
Submitter: Zuul
Branch: stable/ocata

commit b9f9e0912101f28d4a3706b16f194b4f17ae60c2
Author: melanie witt <email address hidden>
Date: Thu Feb 1 22:27:57 2018 +0000

    Don't wait for vif plug events during _hard_reboot

    Originally, in change Id188d48609f3d22d14e16c7f6114291d547a8986 we
    added a re-initialization of volumes, encryptors, and vifs to hard
    reboot. When creating the libvirt domain and network, we were waiting
    for vif plug events from neutron when we replugged the vifs. Then, we
    started seeing timeouts in the linuxbridge gate job because compute
    was timing out waiting for plug events from neutron during a hard
    reboot.

    It turns out that the behavior of neutron plug events depends on what
    vif type we're using and we're also using a stale network info_cache
    throughout the hard reboot code path, so we can't be 100% sure we know
    which vifs to wait for plug events from anyway. We coincidentally get
    some info_cache refreshes from network-changed events from neutron,
    but we shouldn't rely on that.

    Ideally, we could do something like wait for an unplug event after we
    unplug the vif, then refresh the network_info cache, then wait for the
    plug event. BUT, in the case of the os-vif linuxbridge unplug method,
    it is a no-op, so I don't think we could expect to get an unplug
    event for it (and we don't see any network-vif-unplugged events sent
    in the q-svc log for the linuxbridge job during a hard reboot).

    Conflicts:
            nova/tests/unit/virt/libvirt/test_driver.py
            nova/virt/libvirt/driver.py

    NOTE(lyarwood): As with the stable/ocata version of
    Id188d48609f3d22d14e16c7f6114291d547a8986 these conflicts are caused by
    _create_domain_and_network using an additional disk_info argument in O.

    Closes-Bug: #1744361

    Change-Id: Ib0cf5d55750f13d0499a570f14024dca551ed4d4
    (cherry picked from commit 236bb54493385bcfe2034e8b0fd7a387fa66635a)
    (cherry picked from commit 83ffc76edd360929925fc4bb66e5dcbf8254a30e)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/541442

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/541442
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aaf37a26d6caa124f0cc6c3fe21bfdf58ccb8517
Submitter: Zuul
Branch: master

commit aaf37a26d6caa124f0cc6c3fe21bfdf58ccb8517
Author: melanie witt <email address hidden>
Date: Tue Feb 6 20:32:33 2018 +0000

    Refine waiting for vif plug events during _hard_reboot

    In change Ib0cf5d55750f13d0499a570f14024dca551ed4d4, we stopped waiting
    for vif plug events during hard reboot and start, because the behavior
    of neutron plug events depended on the vif type and we couldn't rely on
    the stale network info cache.

    This refines the logic not to wait for vif plug events only for the
    bridge vif type, instead of for all types. We also add a flag to
    _create_domain_and_network to indicate that we're in the middle of a
    reboot and to expect to wait for plug events for all vif types except
    the bridge vif type, regardless of the vif's 'active' status.

    We only query network_info from neutron at the beginning of a reboot,
    before we've unplugged anything, so the majority of the time, the vifs
    will be active=True. The current logic in get_neutron_events will only
    expect events for vifs with 'active' status False. This adds a way to
    override that if we know we've already unplugged vifs as part of a
    reboot.

    Related-Bug: #1744361

    Change-Id: Ib08afad3822f2ca95cfeea18d7f4fc4cb407b4d6

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.0.0rc1

This issue was fixed in the openstack/nova 17.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/542738

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/542748

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/542767

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.0

This issue was fixed in the openstack/nova 16.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/542738
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5a10047f9dd4383f6ec9046d392f26c93f8420d4
Submitter: Zuul
Branch: stable/queens

commit 5a10047f9dd4383f6ec9046d392f26c93f8420d4
Author: melanie witt <email address hidden>
Date: Tue Feb 6 20:32:33 2018 +0000

    Refine waiting for vif plug events during _hard_reboot

    In change Ib0cf5d55750f13d0499a570f14024dca551ed4d4, we stopped waiting
    for vif plug events during hard reboot and start, because the behavior
    of neutron plug events depended on the vif type and we couldn't rely on
    the stale network info cache.

    This refines the logic not to wait for vif plug events only for the
    bridge vif type, instead of for all types. We also add a flag to
    _create_domain_and_network to indicate that we're in the middle of a
    reboot and to expect to wait for plug events for all vif types except
    the bridge vif type, regardless of the vif's 'active' status.

    We only query network_info from neutron at the beginning of a reboot,
    before we've unplugged anything, so the majority of the time, the vifs
    will be active=True. The current logic in get_neutron_events will only
    expect events for vifs with 'active' status False. This adds a way to
    override that if we know we've already unplugged vifs as part of a
    reboot.

    Related-Bug: #1744361

    Change-Id: Ib08afad3822f2ca95cfeea18d7f4fc4cb407b4d6
    (cherry picked from commit aaf37a26d6caa124f0cc6c3fe21bfdf58ccb8517)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/542748
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e06ad602f3ef98c505e49078dd225dde2a79e2a1
Submitter: Zuul
Branch: stable/pike

commit e06ad602f3ef98c505e49078dd225dde2a79e2a1
Author: melanie witt <email address hidden>
Date: Tue Feb 6 20:32:33 2018 +0000

    Refine waiting for vif plug events during _hard_reboot

    In change Ib0cf5d55750f13d0499a570f14024dca551ed4d4, we stopped waiting
    for vif plug events during hard reboot and start, because the behavior
    of neutron plug events depended on the vif type and we couldn't rely on
    the stale network info cache.

    This refines the logic not to wait for vif plug events only for the
    bridge vif type, instead of for all types. We also add a flag to
    _create_domain_and_network to indicate that we're in the middle of a
    reboot and to expect to wait for plug events for all vif types except
    the bridge vif type, regardless of the vif's 'active' status.

    We only query network_info from neutron at the beginning of a reboot,
    before we've unplugged anything, so the majority of the time, the vifs
    will be active=True. The current logic in get_neutron_events will only
    expect events for vifs with 'active' status False. This adds a way to
    override that if we know we've already unplugged vifs as part of a
    reboot.

    Conflicts:
            nova/virt/libvirt/driver.py

    NOTE(lyarwood): Ica323b87fa85a454fca9d46ada3677f18fe50022 and
    I13de970c3beed29311d43991115a0c6d28ac14e0 are the source of the above
    conflicts in driver.py. The first removed encryptor attach logic from
    _create_domain_and_network in Queens while the second altered the
    signature of _create_domain_and_network in Queens, removing reboot that
    is then reintroduced with this change.

    Related-Bug: #1744361

    Change-Id: Ib08afad3822f2ca95cfeea18d7f4fc4cb407b4d6
    (cherry picked from commit aaf37a26d6caa124f0cc6c3fe21bfdf58ccb8517)
    (cherry picked from commit 5a10047f9dd4383f6ec9046d392f26c93f8420d4)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ocata)

Change abandoned by Lee Yarwood (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/542767

Changed in neutron:
status: Confirmed → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.1.1

This issue was fixed in the openstack/nova 15.1.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.