TaggedBootDevicesTest.test_tagged_boot_devices intermittently fails waiting for network-vif-plugged event which neutron does not send

Bug #1808171 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Slawek Kaplonski

Bug Description

Seen here:

http://logs.openstack.org/46/623246/3/check/nova-multiattach/efa830b/logs/screen-n-cpu.txt.gz?#_Dec_12_00_01_13_179474

Dec 12 00:01:13.179474 ubuntu-xenial-inap-mtl01-0001136812 nova-compute[29399]: WARNING nova.virt.libvirt.driver [None req-f43f3c35-dc0b-4aa4-bfd0-4046045b315e tempest-TaggedBootDevicesTest-250467933 tempest-TaggedBootDevicesTest-250467933] [instance: 18e5987e-a535-4044-84aa-dd8f023cedcc] Timeout waiting for [('network-vif-plugged', u'472ab433-bc18-41b5-85ae-009611117b70'), ('network-vif-plugged', u'6b476c8d-35b5-4380-a4c0-9e691f568ff5'), ('network-vif-plugged', u'8c3289ef-07aa-4048-8df1-979c2d8af07b'), ('network-vif-plugged', u'5a8d8e13-3dba-4b64-a17d-67724ef420ce'), ('network-vif-plugged', u'727b1a5d-e704-4185-9765-97c8ffb4f559'), ('network-vif-plugged', u'45563a2d-cd6c-4fe3-b5a9-74699bea3737'), ('network-vif-plugged', u'bc284b29-921e-495c-ad53-6f5fe9df6cd8')] for instance with vm_state building and task_state spawning.: Timeout: 300 seconds

I've seen this in a couple of failures and each time the only port of the 7 that we don't get the network-vif-plugged event for is the first one, in this case 472ab433-bc18-41b5-85ae-009611117b70.

The 472ab433-bc18-41b5-85ae-009611117b70 vif is plugged in nova-compute here:

Dec 11 23:56:11.855811 ubuntu-xenial-inap-mtl01-0001136812 nova-compute[29399]: INFO os_vif [None req-f43f3c35-dc0b-4aa4-bfd0-4046045b315e tempest-TaggedBootDevicesTest-250467933 tempest-TaggedBootDevicesTest-250467933] Successfully plugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:fb:dd:9f,bridge_name='br-int',has_traffic_filtering=True,id=472ab433-bc18-41b5-85ae-009611117b70,network=Network(4988ef3d-6a8f-477f-b0a2-298c33019494),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tap472ab433-bc')

Interestingly, right before that in the neutron server logs I see a network-changed event for that port:

Dec 11 23:55:46.981194 ubuntu-xenial-inap-mtl01-0001136812 neutron-server[20899]: DEBUG neutron.notifiers.nova [-] Sending events: [{'tag': u'472ab433-bc18-41b5-85ae-009611117b70', 'name': 'network-changed', 'server_uuid': u'18e5987e-a535-4044-84aa-dd8f023cedcc'}] {{(pid=21001) send_events /opt/stack/new/neutron/neutron/notifiers/nova.py:245}}

It looks like that is due to the port binding, from the neutron agent logs:

Dec 11 23:55:46.383255 ubuntu-xenial-inap-mtl01-0001136812 neutron-openvswitch-agent[21879]: DEBUG neutron.agent.resource_cache [None req-8081683c-7287-4e57-936f-3376788aec97 service neutron] Resource Port 472ab433-bc18-41b5-85ae-009611117b70 updated (revision_number 1->2). Old fields: {'device_owner': u'', 'bindings': [PortBinding(host='',port_id=472ab433-bc18-41b5-85ae-009611117b70,profile={},status='ACTIVE',vif_details=None,vif_type='unbound',vnic_type='normal')]} New fields: {'device_owner': u'compute:nova', 'bindings': [PortBinding(host='ubuntu-xenial-inap-mtl01-0001136812',port_id=472ab433-bc18-41b5-85ae-009611117b70,profile={},status='ACTIVE',vif_details=None,vif_type='unbound',vnic_type='normal')]} {{(pid=21879) record_resource_update /opt/stack/new/neutron/neutron/agent/resource_cache.py:186}}

Then in the neutron agent I see this:

Dec 11 23:56:12.724830 ubuntu-xenial-inap-mtl01-0001136812 neutron-openvswitch-agent[21879]: DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-cc26f856-2521-411e-ae10-34ad96b7c665 None None] Processing port: 472ab433-bc18-41b5-85ae-009611117b70 {{(pid=21879) treat_devices_added_or_updated /opt/stack/new/neutron/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1605}}
Dec 11 23:56:12.725010 ubuntu-xenial-inap-mtl01-0001136812 neutron-openvswitch-agent[21879]: INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-cc26f856-2521-411e-ae10-34ad96b7c665 None None] Port 472ab433-bc18-41b5-85ae-009611117b70 was not found on the integration bridge and will therefore not be processed

Is that a problem which might cause the network-vif-plugged event to not be sent?

And then I see this:

Dec 11 23:56:14.095895 ubuntu-xenial-inap-mtl01-0001136812 neutron-openvswitch-agent[21879]: INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-cc26f856-2521-411e-ae10-34ad96b7c665 None None] Ports set([u'472ab433-bc18-41b5-85ae-009611117b70', u'bc284b29-921e-495c-ad53-6f5fe9df6cd8', u'45563a2d-cd6c-4fe3-b5a9-74699bea3737', u'6b476c8d-35b5-4380-a4c0-9e691f568ff5', u'8c3289ef-07aa-4048-8df1-979c2d8af07b', u'5a8d8e13-3dba-4b64-a17d-67724ef420ce', u'727b1a5d-e704-4185-9765-97c8ffb4f559']) skipped, changing status to down

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22tempest-TaggedBootDevicesTest%5C%22%20AND%20message%3A%5C%22Timeout%20waiting%20for%20%5B('network-vif-plugged'%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

4 hits in 7 days, check and gate, all failures.

Tags: gate-failure
Revision history for this message
sean mooney (sean-k-mooney) wrote :

so just loking at the logs around when processing of the port was skipped in the neutron agent we see

that the vms tap interface tap472ab433-bc corresponding to port 472ab433-bc18-41b5-85ae-009611117b70
is present in the db list at http://logs.openstack.org/46/623246/3/check/nova-multiattach/efa830b/logs/screen-q-agt.txt.gz#_Dec_11_23_56_12_715432

Dec 11 23:56:12.715432 ubuntu-xenial-inap-mtl01-0001136812 neutron-openvswitch-agent[21879]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DbListCommand(if_exists=True, records=[u'tap6b476c8d-35', u'tap459d08b5-8c', u'tapc4cbed6d-26', u'int-br-ex', u'tap9596db0a-58', u'tapd2e38552-81', u'qr-6ffb4153-0f', u'qg-89b08004-24', u'tap467f555d-fd', u'tap45563a2d-cd', u'qg-88ca1551-8f', u'tap5a8d8e13-3d', u'qr-35a687f0-fb', u'qg-c992632f-da', u'tapf2f4235b-c3', u'tap727b1a5d-e7', u'qg-a51133f8-41', u'tapd883b80b-39', u'tap8c3289ef-07', u'qr-c64db2ce-aa', u'tapb734a414-18', u'tapbc284b29-92', u'qr-3a500395-7f', u'patch-tun', u'tap472ab433-bc'], table=Interface, columns=['name', 'external_ids', 'ofport'], row=False) {{(pid=21879) do_commit /usr/local/lib/python2.7/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}}

then there is a warningn

ubuntu-xenial-inap-mtl01-0001136812 neutron-openvswitch-agent[21879]: WARNING neutron.agent.common.ovs_lib [None req-cc26f856-2521-411e-ae10-34ad96b7c665 None None] ofport: [] for VIF: 472ab433-bc18-41b5-85ae-009611117b70 is not a positive integer: OVSFWTagNotFound: Cannot get tag for port tap5b6aa631-3b from its other_config: {}

and then the message saying the port was not found

Dec 11 23:56:12.725010 ubuntu-xenial-inap-mtl01-0001136812 neutron-openvswitch-agent[21879]: INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-cc26f856-2521-411e-ae10-34ad96b7c665 None None] Port 472ab433-bc18-41b5-85ae-009611117b70 was not found on the integration bridge and will therefore not be processed

now the stagege thing about the warining is that the tap listed is wrong.
the warning states "OVSFWTagNotFound: Cannot get tag for port tap5b6aa631-3b from its other_config: {}" however the tap should be tap472ab433-bc

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like these failures started showing up around Dec 11, and looking at neutron commit history I wonder if this could be related:

https://github.com/openstack/neutron/commit/22c9cf3d95c3d14351045bac2920b4e0994ec42f

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

From this warning in q-agt it looks that port tap472ab433-bc wasn't properly created in ovs thus it had ofport = -1. Because of that neutron-openvswitch-agent didn't process it and it wasn't provisioned by L2 entity at all.

What is strange for me is this "OVSFWTagNotFound: Cannot get tag for port tap5b6aa631-3b from its other_config: {}" message in same line, like in http://logs.openstack.org/46/623246/3/check/nova-multiattach/efa830b/logs/screen-q-agt.txt.gz?level=WARNING#_Dec_11_23_56_12_718674

I will continue investigating this issue in next days.

Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

After more investigating this issue based on example from http://logs.openstack.org/46/623246/3/check/nova-multiattach/efa830b/logs/screen-q-agt.txt.gz I know that this warning that port has ofport -1 isn't real problem.
It may happen like that when neutron-openvswitch-agent will ask ovs for port's details just after Libvirt creates this port. And in fact it was like that in case of e.g. port 8c3289ef-07aa-4048-8df1-979c2d8af07b and 472ab433-bc18-41b5-85ae-009611117b70 in analyzed log file.

For most of ports (like 8c3289ef-07aa-4048-8df1-979c2d8af07b) it's marked as DOWN because of this ofport=-1 but in next iteration it is treated as "added/updated" and configured properly, see http://logs.openstack.org/46/623246/3/check/nova-multiattach/efa830b/logs/screen-q-agt.txt.gz#_Dec_11_23_56_17_619915
Problem is that for unknown for me reason this don't happen for port 472ab433-bc18-41b5-85ae-009611117b70 and this one is not configured later at all.
I will continue debugging this and try to find out why it happens like that.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I wonder if this issue isn't somehow related to patch https://github.com/openstack/os-vif/commit/165ed325917e5deadb274ad9c122db157c0b55b2#diff-024723487bc761f5ba9d39bbf77f7721 in os-vif.
Please check Numan Siddique's comment in https://review.openstack.org/#/c/449695/6 for details

Revision history for this message
sean mooney (sean-k-mooney) wrote :

i think it is related but not the way you are implying.
i think the os-vif change is triggeing a latent bug in the l2 agent

if you read my latest comment on patchset 3 at 12:17AM
here https://review.openstack.org/#/c/602432/3

i am seeing another instance o this issue where we have a similar warning
http://logs.openstack.org/32/602432/3/check/tempest-full/4b76af0/controller/logs/screen-q-agt.txt.gz#_Jan_16_21_35_54_811041
but the port name is incorrect.

the ovsdb monitor first detect teh port as inserted into the ovs db here
http://logs.openstack.org/32/602432/3/check/tempest-full/4b76af0/controller/logs/screen-q-agt.txt.gz#_Jan_16_21_35_54_775597

then the action chagnes to new
http://logs.openstack.org/32/602432/3/check/tempest-full/4b76af0/controller/logs/screen-q-agt.txt.gz#_Jan_16_21_35_54_776433

then here it is somehow listed as a current, added and updated port simultaneously.

the interface associated with the neutron port tapb3d5de79-21 is the first one returned in the list interfaces command here:
http://logs.openstack.org/32/602432/3/check/tempest-full/4b76af0/controller/logs/screen-q-agt.txt.gz#_Jan_16_21_35_54_809441

but without running any other list command the neutron agent say it is not present on the br-int
sortly after here
http://logs.openstack.org/32/602432/3/check/tempest-full/4b76af0/controller/logs/screen-q-agt.txt.gz#_Jan_16_21_35_54_815358

and then the agent chooes to skip the port and set it to down here
http://logs.openstack.org/32/602432/3/check/tempest-full/4b76af0/controller/logs/screen-q-agt.txt.gz#_Jan_16_21_35_55_052196

i have not step though the code on the neutron side to confim but the logs look very suspious to me.

i think the error is likel related to this event
http://logs.openstack.org/32/602432/3/check/tempest-full/4b76af0/controller/logs/screen-q-agt.txt.gz#_Jan_16_21_35_54_806738
where the port is in the 3 sets (current,added updated) simultaiously.

this message is logged from here
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2191

it looks like the port_info object is set here as part of the return of self.process_port_info

https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2166

but its actully built by this fuction
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1337

i have not really spent enough time to trace this properly so this may be unrelated but
the logs feel wrong to me.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

you know what looking at this again.
assuming what normal is say regarding libvirt is corerct.
about libvirt requesting the port very shortly after os-vif does.

the issue in the neutron agent could be similar to this
https://bugs.launchpad.net/neutron/+bug/1807239
we might have a similar race where
1.) os-vif creates teh ovs port
2.a)neutron detects it and starts to wire it up
2.b) libvirt in parallel recreates the port with "'--if-exists', 'del-port', dev, '--',
            'add-port', bridge, dev, ..."
3 neutron sees the port be removed and readed

while libvirt is replugging the interface the neutron agent could obseve the vswitch state between
the del-port and add-port leading to
http://logs.openstack.org/32/602432/3/check/tempest-full/4b76af0/controller/logs/screen-q-agt.txt.gz#_Jan_16_21_35_54_815358 where it say the port is missing even though
before and after that command runs the port would be present on teh bridge.

executing a since ovs-vsctl command with multiple sub commands like
"'--if-exists', 'del-port', dev, '--',
            'add-port', bridge, dev, ..."
used to be atomic in ovs but at some time that chagned and each commad is not executed indevidually
out side of a transaction. meaning that the port will actully be removed and readded which is why we change changed to add-port --may-exist in os-vif instead of --if-exits del-port which libvirt is apparently using.

if this is the root casue i would have expected https://review.openstack.org/#/c/602432/ to solve it as libvirt will no long interact with ovs but since i am seeing this bug with that patch applied also i think there is more to this then a potential race.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi,

Thanks Sean for Your analysis. Log that port was "not found in br-int" is there because ofport was -1. But I agree that this port shouldn't be added to "current" nor "added" nor "updated" lists in such case probably.

I will investigate it on my devstack.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

So, what I think is fact that when port is initially added by os-vif, it triggers OvsdbMonitor event: http://logs.openstack.org/32/602432/3/check/tempest-full/4b76af0/controller/logs/screen-q-agt.txt.gz#_Jan_16_21_35_54_776433 and such port is added to events['added'] list thus neutron-openvswitch-agent process this port.
That is causing issues and some errors in logs as port is actually not configured yet (ofport=-1) so it isn't actually processed. But for some reason it's indeed added to port_info["current"] and if it happens, such port will be not processed later probably as it already is in "current" list.

I think that we should skip events for ports with ofport=-1, there is no need to trigger ovs agent when such port is added to bridge IMO.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/631584

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/631763

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/631584
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f8e0a497ad6a36fa0d9510cda9e7344c2ed73763
Submitter: Zuul
Branch: master

commit f8e0a497ad6a36fa0d9510cda9e7344c2ed73763
Author: Slawek Kaplonski <email address hidden>
Date: Thu Jan 17 18:18:49 2019 +0100

    Skip ovsdb_monitor events for ports with ofport=-1

    There is no need to trigger neutron-openvswitch-agent to
    start processing ports in case if invalid port (with
    ofport=-1) will be added to bridge.
    Such port will be processed later when ofport will be
    changed.

    Change-Id: I3465daf4809d5d56565f59b177b5f8870352cc9d
    Related-Bug: #1808171
    Related-Bug: #1811405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/631763

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Is it still an issue with os-vif 1.14.0? I don't see any new example of such failure in logstash. Last one which I saw was from 24.01 and it was still with os-vif 1.13.1: http://logs.openstack.org/77/616777/9/check/nova-multiattach/93a13d2/logs/devstacklog.txt.gz

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Marking this as fixed because it looks that this is ok with newer os-vif version IMO.

Changed in neutron:
status: Confirmed → Fix Committed
Changed in neutron:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.