test_agent_resync_on_non_existing_bridge failing intermittently sp

Bug #2011377 reported by Miro Tomaska
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Miro Tomaska

Bug Description

Test neutron.tests.functional.agent.ovn.metadata.test_metadata_agent.TestMetadataAgent.test_agent_resync_on_non_existing_bridge is failing intermitelly. Ex [0]

I just reintroduced this test [1] into the code. The failure is does not happen all the time but I can be reproduce it locally with --until-failure with multiple concurrency(big hint) and running the whole TestMetadaAgent class of tests (another hint). Like this

`tox -e dsvm-functional -- neutron.tests.functional.agent.ovn.metadata.test_metadata_agent.TestMetadataAgent --until-failure --concurrency 0`

When the failure happens following exception is found in the logs

2023-03-10 17:49:11.861 40848 INFO neutron.agent.ovn.metadata.agent [-] Port ovn-port-feb89eb1-fcf8-4b38-8aee-d8dd4b6b497e in datapath ovn-f376a6ca-6f5b-4fa9-9fa9-d5f450bb801b bound to our chassis
2023-03-10 17:49:11.863 40848 INFO neutron.agent.ovn.metadata.agent [-] Provisioning metadata for network ovn-f376a6ca-6f5b-4fa9-9fa9-d5f450bb801b
2023-03-10 17:49:11.917 40848 DEBUG neutron.agent.ovn.metadata.agent [-] Creating VETH tapovn-f376a61 in ovnmeta-ovn-f376a6ca-6f5b-4fa9-9fa9-d5f450bb801b namespace provision_datapath /home/zuul/src/opendev.org/openstack/neutron/neutron/agent/ovn/metadata/agent.py:603
2023-03-10 17:49:11.923 41596 DEBUG neutron.privileged.agent.linux.ip_lib [-] Interface tapovn-f376a60 not found in namespace None get_link_id /home/zuul/src/opendev.org/openstack/neutron/neutron/privileged/agent/linux/ip_lib.py:204
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event [-] Unexpected exception in notify_loop: neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound: Network interface tapovn-f376a61 not found in namespace ovnmeta-ovn-f376a6ca-6f5b-4fa9-9fa9-d5f450bb801b.
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event Traceback (most recent call last):
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional-gate/lib/python3.10/site-packages/ovsdbapp/event.py", line 177, in notify_loop
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event match.run(event, row, updates)
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event File "/home/zuul/src/opendev.org/openstack/neutron/neutron/agent/ovn/metadata/agent.py", line 110, in run
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event self.agent.provision_datapath(row.datapath)
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event File "/home/zuul/src/opendev.org/openstack/neutron/neutron/agent/ovn/metadata/agent.py", line 640, in provision_datapath
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event ip2.addr.add_multiple(ipv4_cidrs_to_add)
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event File "/home/zuul/src/opendev.org/openstack/neutron/neutron/agent/linux/ip_lib.py", line 544, in add_multiple
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event add_ip_addresses(cidrs, self.name, self._parent.namespace, scope,
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event File "/home/zuul/src/opendev.org/openstack/neutron/neutron/agent/linux/ip_lib.py", line 848, in add_ip_addresses
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event privileged.add_ip_addresses(
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional-gate/lib/python3.10/site-packages/oslo_privsep/priv_context.py", line 271, in _wrap
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event return self.channel.remote_call(name, args, kwargs,
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional-gate/lib/python3.10/site-packages/oslo_privsep/daemon.py", line 215, in remote_call
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event raise exc_type(*result[2])
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound: Network interface tapovn-f376a61 not found in namespace ovnmeta-ovn-f376a6ca-6f5b-4fa9-9fa9-d5f450bb801b.
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event

The thing is this line of code should actually be creating the new namespace[2] so not sure why its complaining that namespace was not found. I am suspecting there is some race condition or more likely some test interferance due to test runner concurrency.

[0] https://zuul.opendev.org/t/openstack/build/5d1910037db844e88cf9ef694068cf17
[1] https://review.opendev.org/c/openstack/neutron/+/875586
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/ovn/metadata/agent.py#LL610-L611

Miro Tomaska (mtomaska)
Changed in neutron:
assignee: nobody → Miro Tomaska (mtomaska)
description: updated
Changed in neutron:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Miro Tomaska (mtomaska) wrote :

Ok so it appears the agent itself is destroying the namespace when the TestMetadataAgent class tests are run concurrently. The agent start process runs sync() function which will destroy ovnmetadata namespaces not being used by datapaths on the particular chassis for the agent instance. Since each test generates its own datapath and chassis uuid, concurrent agent starts destroy each other namespaces. In another words, multiple agent test instances are operating on the same ovnmeta-* namespaces. This is the reason why running these tests with --concurrency 1 makes it always pass. This was not a problem when this test existed originally but we introduced this change[1] 4 months ago which changes the order of how namespaces are cleaned up. If this test existed when the [1] patch went it, it would have started failing the same way.
So this is really a test problem at this point, the agent code is good. I just need to figure what is the best way to deal with this. One obvious way is to just run this class with --concurrency 1 but I would prefer some better solution if possible.

[1] https://review.opendev.org/c/openstack/neutron/+/864777/2/neutron/agent/ovn/metadata/agent.py#333

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/877535

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/877535
Committed: https://opendev.org/openstack/neutron/commit/04d3f889efeef74e8eb8d8bf330f1594913b161a
Submitter: "Zuul (22348)"
Branch: master

commit 04d3f889efeef74e8eb8d8bf330f1594913b161a
Author: Miro Tomaska <email address hidden>
Date: Wed Mar 15 12:26:12 2023 -0500

    Fix metadata agent intermittent test failures

    Metadata agent has been experiencing intermittent failures
    mostly because of test conccurency and how the metadata agent
    code assumes its the only process running on the system and
    operating on the ovnmeta-* namespaces. See comment#1 the
    linked bug for more details. Although I dont like forcing
    --concurrency 1 for this test class, I think that is going
    to be the best solution and any new tests that will be added
    in the future.

    Closes-Bug: #2011377
    Change-Id: Ie7f3b496de6b23be5739fbeba10f53602e8b300d

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 23.0.0.0b2

This issue was fixed in the openstack/neutron 23.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.