test_agent_resync_on_non_existing_bridge failing intermittently sp
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
Medium
|
Miro Tomaska |
Bug Description
Test neutron.
I just reintroduced this test [1] into the code. The failure is does not happen all the time but I can be reproduce it locally with --until-failure with multiple concurrency(big hint) and running the whole TestMetadaAgent class of tests (another hint). Like this
`tox -e dsvm-functional -- neutron.
When the failure happens following exception is found in the logs
2023-03-10 17:49:11.861 40848 INFO neutron.
2023-03-10 17:49:11.863 40848 INFO neutron.
2023-03-10 17:49:11.917 40848 DEBUG neutron.
2023-03-10 17:49:11.923 41596 DEBUG neutron.
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event [-] Unexpected exception in notify_loop: neutron.
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event Traceback (most recent call last):
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event File "/home/
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event match.run(event, row, updates)
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event File "/home/
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event self.agent.
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event File "/home/
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event ip2.addr.
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event File "/home/
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event add_ip_
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event File "/home/
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event privileged.
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event File "/home/
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event return self.channel.
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event File "/home/
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event raise exc_type(
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event neutron.
2023-03-10 17:49:12.368 40848 ERROR ovsdbapp.event
The thing is this line of code should actually be creating the new namespace[2] so not sure why its complaining that namespace was not found. I am suspecting there is some race condition or more likely some test interferance due to test runner concurrency.
[0] https:/
[1] https:/
[2] https:/
Changed in neutron: | |
assignee: | nobody → Miro Tomaska (mtomaska) |
description: | updated |
Changed in neutron: | |
status: | New → Confirmed |
importance: | Undecided → Medium |
Ok so it appears the agent itself is destroying the namespace when the TestMetadataAgent class tests are run concurrently. The agent start process runs sync() function which will destroy ovnmetadata namespaces not being used by datapaths on the particular chassis for the agent instance. Since each test generates its own datapath and chassis uuid, concurrent agent starts destroy each other namespaces. In another words, multiple agent test instances are operating on the same ovnmeta-* namespaces. This is the reason why running these tests with --concurrency 1 makes it always pass. This was not a problem when this test existed originally but we introduced this change[1] 4 months ago which changes the order of how namespaces are cleaned up. If this test existed when the [1] patch went it, it would have started failing the same way.
So this is really a test problem at this point, the agent code is good. I just need to figure what is the best way to deal with this. One obvious way is to just run this class with --concurrency 1 but I would prefer some better solution if possible.
[1] https:/ /review. opendev. org/c/openstack /neutron/ +/864777/ 2/neutron/ agent/ovn/ metadata/ agent.py# 333