HA routers not going to be "primary" at all

Bug #1946187 reported by Slawek Kaplonski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Confirmed
High
Unassigned

Bug Description

It happens in the CI from time to time that many tests are failing because router is in backup state all the time and it's never transitioned to be primary on the node.

Examples of the failure:
https://3142cc95d58eb8a4ee07-043369ac575bbfe29758366f4ba498a1.ssl.cf1.rackcdn.com/765072/8/check/neutron-tempest-plugin-scenario-openvswitch/499b47d/controller/logs/screen-q-l3.txt

https://6599da62140c9583e14a-cd7f53ffbb0b86c69deae453da021fe8.ssl.cf5.rackcdn.com/811746/4/check/neutron-tempest-plugin-scenario-openvswitch/3cafcd7/testr_results.html

https://zuul.opendev.org/t/openstack/build/75c056464b6f445ebde18c1b07f5bcce

Example of stacktrace:

Traceback (most recent call last):
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/neutron_tempest_plugin/common/utils.py", line 80, in wait_until_true
    eventlet.sleep(sleep)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/eventlet/greenthread.py", line 36, in sleep
    hub.switch()
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch
    return self.greenlet.switch()
eventlet.timeout.Timeout: 600 seconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/neutron_tempest_plugin/scenario/test_basic.py", line 35, in test_basic_instance
    self.setup_network_and_server()
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/neutron_tempest_plugin/scenario/base.py", line 281, in setup_network_and_server
    router = self.create_router_by_client(**kwargs)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/neutron_tempest_plugin/scenario/base.py", line 209, in create_router_by_client
    cls._wait_for_router_ha_active(router['id'])
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/neutron_tempest_plugin/scenario/base.py", line 228, in _wait_for_router_ha_active
    utils.wait_until_true(_router_active_on_l3_agent,
  File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/neutron_tempest_plugin/common/utils.py", line 84, in wait_until_true
    raise exception
tempest.lib.exceptions.TimeoutException: Request timed out
Details: Router 1c4ce297-5a04-4794-9720-20fdec9ca4e5 is not active on any of the L3 agents

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I checked logs from the failed job and in fact this is duplicate of the neutron-ovs-agent crash issue https://bugs.launchpad.net/neutron/+bug/1944201
Routers aren't transitioned to "primary" because neutron-ovs-agent is dead in such job thus HA ports of the routers are DOWN. There is no neutron-l3-agent issue in that case at all.
I'm closing this bug as duplicate of the https://bugs.launchpad.net/neutron/+bug/1944201 and hopefully it will be fixed with new os-ken version.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.