Fullstack: neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router fails

Bug #1667055 reported by Jakub Libosvar
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Ihar Hrachyshka

Bug Description

20 hits in 7 days

logstash query: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%222%20!%3D%201%3A%20HA%20router%20must%20be%20scheduled%20to%20both%20nodes%5C%22%20AND%20tags%3Aconsole%20AND%20build_branch%3Amaster

Test failure example: http://logs.openstack.org/41/437041/1/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/2fe600c/console.html#_2017-02-22_17_46_50_795749

2017-02-22 17:46:50.893183 | 2017-02-22 17:46:50.892 | Captured traceback:
2017-02-22 17:46:50.896340 | 2017-02-22 17:46:50.895 | ~~~~~~~~~~~~~~~~~~~
2017-02-22 17:46:50.899564 | 2017-02-22 17:46:50.898 | Traceback (most recent call last):
2017-02-22 17:46:50.902799 | 2017-02-22 17:46:50.902 | File "neutron/tests/base.py", line 116, in func
2017-02-22 17:46:50.906044 | 2017-02-22 17:46:50.905 | return f(self, *args, **kwargs)
2017-02-22 17:46:50.909335 | 2017-02-22 17:46:50.908 | File "neutron/tests/fullstack/test_l3_agent.py", line 202, in test_ha_router
2017-02-22 17:46:50.912626 | 2017-02-22 17:46:50.911 | 'HA router must be scheduled to both nodes')
2017-02-22 17:46:50.915910 | 2017-02-22 17:46:50.915 | File "/opt/stack/new/neutron/.tox/dsvm-fullstack/local/lib/python2.7/site-packages/testtools/testcase.py", line 411, in assertEqual
2017-02-22 17:46:50.919219 | 2017-02-22 17:46:50.918 | self.assertThat(observed, matcher, message)
2017-02-22 17:46:50.922394 | 2017-02-22 17:46:50.921 | File "/opt/stack/new/neutron/.tox/dsvm-fullstack/local/lib/python2.7/site-packages/testtools/testcase.py", line 498, in assertThat
2017-02-22 17:46:50.925527 | 2017-02-22 17:46:50.924 | raise mismatch_error
2017-02-22 17:46:50.928784 | 2017-02-22 17:46:50.927 | testtools.matchers._impl.MismatchError: 2 != 1: HA router must be scheduled to both nodes

Changed in neutron:
assignee: nobody → venkata anil (anil-venkata)
status: New → In Progress
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

8 hits in last 7 days.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

What I see in the logs is:

http://logs.openstack.org/94/487194/2/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/e868687/testr_results.html.gz

When router is created, it's first committed to db, then it's scheduled to agents, and finally the result of the create operation is returned.

If we look at request-id of the create operation, we see the following in neutron-server log:

2017-08-28 23:49:25.722 32562 DEBUG neutron.scheduler.l3_agent_scheduler [req-e20d202e-025a-4bb5-9c64-7f4182037d44 - - - - -] Router 12c8a31d-6077-4096-89cb-d64b98bb88c1 is scheduled to L3 agent 2c2ed21e-e788-482d-844d-d11eb6145788 with binding_index 1 bind_router /opt/stack/new/neutron/neutron/scheduler/l3_agent_scheduler.py:225

We don't see this request scheduling the router to another agent.

Though around same time we see another log message that talks about scheduling the router to that other agent:

2017-08-28 23:49:27.274 32565 DEBUG neutron.scheduler.l3_agent_scheduler [req-eb2181a8-2d02-4fec-847d-e7e870d411c3 - - - - -] Router 12c8a31d-6077-4096-89cb-d64b98bb88c1 is scheduled to L3 agent d8edd5b8-ec1e-4cdf-9e35-93f258e65230 with binding_index 2 bind_router /opt/stack/new/neutron/neutron/scheduler/l3_agent_scheduler.py:225

You can notice that the request-id is different. This request came from the affected l3 agent:

2017-08-28 23:49:25.073 677 DEBUG oslo_service.periodic_task [req-eb2181a8-2d02-4fec-847d-e7e870d411c3 - - - - -] Running periodic task L3NATAgentForTest.periodic_sync_routers_task run_periodic_tasks /opt/stack/new/neutron/.tox/dsvm-fullstack/local/lib/python2.7/site-packages/oslo_service/periodic_task.py:215
2017-08-28 23:49:25.073 677 DEBUG neutron.agent.l3.agent [req-eb2181a8-2d02-4fec-847d-e7e870d411c3 - - - - -] Starting fullsync periodic_sync_routers_task periodic_sync_routers_task /opt/stack/new/neutron/neutron/agent/l3/agent.py:569

I suspect that there is a race condition between those two requests (one from api user, another from l3 agent rpc) that results in a binding missing/not seen by the next api request that fetches l3 agents for the router.

It would be nice if someone from HA folks look at the logs.

Changed in neutron:
importance: Undecided → High
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Another thought. Could it be that it works as designed, and api users may expect the number of agents scheduled for an HA router changing in short time since its creation?

I see some other test cases, like test_keepalived_multiple_sighups_does_not_forfeit_mastership, do:

        common_utils.wait_until_true(
            lambda:
            len(self.client.list_l3_agent_hosting_routers(
                router['id'])['agents']) == 2,
            timeout=90)

instead. Should we do the same in test_ha_router?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/500185

Changed in neutron:
assignee: venkata anil (anil-venkata) → Ihar Hrachyshka (ihar-hrachyshka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/500185
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f998e8d96552d8c0e2d2421a4e2a8583ad2cfffb
Submitter: Jenkins
Branch: master

commit f998e8d96552d8c0e2d2421a4e2a8583ad2cfffb
Author: Ihar Hrachyshka <email address hidden>
Date: Fri Sep 1 13:21:23 2017 -0700

    test_ha_router: wait until two agents are scheduled

    We need to give some time to neutron-server to schedule the router to
    both agents. This reflects what other fullstack test cases do.

    Change-Id: I3bce907262635c76b5444fab480f7157172e77a2
    Closes-Bug: #1667055

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/501522

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/501524

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/501525

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/501522
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ebb4274838fc18bf5c1d10e87810d56f032785fa
Submitter: Jenkins
Branch: stable/pike

commit ebb4274838fc18bf5c1d10e87810d56f032785fa
Author: Ihar Hrachyshka <email address hidden>
Date: Fri Sep 1 13:21:23 2017 -0700

    test_ha_router: wait until two agents are scheduled

    We need to give some time to neutron-server to schedule the router to
    both agents. This reflects what other fullstack test cases do.

    Change-Id: I3bce907262635c76b5444fab480f7157172e77a2
    Closes-Bug: #1667055
    (cherry picked from commit f998e8d96552d8c0e2d2421a4e2a8583ad2cfffb)

tags: added: in-stable-pike
tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/501524
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=11fd7b35534d11f69d30b9bd13d869837676f9a2
Submitter: Jenkins
Branch: stable/ocata

commit 11fd7b35534d11f69d30b9bd13d869837676f9a2
Author: Ihar Hrachyshka <email address hidden>
Date: Fri Sep 1 13:21:23 2017 -0700

    test_ha_router: wait until two agents are scheduled

    We need to give some time to neutron-server to schedule the router to
    both agents. This reflects what other fullstack test cases do.

    Change-Id: I3bce907262635c76b5444fab480f7157172e77a2
    Closes-Bug: #1667055
    (cherry picked from commit f998e8d96552d8c0e2d2421a4e2a8583ad2cfffb)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/501525
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7b2a0da18caef9aa2878c27ad737f501d498eb4d
Submitter: Jenkins
Branch: stable/newton

commit 7b2a0da18caef9aa2878c27ad737f501d498eb4d
Author: Ihar Hrachyshka <email address hidden>
Date: Fri Sep 1 13:21:23 2017 -0700

    test_ha_router: wait until two agents are scheduled

    We need to give some time to neutron-server to schedule the router to
    both agents. This reflects what other fullstack test cases do.

    Change-Id: I3bce907262635c76b5444fab480f7157172e77a2
    Closes-Bug: #1667055
    (cherry picked from commit f998e8d96552d8c0e2d2421a4e2a8583ad2cfffb)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.4

This issue was fixed in the openstack/neutron 10.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.1

This issue was fixed in the openstack/neutron 11.0.1 release.

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.0.0b1

This issue was fixed in the openstack/neutron 12.0.0.0b1 development milestone.

tags: removed: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.