Two HA routers in master state during functional test

Bug #1580648 reported by Lubosz Kosnik
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Artur Korzeniewski

Bug Description

Scheduling ha routers end with two routers in master state.
Issue discovered in that bug fix - https://review.openstack.org/#/c/273546 - after preparing new functional test.

ha_router.py in method - _get_state_change_monitor_callback() is starting a neutron-keepalived-state-change process with parameter --monitor-interface as ha_device (ha-xxx) and it's IP address.

That application is monitoring using
"ip netns exec xxx ip -o monitor address"
all changes in that namespace. Each addition of that ha-xxx device produces a call to neutron-server API that this router becomes "master".
It's producing false results because that device doesn't tell anything about that router is master or not.

Logs from test_ha_router.L3HATestFailover.test_ha_router_lost_gw_connection

Agent2:
2016-05-10 16:23:20.653 16067 DEBUG neutron.agent.linux.async_process [-] Launching async process [ip netns exec qrouter-962f19e6-f592-49f7-8bc4-add116c0b7a3@agent1@agent2 ip -o monitor address]. start /neutron/neutron/agent/linux/async_process.py:109
2016-05-10 16:23:20.654 16067 DEBUG neutron.agent.linux.utils [-] Running command: ['ip', 'netns', 'exec', 'qrouter-962f19e6-f592-49f7-8bc4-add116c0b7a3@agent1@agent2', 'ip', '-o', 'monitor', 'address'] create_process /neutron/neutron/agent/linux/utils.py:82
2016-05-10 16:23:20.661 16067 DEBUG neutron.agent.l3.keepalived_state_change [-] Monitor: ha-8aedf0c6-2a, 169.254.0.1/24 run /neutron/neutron/agent/l3/keepalived_state_change.py:59
2016-05-10 16:23:20.661 16067 INFO neutron.agent.linux.daemon [-] Process runs with uid/gid: 1000/1000
2016-05-10 16:23:20.767 16067 DEBUG neutron.agent.l3.keepalived_state_change [-] Event: qr-88c93aa9-5a, fe80::c8fe:deff:fead:beef/64, False parse_and_handle_event /neutron/neutron/agent/l3/keepalived_state_change.py:73
2016-05-10 16:23:20.901 16067 DEBUG neutron.agent.l3.keepalived_state_change [-] Event: qg-814d252d-26, fe80::c8fe:deff:fead:beee/64, False parse_and_handle_event /neutron/neutron/agent/l3/keepalived_state_change.py:73
2016-05-10 16:23:21.324 16067 DEBUG neutron.agent.l3.keepalived_state_change [-] Event: ha-8aedf0c6-2a, fe80::2022:22ff:fe22:2222/64, True parse_and_handle_event /neutron/neutron/agent/l3/keepalived_state_change.py:73
2016-05-10 16:23:29.807 16067 DEBUG neutron.agent.l3.keepalived_state_change [-] Event: ha-8aedf0c6-2a, 169.254.0.1/24, True parse_and_handle_event /neutron/neutron/agent/l3/keepalived_state_change.py:73
2016-05-10 16:23:29.808 16067 DEBUG neutron.agent.l3.keepalived_state_change [-] Wrote router 962f19e6-f592-49f7-8bc4-add116c0b7a3 state master write_state_change /neutron/neutron/agent/l3/keepalived_state_change.py:87
2016-05-10 16:23:29.808 16067 DEBUG neutron.agent.l3.keepalived_state_change [-] State: master notify_agent /neutron/neutron/agent/l3/keepalived_state_change.py:93

Agent1:
2016-05-10 16:23:19.417 15906 DEBUG neutron.agent.linux.async_process [-] Launching async process [ip netns exec qrouter-962f19e6-f592-49f7-8bc4-add116c0b7a3@agent1 ip -o monitor address]. start /neutron/neutron/agent/linux/async_process.py:109
2016-05-10 16:23:19.418 15906 DEBUG neutron.agent.linux.utils [-] Running command: ['ip', 'netns', 'exec', 'qrouter-962f19e6-f592-49f7-8bc4-add116c0b7a3@agent1', 'ip', '-o', 'monitor', 'address'] create_process /neutron/neutron/agent/linux/utils.py:82
2016-05-10 16:23:19.425 15906 DEBUG neutron.agent.l3.keepalived_state_change [-] Monitor: ha-22a4d1e0-ad, 169.254.0.1/24 run /neutron/neutron/agent/l3/keepalived_state_change.py:59
2016-05-10 16:23:19.426 15906 INFO neutron.agent.linux.daemon [-] Process runs with uid/gid: 1000/1000
2016-05-10 16:23:19.525 15906 DEBUG neutron.agent.l3.keepalived_state_change [-] Event: qr-88c93aa9-5a, fe80::c8fe:deff:fead:beef/64, False parse_and_handle_event /neutron/neutron/agent/l3/keepalived_state_change.py:73
2016-05-10 16:23:19.645 15906 DEBUG neutron.agent.l3.keepalived_state_change [-] Event: qg-814d252d-26, fe80::c8fe:deff:fead:beee/64, False parse_and_handle_event /neutron/neutron/agent/l3/keepalived_state_change.py:73
2016-05-10 16:23:19.927 15906 DEBUG neutron.agent.l3.keepalived_state_change [-] Event: ha-22a4d1e0-ad, fe80::1034:56ff:fe78:2b5d/64, True parse_and_handle_event /neutron/neutron/agent/l3/keepalived_state_change.py:73
2016-05-10 16:23:28.543 15906 DEBUG neutron.agent.l3.keepalived_state_change [-] Event: ha-22a4d1e0-ad, 169.254.0.1/24, True parse_and_handle_event /neutron/neutron/agent/l3/keepalived_state_change.py:73
2016-05-10 16:23:28.544 15906 DEBUG neutron.agent.l3.keepalived_state_change [-] Wrote router 962f19e6-f592-49f7-8bc4-add116c0b7a3 state master write_state_change /neutron/neutron/agent/l3/keepalived_state_change.py:87
2016-05-10 16:23:28.544 15906 DEBUG neutron.agent.l3.keepalived_state_change [-] State: master notify_agent /neutron/neutron/agent/l3/keepalived_state_change.py:93

Tox logs:
> /neutron/neutron/tests/functional/agent/l3/test_ha_router.py(296)test_ha_router_lost_gw_connection()
-> utils.wait_until_true(lambda: router1.ha_state == 'master')
(Pdb) router1.ha_state, router2.ha_state
('master', 'master')
(Pdb)
('master', 'master')
(Pdb)
('master', 'master')

Lubosz Kosnik (diltram)
Changed in neutron:
assignee: nobody → Lubosz Kosnik (diltram)
Revision history for this message
Assaf Muller (amuller) wrote :

The ip monitor detects changes to the IP address assignment on the HA device because that address is configured as a VIP in keepalived. Whenever keepalived executes a state transition it configures or removes the IP address from the 'ha' device, the IP monitor picks that up, notifies the agent when updates the Neutron server.

I'm not aware of any bugs of false positives or out of sync issues between the IP monitor and the actual state of keepalived.

This indicates an issue with the test, not the production code.

summary: - Two HA routers in master state
+ Two HA routers in master state during functional test
Changed in neutron:
status: New → Incomplete
Revision history for this message
Lubosz Kosnik (diltram) wrote :

I was able to get that behavior on multinode setup. I had two router with master state.
I'm gonna build up that environment and test that. After getting results I will update that bug.

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

On the scale env I see the issue with two master after node recover from reboot, but I'm not sure if this has the same reason https://bugs.launchpad.net/mos/10.0.x/+bug/1563298.

Revision history for this message
Lubosz Kosnik (diltram) wrote : Re: [Bug 1580648] Re: Two HA routers in master state during functional test

Probably yes. Every time when you reboot agent, keepalived process you're
getting multiple master routers because it's checking that ha interface
which like Assaf said is always plugged in into namespace.
On Thu, 12 May 2016 at 7:01 AM, Ann Kamyshnikova <email address hidden>
wrote:

> On the scale env I see the issue with two master after node recover from
> reboot, but I'm not sure if this has the same reason
> https://bugs.launchpad.net/mos/10.0.x/+bug/1563298.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1580648
>
> Title:
> Two HA routers in master state during functional test
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1580648/+subscriptions
>

Lubosz Kosnik (diltram)
Changed in neutron:
assignee: Lubosz Kosnik (diltram) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
Revision history for this message
Gauvain Pocentek (gpocentek) wrote :

I've faced this problem on a production cluster twice in a few weeks, so setting the bug status back to 'confirmed'.

2 L3 agents were 'active' for the routers, and 1 inactive (3 nodes setup).

Changed in neutron:
status: Expired → Confirmed
Revision history for this message
Brian Haley (brian-haley) wrote :

I think this can be closed as a duplicate of a couple of other bugs, like:

https://bugs.launchpad.net/neutron/+bug/1602320

And a patch merged to master the other day for it:

https://review.openstack.org/#/c/366493/

That was also just merged to stable/mitaka and fixed the 2 Active L3-agent issue for me.

Revision history for this message
Lubosz Kosnik (diltram) wrote :

This bug is about L3 HA without DVR so we cannot merge them.

Revision history for this message
Ann Taraday (akamyshnikova) wrote :
Revision history for this message
Ann Taraday (akamyshnikova) wrote :

This problem is still reproduced, but I consider this as a keepalived limitation. There is a comment https://bugs.launchpad.net/neutron/+bug/1597461/comments/16 with some details on the bug I mentioned above.

Revision history for this message
Hirofumi Ichihara (ichihara-hirofumi) wrote :

It seems keepalived limitation as Ann said.

Changed in neutron:
status: Confirmed → Opinion
Revision history for this message
Ann Taraday (akamyshnikova) wrote :

I tried with the latest Keepalived v1.2.24, the issue is still reproduces, changing ha priority for one of the routers also does not help.

Revision history for this message
Brian Haley (brian-haley) wrote :

Assigned to John to determine what to do here.

Changed in neutron:
assignee: nobody → John Schwarz (jschwarz)
Revision history for this message
John Schwarz (jschwarz) wrote :

This seems like a bug to me. I understand that it stands as a limitation that keepalived always selects the higher-IP to be master, but then I would expect the non-higher-IP nodes to revert to backups. If this isn't the case (as it seems from what Ann and Gustavo write) then this is a bug.

Reopening.

Changed in neutron:
status: Opinion → Confirmed
importance: Undecided → High
Revision history for this message
John Schwarz (jschwarz) wrote :

I'm unable to reproduce this error. Ann, can you provide some reproduction steps?

I'm using keepalived v1.2.19.

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

The same steps as mentioned in bug description.
1) I put pdb.stack_trace() in https://github.com/openstack/neutron/blob/master/neutron/tests/functional/agent/l3/test_ha_router.py#L324
2) run tox -e dsvm-functional -- test_ha_router_failover
3) in pdb check router1.ha_state, router2.ha_state they both were master.

And now it also does not reproduce for me as well.
(Pdb) router1.ha_state, router2.ha_state
('master', 'backup')

Revision history for this message
John Schwarz (jschwarz) wrote :

Since this doesn't reproduce anymore, I'm closing this bug.

If someone happens to run into this again, please provide:
- Current code version (package version or githash)
- Keepalived version
- Reproduction steps

Changed in neutron:
status: Confirmed → Incomplete
Changed in neutron:
assignee: John Schwarz (jschwarz) → Artur Korzeniewski (artur-korzeniewski)
status: Incomplete → In Progress
Revision history for this message
Artur Korzeniewski (artur-korzeniewski) wrote :

I was able to reproduce it

- Code version: current master, tested on hash c08766db460ec4808ca16d6e9536c71365dc61eb
- Keepalived version: 1:1.2.23~ubuntu14.04.1
- Reproduction steps:
1. Put pdb.set_trace() in test_ha_router_falilover code, line 342: https://github.com/openstack/neutron/blob/master/neutron/tests/functional/agent/l3/test_ha_router.py#L342
2. run the test with: nosetests -s neutron.tests.functional.agent.l3.test_ha_router:L3HATestFailover.test_ha_router_failover
3. wait for 10-20 seconds, and run in pdb ha_state check:
> /opt/stack/neutron/neutron/tests/functional/agent/l3/test_ha_router.py(343)test_ha_router_failover()
-> common_utils.wait_until_true(lambda: router1.ha_state == 'master')
(Pdb) router1.ha_state, router2.ha_state
('master', 'master')
(Pdb) c
F
======================================================================
FAIL: neutron.tests.functional.agent.l3.test_ha_router.L3HATestFailover.test_ha_router_failover
----------------------------------------------------------------------
_StringException: Traceback (most recent call last):
  File "/opt/stack/neutron/neutron/tests/base.py", line 129, in func
    self.fail('Execution of this test timed out: %s' % e)
  File "/usr/local/lib/python2.7/dist-packages/unittest2/case.py", line 690, in fail
    raise self.failureException(msg)
AssertionError: Execution of this test timed out: 60 seconds

I was able to fix it in: https://review.openstack.org/#/c/273546/37/neutron/tests/functional/agent/l3/test_ha_router.py@320

by adding veth pair interface UP:
veth1.link.set_up()
veth2.link.set_up()

I could see that router namespaces could not reach each other, thus keepalived had no connectivity on HA addresses. For regular test case, the router2 was starting with 'backup' status, so for UT is was fast enough to validate positively that router1 was master and router2 was backup, but after a short while, router2 was becoming master, so we had both routers to be master. I guess that unit test was working by pure luck.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/420693

tags: added: newton-backport-potential
tags: added: ocata-rc-potential
Changed in neutron:
milestone: none → ocata-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/420693
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8d3f216e2421a01b54a4049c639bdb803df72510
Submitter: Jenkins
Branch: master

commit 8d3f216e2421a01b54a4049c639bdb803df72510
Author: Artur Korzeniewski <email address hidden>
Date: Fri Jan 27 11:19:16 2017 +0100

    Addressing L3 HA keepalived failures in functional tests

    Current testing of Keepalived was not configuring the connectivity
    between 2 agent namespaces.
    Added setting up the veth pair.

    Also, bridges external qg-<id> and internal qr-<id> were removed
    from agent1 namespace and moved to agent2 namespace, because they had
    the same name.
    Added patching the qg and qr bridges name creation to be different for
    functional tests.

    Change-Id: I82b3218091da4feb39a9e820d0e54639ae27c97d
    Closes-Bug: #1580648

Changed in neutron:
status: In Progress → Fix Released
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.0.0rc1

This issue was fixed in the openstack/neutron 10.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/445375

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/445375
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=dd26b010a6191f8870779fb0ba654b6ed5b094e7
Submitter: Jenkins
Branch: stable/newton

commit dd26b010a6191f8870779fb0ba654b6ed5b094e7
Author: Artur Korzeniewski <email address hidden>
Date: Fri Jan 27 11:19:16 2017 +0100

    Addressing L3 HA keepalived failures in functional tests

    Current testing of Keepalived was not configuring the connectivity
    between 2 agent namespaces.
    Added setting up the veth pair.

    Also, bridges external qg-<id> and internal qr-<id> were removed
    from agent1 namespace and moved to agent2 namespace, because they had
    the same name.
    Added patching the qg and qr bridges name creation to be different for
    functional tests.

    Change-Id: I82b3218091da4feb39a9e820d0e54639ae27c97d
    Closes-Bug: #1580648
    (cherry picked from commit 8d3f216e2421a01b54a4049c639bdb803df72510)

tags: added: in-stable-newton
tags: removed: neutron-proactive-backport-potential newton-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.3.0

This issue was fixed in the openstack/neutron 9.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.