Fullstack test test_ha_router_restart_agents_no_packet_lost failing

Bug #1798475 reported by Slawek Kaplonski on 2018-10-17
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
LIU Yulong

Reviewed: https://review.openstack.org/619935
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c1407db676b6e610a8dc37620a60f45610f4249c
Submitter: Zuul
Branch: master

commit c1407db676b6e610a8dc37620a60f45610f4249c
Author: Slawek Kaplonski <email address hidden>
Date: Sun Nov 25 17:39:12 2018 +0100

    Store journal log from host in fullstack's job results

    Change-Id: Ibd16e111927d4c999cd668de9c04fc4c3146f347
    Related-Bug: #1798475

tags: added: neutron-proactive-backport-potential
Slawek Kaplonski (slaweq) wrote :
Download full text (3.6 KiB)

I was trying to understand on one example what happens there that this failover happens sometimes.

I was based on test result http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/testr_results.html.gz

Two „hosts”: host-3f3dad1b and host-6d630618

Router id: 3d3c2c83-234a-4b63-bd6b-c450da34a7d2

First time router was created:
* host-3f3dad1b was backup, router transitioned to backup at 3:37:35.482
 http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost/neutron-l3-agent--2018-11-30--03-37-05-357461.txt.gz#_2018-11-30_03_37_35_482

* host-6d630618 was active, router transitioned first to backup at 03:37:32.245
 http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost/neutron-l3-agent--2018-11-30--03-37-05-693500.txt.gz#_2018-11-30_03_37_32_245
 and later transitioned to master at 03:37:47.489
 http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost/neutron-l3-agent--2018-11-30--03-37-05-693500.txt.gz#_2018-11-30_03_37_47_489

Restarts of agents:
* First restart of backup agent (host-3f3dad1b) at 03:37:50.546
 http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost.txt.gz#_2018-11-30_03_37_50_546
 Pinging gateway IP address from external vm for 1 minute is fine,
* New process on this host is started and router is agent transitioned to backup at 03:37:59.021:
 http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost/neutron-l3-agent--2018-11-30--03-37-50-590709.txt.gz#_2018-11-30_03_37_59_388

* Then restart of master agent happens at 03:38:50.909
 http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost.txt.gz#_2018-11-30_03_38_50_909
 Router is then transitioned to active on host-3f3dad1b at 03:39:02.322
 http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost/neutron-l3-agent--2018-11-30--03-37-50-590709.txt.gz#_2018-11-30_03_39_02_322
 And it is transitioned to backup at 03:39:03.522:
 http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost/neutron-l3-agent--2018-11-30--03-38-50-946978.txt.gz#_2018-11-30_03_39_03_522
 On this host it is also transitioned to backup once again at 03:39:19.314
 http://logs.openstack.org/09/608909/20/check/neutron-fullstack/c7b6401/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost/neutron-l3-agent--2018-11-30--03-38-50-946978.txt.gz#_2018-11-30_03_39_19_314
 And finally it is transitioned back to active on host host-6d630618 at 03:39:36.339
 http://logs.openstack.org/09/608909...

Read more...

Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)

Fix proposed to branch: master
Review: https://review.openstack.org/625054

Changed in neutron:
status: Confirmed → In Progress

Related fix proposed to branch: master
Review: https://review.openstack.org/627546

LIU Yulong (dragon889) wrote :

Seems not meets too much during last 15days:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost%5C%22%20AND%20message%3A%5C%22FAILED%5C%22

Totally 26 times:
https://review.openstack.org/#/c/625054 14 times, , fix ongoing
https://review.openstack.org/#/c/626830 2 times, fix ongoing
https://review.openstack.org/#/c/626829 2 times, fix ongoing
https://review.openstack.org/#/c/624709 4 times, seems code does not fully accomplished.
https://review.openstack.org/#/c/624708 4 times, seems code does not fully accomplished.

10 times recheck, this case are all pass:
https://review.openstack.org/#/c/627285/

Noting changed recheck, ongoing, seems still does not meet the failure:
https://review.openstack.org/#/c/627738/

still investigating...

Slawek Kaplonski (slaweq) wrote :

@LIU, thx for update on this, in comment #8 You wrote for some patches "fix ongoing". Can You explain me what fix is actually ongoing? E.g. https://review.openstack.org/#/c/626829 don't look like related to this issue.

LIU Yulong (dragon889) wrote :
Download full text (3.8 KiB)

2019-01-10T17:34:36.627+08:00 {0} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [65.050745s] ... FAILED refs/changes/08/624708/11
2019-01-10T17:34:36.627+08:00 primary | 2019-01-10 09:34:36.627 | {0} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [65.050745s] ... refs/changes/08/624708/11
2019-01-10T17:30:34.030+08:00 primary | 2019-01-10 09:30:34.030 | {2} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [64.724999s] ... refs/changes/09/624709/11
2019-01-10T17:30:34.030+08:00 {2} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [64.724999s] ... FAILED refs/changes/09/624709/11
2019-01-10T12:46:58.747+08:00 primary | 2019-01-10 04:46:58.747 | {3} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [140.420692s]... refs/changes/57/628057/2
2019-01-10T12:46:58.747+08:00 {3} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [140.420692s] ... FAILED refs/changes/57/628057/2
2019-01-08T12:18:42.374+08:00 primary | 2019-01-08 04:18:42.373 | {1} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [66.065260s] ... refs/changes/57/628857/10
2019-01-08T12:18:42.373+08:00 {1} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [66.065260s] ... FAILED refs/changes/57/628857/10
2019-01-08T10:08:53.057+08:00 primary | 2019-01-08 02:08:53.057 | {1} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [65.064567s] ... refs/changes/57/628857/8
2019-01-08T10:08:53.057+08:00 {1} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [65.064567s] ... FAILED refs/changes/57/628857/8
2019-01-07T18:43:09.519+08:00 primary | 2019-01-07 10:43:09.518 | {3} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [47.367162s] ... refs/changes/57/628857/5
2019-01-07T18:43:09.518+08:00 {3} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [47.367162s] ... FAILED refs/changes/57/628857/5
2019-01-07T07:45:45.418+08:00 primary | 2019-01-06 23:45:45.417 | {3} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [211.961294s]... refs/changes/91/628691/8
2019-01-07T07:45:45.417+08:00 {3} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [211.961294s] ... FAILED refs/changes/91/628691/8
2019-01-06T17:52:02.597+08:00 primary | 2019-01-06 09:52:02.596 | {3} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [167.310442s]... refs/changes/29/626829/2
2019-01-06T17:52:02.596+08:00 {3} neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost [167.310442s] ... FAILED refs/changes/29/626829/2
2019-01-05T15:24:58.539+08:00 {1} neutron.tests.fullstack.test_l3_agent.TestHAL...

Read more...

LIU Yulong (dragon889) wrote :

@Slawek,
Thanks for the post.
http://logs.openstack.org/57/628057/2/gate/neutron-fullstack/fe4a1c0/logs/testr_results.html.gz
This one seems met a new issue I've not seen before.

In the journal.log, I noticed these logs:
http://logs.openstack.org/57/628057/2/gate/neutron-fullstack/fe4a1c0/logs/journal.log
Keepalived_vrrp 13885 is master at this time:
Jan 10 04:45:42 ubuntu-xenial-limestone-regionone-0001611222 Keepalived_vrrp[13885]: VRRP_Instance(VR_1) Transition to MASTER STATE
and Keepalived_vrrp 14083 is backup at this:
Jan 10 04:45:41 ubuntu-xenial-limestone-regionone-0001611222 Keepalived_vrrp[14083]: VRRP_Instance(VR_1) Entering BACKUP STATE

At this time, a state change happened.
Jan 10 04:46:01 ubuntu-xenial-limestone-regionone-0001611222 Keepalived_vrrp[14083]: VRRP_Instance(VR_1) Transition to MASTER STATE
Jan 10 04:46:01 ubuntu-xenial-limestone-regionone-0001611222 Keepalived_vrrp[13885]: VRRP_Instance(VR_1) Received advert with higher priority 50, ours 50
Jan 10 04:46:01 ubuntu-xenial-limestone-regionone-0001611222 Keepalived_vrrp[13885]: VRRP_Instance(VR_1) Entering BACKUP STATE

This LOG "Received advert with higher priority 50, ours 50" is really interesting, seems that 'non-preemptive' VRRP does not work as excepted.

The testing keepalived version is v1.3.9:
Jan 10 04:45:35 ubuntu-xenial-limestone-regionone-0001611222 Keepalived[13880]: Starting Keepalived v1.3.9 (10/21,2017)
Here are some github issues:
https://github.com/acassen/keepalived/issues/512
https://github.com/osixia/docker-keepalived/issues/20

LIU Yulong (dragon889) wrote :

This one:
http://logs.openstack.org/91/628691/8/check/neutron-fullstack/977b7b2/logs/testr_results.html.gz
http://logs.openstack.org/91/628691/8/check/neutron-fullstack/977b7b2/logs/journal.log

Get this new output "forcing new election"...

Jan 06 23:44:40 ubuntu-xenial-limestone-regionone-0001530518 Keepalived_vrrp[16293]: VRRP_Instance(VR_1) Received advert with lower priority 50, ours 50, forcing new election
Jan 06 23:44:40 ubuntu-xenial-limestone-regionone-0001530518 Keepalived_vrrp[15969]: VRRP_Instance(VR_1) Received advert with higher priority 50, ours 50

LIU Yulong (dragon889) wrote :

http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost%5C%22%20AND%20message:%5C%22FAILED%5C%22

Failed cases were seen at:
(1) https://review.openstack.org/#/c/626829, reason: kill -9 VRRP process
(2) https://review.openstack.org/#/c/628691, "forcing new election", "Received advert with higher priority 50, ours 50"
(3) https://review.openstack.org/#/c/628057, "Received advert with higher priority 50, ours 50"
(4) https://review.openstack.org/#/c/628857, code was not fully accomplished, fullstack all cases were failed.
(5) https://review.openstack.org/#/c/624709, code was not fully accomplished, fullstack all cases were failed.
(6) https://review.openstack.org/#/c/624708, code was not fully accomplished, fullstack all cases were failed.

Let's investigate that re-election and 'non-preemptive' issue. More like a keepalived problem.

LIU Yulong (dragon889) wrote :

So, that "forcing new election", "Received advert with higher priority 50, ours 50" can be caused by this:
https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L277-L280
It will set the HA port down during the L3 agent restart, sometimes that can unexpectedly cause VRRP re-election.
Yes, I've tested it without such behavior: https://review.openstack.org/#/c/627285/.
The fullstack test was passed 17 times.

LIU Yulong (dragon889) wrote :

Another LOG entry:
Jan 12 21:46:21 ubuntu-xenial-ovh-gra1-0001668446 Keepalived_vrrp[5667]: VRRP_Instance(VR_1) Received advert with higher priority 50, ours 50

And the ovs-agent has such log which set ha-port from DOWN to ACTIVE:
http://logs.openstack.org/48/609848/7/check/neutron-fullstack/8dc2558/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router_restart_agents_no_packet_lost/neutron-openvswitch-agent--2019-01-12--21-45-15-507088.txt.gz#_2019-01-12_21_46_21_389

Change abandoned by LIU Yulong (<email address hidden>) on branch: master
Review: https://review.openstack.org/625054

LIU Yulong (dragon889) wrote :

Change abandoned by LIU Yulong (<email address hidden>) on branch: master
Review: https://review.openstack.org/627546
Reason: This is not needed now.

Change abandoned by LIU Yulong (<email address hidden>) on branch: master
Review: https://review.openstack.org/626628
Reason: Restore if needed.

Reviewed: https://review.openstack.org/627285
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5b7d444b3176dd3f8bf166d332781ac93670a51d
Submitter: Zuul
Branch: master

commit 5b7d444b3176dd3f8bf166d332781ac93670a51d
Author: LIU Yulong <email address hidden>
Date: Tue Dec 25 17:45:05 2018 +0800

    Not set the HA port down at regular l3-agent restart

    If l3-agent was restarted by a regular action, such as config change,
    package upgrade, manually service restart etc. We should not set the
    HA port down during such scenarios. Unless the physical host was
    rebooted, aka the VRRP processes were all terminated.

    This patch adds a new RPC call during l3 agent init, it will try to
    retrieve the HA router count first. And then compare the VRRP process
    (keepalived) count and 'neutron-keepalived-state-change' count
    with the hosting router count. If the count matches, then that
    set HA port to 'DOWN' state action will not be triggered anymore.

    Closes-Bug: #1798475
    Change-Id: I5e2bb64df0aaab11a640a798963372c8d91a06a8

Changed in neutron:
status: In Progress → Fix Released

This issue was fixed in the openstack/neutron 14.0.0.0b2 development milestone.

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/653566
Reason: Base on Bernard's comment I think that this will not be possible to backport. As it isn't very critical thing IMO, lets abandon this backport now.

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/pike
Review: https://review.opendev.org/653584
Reason: Base on Bernard's comment in https://review.opendev.org/#/c/653566/ I think that this will not be possible to backport. As it isn't very critical thing IMO, lets abandon this backport now.

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/653569
Reason: Base on Bernard's comment in https://review.opendev.org/#/c/653566/ I think that this will not be possible to backport. As it isn't very critical thing IMO, lets abandon this backport now.

Changed in neutron:
status: Fix Released → Confirmed

Related fix proposed to branch: master
Review: https://review.opendev.org/660743

Related fix proposed to branch: master
Review: https://review.opendev.org/660905

Reviewed: https://review.opendev.org/660905
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=26388a9952dcd186759b7d63a12d64a60a15ceef
Submitter: Zuul
Branch: master

commit 26388a9952dcd186759b7d63a12d64a60a15ceef
Author: LIU Yulong <email address hidden>
Date: Thu May 23 15:19:56 2019 +0800

    Set neutron-keepalived-state-change proctitle

    Then we can count the process correctly.

    Related-Bug: #1798475
    Change-Id: I9c6651ed192669b91a4683f5f3bd2795e8d8276a

Reviewed: https://review.opendev.org/660592
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bc073849b6aba62a39846b8be98b0da744caab84
Submitter: Zuul
Branch: master

commit bc073849b6aba62a39846b8be98b0da744caab84
Author: Slawek Kaplonski <email address hidden>
Date: Wed May 22 09:52:26 2019 +0200

    Mark fullstack test_ha_router_restart_agents_no_packet_lost as unstable

    This test is still failing quite often and we don't have root cause
    yet.
    Lets mark it as unstable for now to make our gate more stable for now.

    Change-Id: Id7d14b0b399ce7efc32c1524ca0f38d0a2f85410
    Related-Bug: #1798475

Reviewed: https://review.opendev.org/660743
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=426a5b2833966569da39f3ef9a38ba2f451da687
Submitter: Zuul
Branch: master

commit 426a5b2833966569da39f3ef9a38ba2f451da687
Author: LIU Yulong <email address hidden>
Date: Wed May 22 22:38:40 2019 +0800

    Adjust some HA router log

    In case router is concurrently deleted, so the HA
    state change LOG is not necessary. It sometimes
    makes us confusing.
    Also print the log for the pid of router
    keepalived-state-change child process.

    Change-Id: Id57dd787c254994af967db17647a3a28925714da
    Related-Bug: #1798475

Reviewed: https://review.opendev.org/668914
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8d8ce04ed6e0580e758cc9f6e89ba347e93dcee6
Submitter: Zuul
Branch: master

commit 8d8ce04ed6e0580e758cc9f6e89ba347e93dcee6
Author: Slawek Kaplonski <email address hidden>
Date: Wed Jul 3 16:04:25 2019 +0200

    Mark fullstack test_ha_router_restart_agents_no_packet_lost as unstable

    Even after we merged [1] which should fix this failing test,
    it is still failing quite often and we don't have root cause yet.

    Lets (again) mark it as unstable for now to make our gate more
    stable for now.

    [1] https://review.opendev.org/#/c/664629/

    Change-Id: I8ab51afc154a4359d9667112d4563faa82f02fcd
    Related-Bug: #1798475

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.