Fullstack test test_ha_router_restart_agents_no_packet_lost failing
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Won't Fix
|
High
|
LIU Yulong |
Bug Description
Found at least 4 times recently:
http://
http://
http://
http://
Looks that sometimes during L3 agent restart there is some packets loss noticed and that cause failure. We need to investigate that.
Hongbin Lu (hongbin.lu) wrote : | #1 |
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master) | #2 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #3 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit c1407db676b6e61
Author: Slawek Kaplonski <email address hidden>
Date: Sun Nov 25 17:39:12 2018 +0100
Store journal log from host in fullstack's job results
Change-Id: Ibd16e111927d4c
Related-Bug: #1798475
tags: | added: neutron-proactive-backport-potential |
Slawek Kaplonski (slaweq) wrote : | #4 |
I was trying to understand on one example what happens there that this failover happens sometimes.
I was based on test result http://
Two „hosts”: host-3f3dad1b and host-6d630618
Router id: 3d3c2c83-
First time router was created:
* host-3f3dad1b was backup, router transitioned to backup at 3:37:35.482
http://
* host-6d630618 was active, router transitioned first to backup at 03:37:32.245
http://
and later transitioned to master at 03:37:47.489
http://
Restarts of agents:
* First restart of backup agent (host-3f3dad1b) at 03:37:50.546
http://
Pinging gateway IP address from external vm for 1 minute is fine,
* New process on this host is started and router is agent transitioned to backup at 03:37:59.021:
http://
* Then restart of master agent happens at 03:38:50.909
http://
Router is then transitioned to active on host-3f3dad1b at 03:39:02.322
http://
And it is transitioned to backup at 03:39:03.522:
http://
On this host it is also transitioned to backup once again at 03:39:19.314
http://
And finally it is transitioned back to active on host host-6d630618 at 03:39:36.339
http://
Changed in neutron: | |
assignee: | nobody → LIU Yulong (dragon889) |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master) | #5 |
Fix proposed to branch: master
Review: https:/
Changed in neutron: | |
status: | Confirmed → In Progress |
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master) | #6 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #7 |
Related fix proposed to branch: master
Review: https:/
LIU Yulong (dragon889) wrote : | #8 |
Seems not meets too much during last 15days:
http://
Totally 26 times:
https:/
https:/
https:/
https:/
https:/
10 times recheck, this case are all pass:
https:/
Noting changed recheck, ongoing, seems still does not meet the failure:
https:/
still investigating...
LIU Yulong (dragon889) wrote : | #9 |
Nothing changed recheck:
https:/
It also has 10+ times pass now:
http://
Slawek Kaplonski (slaweq) wrote : | #10 |
@LIU, thx for update on this, in comment #8 You wrote for some patches "fix ongoing". Can You explain me what fix is actually ongoing? E.g. https:/
Slawek Kaplonski (slaweq) wrote : | #11 |
It just happened again in http://
LIU Yulong (dragon889) wrote : | #12 |
2019-01-
2019-01-
2019-01-
2019-01-
2019-01-
2019-01-
2019-01-
2019-01-
2019-01-
2019-01-
2019-01-
2019-01-
2019-01-
2019-01-
2019-01-
2019-01-
2019-01-
LIU Yulong (dragon889) wrote : | #13 |
@Slawek,
Thanks for the post.
http://
This one seems met a new issue I've not seen before.
In the journal.log, I noticed these logs:
http://
Keepalived_vrrp 13885 is master at this time:
Jan 10 04:45:42 ubuntu-
and Keepalived_vrrp 14083 is backup at this:
Jan 10 04:45:41 ubuntu-
At this time, a state change happened.
Jan 10 04:46:01 ubuntu-
Jan 10 04:46:01 ubuntu-
Jan 10 04:46:01 ubuntu-
This LOG "Received advert with higher priority 50, ours 50" is really interesting, seems that 'non-preemptive' VRRP does not work as excepted.
The testing keepalived version is v1.3.9:
Jan 10 04:45:35 ubuntu-
Here are some github issues:
https:/
https:/
LIU Yulong (dragon889) wrote : | #14 |
This one:
http://
http://
Get this new output "forcing new election"...
Jan 06 23:44:40 ubuntu-
Jan 06 23:44:40 ubuntu-
LIU Yulong (dragon889) wrote : | #15 |
Failed cases were seen at:
(1) https:/
(2) https:/
(3) https:/
(4) https:/
(5) https:/
(6) https:/
Let's investigate that re-election and 'non-preemptive' issue. More like a keepalived problem.
LIU Yulong (dragon889) wrote : | #16 |
So, that "forcing new election", "Received advert with higher priority 50, ours 50" can be caused by this:
https:/
It will set the HA port down during the L3 agent restart, sometimes that can unexpectedly cause VRRP re-election.
Yes, I've tested it without such behavior: https:/
The fullstack test was passed 17 times.
LIU Yulong (dragon889) wrote : | #17 |
Another LOG entry:
Jan 12 21:46:21 ubuntu-
And the ovs-agent has such log which set ha-port from DOWN to ACTIVE:
http://
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master) | #18 |
Change abandoned by LIU Yulong (<email address hidden>) on branch: master
Review: https:/
LIU Yulong (dragon889) wrote : | #19 |
The fix starts here:
https:/
OpenStack Infra (hudson-openstack) wrote : | #20 |
Change abandoned by LIU Yulong (<email address hidden>) on branch: master
Review: https:/
Reason: This is not needed now.
OpenStack Infra (hudson-openstack) wrote : | #21 |
Change abandoned by LIU Yulong (<email address hidden>) on branch: master
Review: https:/
Reason: Restore if needed.
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master) | #22 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 5b7d444b3176dd3
Author: LIU Yulong <email address hidden>
Date: Tue Dec 25 17:45:05 2018 +0800
Not set the HA port down at regular l3-agent restart
If l3-agent was restarted by a regular action, such as config change,
package upgrade, manually service restart etc. We should not set the
HA port down during such scenarios. Unless the physical host was
rebooted, aka the VRRP processes were all terminated.
This patch adds a new RPC call during l3 agent init, it will try to
retrieve the HA router count first. And then compare the VRRP process
(keepalived) count and 'neutron-
with the hosting router count. If the count matches, then that
set HA port to 'DOWN' state action will not be triggered anymore.
Closes-Bug: #1798475
Change-Id: I5e2bb64df0aaab
Changed in neutron: | |
status: | In Progress → Fix Released |
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.0.0b2 | #23 |
This issue was fixed in the openstack/neutron 14.0.0.0b2 development milestone.
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky) | #24 |
Fix proposed to branch: stable/rocky
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens) | #25 |
Fix proposed to branch: stable/queens
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike) | #26 |
Fix proposed to branch: stable/pike
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/rocky) | #27 |
Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/rocky
Review: https:/
Reason: Base on Bernard's comment I think that this will not be possible to backport. As it isn't very critical thing IMO, lets abandon this backport now.
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/pike) | #28 |
Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/pike
Review: https:/
Reason: Base on Bernard's comment in https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/queens) | #29 |
Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/queens
Review: https:/
Reason: Base on Bernard's comment in https:/
Changed in neutron: | |
status: | Fix Released → Confirmed |
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master) | #30 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #31 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #32 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #33 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 26388a9952dcd18
Author: LIU Yulong <email address hidden>
Date: Thu May 23 15:19:56 2019 +0800
Set neutron-
Then we can count the process correctly.
Related-Bug: #1798475
Change-Id: I9c6651ed192669
LIU Yulong (dragon889) wrote : | #34 |
OpenStack Infra (hudson-openstack) wrote : | #35 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit bc073849b6aba62
Author: Slawek Kaplonski <email address hidden>
Date: Wed May 22 09:52:26 2019 +0200
Mark fullstack test_ha_
This test is still failing quite often and we don't have root cause
yet.
Lets mark it as unstable for now to make our gate more stable for now.
Change-Id: Id7d14b0b399ce7
Related-Bug: #1798475
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master) | #36 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #37 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 426a5b283396656
Author: LIU Yulong <email address hidden>
Date: Wed May 22 22:38:40 2019 +0800
Adjust some HA router log
In case router is concurrently deleted, so the HA
state change LOG is not necessary. It sometimes
makes us confusing.
Also print the log for the pid of router
keepalived-
Change-Id: Id57dd787c25499
Related-Bug: #1798475
OpenStack Infra (hudson-openstack) wrote : | #38 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 8d8ce04ed6e0580
Author: Slawek Kaplonski <email address hidden>
Date: Wed Jul 3 16:04:25 2019 +0200
Mark fullstack test_ha_
Even after we merged [1] which should fix this failing test,
it is still failing quite often and we don't have root cause yet.
Lets (again) mark it as unstable for now to make our gate more
stable for now.
[1] https:/
Change-Id: I8ab51afc154a43
Related-Bug: #1798475
tags: | removed: neutron-proactive-backport-potential |
Lajos Katona (lajos-katona) wrote : | #39 |
I close this for now, the test test_ha_
Changed in neutron: | |
status: | Confirmed → Won't Fix |
One more failure: http:// logs.openstack. org/72/ 599572/ 13/check/ neutron- fullstack/ 92f2194/ logs/testr_ results. html.gz