neutron

L3 agent unable to update HA router state after race between HA router creating and deleting

Bug #1533454 reported by LIU Yulong on 2016-01-13

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	Medium	LIU Yulong
	Kilo	Fix Released	Undecided	Unassigned

Bug Description

The router L3 HA binding process does not take into account the fact that the port it is binding to the agent can be concurrently deleted.

Details:

When neutron server deleted all the resources of a
HA router, L3 agent can not aware that, so race
happened in some procedure like this:
1. Neutron server delete all resources of a HA router
2. RPC fanout to L3 agent 1 in which
the HA router was master state
3. In l3 agent 2 'backup' router set itself to masert
and notify neutron server a HA router state change notify.
4. PortNotFound rasied in updating HA router states function
(Seems the DB error was no longer existed.)

How the step 2 and 3 happens?
Consider that l3 agent 2 has much more HA routers than l3 agent 1,
or any reason that causes l3 agent 2 gets/processes the deleting
RPC later than l3 agent 1. Then l3 agent 1 remove HA router's
keepalived process will soonly be detected by backup router in
l3 agent 2 via VRRP protocol. Now the router deleting RPC is in
the queue of RouterUpdate or any step of a HA router deleting
procedure, and the router_info will still have 'the' router info.
So l3 agent 2 will do the state change procedure, AKA notify
the neutron server to update router state.

See original description

Tags:

LIU Yulong (dragon889) on 2016-01-13

summary:

- L3 agent unable to update HA router state race after between HA router
+ L3 agent unable to update HA router state after race between HA router
creating and deleting

OpenStack Infra (hudson-openstack) on 2016-01-14

Changed in neutron:
assignee:	nobody → LIU Yulong (dragon889)
status:	New → In Progress

Miguel Angel Ajo (mangelajo) on 2016-01-26

tags:	added: kilo-backport-potential
tags:	added: liberty-backport-potential

Miguel Angel Ajo (mangelajo) on 2016-01-26

tags:

added: l3-ha

Revision history for this message

Assaf Muller (amuller) wrote on 2016-01-28:

Can you show an example TRACE?

Revision history for this message

LIU Yulong (dragon889) wrote on 2016-01-29:

Currently the DB error is not easy to reproduce, but some other exceptions proved that the race is existed,
after run rally task create_and_delete_routers for 10 times I got this log trace:
http://paste.openstack.org/show/485366/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-03: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/275614

Revision history for this message

LIU Yulong (dragon889) wrote on 2016-02-23:

Patch:
https://review.openstack.org/#/c/265685/

description:

updated

LIU Yulong (dragon889) on 2016-02-26

description:

updated

LIU Yulong (dragon889) on 2016-02-26

description:	updated
description:	updated

Kevin Benton (kevinbenton) on 2016-02-26

Changed in neutron:
importance:	Undecided → Medium
description:	updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-28: Fix merged to neutron (master)

Reviewed: https://review.openstack.org/265685
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=472d84d25cee0694500e583845718a4f377cc75c
Submitter: Jenkins
Branch: master

commit 472d84d25cee0694500e583845718a4f377cc75c
Author: LIU Yulong <email address hidden>
Date: Mon Jan 11 12:02:55 2016 +0800

Catch PortNotFound after HA router race condition

    When neutron server deleted all the resources of a
    HA router, L3 agent can not aware that, so race
    happened in some procedure like this:
    1. Neutron server delete all resources of a HA router.
    2. RPC fanout to L3 agent 1 in which the HA router was
       master state.
    3. In l3 agent 2 'backup' router set itself to masert
       and notify neutron server a HA router state change
       notify.
    4. PorNotFound rasied in updating router HA port status.

    How the step 2 and 3 happens?
    Consider that l3 agent 2 has much more HA routers than l3 agent 1,
    or any reason that causes l3 agent 2 gets/processes the deleting
    RPC later than l3 agent 1. Then l3 agent 1 remove HA router's
    keepalived process will soonly be detected by backup router in
    l3 agent 2 via VRRP protocol. Now the router deleting RPC is in
    the queue of RouterUpdate or any step of a HA router deleting
    procedure, and the router_info will still have 'the' router info.
    So l3 agent 2 will do the state change procedure, AKA notify
    the neutron server to update router state.

This patch is mainly to deal with the race by catching the
PorNotFound exception in neutron-server side.

    Change-Id: I34d7347595bfceb8a70685672a6287e1a44ede6b
    Closes-Bug: #1533454
    Related-Bug: #1523780

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-28: Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/285804

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-28: Fix proposed to neutron (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/285805

Revision history for this message

Thierry Carrez (ttx) wrote on 2016-03-03: Fix included in openstack/neutron 8.0.0.0b3

This issue was fixed in the openstack/neutron 8.0.0.0b3 development milestone.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-12: Fix merged to neutron (stable/kilo)

Reviewed: https://review.openstack.org/285805
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1c9e1fd772f5bfba24251d51b49f89c38eaffe08
Submitter: Jenkins
Branch: stable/kilo

commit 1c9e1fd772f5bfba24251d51b49f89c38eaffe08
Author: LIU Yulong <email address hidden>
Date: Mon Jan 11 12:02:55 2016 +0800

Catch PortNotFound after HA router race condition

    When neutron server deleted all the resources of a
    HA router, L3 agent can not aware that, so race
    happened in some procedure like this:
    1. Neutron server delete all resources of a HA router.
    2. RPC fanout to L3 agent 1 in which the HA router was
       master state.
    3. In l3 agent 2 'backup' router set itself to masert
       and notify neutron server a HA router state change
       notify.
    4. PorNotFound rasied in updating router HA port status.

    How the step 2 and 3 happens?
    Consider that l3 agent 2 has much more HA routers than l3 agent 1,
    or any reason that causes l3 agent 2 gets/processes the deleting
    RPC later than l3 agent 1. Then l3 agent 1 remove HA router's
    keepalived process will soonly be detected by backup router in
    l3 agent 2 via VRRP protocol. Now the router deleting RPC is in
    the queue of RouterUpdate or any step of a HA router deleting
    procedure, and the router_info will still have 'the' router info.
    So l3 agent 2 will do the state change procedure, AKA notify
    the neutron server to update router state.

This patch is mainly to deal with the race by catching the
PorNotFound exception in neutron-server side.

    Change-Id: I34d7347595bfceb8a70685672a6287e1a44ede6b
    Closes-Bug: #1533454
    Related-Bug: #1523780
    (cherry picked from commit 472d84d25cee0694500e583845718a4f377cc75c)

tags:

added: in-stable-kilo

Assaf Muller (amuller) on 2016-03-17

tags:	removed: kilo-backport-potential
tags:	added: in-stable-liberty removed: liberty-backport-potential

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-03-29: Fix included in openstack/neutron 7.0.4

#10

This issue was fixed in the openstack/neutron 7.0.4 release.

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-05-10: Fix included in openstack/neutron 2015.1.4

#11

This issue was fixed in the openstack/neutron 2015.1.4 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-10:

#12

This issue was fixed in the openstack/neutron 2015.1.4 release.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1549977

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.