ha router sometime goes in standby mode in all controllers

Bug #1823314 reported by Slawek Kaplonski
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Slawek Kaplonski

Bug Description

Sometimes when 2 HA routers are created for same tenant in very short time, it may happen that both routers will have same vr_id assigned thus it will be same application for keepalived and only one of those routers will be active on some hosts.

When I spotted it it looked like:

[stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-2
+--------------------------------------+--------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True | :-) | standby |
| 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True | :-) | active |
| 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True | :-) | standby |
+--------------------------------------+--------------------------+----------------+-------+----------+
[stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-1
+--------------------------------------+--------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True | :-) | standby |
| 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True | :-) | standby |
| 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True | :-) | standby |
+--------------------------------------+--------------------------+----------------+-------+----------+

And in db it looks like:

MariaDB [ovs_neutron]> select * from router_extra_attributes;
+--------------------------------------+-------------+----------------+----+----------+-------------------------+
| router_id | distributed | service_router | ha | ha_vr_id | availability_zone_hints |
+--------------------------------------+-------------+----------------+----+----------+-------------------------+
| 6ba430d7-2f9d-4e8e-a59f-4d4fb5644a8e | 0 | 0 | 1 | 1 | [] |
| ace64e85-5f3b-4815-aeae-3b54c75ef5eb | 0 | 0 | 1 | 1 | [] |
| cd6b61e1-60c9-47da-8866-169ca29ece20 | 1 | 0 | 0 | 0 | [] |
+--------------------------------------+-------------+----------------+----+----------+-------------------------+
3 rows in set (0.01 sec)

MariaDB [ovs_neutron]> select * from ha_router_vrid_allocations;
+--------------------------------------+-------+
| network_id | vr_id |
+--------------------------------------+-------+
| 45aaae94-ce16-412d-bd74-b3812b16ff6f | 1 |
+--------------------------------------+-------+
1 row in set (0.01 sec)

So indeed there is possible race during such creation of 2 different routers in very short time.

But when I then created another router, it was created properly with new vr_id and all worked fine for it:

[stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-3
+--------------------------------------+--------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True | :-) | standby |
| 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True | :-) | active |
| 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True | :-) | standby |
+--------------------------------------+--------------------------+----------------+-------+----------+

MariaDB [ovs_neutron]> select * from ha_router_vrid_allocations;
+--------------------------------------+-------+
| network_id | vr_id |
+--------------------------------------+-------+
| 45aaae94-ce16-412d-bd74-b3812b16ff6f | 1 |
| 45aaae94-ce16-412d-bd74-b3812b16ff6f | 2 |
+--------------------------------------+-------+

I found this bug on old version based on Newton release but from what I saw in https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L109 this code didn't change a lot so I think that the same issue may happen also on newer releases.

Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
Changed in neutron:
importance: Undecided → Low
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/651495

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/651495
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a8d0f557d504957aeb91f451105cca9eee2d6adb
Submitter: Zuul
Branch: master

commit a8d0f557d504957aeb91f451105cca9eee2d6adb
Author: Slawek Kaplonski <email address hidden>
Date: Wed Apr 10 12:49:49 2019 +0200

    Choose random value for HA routes' vr_id

    HA routers are using keepalived and needs to have virtual_router_id
    configured. As routers which belongs to same tenant are using same
    ha network, those values have to be different for each router.

    Before this patch this value was always taken as first available value
    from available_vr_ids range.
    In some (rare) cases, when more than one router is created in parallel
    for same tenant it may happen that those routers would have same vr_id
    choosen so keepalived would treat them as single application and only
    one router would be ACTIVE on one of L3 agents.

    This patch changes this behaviour that now random value from available
    vr_ids will be chosen instead of taking first value always.
    That should mittigate this rare race condition that it will be (almost)
    not noticable for users.

    However, proper fix should be probably done as some additional
    constraint in database layer. But such solution wouldn't be possible to
    backport to stable branches so I decided to propose this easy patch
    first.

    Change-Id: Idb0ed744e54976dca23593fb2d7317bf77442e65
    Related-Bug: #1823314

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.openstack.org/651983

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/651984

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/651986

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/651987

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/651988

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky)

Reviewed: https://review.openstack.org/651984
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ee2ed681c495c4fc5086d761853731b7dc2fd34f
Submitter: Zuul
Branch: stable/rocky

commit ee2ed681c495c4fc5086d761853731b7dc2fd34f
Author: Slawek Kaplonski <email address hidden>
Date: Wed Apr 10 12:49:49 2019 +0200

    Choose random value for HA routes' vr_id

    HA routers are using keepalived and needs to have virtual_router_id
    configured. As routers which belongs to same tenant are using same
    ha network, those values have to be different for each router.

    Before this patch this value was always taken as first available value
    from available_vr_ids range.
    In some (rare) cases, when more than one router is created in parallel
    for same tenant it may happen that those routers would have same vr_id
    choosen so keepalived would treat them as single application and only
    one router would be ACTIVE on one of L3 agents.

    This patch changes this behaviour that now random value from available
    vr_ids will be chosen instead of taking first value always.
    That should mittigate this rare race condition that it will be (almost)
    not noticable for users.

    However, proper fix should be probably done as some additional
    constraint in database layer. But such solution wouldn't be possible to
    backport to stable branches so I decided to propose this easy patch
    first.

    Change-Id: Idb0ed744e54976dca23593fb2d7317bf77442e65
    Related-Bug: #1823314
    (cherry picked from commit a8d0f557d504957aeb91f451105cca9eee2d6adb)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/651986
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=72c9a7ef8416f894a85a36c6b5bbf995e48599d1
Submitter: Zuul
Branch: stable/queens

commit 72c9a7ef8416f894a85a36c6b5bbf995e48599d1
Author: Slawek Kaplonski <email address hidden>
Date: Wed Apr 10 12:49:49 2019 +0200

    Choose random value for HA routes' vr_id

    HA routers are using keepalived and needs to have virtual_router_id
    configured. As routers which belongs to same tenant are using same
    ha network, those values have to be different for each router.

    Before this patch this value was always taken as first available value
    from available_vr_ids range.
    In some (rare) cases, when more than one router is created in parallel
    for same tenant it may happen that those routers would have same vr_id
    choosen so keepalived would treat them as single application and only
    one router would be ACTIVE on one of L3 agents.

    This patch changes this behaviour that now random value from available
    vr_ids will be chosen instead of taking first value always.
    That should mittigate this rare race condition that it will be (almost)
    not noticable for users.

    However, proper fix should be probably done as some additional
    constraint in database layer. But such solution wouldn't be possible to
    backport to stable branches so I decided to propose this easy patch
    first.

    Conflicts:
        neutron/db/l3_hamode_db.py

    Change-Id: Idb0ed744e54976dca23593fb2d7317bf77442e65
    Related-Bug: #1823314
    (cherry picked from commit a8d0f557d504957aeb91f451105cca9eee2d6adb)
    (cherry picked from commit ee2ed681c495c4fc5086d761853731b7dc2fd34f)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein)

Reviewed: https://review.openstack.org/651983
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2aa200bdb994a72e8159f78c4439b81232436942
Submitter: Zuul
Branch: stable/stein

commit 2aa200bdb994a72e8159f78c4439b81232436942
Author: Slawek Kaplonski <email address hidden>
Date: Wed Apr 10 12:49:49 2019 +0200

    Choose random value for HA routes' vr_id

    HA routers are using keepalived and needs to have virtual_router_id
    configured. As routers which belongs to same tenant are using same
    ha network, those values have to be different for each router.

    Before this patch this value was always taken as first available value
    from available_vr_ids range.
    In some (rare) cases, when more than one router is created in parallel
    for same tenant it may happen that those routers would have same vr_id
    choosen so keepalived would treat them as single application and only
    one router would be ACTIVE on one of L3 agents.

    This patch changes this behaviour that now random value from available
    vr_ids will be chosen instead of taking first value always.
    That should mittigate this rare race condition that it will be (almost)
    not noticable for users.

    However, proper fix should be probably done as some additional
    constraint in database layer. But such solution wouldn't be possible to
    backport to stable branches so I decided to propose this easy patch
    first.

    Change-Id: Idb0ed744e54976dca23593fb2d7317bf77442e65
    Related-Bug: #1823314
    (cherry picked from commit a8d0f557d504957aeb91f451105cca9eee2d6adb)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/651988
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4a45e1adebee9a72b6e8a36c5fd88a1380f81cb2
Submitter: Zuul
Branch: stable/ocata

commit 4a45e1adebee9a72b6e8a36c5fd88a1380f81cb2
Author: Slawek Kaplonski <email address hidden>
Date: Wed Apr 10 12:49:49 2019 +0200

    Choose random value for HA routes' vr_id

    HA routers are using keepalived and needs to have virtual_router_id
    configured. As routers which belongs to same tenant are using same
    ha network, those values have to be different for each router.

    Before this patch this value was always taken as first available value
    from available_vr_ids range.
    In some (rare) cases, when more than one router is created in parallel
    for same tenant it may happen that those routers would have same vr_id
    choosen so keepalived would treat them as single application and only
    one router would be ACTIVE on one of L3 agents.

    This patch changes this behaviour that now random value from available
    vr_ids will be chosen instead of taking first value always.
    That should mittigate this rare race condition that it will be (almost)
    not noticable for users.

    However, proper fix should be probably done as some additional
    constraint in database layer. But such solution wouldn't be possible to
    backport to stable branches so I decided to propose this easy patch
    first.

    Conflicts:
        neutron/db/l3_hamode_db.py

    Change-Id: Idb0ed744e54976dca23593fb2d7317bf77442e65
    Related-Bug: #1823314
    (cherry picked from commit a8d0f557d504957aeb91f451105cca9eee2d6adb)
    (cherry picked from commit ee2ed681c495c4fc5086d761853731b7dc2fd34f)
    (cherry picked from commit 72c9a7ef8416f894a85a36c6b5bbf995e48599d1)

tags: added: in-stable-ocata
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I think that I know what is going on there.

It is race condition with creating HA network and assigning new vr_id to the router.

Lets assume we are creating 2 different routers (first 2 HA routers for tenant).
Each request goes to different controller and now.
1. Controller-1, as part of creation of router-1, creates HA network, lets call it HA-Net-A,
2. For some reason (I'm not sure what the reason was exactly), controller-1 starts to remove HA-Net-A but
3. in same time on controller 2 HA-Net-A was found and router-2 is trying to use it
4. controller-2 allocates vr_id=1 for router-2 on HA-Net-A,
5. HA-Net-A is finally removed on controller-1 so controller-2 also got some error and retries configure router-2,
6. controller-2 creates new network HA-Net-B but it already have allocated vr_id=1 for router-2 (see p.4), it is stored in different table in db and have nothing to do with removed network,
7. controller-1 tries to allocate vr_id for router-1. As it is for HA-Net-B this time, vr_id=1 is free on this network so it is allocated,

And finally both routers got vr_id=1 allocated and only one of them is active on one L3 agent.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/651987
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7f7dee73248b561bdefb289573990e8455a666c4
Submitter: Zuul
Branch: stable/pike

commit 7f7dee73248b561bdefb289573990e8455a666c4
Author: Slawek Kaplonski <email address hidden>
Date: Wed Apr 10 12:49:49 2019 +0200

    Choose random value for HA routes' vr_id

    HA routers are using keepalived and needs to have virtual_router_id
    configured. As routers which belongs to same tenant are using same
    ha network, those values have to be different for each router.

    Before this patch this value was always taken as first available value
    from available_vr_ids range.
    In some (rare) cases, when more than one router is created in parallel
    for same tenant it may happen that those routers would have same vr_id
    choosen so keepalived would treat them as single application and only
    one router would be ACTIVE on one of L3 agents.

    This patch changes this behaviour that now random value from available
    vr_ids will be chosen instead of taking first value always.
    That should mittigate this rare race condition that it will be (almost)
    not noticable for users.

    However, proper fix should be probably done as some additional
    constraint in database layer. But such solution wouldn't be possible to
    backport to stable branches so I decided to propose this easy patch
    first.

    Conflicts:
        neutron/db/l3_hamode_db.py

    Change-Id: Idb0ed744e54976dca23593fb2d7317bf77442e65
    Related-Bug: #1823314
    (cherry picked from commit a8d0f557d504957aeb91f451105cca9eee2d6adb)
    (cherry picked from commit ee2ed681c495c4fc5086d761853731b7dc2fd34f)
    (cherry picked from commit 72c9a7ef8416f894a85a36c6b5bbf995e48599d1)

tags: added: in-stable-pike
tags: added: canonical-bootstack
Changed in neutron:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.