Bug #1823314 “ha router sometime goes in standby mode in all con...” : Bugs : neutron

Slawek Kaplonski (slaweq) on 2019-04-05

Changed in neutron:
assignee:	nobody → Slawek Kaplonski (slaweq)

Bernard Cafarelli (bcafarel) on 2019-04-05

Changed in neutron:
importance:	Undecided → Low
status:	New → Confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-10: Related fix proposed to neutron (master)

#1

Related fix proposed to branch: master
Review: https://review.openstack.org/651495

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Related fix merged to neutron (master)

#2

Reviewed: https://review.openstack.org/651495
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a8d0f557d504957aeb91f451105cca9eee2d6adb
Submitter: Zuul
Branch: master

commit a8d0f557d504957aeb91f451105cca9eee2d6adb
Author: Slawek Kaplonski <email address hidden>
Date: Wed Apr 10 12:49:49 2019 +0200

Choose random value for HA routes' vr_id

    HA routers are using keepalived and needs to have virtual_router_id
    configured. As routers which belongs to same tenant are using same
    ha network, those values have to be different for each router.

    Before this patch this value was always taken as first available value
    from available_vr_ids range.
    In some (rare) cases, when more than one router is created in parallel
    for same tenant it may happen that those routers would have same vr_id
    choosen so keepalived would treat them as single application and only
    one router would be ACTIVE on one of L3 agents.

    This patch changes this behaviour that now random value from available
    vr_ids will be chosen instead of taking first value always.
    That should mittigate this rare race condition that it will be (almost)
    not noticable for users.

    However, proper fix should be probably done as some additional
    constraint in database layer. But such solution wouldn't be possible to
    backport to stable branches so I decided to propose this easy patch
    first.

Change-Id: Idb0ed744e54976dca23593fb2d7317bf77442e65
Related-Bug: #1823314

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Related fix proposed to neutron (stable/stein)

#3

Related fix proposed to branch: stable/stein
Review: https://review.openstack.org/651983

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Related fix proposed to neutron (stable/rocky)

#4

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/651984

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Related fix proposed to neutron (stable/queens)

#5

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/651986

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Related fix proposed to neutron (stable/pike)

#6

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/651987

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Related fix proposed to neutron (stable/ocata)

#7

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/651988

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-15: Related fix merged to neutron (stable/rocky)

#8

Reviewed: https://review.openstack.org/651984
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ee2ed681c495c4fc5086d761853731b7dc2fd34f
Submitter: Zuul
Branch: stable/rocky

commit ee2ed681c495c4fc5086d761853731b7dc2fd34f
Author: Slawek Kaplonski <email address hidden>
Date: Wed Apr 10 12:49:49 2019 +0200

Choose random value for HA routes' vr_id

    HA routers are using keepalived and needs to have virtual_router_id
    configured. As routers which belongs to same tenant are using same
    ha network, those values have to be different for each router.

    Before this patch this value was always taken as first available value
    from available_vr_ids range.
    In some (rare) cases, when more than one router is created in parallel
    for same tenant it may happen that those routers would have same vr_id
    choosen so keepalived would treat them as single application and only
    one router would be ACTIVE on one of L3 agents.

    This patch changes this behaviour that now random value from available
    vr_ids will be chosen instead of taking first value always.
    That should mittigate this rare race condition that it will be (almost)
    not noticable for users.

    However, proper fix should be probably done as some additional
    constraint in database layer. But such solution wouldn't be possible to
    backport to stable branches so I decided to propose this easy patch
    first.

    Change-Id: Idb0ed744e54976dca23593fb2d7317bf77442e65
    Related-Bug: #1823314
    (cherry picked from commit a8d0f557d504957aeb91f451105cca9eee2d6adb)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-16: Related fix merged to neutron (stable/queens)

#9

Reviewed: https://review.openstack.org/651986
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=72c9a7ef8416f894a85a36c6b5bbf995e48599d1
Submitter: Zuul
Branch: stable/queens

commit 72c9a7ef8416f894a85a36c6b5bbf995e48599d1
Author: Slawek Kaplonski <email address hidden>
Date: Wed Apr 10 12:49:49 2019 +0200

Choose random value for HA routes' vr_id

    HA routers are using keepalived and needs to have virtual_router_id
    configured. As routers which belongs to same tenant are using same
    ha network, those values have to be different for each router.

    Before this patch this value was always taken as first available value
    from available_vr_ids range.
    In some (rare) cases, when more than one router is created in parallel
    for same tenant it may happen that those routers would have same vr_id
    choosen so keepalived would treat them as single application and only
    one router would be ACTIVE on one of L3 agents.

    This patch changes this behaviour that now random value from available
    vr_ids will be chosen instead of taking first value always.
    That should mittigate this rare race condition that it will be (almost)
    not noticable for users.

    However, proper fix should be probably done as some additional
    constraint in database layer. But such solution wouldn't be possible to
    backport to stable branches so I decided to propose this easy patch
    first.

Conflicts:
neutron/db/l3_hamode_db.py

    Change-Id: Idb0ed744e54976dca23593fb2d7317bf77442e65
    Related-Bug: #1823314
    (cherry picked from commit a8d0f557d504957aeb91f451105cca9eee2d6adb)
    (cherry picked from commit ee2ed681c495c4fc5086d761853731b7dc2fd34f)

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-16: Related fix merged to neutron (stable/stein)

#10

Reviewed: https://review.openstack.org/651983
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2aa200bdb994a72e8159f78c4439b81232436942
Submitter: Zuul
Branch: stable/stein

commit 2aa200bdb994a72e8159f78c4439b81232436942
Author: Slawek Kaplonski <email address hidden>
Date: Wed Apr 10 12:49:49 2019 +0200

Choose random value for HA routes' vr_id

    HA routers are using keepalived and needs to have virtual_router_id
    configured. As routers which belongs to same tenant are using same
    ha network, those values have to be different for each router.

    Before this patch this value was always taken as first available value
    from available_vr_ids range.
    In some (rare) cases, when more than one router is created in parallel
    for same tenant it may happen that those routers would have same vr_id
    choosen so keepalived would treat them as single application and only
    one router would be ACTIVE on one of L3 agents.

    This patch changes this behaviour that now random value from available
    vr_ids will be chosen instead of taking first value always.
    That should mittigate this rare race condition that it will be (almost)
    not noticable for users.

    However, proper fix should be probably done as some additional
    constraint in database layer. But such solution wouldn't be possible to
    backport to stable branches so I decided to propose this easy patch
    first.

    Change-Id: Idb0ed744e54976dca23593fb2d7317bf77442e65
    Related-Bug: #1823314
    (cherry picked from commit a8d0f557d504957aeb91f451105cca9eee2d6adb)

tags:

added: in-stable-stein

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-17: Related fix merged to neutron (stable/ocata)

#11

Reviewed: https://review.openstack.org/651988
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4a45e1adebee9a72b6e8a36c5fd88a1380f81cb2
Submitter: Zuul
Branch: stable/ocata

commit 4a45e1adebee9a72b6e8a36c5fd88a1380f81cb2
Author: Slawek Kaplonski <email address hidden>
Date: Wed Apr 10 12:49:49 2019 +0200

Choose random value for HA routes' vr_id

    HA routers are using keepalived and needs to have virtual_router_id
    configured. As routers which belongs to same tenant are using same
    ha network, those values have to be different for each router.

    Before this patch this value was always taken as first available value
    from available_vr_ids range.
    In some (rare) cases, when more than one router is created in parallel
    for same tenant it may happen that those routers would have same vr_id
    choosen so keepalived would treat them as single application and only
    one router would be ACTIVE on one of L3 agents.

    This patch changes this behaviour that now random value from available
    vr_ids will be chosen instead of taking first value always.
    That should mittigate this rare race condition that it will be (almost)
    not noticable for users.

    However, proper fix should be probably done as some additional
    constraint in database layer. But such solution wouldn't be possible to
    backport to stable branches so I decided to propose this easy patch
    first.

Conflicts:
neutron/db/l3_hamode_db.py

    Change-Id: Idb0ed744e54976dca23593fb2d7317bf77442e65
    Related-Bug: #1823314
    (cherry picked from commit a8d0f557d504957aeb91f451105cca9eee2d6adb)
    (cherry picked from commit ee2ed681c495c4fc5086d761853731b7dc2fd34f)
    (cherry picked from commit 72c9a7ef8416f894a85a36c6b5bbf995e48599d1)

tags:

added: in-stable-ocata

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2019-04-17:

#12

I think that I know what is going on there.

It is race condition with creating HA network and assigning new vr_id to the router.

Lets assume we are creating 2 different routers (first 2 HA routers for tenant).
Each request goes to different controller and now.
1. Controller-1, as part of creation of router-1, creates HA network, lets call it HA-Net-A,
2. For some reason (I'm not sure what the reason was exactly), controller-1 starts to remove HA-Net-A but
3. in same time on controller 2 HA-Net-A was found and router-2 is trying to use it
4. controller-2 allocates vr_id=1 for router-2 on HA-Net-A,
5. HA-Net-A is finally removed on controller-1 so controller-2 also got some error and retries configure router-2,
6. controller-2 creates new network HA-Net-B but it already have allocated vr_id=1 for router-2 (see p.4), it is stored in different table in db and have nothing to do with removed network,
7. controller-1 tries to allocate vr_id for router-1. As it is for HA-Net-B this time, vr_id=1 is free on this network so it is allocated,

And finally both routers got vr_id=1 allocated and only one of them is active on one L3 agent.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-18: Related fix merged to neutron (stable/pike)

#13

Reviewed: https://review.openstack.org/651987
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7f7dee73248b561bdefb289573990e8455a666c4
Submitter: Zuul
Branch: stable/pike

commit 7f7dee73248b561bdefb289573990e8455a666c4
Author: Slawek Kaplonski <email address hidden>
Date: Wed Apr 10 12:49:49 2019 +0200

Choose random value for HA routes' vr_id

    HA routers are using keepalived and needs to have virtual_router_id
    configured. As routers which belongs to same tenant are using same
    ha network, those values have to be different for each router.

    Before this patch this value was always taken as first available value
    from available_vr_ids range.
    In some (rare) cases, when more than one router is created in parallel
    for same tenant it may happen that those routers would have same vr_id
    choosen so keepalived would treat them as single application and only
    one router would be ACTIVE on one of L3 agents.

    This patch changes this behaviour that now random value from available
    vr_ids will be chosen instead of taking first value always.
    That should mittigate this rare race condition that it will be (almost)
    not noticable for users.

    However, proper fix should be probably done as some additional
    constraint in database layer. But such solution wouldn't be possible to
    backport to stable branches so I decided to propose this easy patch
    first.

Conflicts:
neutron/db/l3_hamode_db.py

    Change-Id: Idb0ed744e54976dca23593fb2d7317bf77442e65
    Related-Bug: #1823314
    (cherry picked from commit a8d0f557d504957aeb91f451105cca9eee2d6adb)
    (cherry picked from commit ee2ed681c495c4fc5086d761853731b7dc2fd34f)
    (cherry picked from commit 72c9a7ef8416f894a85a36c6b5bbf995e48599d1)

tags:

added: in-stable-pike

Drew Freiberger (afreiberger) on 2019-07-04

tags:

added: canonical-bootstack

Swaminathan Vasudevan (swaminathan-vasudevan) on 2019-07-08

Changed in neutron:
status:	Confirmed → Fix Released

neutron

ha router sometime goes in standby mode in all controllers

Bug Description

Other bug subscribers

Remote bug watches