[L3][HA] race condition between first two router creations when tenant has no HA network

Bug #2016198 reported by LIU Yulong
18
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Unassigned

Bug Description

When the tenant creates the first HA router-1, neutron will try to create HA network for this project.
But during the HA network creation procedure, we assume it has 2 steps:
(1) create HA network
(2) create subnet for this HA network
another router-2 creation API call is comming, while HA network creation is just done the step (1).
This router-2 creation API can retrieve the project HA network, and the HA network has no subnet.

Because of the
"[Open Discussion] neutron can create port from network which has no subnet"
https://bugs.launchpad.net/neutron/+bug/2016197,
router-2 will create HA ports without fixed IP.

So this HA router-2 will not get UP forever in L3 agent side.

In our production environment, we see many times of such issue.

LIU Yulong (dragon889)
summary: - [L3][HA] race condition amoning router creations when tenant has no HA
+ [L3][HA] race condition first two router creations when tenant has no HA
network
summary: - [L3][HA] race condition first two router creations when tenant has no HA
- network
+ [L3][HA] race condition between first two router creations when tenant
+ has no HA network
description: updated
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

This bug has been discussed during the Neutron drivers meeting, along with LP#2016197.

The proposal to fix LP#2016198 is to limit the creation of the "L3HARouterNetwork" DB resource [1][2]. The HA network is created per project. Only one network should be created at once. If we limit in the DB engine the number of "L3HARouterNetwork" per project, by making the "project_id" unique, we'll prevent the race condition.

[1]https://github.com/openstack/neutron/blob/47d070c71e795e41e698cdb278d99dcfb3448bde/neutron/objects/l3_hamode.py#L58
[2]https://github.com/openstack/neutron/blob/master/neutron/db/models/l3ha.py#L62

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Actually checking the code I found this:
* https://bugs.launchpad.net/neutron/+bug/1548285
* https://review.opendev.org/c/openstack/neutron/+/282876

But the problem of this code is that two concurrent operations can create and check "L3HARouterNetwork.count()==1" at the same time. I'll try to reproduce it locally.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Ok, now I really understand what the issue is: the network object is created by worker1. At the same time worker2 checks the existence of the HA network but the subnet is still being created. When worker2 is trying to use "HA network", worker1 didn't create the subnet yet.

I'll keep investigating this issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-lib (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron-lib/+/881734

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/881735

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/881742

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-lib (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron-lib/+/881804

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/881826

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-lib (master)

Reviewed: https://review.opendev.org/c/openstack/neutron-lib/+/881734
Committed: https://opendev.org/openstack/neutron-lib/commit/d5884bb20baf28b094c299bda3c05162848bd1ec
Submitter: "Zuul (22348)"
Branch: master

commit d5884bb20baf28b094c299bda3c05162848bd1ec
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Thu Apr 27 15:17:16 2023 +0000

    New ``network-ha`` API definition

    This new network API field, that can be used during the network
    creation, will trigger the creation of a ``ha_router_networks``
    database register. This register binds the project with the
    created network and defines it as the high availability network
    of the project, that is unique per project.

    The default value is "False".

    Related-Bug: #2016198

    Change-Id: Id6e434060a7559026f9083904a91213b39361336

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/neutron-lib/+/881804
Committed: https://opendev.org/openstack/neutron-lib/commit/1ccebda1172150b081f6997044fddf06a06ad8be
Submitter: "Zuul (22348)"
Branch: master

commit 1ccebda1172150b081f6997044fddf06a06ad8be
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Fri Apr 28 10:28:34 2023 +0200

    Introduce "HasProjectPrimaryUniqueKey" class

    This class is used for the database model definitions. It adds the
    "project_id" field as primary key, non nullable and unique. This
    is the only class that defined this field as unique; that implies
    any register created will be unique per project.

    For example, [1] is introducing this database limitation for the
    "ha_router_networks" registers: only one HA network per project
    can be created.

    [1]https://review.opendev.org/c/openstack/neutron/+/881735/

    Change-Id: Id26acbb29a8ddacf4e621e847f2cc05a7ae7f9d6
    Related-Bug: #2016198

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/881735
Committed: https://opendev.org/openstack/neutron/commit/6a2ccfac32c2993e11ee9d958886d786914131e8
Submitter: "Zuul (22348)"
Branch: master

commit 6a2ccfac32c2993e11ee9d958886d786914131e8
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Thu Apr 27 13:53:47 2023 +0000

    Make "project_id" in "L3HARouterNetwork" unique constraint

    There could be just only one HA network per project. This database
    enforcement guarantees this limitation.

    Partial-Bug: #2016198
    Change-Id: Ieb8aac6244d384b0af522f9ba145e9367de2c8ef

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-lib (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron-lib/+/886598

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/881742
Committed: https://opendev.org/openstack/neutron/commit/4109ee9bb44a1dbaf520c6bdb051c8f63328e23f
Submitter: "Zuul (22348)"
Branch: master

commit 4109ee9bb44a1dbaf520c6bdb051c8f63328e23f
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Thu Apr 27 15:36:13 2023 +0000

    Use the new network HA parameter

    This patch implements the new network HA boolean field API extension.
    This field is an input only parameter for POST operations (creation).
    By default is "False". When enabled, the Neutron server will create
    a ``ha_router_networks`` register in the same transaction of the
    network creation.

    If by any circumstance (a race condition, for example), another
    ``ha_router_networks`` exists in the same project, a
    ``DBDuplicateEntry`` exception will be raised and the transaction
    will be rolled back.

    Partial-Bug: #2016198
    Change-Id: Ie42c13ecbe4abcad9229b71f6942e393fd0f2e4e

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/neutron/+/881826
Committed: https://opendev.org/openstack/neutron/commit/e6fb32e27d1141ca4b00c3758ee44bfe2f060ccf
Submitter: "Zuul (22348)"
Branch: master

commit e6fb32e27d1141ca4b00c3758ee44bfe2f060ccf
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Fri Apr 28 13:56:02 2023 +0000

    Fix race condition when creating two routers without HA network

    When a HA router is created and the HA is not yet, before creating
    the router, the Neutron server creates the HA network and the
    corresponding subnet.

    The HA network cannot be duplicated (see previous patches related to
    this bug). But the subnet, that is created in another database
    transaction, cannot be present when the router creation call tries
    to create the HA port.

    This patch adds a HA subnet check before creating the router and the
    HA port. Even if the subnet check fails and the worker tries to
    create this subnet, if the process fails with ``InvalidInput``, that
    means other worker created the subnet before and the current one
    fails because tries to create the same subnet with the same CIDR.
    In this case, we dismiss the exception and continue with the router
    creation.

    Closes-Bug: #2016198

    Change-Id: I82225fcc6248bb0fd68959ceb1daabff423d81ff

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 23.0.0.0b3

This issue was fixed in the openstack/neutron 23.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.