Concurrent DHCP agent updates can result in a DB lock

Bug #1939432 reported by Rodolfo Alonso
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Rodolfo Alonso

Bug Description

Bugzilla reference: https://bugzilla.redhat.com/show_bug.cgi?id=1982981

When a new network and the first subnet are created, the DHCP agent is updated. The agent scheduler increases the DHCP agent register "load" [1] field that will be used to schedule new networks into the same agent.

If multiple concurrent networks (and the first subnet) are created, the agent "load" will be modified concurrently. The DB guarantees that only one transaction can increase the agent "load" parameter at once; the other transactions will fail and retried again. E.g.: https://paste.opendev.org/show/807984/

NOTE: when I say network and the first subnet is because that will trigger the spawn of a new dnsmasq process. This is the event that increases +1 the "load" value. Any other new subnet added to this network will modify the dnsmasq config but won't increase the "load" value.

As commented in the "BaseResourceFilter.bind" method [2], "the resource being bound might or might not be of the same type which is accounted for the load. It isn't a problem because "+ 1" here does not meant to predict precisely what the load of the agent will be. The value will be corrected by the agent on the next report interval." In other words, when the DHCP agent reports the status, accurately updates the number of resources (networks) that is handling.

This bug proposes to catch the DB errors in "BaseResourceFilter.bind" method [2] to avoid the DB retry action. That is unnecessary because the DHCP agent, as commented, will update the "load" value. By avoiding this retry, we avoid unnecessary Neutron server and DB operations and command delays (for example when creating a subnet).

[1]https://github.com/openstack/neutron/blob/0ccfed0ae13182f820e6a8c11a2fa801506f3a3a/neutron/db/models/agent.py#L55
[2]https://github.com/openstack/neutron/blob/0ccfed0ae13182f820e6a8c11a2fa801506f3a3a/neutron/scheduler/base_resource_filter.py#L35-L39

Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/804218

Changed in neutron:
status: New → In Progress
Miguel Lavalle (minsel)
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/806568

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/806569

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/806571

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/neutron/+/806573

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/804218
Committed: https://opendev.org/openstack/neutron/commit/668b1cc652f076e555ef1fc1289684367159186a
Submitter: "Zuul (22348)"
Branch: master

commit 668b1cc652f076e555ef1fc1289684367159186a
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Aug 11 09:13:55 2021 +0000

    Do not fail if the agent load is not bumped

    When a new network and its first subnet is created, the DHCP agent
    bumps the "load" parameter to reflect the number of networks handled.
    This "load" parameter is modified when:
    - As commented, when the first subnet of a network is created. The
      "load" value is bumped.
    - When periodically the DHCP agent sends the status, informing about
      the current number of networks handled.

    If during the subnet creation this "load" value is not updated, it will
    be in the next periodic update of the agent.

    This "load" value is used by the scheduler to equally distribute the
    objects to be managed by any agent type (DHCP agents manage networks).

    The bug refers to DHCP but is valid for any other agent.

    Change-Id: Ief402048d99d40b64d81fcf58eb2e39b1ba7ebbb
    Closes-Bug: #1939432

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/806568
Committed: https://opendev.org/openstack/neutron/commit/816aca60b90b89038863d6974b5a9e0ee8983424
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 816aca60b90b89038863d6974b5a9e0ee8983424
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Aug 11 09:13:55 2021 +0000

    Do not fail if the agent load is not bumped

    When a new network and its first subnet is created, the DHCP agent
    bumps the "load" parameter to reflect the number of networks handled.
    This "load" parameter is modified when:
    - As commented, when the first subnet of a network is created. The
      "load" value is bumped.
    - When periodically the DHCP agent sends the status, informing about
      the current number of networks handled.

    If during the subnet creation this "load" value is not updated, it will
    be in the next periodic update of the agent.

    This "load" value is used by the scheduler to equally distribute the
    objects to be managed by any agent type (DHCP agents manage networks).

    The bug refers to DHCP but is valid for any other agent.

    Conflicts:
          neutron/common/utils.py

    Change-Id: Ief402048d99d40b64d81fcf58eb2e39b1ba7ebbb
    Closes-Bug: #1939432
    (cherry picked from commit 668b1cc652f076e555ef1fc1289684367159186a)

tags: added: in-stable-wallaby
tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/806569
Committed: https://opendev.org/openstack/neutron/commit/1eb6b8926a7a5ad442c5af6057c042999e2645f1
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 1eb6b8926a7a5ad442c5af6057c042999e2645f1
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Aug 11 09:13:55 2021 +0000

    Do not fail if the agent load is not bumped

    When a new network and its first subnet is created, the DHCP agent
    bumps the "load" parameter to reflect the number of networks handled.
    This "load" parameter is modified when:
    - As commented, when the first subnet of a network is created. The
      "load" value is bumped.
    - When periodically the DHCP agent sends the status, informing about
      the current number of networks handled.

    If during the subnet creation this "load" value is not updated, it will
    be in the next periodic update of the agent.

    This "load" value is used by the scheduler to equally distribute the
    objects to be managed by any agent type (DHCP agents manage networks).

    The bug refers to DHCP but is valid for any other agent.

    Conflicts:
          neutron/common/utils.py

    Change-Id: Ief402048d99d40b64d81fcf58eb2e39b1ba7ebbb
    Closes-Bug: #1939432
    (cherry picked from commit 668b1cc652f076e555ef1fc1289684367159186a)
    (cherry picked from commit 816aca60b90b89038863d6974b5a9e0ee8983424)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/806571
Committed: https://opendev.org/openstack/neutron/commit/f315f85a7b0ead30877e19988db4e4f80fc960e9
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit f315f85a7b0ead30877e19988db4e4f80fc960e9
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Aug 11 09:13:55 2021 +0000

    Do not fail if the agent load is not bumped

    When a new network and its first subnet is created, the DHCP agent
    bumps the "load" parameter to reflect the number of networks handled.
    This "load" parameter is modified when:
    - As commented, when the first subnet of a network is created. The
      "load" value is bumped.
    - When periodically the DHCP agent sends the status, informing about
      the current number of networks handled.

    If during the subnet creation this "load" value is not updated, it will
    be in the next periodic update of the agent.

    This "load" value is used by the scheduler to equally distribute the
    objects to be managed by any agent type (DHCP agents manage networks).

    The bug refers to DHCP but is valid for any other agent.

    Conflicts:
          neutron/common/utils.py
          neutron/scheduler/base_resource_filter.py

    Change-Id: Ief402048d99d40b64d81fcf58eb2e39b1ba7ebbb
    Closes-Bug: #1939432
    (cherry picked from commit 668b1cc652f076e555ef1fc1289684367159186a)
    (cherry picked from commit 816aca60b90b89038863d6974b5a9e0ee8983424)
    (cherry picked from commit 1eb6b8926a7a5ad442c5af6057c042999e2645f1)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/806573
Committed: https://opendev.org/openstack/neutron/commit/722b9b57e1ff91200636db45e515ca52a769736c
Submitter: "Zuul (22348)"
Branch: stable/train

commit 722b9b57e1ff91200636db45e515ca52a769736c
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Aug 11 09:13:55 2021 +0000

    Do not fail if the agent load is not bumped

    When a new network and its first subnet is created, the DHCP agent
    bumps the "load" parameter to reflect the number of networks handled.
    This "load" parameter is modified when:
    - As commented, when the first subnet of a network is created. The
      "load" value is bumped.
    - When periodically the DHCP agent sends the status, informing about
      the current number of networks handled.

    If during the subnet creation this "load" value is not updated, it will
    be in the next periodic update of the agent.

    This "load" value is used by the scheduler to equally distribute the
    objects to be managed by any agent type (DHCP agents manage networks).

    The bug refers to DHCP but is valid for any other agent.

    Conflicts:
          neutron/common/utils.py
          neutron/scheduler/base_resource_filter.py

    Change-Id: Ief402048d99d40b64d81fcf58eb2e39b1ba7ebbb
    Closes-Bug: #1939432
    (cherry picked from commit 668b1cc652f076e555ef1fc1289684367159186a)
    (cherry picked from commit 816aca60b90b89038863d6974b5a9e0ee8983424)
    (cherry picked from commit 1eb6b8926a7a5ad442c5af6057c042999e2645f1)
    (cherry picked from commit f315f85a7b0ead30877e19988db4e4f80fc960e9)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.4.1

This issue was fixed in the openstack/neutron 16.4.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 17.2.1

This issue was fixed in the openstack/neutron 17.2.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.1.1

This issue was fixed in the openstack/neutron 18.1.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.0.0.0rc1

This issue was fixed in the openstack/neutron 19.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron train-eol

This issue was fixed in the openstack/neutron train-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.