dhcp-agent can overwhelm neutron server with dhcp_ready_on_ports RPC calls

Bug #1834257 reported by Brian Haley
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Sebastian Lohff

Bug Description

The Neutron dhcp-agents reports all ready ports to the Neutron server via the dhcp_ready_on_ports() RPC call. When the dhcp-agent gets ports ready faster than the server can process them, the amount of ports per RPC call can grow so high (e.g. 10000 Ports) that the neutron server never has a chance of processing the request before the RPC timeout kills the request, leading to the dhcp-agent sending the request again, possibly with even more ports than before, resulting in an endless loop of dhcp_ready_on_ports() calls. This happens especially on agent startup.

We should use either a smaller fixed amount, or use an algorithm to reduce the number being sent in the event a message timeout is received.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/667472

Changed in neutron:
assignee: nobody → Brian Haley (brian-haley)
status: New → In Progress
Revision history for this message
Bence Romsics (bence-romsics) wrote :

Hi Brian,

Trying to understand the symptoms of this error: Beyond producing unwanted load both in the agent and the server can the timeout of the rpc prevent ports to get to the ACTIVE state?

Do you have steps to reproduce the problem?

Cheers,
Bence

Changed in neutron:
assignee: Brian Haley (brian-haley) → Sebastian Lohff (sebageek)
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

@Bence:
So if You have thousands of ports in Your networks and You will restart DHCP agent, it will do full sync of everything and than will put those thousands of ports in the set and will send all of them in one rpc message to server. Than server is iterating through all those ports one by one and it is trying to remove DHCP provisioning block from port. That takes very long time and dhcp agent's worker which is responsible for sending this notifications is waiting. So if You will then spawn new instance, its port will wait 5 minutes to be switched to ACTIVE and finally install will finish in ERROR state due to this delay in DHCP agent.

To reproduce this, You need to have a lot of ports handled by one DHCP agent and e.g. restart it. That should be enough to see this problem.

Revision history for this message
Sebastian Lohff (sebageek) wrote :

I observed the problem in the wild (aka in our OpenStack installation). What happend there was that we restarted a neutron-dhcp-agent that had many networks and especially many ports scheduled to it. The neutron-dhcp-agent was faster in getting ports ready than neutron-server was in processing the `dhcp_ready_on_ports()` rpc. This led to a point where there were more ports ready per rpc than neutron-server could process in $rpc_timeout (~50s in our case). After each rpc timeout neutron-dhcp-agent would put the same ports into the call again, leaving all our neutron-server instances busy with processing the `dhcp_ready_on_ports()` calls.

To reproduce this I'd suggest getting 500-1000 ports ready on the same dhcp-agent at once and maybe put the neutron-server under some load as well.

The fix for our infrastructure was limiting the amount of ports send per `dhcp_ready_on_ports()` call to something sane - we chose 64 ports.

From my understanding this will not prevent ports to get into the ACTIVE state. neutron-server processes the request, even if it takes up to 50 minutes to handle all ports. But as neutron-dhcp-agent has no feedback other than the return value of `dhcp_ready_on_ports()` it will try again and again until the call succeeds.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

There is another patch related to this bug under review: https://review.opendev.org/#/c/659274

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Brian Haley (<email address hidden>) on branch: master
Review: https://review.opendev.org/667472
Reason: https://review.opendev.org/#/c/659274/ should be good enough

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/659274
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=76ccdb35d4b09106aa8adc97528b02f0fd8acbcc
Submitter: Zuul
Branch: master

commit 76ccdb35d4b09106aa8adc97528b02f0fd8acbcc
Author: Sebastian Lohff <email address hidden>
Date: Wed May 15 13:39:55 2019 +0200

    Limit max ports per rpc for dhcp_ready_on_ports()

    The Neutron dhcp agents reports all ready ports to the Neutron
    server via the dhcp_ready_on_ports() rpc call. When the dhcp agent
    gets ports ready faster than the server can process them the amount
    of ports per rpc call can grow so high (e.g. 10000 Ports) that the
    neutron server never has a chance of processing the request before
    the rpc timeout kills the request, leading to the dhcp agent
    sending the request again, resulting in an endless loop of
    dhcp_ready_on_ports() calls. This happens especially on agent startup.

    To mitigate this problems we now limit the number of ports sent
    per dhcp_ready_on_ports() call.

    Closes-bug: #1834257
    Change-Id: I407e126e760ebf6aca4c31b9c3ff58dcfa55107f

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/669942

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/669943

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/669944

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/669942
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2626baf3d603fcda9fc6be8cc63265530954a370
Submitter: Zuul
Branch: stable/stein

commit 2626baf3d603fcda9fc6be8cc63265530954a370
Author: Sebastian Lohff <email address hidden>
Date: Wed May 15 13:39:55 2019 +0200

    Limit max ports per rpc for dhcp_ready_on_ports()

    The Neutron dhcp agents reports all ready ports to the Neutron
    server via the dhcp_ready_on_ports() rpc call. When the dhcp agent
    gets ports ready faster than the server can process them the amount
    of ports per rpc call can grow so high (e.g. 10000 Ports) that the
    neutron server never has a chance of processing the request before
    the rpc timeout kills the request, leading to the dhcp agent
    sending the request again, resulting in an endless loop of
    dhcp_ready_on_ports() calls. This happens especially on agent startup.

    To mitigate this problems we now limit the number of ports sent
    per dhcp_ready_on_ports() call.

    Closes-bug: #1834257
    Change-Id: I407e126e760ebf6aca4c31b9c3ff58dcfa55107f
    (cherry picked from commit 76ccdb35d4b09106aa8adc97528b02f0fd8acbcc)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/669943
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f9f6ae9c98a9ff9d3c0807b9fd51fb2ffe25fd5d
Submitter: Zuul
Branch: stable/rocky

commit f9f6ae9c98a9ff9d3c0807b9fd51fb2ffe25fd5d
Author: Sebastian Lohff <email address hidden>
Date: Wed May 15 13:39:55 2019 +0200

    Limit max ports per rpc for dhcp_ready_on_ports()

    The Neutron dhcp agents reports all ready ports to the Neutron
    server via the dhcp_ready_on_ports() rpc call. When the dhcp agent
    gets ports ready faster than the server can process them the amount
    of ports per rpc call can grow so high (e.g. 10000 Ports) that the
    neutron server never has a chance of processing the request before
    the rpc timeout kills the request, leading to the dhcp agent
    sending the request again, resulting in an endless loop of
    dhcp_ready_on_ports() calls. This happens especially on agent startup.

    To mitigate this problems we now limit the number of ports sent
    per dhcp_ready_on_ports() call.

    Closes-bug: #1834257
    Change-Id: I407e126e760ebf6aca4c31b9c3ff58dcfa55107f
    (cherry picked from commit 76ccdb35d4b09106aa8adc97528b02f0fd8acbcc)

tags: added: in-stable-rocky
tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/669944
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cd033bee076dabc847c96058da0e40e0989ccba8
Submitter: Zuul
Branch: stable/queens

commit cd033bee076dabc847c96058da0e40e0989ccba8
Author: Sebastian Lohff <email address hidden>
Date: Wed May 15 13:39:55 2019 +0200

    Limit max ports per rpc for dhcp_ready_on_ports()

    The Neutron dhcp agents reports all ready ports to the Neutron
    server via the dhcp_ready_on_ports() rpc call. When the dhcp agent
    gets ports ready faster than the server can process them the amount
    of ports per rpc call can grow so high (e.g. 10000 Ports) that the
    neutron server never has a chance of processing the request before
    the rpc timeout kills the request, leading to the dhcp agent
    sending the request again, resulting in an endless loop of
    dhcp_ready_on_ports() calls. This happens especially on agent startup.

    To mitigate this problems we now limit the number of ports sent
    per dhcp_ready_on_ports() call.

    Closes-bug: #1834257
    Change-Id: I407e126e760ebf6aca4c31b9c3ff58dcfa55107f
    (cherry picked from commit 76ccdb35d4b09106aa8adc97528b02f0fd8acbcc)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 15.0.0.0b1

This issue was fixed in the openstack/neutron 15.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.3

This issue was fixed in the openstack/neutron 14.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.5

This issue was fixed in the openstack/neutron 13.0.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.1.1

This issue was fixed in the openstack/neutron 12.1.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.