rpc response timeout for agent report_state is not possible

Bug #1948676 reported by Tobias Urdin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Tobias Urdin

Bug Description

When hosting a large amount of routers and/or networks the RPC calls from the agents can take a long time which requires us to increase the rpc_response_timeout from the default of 60 seconds to a higher value for the agents to not timeout.

This has the side effect that if a rabbitmq or neutron-server is restarted all agents that is currently reporting there will hang for a long time until report_state times out, during this time neutron-server has not got any reports causing it to set the agent as down.

When it times out and tries again the reporting will succeed but a full sync will be triggered for all agents that was previously dead. This in itself can cause a very high load on the control plane.

Consider the fact that a configuration change is deployed using tooling to all neutron-server nodes which is restarted, all agents will die, when they either 1) come back after rpc_response_timeout is reached and tries again or 2) is restarted manually all of them will do a full sync.

We should have a configuration option that only applies to the rpc timeout for the report_state RPC call from agents because that could be lowered to be within the bounds of the agent not being seen as down.

The old behavior can be kept by simply falling back to rpc_response_timeout by default instead of introducing a new default in this override.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/815310

Changed in neutron:
status: New → In Progress
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Did you investigate "This has the side effect that if a rabbitmq or neutron-server is restarted all agents that is currently reporting there will hang for a long time until report_state times out"? Is it expected behavior from messaging side?

Changed in neutron:
status: In Progress → Opinion
Revision history for this message
Tobias Urdin (tobias-urdin) wrote :
Download full text (4.7 KiB)

It's expected behavior that it's blocking until the timeout (rpc_response_timeout) occurs. Since the neutron-server is when restarted/stopped it will leave the message queue the agent will never get a reply to that message and you'll get, for example:

2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent [req-7be84bdb-8eb6-4368-8265-f4aaa9833406 - - - - -] Failed reporting state!: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 510dd9e58a274a40a94ac84f0df9f245
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 433, in get
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent return self._queues[msg_id].get(block=True, timeout=timeout)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent return waiter.wait()
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent return get_hub().switch()
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent return self.greenlet.switch()
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent queue.Empty
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent During handling of the above exception, another exception occurred:
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 1090, in _report_state
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent ctx, self.agent_state, True)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/rpc.py", line 103, in report_state
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent return method(context, 'report_state', **kwargs)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent transport_options=self.transport_options)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 128, in _send
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent transport_options=transport_options)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_mes...

Read more...

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

I agree with your comment that a potential report_rpc_response_timeout mightas well be a global configuration option if that's what's decided.

Another alternative would be setting timeout to report_interval instead of using rpc_response_timeout but that would make it hardcoded instead of configurable.

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Actually I think setting state report rpc timeout to report_interval makes sense: it's not heavy for the server and if there is no response during report_interval likely there is something wrong with the server and there is not much sense to wait more.

Changed in neutron:
status: Opinion → In Progress
Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Yes but it would not be backportable because of the behavior change :(

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/815310
Committed: https://opendev.org/openstack/neutron/commit/7d552848c272b4fbfdafdc552e54cefd25b6d46a
Submitter: "Zuul (22348)"
Branch: master

commit 7d552848c272b4fbfdafdc552e54cefd25b6d46a
Author: Tobias Urdin <email address hidden>
Date: Mon Oct 25 13:52:03 2021 +0000

    Set RPC timeout in PluginReportStateAPI to report_interval

    See more details on why this is need in the referenced
    bug #1948676

    Change-Id: I8a95e80ca74edc8f8f394cefc749c4065a8e0575
    Closes-Bug: #1948676

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/815879

Changed in neutron:
importance: Undecided → Medium
assignee: nobody → Tobias Urdin (tobias-urdin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/815879
Committed: https://opendev.org/openstack/neutron/commit/551d36362aa7129bbca4bde56c6957124a13c835
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 551d36362aa7129bbca4bde56c6957124a13c835
Author: Tobias Urdin <email address hidden>
Date: Mon Oct 25 13:52:03 2021 +0000

    Set RPC timeout in PluginReportStateAPI to report_interval

    See more details on why this is need in the referenced
    bug #1948676

    Change-Id: I8a95e80ca74edc8f8f394cefc749c4065a8e0575
    Closes-Bug: #1948676
    (cherry picked from commit 7d552848c272b4fbfdafdc552e54cefd25b6d46a)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/817544

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/817578

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/817544
Committed: https://opendev.org/openstack/neutron/commit/7f9a3f329340074a24b5d881fb762e5f0a91d47e
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 7f9a3f329340074a24b5d881fb762e5f0a91d47e
Author: Tobias Urdin <email address hidden>
Date: Mon Oct 25 13:52:03 2021 +0000

    Set RPC timeout in PluginReportStateAPI to report_interval

    See more details on why this is need in the referenced
    bug #1948676

    Change-Id: I8a95e80ca74edc8f8f394cefc749c4065a8e0575
    Closes-Bug: #1948676
    (cherry picked from commit 7d552848c272b4fbfdafdc552e54cefd25b6d46a)
    (cherry picked from commit 551d36362aa7129bbca4bde56c6957124a13c835)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/817578
Committed: https://opendev.org/openstack/neutron/commit/ee62d5b01a3deb61ff2559449563722d5eaa1960
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit ee62d5b01a3deb61ff2559449563722d5eaa1960
Author: Tobias Urdin <email address hidden>
Date: Mon Oct 25 13:52:03 2021 +0000

    Set RPC timeout in PluginReportStateAPI to report_interval

    See more details on why this is need in the referenced
    bug #1948676

    Change-Id: I8a95e80ca74edc8f8f394cefc749c4065a8e0575
    Closes-Bug: #1948676
    (cherry picked from commit 7d552848c272b4fbfdafdc552e54cefd25b6d46a)
    (cherry picked from commit 551d36362aa7129bbca4bde56c6957124a13c835)
    (cherry picked from commit 7f9a3f329340074a24b5d881fb762e5f0a91d47e)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/819117

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/819117
Committed: https://opendev.org/openstack/neutron/commit/a4ffd1a2f98cea425d2e1c887bcaecd655fcd473
Submitter: "Zuul (22348)"
Branch: master

commit a4ffd1a2f98cea425d2e1c887bcaecd655fcd473
Author: yatinkarel <email address hidden>
Date: Wed Nov 24 19:53:19 2021 +0530

    Set report_interval to 0 for ovs agent unit tests

    Unit tests are taking longer than usual since [1] and
    causing timeouts in unit tests jobs intermittently. Set
    report_interval(as this is now used to set RPC timeout)
    to 0 so report_state responds immediately.

    [1] https://review.opendev.org/q/I8a95e80ca74edc8f8f394cefc749c4065a8e0575

    Related-Bug: #1948676
    Change-Id: I653c907a4323e19d5bc381cd3716d42c45a75e15

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/819174

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/819175

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/819177

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/819177
Committed: https://opendev.org/openstack/neutron/commit/8439348addb649459609f22fecc9a6fb8be7ed75
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 8439348addb649459609f22fecc9a6fb8be7ed75
Author: yatinkarel <email address hidden>
Date: Wed Nov 24 19:53:19 2021 +0530

    Set report_interval to 0 for ovs agent unit tests

    Unit tests are taking longer than usual since [1] and
    causing timeouts in unit tests jobs intermittently. Set
    report_interval(as this is now used to set RPC timeout)
    to 0 so report_state responds immediately.

    [1] https://review.opendev.org/q/I8a95e80ca74edc8f8f394cefc749c4065a8e0575

    Related-Bug: #1948676
    Change-Id: I653c907a4323e19d5bc381cd3716d42c45a75e15
    (cherry picked from commit a4ffd1a2f98cea425d2e1c887bcaecd655fcd473)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/819175
Committed: https://opendev.org/openstack/neutron/commit/3762c46d8a79555134a6b42daeea1d85cdf0ab1b
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 3762c46d8a79555134a6b42daeea1d85cdf0ab1b
Author: yatinkarel <email address hidden>
Date: Wed Nov 24 19:53:19 2021 +0530

    Set report_interval to 0 for ovs agent unit tests

    Unit tests are taking longer than usual since [1] and
    causing timeouts in unit tests jobs intermittently. Set
    report_interval(as this is now used to set RPC timeout)
    to 0 so report_state responds immediately.

    [1] https://review.opendev.org/q/I8a95e80ca74edc8f8f394cefc749c4065a8e0575

    Related-Bug: #1948676
    Change-Id: I653c907a4323e19d5bc381cd3716d42c45a75e15
    (cherry picked from commit a4ffd1a2f98cea425d2e1c887bcaecd655fcd473)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/819174
Committed: https://opendev.org/openstack/neutron/commit/3bc100140f3b6e19387c1e46e26fa70d34aaa8b8
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 3bc100140f3b6e19387c1e46e26fa70d34aaa8b8
Author: yatinkarel <email address hidden>
Date: Wed Nov 24 19:53:19 2021 +0530

    Set report_interval to 0 for ovs agent unit tests

    Unit tests are taking longer than usual since [1] and
    causing timeouts in unit tests jobs intermittently. Set
    report_interval(as this is now used to set RPC timeout)
    to 0 so report_state responds immediately.

    [1] https://review.opendev.org/q/I8a95e80ca74edc8f8f394cefc749c4065a8e0575

    Related-Bug: #1948676
    Change-Id: I653c907a4323e19d5bc381cd3716d42c45a75e15
    (cherry picked from commit a4ffd1a2f98cea425d2e1c887bcaecd655fcd473)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.1.0

This issue was fixed in the openstack/neutron 19.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 17.3.0

This issue was fixed in the openstack/neutron 17.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.2.0

This issue was fixed in the openstack/neutron 18.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.0.0.0rc1

This issue was fixed in the openstack/neutron 20.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.