Bug #1948676 “rpc response timeout for agent report_state is not...” : Bugs : neutron

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-25: Fix proposed to neutron (master)

#1

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/815310

Changed in neutron:
status:	New → In Progress

Revision history for this message

Oleg Bondarev (obondarev) wrote on 2021-10-25:

#2

Did you investigate "This has the side effect that if a rabbitmq or neutron-server is restarted all agents that is currently reporting there will hang for a long time until report_state times out"? Is it expected behavior from messaging side?

Changed in neutron:
status:	In Progress → Opinion

Revision history for this message

Tobias Urdin (tobias-urdin) wrote on 2021-10-25:

#3

Download full text (4.7 KiB)

It's expected behavior that it's blocking until the timeout (rpc_response_timeout) occurs. Since the neutron-server is when restarted/stopped it will leave the message queue the agent will never get a reply to that message and you'll get, for example:

2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent [req-7be84bdb-8eb6-4368-8265-f4aaa9833406 - - - - -] Failed reporting state!: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 510dd9e58a274a40a94ac84f0df9f245
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 433, in get
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent return self._queues[msg_id].get(block=True, timeout=timeout)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent return waiter.wait()
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent return get_hub().switch()
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent return self.greenlet.switch()
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent queue.Empty
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent During handling of the above exception, another exception occurred:
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 1090, in _report_state
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent ctx, self.agent_state, True)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/neutron/agent/rpc.py", line 103, in report_state
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent return method(context, 'report_state', **kwargs)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent transport_options=self.transport_options)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 128, in _send
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent transport_options=transport_options)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent File "/usr/lib/python3.6/site-packages/oslo_mes...

It's expected behavior that it's blocking until the timeout (rpc_response_timeout) occurs. Since the neutron-server is when restarted/stopped it will leave the message queue the agent will never get a reply to that message and you'll get, for example:

2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent [req-7be84bdb-8eb6-4368-8265-f4aaa9833406 - - - - -] Failed reporting state!: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 510dd9e58a274a40a94ac84f0df9f245
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 433, in get
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent     return self._queues[msg_id].get(block=True, timeout=timeout)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent     return waiter.wait()
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent     return get_hub().switch()
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent     return self.greenlet.switch()
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent queue.Empty
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent 
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent During handling of the above exception, another exception occurred:
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent 
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/dhcp/agent.py", line 1090, in _report_state
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent     ctx, self.agent_state, True)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/neutron/agent/rpc.py", line 103, in report_state
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent     return method(context, 'report_state', **kwargs)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent     transport_options=self.transport_options)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 128, in _send
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent     transport_options=transport_options)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 682, in send
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent     transport_options=transport_options)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 670, in _send
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent     call_monitor_timeout)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in wait
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent     message = self.waiters.get(msg_id, timeout=timeout)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 437, in get
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent     'to message ID %s' % msg_id)
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 510dd9e58a274a40a94ac84f0df9f245
2021-10-25 10:36:39.104 1590145 ERROR neutron.agent.dhcp.agent 
2021-10-25 10:36:39.110 1590145 WARNING oslo.service.loopingcall [req-7be84bdb-8eb6-4368-8265-f4aaa9833406 - - - - -] Function 'neutron.agent.dhcp.agent.DhcpAgentWithStateReport._report_state' run outlasted interval by 1170.01 sec
2021-10-25 10:36:39.131 1590145 INFO neutron.agent.dhcp.agent [req-3d184045-4294-46cc-a5e1-018290bc44fe - - - - -] Agent has just been revived. Scheduling full sync

Revision history for this message

Tobias Urdin (tobias-urdin) wrote on 2021-10-25:

#4

I agree with your comment that a potential report_rpc_response_timeout mightas well be a global configuration option if that's what's decided.

Another alternative would be setting timeout to report_interval instead of using rpc_response_timeout but that would make it hardcoded instead of configurable.

Revision history for this message

Oleg Bondarev (obondarev) wrote on 2021-10-25:

#5

Actually I think setting state report rpc timeout to report_interval makes sense: it's not heavy for the server and if there is no response during report_interval likely there is something wrong with the server and there is not much sense to wait more.

OpenStack Infra (hudson-openstack) on 2021-10-25

Changed in neutron:
status:	Opinion → In Progress

Revision history for this message

Tobias Urdin (tobias-urdin) wrote on 2021-10-25:

#6

Yes but it would not be backportable because of the behavior change :(

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-28: Fix merged to neutron (master)

#7

Reviewed: https://review.opendev.org/c/openstack/neutron/+/815310
Committed: https://opendev.org/openstack/neutron/commit/7d552848c272b4fbfdafdc552e54cefd25b6d46a
Submitter: "Zuul (22348)"
Branch: master

commit 7d552848c272b4fbfdafdc552e54cefd25b6d46a
Author: Tobias Urdin <email address hidden>
Date: Mon Oct 25 13:52:03 2021 +0000

Set RPC timeout in PluginReportStateAPI to report_interval

See more details on why this is need in the referenced
bug #1948676

Change-Id: I8a95e80ca74edc8f8f394cefc749c4065a8e0575
Closes-Bug: #1948676

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-28: Fix proposed to neutron (stable/xena)

#8

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/815879

Oleg Bondarev (obondarev) on 2021-11-01

Changed in neutron:
importance:	Undecided → Medium
assignee:	nobody → Tobias Urdin (tobias-urdin)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-11: Fix merged to neutron (stable/xena)

#9

Reviewed: https://review.opendev.org/c/openstack/neutron/+/815879
Committed: https://opendev.org/openstack/neutron/commit/551d36362aa7129bbca4bde56c6957124a13c835
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 551d36362aa7129bbca4bde56c6957124a13c835
Author: Tobias Urdin <email address hidden>
Date: Mon Oct 25 13:52:03 2021 +0000

Set RPC timeout in PluginReportStateAPI to report_interval

See more details on why this is need in the referenced
bug #1948676

    Change-Id: I8a95e80ca74edc8f8f394cefc749c4065a8e0575
    Closes-Bug: #1948676
    (cherry picked from commit 7d552848c272b4fbfdafdc552e54cefd25b6d46a)

tags:

added: in-stable-xena

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-11: Fix proposed to neutron (stable/wallaby)

#10

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/817544

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-11: Fix proposed to neutron (stable/victoria)

#11

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/817578

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-15: Fix merged to neutron (stable/wallaby)

#12

Reviewed: https://review.opendev.org/c/openstack/neutron/+/817544
Committed: https://opendev.org/openstack/neutron/commit/7f9a3f329340074a24b5d881fb762e5f0a91d47e
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 7f9a3f329340074a24b5d881fb762e5f0a91d47e
Author: Tobias Urdin <email address hidden>
Date: Mon Oct 25 13:52:03 2021 +0000

Set RPC timeout in PluginReportStateAPI to report_interval

See more details on why this is need in the referenced
bug #1948676

    Change-Id: I8a95e80ca74edc8f8f394cefc749c4065a8e0575
    Closes-Bug: #1948676
    (cherry picked from commit 7d552848c272b4fbfdafdc552e54cefd25b6d46a)
    (cherry picked from commit 551d36362aa7129bbca4bde56c6957124a13c835)

tags:

added: in-stable-wallaby

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-16: Fix merged to neutron (stable/victoria)

#13

Reviewed: https://review.opendev.org/c/openstack/neutron/+/817578
Committed: https://opendev.org/openstack/neutron/commit/ee62d5b01a3deb61ff2559449563722d5eaa1960
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit ee62d5b01a3deb61ff2559449563722d5eaa1960
Author: Tobias Urdin <email address hidden>
Date: Mon Oct 25 13:52:03 2021 +0000

Set RPC timeout in PluginReportStateAPI to report_interval

See more details on why this is need in the referenced
bug #1948676

    Change-Id: I8a95e80ca74edc8f8f394cefc749c4065a8e0575
    Closes-Bug: #1948676
    (cherry picked from commit 7d552848c272b4fbfdafdc552e54cefd25b6d46a)
    (cherry picked from commit 551d36362aa7129bbca4bde56c6957124a13c835)
    (cherry picked from commit 7f9a3f329340074a24b5d881fb762e5f0a91d47e)

tags:

added: in-stable-victoria

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-24: Related fix proposed to neutron (master)

#14

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/819117

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-25: Related fix merged to neutron (master)

#15

Reviewed: https://review.opendev.org/c/openstack/neutron/+/819117
Committed: https://opendev.org/openstack/neutron/commit/a4ffd1a2f98cea425d2e1c887bcaecd655fcd473
Submitter: "Zuul (22348)"
Branch: master

commit a4ffd1a2f98cea425d2e1c887bcaecd655fcd473
Author: yatinkarel <email address hidden>
Date: Wed Nov 24 19:53:19 2021 +0530

Set report_interval to 0 for ovs agent unit tests

    Unit tests are taking longer than usual since [1] and
    causing timeouts in unit tests jobs intermittently. Set
    report_interval(as this is now used to set RPC timeout)
    to 0 so report_state responds immediately.

[1] https://review.opendev.org/q/I8a95e80ca74edc8f8f394cefc749c4065a8e0575

Related-Bug: #1948676
Change-Id: I653c907a4323e19d5bc381cd3716d42c45a75e15

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-25: Related fix proposed to neutron (stable/xena)

#16

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/819174

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-25: Related fix proposed to neutron (stable/wallaby)

#17

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/819175

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-25: Related fix proposed to neutron (stable/victoria)

#18

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/819177

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-25: Related fix merged to neutron (stable/victoria)

#19

Reviewed: https://review.opendev.org/c/openstack/neutron/+/819177
Committed: https://opendev.org/openstack/neutron/commit/8439348addb649459609f22fecc9a6fb8be7ed75
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 8439348addb649459609f22fecc9a6fb8be7ed75
Author: yatinkarel <email address hidden>
Date: Wed Nov 24 19:53:19 2021 +0530

Set report_interval to 0 for ovs agent unit tests

    Unit tests are taking longer than usual since [1] and
    causing timeouts in unit tests jobs intermittently. Set
    report_interval(as this is now used to set RPC timeout)
    to 0 so report_state responds immediately.

[1] https://review.opendev.org/q/I8a95e80ca74edc8f8f394cefc749c4065a8e0575

    Related-Bug: #1948676
    Change-Id: I653c907a4323e19d5bc381cd3716d42c45a75e15
    (cherry picked from commit a4ffd1a2f98cea425d2e1c887bcaecd655fcd473)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-25: Related fix merged to neutron (stable/wallaby)

#20

Reviewed: https://review.opendev.org/c/openstack/neutron/+/819175
Committed: https://opendev.org/openstack/neutron/commit/3762c46d8a79555134a6b42daeea1d85cdf0ab1b
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 3762c46d8a79555134a6b42daeea1d85cdf0ab1b
Author: yatinkarel <email address hidden>
Date: Wed Nov 24 19:53:19 2021 +0530

Set report_interval to 0 for ovs agent unit tests

    Unit tests are taking longer than usual since [1] and
    causing timeouts in unit tests jobs intermittently. Set
    report_interval(as this is now used to set RPC timeout)
    to 0 so report_state responds immediately.

[1] https://review.opendev.org/q/I8a95e80ca74edc8f8f394cefc749c4065a8e0575

    Related-Bug: #1948676
    Change-Id: I653c907a4323e19d5bc381cd3716d42c45a75e15
    (cherry picked from commit a4ffd1a2f98cea425d2e1c887bcaecd655fcd473)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-25: Related fix merged to neutron (stable/xena)

#21

Reviewed: https://review.opendev.org/c/openstack/neutron/+/819174
Committed: https://opendev.org/openstack/neutron/commit/3bc100140f3b6e19387c1e46e26fa70d34aaa8b8
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 3bc100140f3b6e19387c1e46e26fa70d34aaa8b8
Author: yatinkarel <email address hidden>
Date: Wed Nov 24 19:53:19 2021 +0530

Set report_interval to 0 for ovs agent unit tests

    Unit tests are taking longer than usual since [1] and
    causing timeouts in unit tests jobs intermittently. Set
    report_interval(as this is now used to set RPC timeout)
    to 0 so report_state responds immediately.

[1] https://review.opendev.org/q/I8a95e80ca74edc8f8f394cefc749c4065a8e0575

    Related-Bug: #1948676
    Change-Id: I653c907a4323e19d5bc381cd3716d42c45a75e15
    (cherry picked from commit a4ffd1a2f98cea425d2e1c887bcaecd655fcd473)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-10: Fix included in openstack/neutron 19.1.0

#22

This issue was fixed in the openstack/neutron 19.1.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-13: Fix included in openstack/neutron 17.3.0

#23

This issue was fixed in the openstack/neutron 17.3.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-13: Fix included in openstack/neutron 18.2.0

#24

This issue was fixed in the openstack/neutron 18.2.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-10: Fix included in openstack/neutron 20.0.0.0rc1

#25

This issue was fixed in the openstack/neutron 20.0.0.0rc1 release candidate.

neutron

rpc response timeout for agent report_state is not possible

Bug Description

Other bug subscribers

Remote bug watches