Agents might be reported as down for 10 minutes after all controllers restart

Bug #1606827 reported by John Schwarz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
John Schwarz

Bug Description

The scenario which initially revealed this issue involved multiple controllers and an extra compute node (total of 4) but it should also reproduce on deployments smaller than described.

The issue is that if an agent tries to report_state to the neutron-server and it fails because of a timeout (raising oslo_messaging.MessagingTimeout), then there is an exponential back-off effect which was put in place by [1]. The feature was intended for heavy RPC calls (like get_routers()) and not for light calls such as report_state, so this can be considered a regression. This can be reproduced by restarting the controllers on a triple-O deployment and specified before.

A solution would be to ensure PluginReportStateAPI doesn't use the exponential backoff, instead seeking to always time out after rpc_response_timeout.

[1]: https://review.openstack.org/#/c/280595/14/neutron/common/rpc.py

John Schwarz (jschwarz)
description: updated
tags: added: mitaka-backport-potential
tags: added: liberty-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/347708

Changed in neutron:
assignee: nobody → John Schwarz (jschwarz)
status: New → In Progress
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/347708
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bb989be99db84a2789abe2849c786a075e3f5ab7
Submitter: Jenkins
Branch: master

commit bb989be99db84a2789abe2849c786a075e3f5ab7
Author: John Schwarz <email address hidden>
Date: Wed Jul 27 12:09:30 2016 +0300

    Don't use exponential back-off for report_state

    If an agent tries to report_state to the neutron-server and it fails
    because of a timeout (raising oslo_messaging.MessagingTimeout), then
    there is an exponential back-off effect, which causes the
    seemingly-simple report_state RPC call to take 60 seconds, then 120,
    then 240 and so on. This can happen if all the controllers are
    restarted simultaneously a number of time, as the bug report describes.

    Since the feature was intended for heavy RPC calls (like get_routers())
    and not for light calls such as report_state, it's safe to reduce the
    timeout to a constant 60 seconds interval.

    Closes-Bug: #1606827
    Change-Id: I15aeea9f8265b859bb1a8ee933b8b2ce1e64b695

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/349038

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/349039

Revision history for this message
Kevin Benton (kevinbenton) wrote :

I don't think this actually changed anything and we should revert it.

The part that exponentially increases is the amount of time it will wait for a response from the server, not the amount of time it backs off.

Consider this (verified behavior on devstack):

Server goes offline
Agent report state (start timer for 60 seconds)
Agent report state timeout exception
Agent sleeps random(0, rpc.TRANSPORT.conf.rpc_response_timeout)[1]
Agent report state (start timer for 120 seconds)
Agent report state timeout exception
Agent sleeps random(0, rpc.TRANSPORT.conf.rpc_response_timeout)
Agent report state (start timer for 240 seconds)
Server resumes after 150 seconds
Server processes messages in 'reports' queue.
Agent gets report state response.

There is no point in changing the exponential timeout increase because the server will process the report state as long as its in that timeout window.

The maximum time in the worst possible case an agent will not have a report_state in the queue is the rpc.TRANSPORT.conf.rpc_response_timeout.

1. https://github.com/openstack/neutron/blob/9f4f6c8db27f4838a11b4a271e96c372f01118dd/neutron/common/rpc.py#L141

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/349113

Revision history for this message
John Schwarz (jschwarz) wrote :

@Kevin, the scenario we detected doesn't concern only the neutron-server being restarted, but also the rabbitmq server. The state that is achieved is that, somehow and for some reason, even though the rabbitmq server is up the report_state RPC doesn't exit immediately and in fact stays "hung" until the timeout expires. Then, once the timeout expires and report_state re-tries itself, the call succeeds almost immediately.

At some cases we've observed that if the timeout is 600 seconds (the maximum), it can take up quite a long time (we observed just under 10 minutes) to actually re-try the call and succeed.

I'm pasting the traceback we encountered at [1], which demonstrates this issue. Granted, this "restart all the controllers" business went on for a few times until the maximum timeout has reached 600 seconds, but it's a possible scenario nonetheless (power outages, etc). Notice that in the traceback, there is a just-under-10-minutes sleep at 15:28:45, the call is timed out at 15:38:30, the code sleeps for a second and executes again and finally at 15:38:32 the call succeeds immediately.

Thus, the scenario of "once the server re-starts, everything will work" is one scenario that is possible, but the issue talks of a different one :) Either way, lowering the timeout as the suggested merged patch did doesn't hurt either scenarios.

[1]: http://paste.openstack.org/show/544333/

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I agree that keeping the report_state timeout to a fixed quantity can be beneficial, but I cannot see the correlation between the fix and the failure as filed in this bug report.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

@John,

So the oslo messaging driver is not detecting that AMQP has killed it's connection. In this deployment what did you have heartbeat_timeout_threshold set to? I would like to understand this failure more because it seems that the messaging client might have a pretty bad limitation in that it doesn't detect connection failures once it's waiting for a response.

Revision history for this message
John Schwarz (jschwarz) wrote :

Sorry for the delay, PTO :) The value of heartbeat_timeout_threshold was 60 on the deployment we discovered this issue on.

@Armando, I agree this is ultimately an oslo.messaging issue, however I think that waiting until the oslo guys fix this issue (which might be a lot more complicated and time consuming) is counter-productive since the Neutron solution is very straightforward.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/349113
Reason: the patch seemed to actually be a workaround for an oslo messaging bug (not detecting lost connection). I think going with this is okay until that's resolved.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/349038
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bf5b8cafc29f2936c379081d2da9bfa317b44249
Submitter: Jenkins
Branch: stable/mitaka

commit bf5b8cafc29f2936c379081d2da9bfa317b44249
Author: John Schwarz <email address hidden>
Date: Wed Jul 27 12:09:30 2016 +0300

    Don't use exponential back-off for report_state

    If an agent tries to report_state to the neutron-server and it fails
    because of a timeout (raising oslo_messaging.MessagingTimeout), then
    there is an exponential back-off effect, which causes the
    seemingly-simple report_state RPC call to take 60 seconds, then 120,
    then 240 and so on. This can happen if all the controllers are
    restarted simultaneously a number of time, as the bug report describes.

    Since the feature was intended for heavy RPC calls (like get_routers())
    and not for light calls such as report_state, it's safe to reduce the
    timeout to a constant 60 seconds interval.

    Closes-Bug: #1606827
    Change-Id: I15aeea9f8265b859bb1a8ee933b8b2ce1e64b695
    (cherry picked from commit bb989be99db84a2789abe2849c786a075e3f5ab7)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/349039
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1287e86c5f52210945766d3306610269abf59f11
Submitter: Jenkins
Branch: stable/liberty

commit 1287e86c5f52210945766d3306610269abf59f11
Author: John Schwarz <email address hidden>
Date: Wed Jul 27 12:09:30 2016 +0300

    Don't use exponential back-off for report_state

    If an agent tries to report_state to the neutron-server and it fails
    because of a timeout (raising oslo_messaging.MessagingTimeout), then
    there is an exponential back-off effect, which causes the
    seemingly-simple report_state RPC call to take 60 seconds, then 120,
    then 240 and so on. This can happen if all the controllers are
    restarted simultaneously a number of time, as the bug report describes.

    Since the feature was intended for heavy RPC calls (like get_routers())
    and not for light calls such as report_state, it's safe to reduce the
    timeout to a constant 60 seconds interval.

    Closes-Bug: #1606827
    Change-Id: I15aeea9f8265b859bb1a8ee933b8b2ce1e64b695
    (cherry picked from commit bb989be99db84a2789abe2849c786a075e3f5ab7)

tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0b3

This issue was fixed in the openstack/neutron 9.0.0.0b3 development milestone.

tags: removed: liberty-backport-potential mitaka-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 7.2.0

This issue was fixed in the openstack/neutron 7.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 8.3.0

This issue was fixed in the openstack/neutron 8.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 7.2.0

This issue was fixed in the openstack/neutron 7.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 8.3.0

This issue was fixed in the openstack/neutron 8.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.