MaxRetriesExceeded sometime fails with messaging exception

Bug #1837955 reported by Erik Olof Gunnar Andersson on 2019-07-25
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Erik Olof Gunnar Andersson
Queens
Medium
Matt Riedemann
Rocky
Medium
Matt Riedemann
Stein
Medium
Matt Riedemann

Bug Description

We are occasionally seeing MaxRetriesExceeded causing an "Exception during message handling" error. This prevents the database from setting the instance into error'd state and causes it to get stuck scheduling.

Example logs:
WARNING nova.scheduler.client.report [req-] Unable to submit allocation for instance x (409 {"errors": [{"status": 409, "request_id": "req-", "code": "placement.undefined_code", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'DISK_GB' on resource provider 'req-'. The requested amount would exceed the capacity. ", "title": "Conflict"}]})
ERROR oslo_messaging.rpc.server [req-] Exception during message handling: MaxRetriesExceeded: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance x.

Duc Truong (dtruong) wrote :

This is the stack trace for this exception:

TRACE oslo_messaging.rpc.server Traceback (most recent call last):
TRACE oslo_messaging.rpc.server File "/usr/local/openstack/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
TRACE oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
TRACE oslo_messaging.rpc.server File "/usr/local/openstack/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
TRACE oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
TRACE oslo_messaging.rpc.server File "/usr/local/openstack/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
TRACE oslo_messaging.rpc.server result = func(ctxt, **new_args)
TRACE oslo_messaging.rpc.server File "/usr/local/openstack/lib/python2.7/site-packages/nova/conductor/manager.py", line 676, in build_instances
TRACE oslo_messaging.rpc.server raise exception.MaxRetriesExceeded(reason=msg)
TRACE oslo_messaging.rpc.server MaxRetriesExceeded: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance X.

Fix proposed to branch: master
Review: https://review.opendev.org/672855

Changed in nova:
assignee: nobody → Erik Olof Gunnar Andersson (eandersson)
status: New → In Progress
melanie witt (melwitt) wrote :

Looks like this bug was introduced back in Queens with this change:

https://review.opendev.org/511358

There is one call site where MaxRetriesExceeded is raised that is not inside a try-except block. This is why the proper notification and setting of instance to ERROR state does not occur in that case.

Changed in nova:
importance: Undecided → Medium
tags: added: conductor

Related fix proposed to branch: master
Review: https://review.opendev.org/673357

Changed in nova:
assignee: Erik Olof Gunnar Andersson (eandersson) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) on 2019-07-29
Changed in nova:
assignee: Matt Riedemann (mriedem) → Erik Olof Gunnar Andersson (eandersson)
Changed in nova:
assignee: Erik Olof Gunnar Andersson (eandersson) → Eric Fried (efried)
Eric Fried (efried) on 2019-07-29
Changed in nova:
assignee: Eric Fried (efried) → Erik Olof Gunnar Andersson (eandersson)

Reviewed: https://review.opendev.org/673357
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5cc39fc51eee6e39f584394283f009aa6809d17b
Submitter: Zuul
Branch: master

commit 5cc39fc51eee6e39f584394283f009aa6809d17b
Author: Matt Riedemann <email address hidden>
Date: Mon Jul 29 15:36:42 2019 -0400

    Add functional regression test for bug 1837955

    This adds a functional regression recreate test for
    bug 1837955 which was introduced with change
    Iae904afb6cb4fcea8bb27741d774ffbe986a5fb4 in the Queens
    release.

    In this scenario, the primary (and potentially alternate)
    hosts for a server build fail and reschedule to conductor.
    Eventually all alternate hosts are exhausted and specifically
    trying to claim allocations against the alternates fails,
    probably because between the time of initial scheduling and
    rescheduling something else took up the spare capacity on the
    alternate host.

    When this happens, MaxRetriesExceeded is raised but the
    instance is stuck in BUILD status rather than set to ERROR
    status with a fault message.

    Change-Id: I4ca64dd60d883356880680fb1f04cee4136c2e00
    Related-Bug: #1837955

Reviewed: https://review.opendev.org/672855
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b98d4ba6d54f5ca2999a8fe6b6d7dfcc134061df
Submitter: Zuul
Branch: master

commit b98d4ba6d54f5ca2999a8fe6b6d7dfcc134061df
Author: Erik Olof Gunnar Andersson <email address hidden>
Date: Thu Jul 25 20:19:40 2019 -0700

    Cleanup when hitting MaxRetriesExceeded from no host_available

    Prior to this patch there was a condition when no
    host_available was true and an exception would get
    raised without first cleaning up the instance.
    This causes instances to get indefinitely stuck in
    a scheduling state.

    This patch fixes this by calling the clean up function
    and then exits build_instances using a return statement.

    The related functional regression recreate test is updated
    to show this fixes the bug.

    Change-Id: I6a2c63a4c33e783100208fd3f45eb52aad49e3d6
    Closes-bug: #1837955

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/673532
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0d2e27fd018e08617964dc38321ebfa8f1a5a4f9
Submitter: Zuul
Branch: stable/stein

commit 0d2e27fd018e08617964dc38321ebfa8f1a5a4f9
Author: Matt Riedemann <email address hidden>
Date: Mon Jul 29 15:36:42 2019 -0400

    Add functional regression test for bug 1837955

    This adds a functional regression recreate test for
    bug 1837955 which was introduced with change
    Iae904afb6cb4fcea8bb27741d774ffbe986a5fb4 in the Queens
    release.

    In this scenario, the primary (and potentially alternate)
    hosts for a server build fail and reschedule to conductor.
    Eventually all alternate hosts are exhausted and specifically
    trying to claim allocations against the alternates fails,
    probably because between the time of initial scheduling and
    rescheduling something else took up the spare capacity on the
    alternate host.

    When this happens, MaxRetriesExceeded is raised but the
    instance is stuck in BUILD status rather than set to ERROR
    status with a fault message.

    Change-Id: I4ca64dd60d883356880680fb1f04cee4136c2e00
    Related-Bug: #1837955
    (cherry picked from commit 5cc39fc51eee6e39f584394283f009aa6809d17b)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/673533
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fcc2b9e33ed2b3d9f469b458e0f46011fe9883ac
Submitter: Zuul
Branch: stable/stein

commit fcc2b9e33ed2b3d9f469b458e0f46011fe9883ac
Author: Erik Olof Gunnar Andersson <email address hidden>
Date: Thu Jul 25 20:19:40 2019 -0700

    Cleanup when hitting MaxRetriesExceeded from no host_available

    Prior to this patch there was a condition when no
    host_available was true and an exception would get
    raised without first cleaning up the instance.
    This causes instances to get indefinitely stuck in
    a scheduling state.

    This patch fixes this by calling the clean up function
    and then exits build_instances using a return statement.

    The related functional regression recreate test is updated
    to show this fixes the bug.

    Change-Id: I6a2c63a4c33e783100208fd3f45eb52aad49e3d6
    Closes-bug: #1837955
    (cherry picked from commit b98d4ba6d54f5ca2999a8fe6b6d7dfcc134061df)

Reviewed: https://review.opendev.org/673536
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f292a92a89b452c66b5799ac309a5f623ee7b16c
Submitter: Zuul
Branch: stable/rocky

commit f292a92a89b452c66b5799ac309a5f623ee7b16c
Author: Matt Riedemann <email address hidden>
Date: Mon Jul 29 15:36:42 2019 -0400

    Add functional regression test for bug 1837955

    This adds a functional regression recreate test for
    bug 1837955 which was introduced with change
    Iae904afb6cb4fcea8bb27741d774ffbe986a5fb4 in the Queens
    release.

    In this scenario, the primary (and potentially alternate)
    hosts for a server build fail and reschedule to conductor.
    Eventually all alternate hosts are exhausted and specifically
    trying to claim allocations against the alternates fails,
    probably because between the time of initial scheduling and
    rescheduling something else took up the spare capacity on the
    alternate host.

    When this happens, MaxRetriesExceeded is raised but the
    instance is stuck in BUILD status rather than set to ERROR
    status with a fault message.

    NOTE(mriedem): In this backport the test is changed slighty
    since change Id8b5c48a6e8cf65dc0a7dc13a80a0a72684f70d9 is
    not in Rocky so the ProviderUsageBaseTestCase.neutron
    attribute does not exist.

    Change-Id: I4ca64dd60d883356880680fb1f04cee4136c2e00
    Related-Bug: #1837955
    (cherry picked from commit 5cc39fc51eee6e39f584394283f009aa6809d17b)
    (cherry picked from commit 0d2e27fd018e08617964dc38321ebfa8f1a5a4f9)

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/673553
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e3b68a1c8bbedd877cfc988898f7e458ab067f28
Submitter: Zuul
Branch: stable/rocky

commit e3b68a1c8bbedd877cfc988898f7e458ab067f28
Author: Erik Olof Gunnar Andersson <email address hidden>
Date: Thu Jul 25 20:19:40 2019 -0700

    Cleanup when hitting MaxRetriesExceeded from no host_available

    Prior to this patch there was a condition when no
    host_available was true and an exception would get
    raised without first cleaning up the instance.
    This causes instances to get indefinitely stuck in
    a scheduling state.

    This patch fixes this by calling the clean up function
    and then exits build_instances using a return statement.

    The related functional regression recreate test is updated
    to show this fixes the bug.

    NOTE(mriedem): There are three changes in this backport. First,
    since bug 1819460 and change I78fc2312274471a7bd85a263de12cc5a0b19fd10
    do not apply to Rocky, _cleanup_when_reschedule_fails is added here.
    Second, the test_bug_1837955 setUp needed to stub notifications
    since change I017d1a31139c9300642dd706eadc265f7c954ca8 is not
    in Rocky to do that in ProviderUsageBaseTestCase. Third, the unit
    test is changed to mock _set_vm_state_and_notify since
    change Ibfb0a6db5920d921c4fc7cabf3f4d2838ea7f421 is not in Rocky
    for compute utils to call notify_about_compute_task_error.

    Change-Id: I6a2c63a4c33e783100208fd3f45eb52aad49e3d6
    Closes-bug: #1837955
    (cherry picked from commit b98d4ba6d54f5ca2999a8fe6b6d7dfcc134061df)
    (cherry picked from commit fcc2b9e33ed2b3d9f469b458e0f46011fe9883ac)

Reviewed: https://review.opendev.org/673567
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=26ec2cbe5975f1687cce3e6b81b448ed51569cd6
Submitter: Zuul
Branch: stable/queens

commit 26ec2cbe5975f1687cce3e6b81b448ed51569cd6
Author: Matt Riedemann <email address hidden>
Date: Mon Jul 29 15:36:42 2019 -0400

    Add functional regression test for bug 1837955

    This adds a functional regression recreate test for
    bug 1837955 which was introduced with change
    Iae904afb6cb4fcea8bb27741d774ffbe986a5fb4 in the Queens
    release.

    In this scenario, the primary (and potentially alternate)
    hosts for a server build fail and reschedule to conductor.
    Eventually all alternate hosts are exhausted and specifically
    trying to claim allocations against the alternates fails,
    probably because between the time of initial scheduling and
    rescheduling something else took up the spare capacity on the
    alternate host.

    When this happens, MaxRetriesExceeded is raised but the
    instance is stuck in BUILD status rather than set to ERROR
    status with a fault message.

    NOTE(mriedem): In this backport the test is changed slighty
    since change Iea283322124cb35fc0bc6d25f35548621e8c8c2f is not
    in Queens so ProviderUsageBaseTestCase is still in test_servers.py.

    Change-Id: I4ca64dd60d883356880680fb1f04cee4136c2e00
    Related-Bug: #1837955
    (cherry picked from commit 5cc39fc51eee6e39f584394283f009aa6809d17b)
    (cherry picked from commit 0d2e27fd018e08617964dc38321ebfa8f1a5a4f9)
    (cherry picked from commit f292a92a89b452c66b5799ac309a5f623ee7b16c)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/673576
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3ec4c5ed9c93ee261a948db9679d1550f58c2882
Submitter: Zuul
Branch: stable/queens

commit 3ec4c5ed9c93ee261a948db9679d1550f58c2882
Author: Erik Olof Gunnar Andersson <email address hidden>
Date: Thu Jul 25 20:19:40 2019 -0700

    Cleanup when hitting MaxRetriesExceeded from no host_available

    Prior to this patch there was a condition when no
    host_available was true and an exception would get
    raised without first cleaning up the instance.
    This causes instances to get indefinitely stuck in
    a scheduling state.

    This patch fixes this by calling the clean up function
    and then exits build_instances using a return statement.

    The related functional regression recreate test is updated
    to show this fixes the bug.

    Change-Id: I6a2c63a4c33e783100208fd3f45eb52aad49e3d6
    Closes-bug: #1837955
    (cherry picked from commit b98d4ba6d54f5ca2999a8fe6b6d7dfcc134061df)
    (cherry picked from commit fcc2b9e33ed2b3d9f469b458e0f46011fe9883ac)
    (cherry picked from commit e3b68a1c8bbedd877cfc988898f7e458ab067f28)

This issue was fixed in the openstack/nova 19.0.2 release.

This issue was fixed in the openstack/nova 17.0.12 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers