instance stuck in BUILD state due to unhandled exceptions in conductor

Bug #1819460 reported by Balazs Gibizer
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Balazs Gibizer
Stein
Medium
Balazs Gibizer

Bug Description

There are two calls[1][2] during ConductorTaskManager.build_instances, used during re-schedule, that could potentially raise exceptions which leads to that the instance is stuck in BUILD state instead of going to ERROR state.

[1] https://github.com/openstack/nova/blob/892ead1438abc9a8a876209343e6a85c80f0059f/nova/conductor/manager.py#L670
[2] https://github.com/openstack/nova/blob/892ead1438abc9a8a876209343e6a85c80f0059f/nova/conductor/manager.py#L679

Changed in nova:
assignee: nobody → Balazs Gibizer (balazs-gibizer)
importance: Undecided → Low
tags: added: conductor
tags: added: re-schedule
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/642444

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/642470

Changed in nova:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Balazs Gibizer (<email address hidden>) on branch: master
Review: https://review.openstack.org/642470
Reason: It was a mistake to push this with a new changeid, here is the orignal patch https://review.openstack.org/#/c/639608/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/642444
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b63c42a0d4836fd0364cb306145d3474619f1e19
Submitter: Zuul
Branch: master

commit b63c42a0d4836fd0364cb306145d3474619f1e19
Author: Balazs Gibizer <email address hidden>
Date: Mon Mar 11 14:39:10 2019 +0100

    Reproduce bug #1819460 in functional test

    There are two calls during ConductorTaskManager.build_instances,
    used during re-schedule, that could potentially raise exceptions
    which leads to that the instance is stuck in BUILD state instead
    of going to ERROR state.

    This patch adds two functional testcase to reproduce the problems.

    Change-Id: If80c4e4776b81cc06293989ee41d39b53735352b
    Related-Bug: #1819460

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/648651

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/648651
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fd3b86d1c35efdd8356233863b8ad5b628df8d29
Submitter: Zuul
Branch: master

commit fd3b86d1c35efdd8356233863b8ad5b628df8d29
Author: Balazs Gibizer <email address hidden>
Date: Fri Mar 29 10:03:48 2019 +0100

    Fix exception type in test_boot_reschedule_fill_provider_mapping_raises

    After Iecbee518444bd282ce5f6fd019db41a322f76a83 merged the
    test_boot_reschedule_fill_provider_mapping_raises does not simulate a
    possible scenario as ConsumerAllocationRetrievalFailed cannot be
    raised from _fill_provider_mapping any more.

    The fault this functional test reporduces still exists just with
    different exception types. So this patch changes the exception type
    to a valid one that is still raised by _fill_provider_mapping.

    Change-Id: I79cf44aafccb1bc6b13a3109c5a98215811bd04d
    Related-Bug: #1819460

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/639608
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b096d9303acfea81ca56394cab681d2b2eed2d91
Submitter: Zuul
Branch: master

commit b096d9303acfea81ca56394cab681d2b2eed2d91
Author: Balazs Gibizer <email address hidden>
Date: Mon Mar 11 15:24:05 2019 +0100

    Handle placement error during re-schedule

    If claim_resources or _fill_provider_mapping calls fail due to placement
    error then the re-scheduling stops and the instance is not put into
    ERROR state.

    This patch adds proper exception handling to these cases.

    Change-Id: I78fc2312274471a7bd85a263de12cc5a0b19fd10
    Closes-Bug: #1819460

Changed in nova:
status: In Progress → Fix Released
Matt Riedemann (mriedem)
Changed in nova:
importance: Low → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/657600

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/657601

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/657602

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/657600
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=07a1a8ff7dcb00283ba7ebb6f59a70002a4ee4db
Submitter: Zuul
Branch: stable/stein

commit 07a1a8ff7dcb00283ba7ebb6f59a70002a4ee4db
Author: Balazs Gibizer <email address hidden>
Date: Mon Mar 11 14:39:10 2019 +0100

    Reproduce bug #1819460 in functional test

    There are two calls during ConductorTaskManager.build_instances,
    used during re-schedule, that could potentially raise exceptions
    which leads to that the instance is stuck in BUILD state instead
    of going to ERROR state.

    This patch adds two functional testcase to reproduce the problems.

    Conflicts:
           nova/tests/functional/test_servers.py

    Change-Id: If80c4e4776b81cc06293989ee41d39b53735352b
    Related-Bug: #1819460
    (cherry picked from commit b63c42a0d4836fd0364cb306145d3474619f1e19)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/657601
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c38a82190242aeea3c3577cd7dde9fba8bb04d72
Submitter: Zuul
Branch: stable/stein

commit c38a82190242aeea3c3577cd7dde9fba8bb04d72
Author: Balazs Gibizer <email address hidden>
Date: Fri Mar 29 10:03:48 2019 +0100

    Fix exception type in test_boot_reschedule_fill_provider_mapping_raises

    After Iecbee518444bd282ce5f6fd019db41a322f76a83 merged the
    test_boot_reschedule_fill_provider_mapping_raises does not simulate a
    possible scenario as ConsumerAllocationRetrievalFailed cannot be
    raised from _fill_provider_mapping any more.

    The fault this functional test reporduces still exists just with
    different exception types. So this patch changes the exception type
    to a valid one that is still raised by _fill_provider_mapping.

    Change-Id: I79cf44aafccb1bc6b13a3109c5a98215811bd04d
    Related-Bug: #1819460
    (cherry picked from commit fd3b86d1c35efdd8356233863b8ad5b628df8d29)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/657602
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1c774315a5134937909f674b5c5a65f06efb3cb7
Submitter: Zuul
Branch: stable/stein

commit 1c774315a5134937909f674b5c5a65f06efb3cb7
Author: Balazs Gibizer <email address hidden>
Date: Mon Mar 11 15:24:05 2019 +0100

    Handle placement error during re-schedule

    If claim_resources or _fill_provider_mapping calls fail due to placement
    error then the re-scheduling stops and the instance is not put into
    ERROR state.

    This patch adds proper exception handling to these cases.

    Change-Id: I78fc2312274471a7bd85a263de12cc5a0b19fd10
    Closes-Bug: #1819460
    (cherry picked from commit b096d9303acfea81ca56394cab681d2b2eed2d91)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.1

This issue was fixed in the openstack/nova 19.0.1 release.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I'll be backporting the non-fill provider mapping part of this to rocky and queens since the code fix and functional tests related to bug 1837955 rely on changes from the series that fixed this bug.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/673546

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/673550

Revision history for this message
Matt Riedemann (mriedem) wrote :

Actually ignore comment 15, claim_resources didn't raise AllocationUpdateFailed until Stein:

https://github.com/openstack/nova/commit/37301f2f278a3702369eec957402e36d53068973

So the bug doesn't apply to Rocky or Queens.

no longer affects: nova/rocky
no longer affects: nova/queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/rocky)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/673546
Reason: Does not apply to stable/rocky.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/673550
Reason: Does not apply to stable/rocky.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers