false error log at compute restart during error out stuck instances

Bug #1852759 reported by Balazs Gibizer
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
Pike
Fix Released
Low
Lee Yarwood
Queens
Fix Released
Low
s10
Rocky
Fix Committed
Low
Balazs Gibizer

Bug Description

Since https://review.opendev.org/#/c/687565 is merged to stable/rocky a compute node without any allocations in placement will log an error at every nova-compute restart.

Nov 15 15:02:41 ubuntu nova-compute[21876]: ERROR nova.compute.manager [None req-0ab61fb0-a780-4b84-ad07-3d6b3216b280 None None] Could not retrieve compute node resource provider 5895faa5-01fd-46ee-8afb-6ddcf136f65e and therefore unable to error out any instances stuck in BUILDING state.

The ERROR log is simply wrong. It happens because the placement report client does not differentiate between error received from placement and empty allocation dict received from placement. This only effects stable/rocky and older stable branches as in stein get_allocations_for_resource_provider() was enhanced to raise instead of returning {} if placement returned an error [1].

[1] https://github.com/openstack/nova/commit/f534495a427d1683bc536cf003ec02edbf6d8a45

Tags: compute
Changed in nova:
status: New → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/694581

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/696733

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/694581
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=64f797a0514b0276540d4f6c28cb290383088e35
Submitter: Zuul
Branch: stable/rocky

commit 64f797a0514b0276540d4f6c28cb290383088e35
Author: Balazs Gibizer <email address hidden>
Date: Fri Nov 15 16:31:04 2019 +0100

    Fix false ERROR message at compute restart

    If an empty compute is restarted a false ERROR message was printed in
    the log as the placement report client does not distinguish between
    error from placement from empty allocation dict from placement.

    This patch changes get_allocations_for_resource_provider to return None
    in case of error instead of an empty dict. This is in line with
    @safe_connect that would make the call return None as well. The
    _error_out_instances_whose_build_was_interrupted also is changed to check
    for None instead of empty dict before reporting the ERROR. The only
    other caller of get_allocations_for_resource_provider was already
    checking for None and converting it to an empty dict so from that caller
    perspective this is compatible change on the report client.

    This is stable only change as get_allocations_for_resource_provider was
    improved during stein[1] to raise on placement error.

    [1]I020e7dc47efc79f8907b7bfb753ec779a8da69a1

    Change-Id: I6042e493144d4d5a29ec6ab23ffed6b3e7f385fe
    Closes-Bug: #1852759

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/696733
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4fcb7816bc88fd513debe70b95aa60bff74e37fb
Submitter: Zuul
Branch: stable/queens

commit 4fcb7816bc88fd513debe70b95aa60bff74e37fb
Author: Balazs Gibizer <email address hidden>
Date: Fri Nov 15 16:31:04 2019 +0100

    Fix false ERROR message at compute restart

    If an empty compute is restarted a false ERROR message was printed in
    the log as the placement report client does not distinguish between
    error from placement from empty allocation dict from placement.

    This patch changes get_allocations_for_resource_provider to return None
    in case of error instead of an empty dict. This is in line with
    @safe_connect that would make the call return None as well. The
    _error_out_instances_whose_build_was_interrupted also is changed to check
    for None instead of empty dict before reporting the ERROR. The only
    other caller of get_allocations_for_resource_provider was already
    checking for None and converting it to an empty dict so from that caller
    perspective this is compatible change on the report client.

    This is stable only change as get_allocations_for_resource_provider was
    improved during stein[1] to raise on placement error.

    [1]I020e7dc47efc79f8907b7bfb753ec779a8da69a1

    Change-Id: I6042e493144d4d5a29ec6ab23ffed6b3e7f385fe
    Closes-Bug: #1852759
    (cherry picked from commit 64f797a0514b0276540d4f6c28cb290383088e35)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/699496

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.3.0

This issue was fixed in the openstack/nova 18.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.opendev.org/699496
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=511b3b0a446694dbd9fc97a4b5a4263b2bfab2fb
Submitter: Zuul
Branch: stable/pike

commit 511b3b0a446694dbd9fc97a4b5a4263b2bfab2fb
Author: Balazs Gibizer <email address hidden>
Date: Fri Nov 15 16:31:04 2019 +0100

    Fix false ERROR message at compute restart

    If an empty compute is restarted a false ERROR message was printed in
    the log as the placement report client does not distinguish between
    error from placement from empty allocation dict from placement.

    This patch changes get_allocations_for_resource_provider to return None
    in case of error instead of an empty dict. This is in line with
    @safe_connect that would make the call return None as well. The
    _error_out_instances_whose_build_was_interrupted also is changed to check
    for None instead of empty dict before reporting the ERROR. The only
    other caller of get_allocations_for_resource_provider was already
    checking for None and converting it to an empty dict so from that caller
    perspective this is compatible change on the report client.

    This is stable only change as get_allocations_for_resource_provider was
    improved during stein[1] to raise on placement error.

    [1]I020e7dc47efc79f8907b7bfb753ec779a8da69a1

    Conflicts:
          nova/compute/manager.py

    NOTE(mriedem): The conflict and changes to test_compute_mgr.py
    are due to not having change I7891b98f225f97ad47f189afb9110ef31c810717
    in Pike which added the context argument to method
    get_allocations_for_resource_provider.

    Change-Id: I6042e493144d4d5a29ec6ab23ffed6b3e7f385fe
    Closes-Bug: #1852759
    (cherry picked from commit 64f797a0514b0276540d4f6c28cb290383088e35)
    (cherry picked from commit 4fcb7816bc88fd513debe70b95aa60bff74e37fb)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova pike-eol

This issue was fixed in the openstack/nova pike-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova queens-eol

This issue was fixed in the openstack/nova queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.