Error updating resources for node ubuntu-bionic-ovh-bhs1-0008373888.: AttributeError: 'NoneType' object has no attribute 'flavorid' - race with resize confirm

Bug #1834349 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann
Stein
Fix Committed
Medium
Matt Riedemann

Bug Description

Seeing this in CI jobs:

http://logs.openstack.org/45/666845/1/check/neutron-tempest-dvr-ha-multinode-full/5b09053/controller/logs/screen-n-cpu.txt#_Jun_26_13_51_48_973568

Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager [None req-903b3e73-3ce3-4f5c-9a30-811383077679 None None] Error updating resources for node ubuntu-bionic-ovh-bhs1-0008373888.: AttributeError: 'NoneType' object has no attribute 'flavorid'
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager Traceback (most recent call last):
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/manager.py", line 8101, in _update_available_resource_for_node
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager startup=startup)
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 735, in update_available_resource
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager self._update_available_resource(context, resources, startup=startup)
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager File "/usr/local/lib/python3.6/dist-packages/oslo_concurrency/lockutils.py", line 328, in inner
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager return f(*args, **kwargs)
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 783, in _update_available_resource
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager self._update_usage_from_migrations(context, migrations, nodename)
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1192, in _update_usage_from_migrations
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager nodename)
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1118, in _update_usage_from_migration
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager migration.uuid, itype.flavorid, instance=instance)
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager AttributeError: 'NoneType' object has no attribute 'flavorid'
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager

I'm not sure if this is a race, but it's this code:

https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/compute/resource_tracker.py#L1118

which was introduced with this change (of mine) in Stein:

https://github.com/openstack/nova/commit/20a46ece5701c9798a5e0df12c944237cb1ece3e

According to logstash this is only showing up in multinode jobs:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22AttributeError%3A%20'NoneType'%20object%20has%20no%20attribute%20'flavorid'%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

Given this code to get the flavor:

https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/compute/resource_tracker.py#L1439

    def _get_instance_type(self, instance, prefix, migration):
        """Get the instance type from instance."""
        stashed_flavors = migration.migration_type in ('resize',)
        if stashed_flavors:
            return getattr(instance, '%sflavor' % prefix)
        else:
            # NOTE(ndipanov): Certain migration types (all but resize)
            # do not change flavors so there is no need to stash
            # them. In that case - just get the instance flavor.
            return instance.flavor

And this in the logs to load the instance.old_flavor:

Jun 26 13:51:48.911052 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: INFO nova.compute.resource_tracker [None req-903b3e73-3ce3-4f5c-9a30-811383077679 None None] [instance: 4342d535-d00f-4ebc-930a-8e634f343eef] Updating resource usage from migration 87280d26-3511-47cb-9a69-6d510fd92718
Jun 26 13:51:48.911397 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: DEBUG nova.objects.instance [None req-903b3e73-3ce3-4f5c-9a30-811383077679 None None] Lazy-loading 'old_flavor' on Instance uuid 4342d535-d00f-4ebc-930a-8e634f343eef {{(pid=27912) obj_load_attr /opt/stack/nova/nova/objects/instance.py:1053}}

This is coming from a resize operation. I also see this:

Jun 26 13:51:48.844575 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: DEBUG nova.compute.resource_tracker [None req-903b3e73-3ce3-4f5c-9a30-811383077679 None None] Migration for instance 4342d535-d00f-4ebc-930a-8e634f343eef refers to another host's instance! {{(pid=27912) _pair_instances_to_migrations /opt/stack/nova/nova/compute/resource_tracker.py:751}}

Which was added because of https://review.opendev.org/#/c/382195/ so we might just be racing with the confirm/revert during the update_available_resource task and the resize was confirmed/reverted.

Further up in the logs I also see this, confirming the instance is being resized:

Jun 26 13:51:48.559335 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: INFO nova.virt.libvirt.driver [None req-903b3e73-3ce3-4f5c-9a30-811383077679 None None] [instance: 4342d535-d00f-4ebc-930a-8e634f343eef] Periodic task is updating the host stats; it is trying to get disk info for instance-0000006c, but the backing disk was removed by a concurrent operation (task_state=None) and (vm_state=resized)

When confirming a resize (which tempest will do as soon as the server status goes to VERIFY_RESIZE in most tests), the instance.old_flavor is nulled out here:

https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/compute/manager.py#L4031

And then the move claim is dropped on the source compute:

https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/compute/manager.py#L4057

and that's where we remove it from tracked_migrations in the RT:

https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/compute/resource_tracker.py#L464

Matt Riedemann (mriedem)
summary: Error updating resources for node ubuntu-bionic-ovh-bhs1-0008373888.:
- AttributeError: 'NoneType' object has no attribute 'flavorid'
+ AttributeError: 'NoneType' object has no attribute 'flavorid' - race
+ with resize confirm
tags: added: resize resource-tracker
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/667687

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: New → In Progress
Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/667687
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=818419c9d313bd6151d1a05b3a087a15116f61b8
Submitter: Zuul
Branch: master

commit 818419c9d313bd6151d1a05b3a087a15116f61b8
Author: Matt Riedemann <email address hidden>
Date: Wed Jun 26 13:26:44 2019 -0400

    Fix AttributeError in RT._update_usage_from_migration

    Change Ieb539c9a0cfbac743c579a1633234537a8e3e3ee in Stein
    added some logging in _update_usage_from_migration to log
    the flavor for an inbound and outbound migration.

    If an instance is resized and then the resize is immediately
    confirmed, it's possible to race with ComputeManager._confirm_resize
    setting the instance.old_flavor to None before the migration
    status is changed to "confirmed" while the update_available_resource
    periodic is running which will result in _update_usage_from_migration
    hitting an AttributeError when trying to log instance.old_flavor.flavorid
    since instance.old_flavor is None.

    There are a few key points there:

    - We get into _update_usage_from_migration because the
      _update_available_resource method gets in-progress migrations
      related to the host (in this case the source compute) and the
      migration is consider in-progress until its status is "confirmed".

    - The instance is not in the tracked_instances dict when
      _update_usage_from_migration runs because RT only tracks instances
      where the instance.host matches the RT.host and in this case the
      instance has been resized to another compute and the instance.host
      is pointing at the dest compute.

    The fix here is to simply check if we got the instance.old_flavor and
    not log the message if we do not have it, which gets us back to the old
    behavior.

    This bug was found by noticing it in CI job logs - there is a link to
    hits in logstash in the bug report.

    As for the "incoming and not tracked" case in _update_usage_from_migration
    I have not modified that since I am not sure we have the same race nor
    have I seen it in CI logs.

    Change-Id: I43e34b3ff1424d42632a3e8f842c93508905aa1a
    Closes-Bug: #1834349

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/673856

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/673856
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dd09686232c485225951eabadadae143d2b95e7d
Submitter: Zuul
Branch: stable/stein

commit dd09686232c485225951eabadadae143d2b95e7d
Author: Matt Riedemann <email address hidden>
Date: Wed Jun 26 13:26:44 2019 -0400

    Fix AttributeError in RT._update_usage_from_migration

    Change Ieb539c9a0cfbac743c579a1633234537a8e3e3ee in Stein
    added some logging in _update_usage_from_migration to log
    the flavor for an inbound and outbound migration.

    If an instance is resized and then the resize is immediately
    confirmed, it's possible to race with ComputeManager._confirm_resize
    setting the instance.old_flavor to None before the migration
    status is changed to "confirmed" while the update_available_resource
    periodic is running which will result in _update_usage_from_migration
    hitting an AttributeError when trying to log instance.old_flavor.flavorid
    since instance.old_flavor is None.

    There are a few key points there:

    - We get into _update_usage_from_migration because the
      _update_available_resource method gets in-progress migrations
      related to the host (in this case the source compute) and the
      migration is consider in-progress until its status is "confirmed".

    - The instance is not in the tracked_instances dict when
      _update_usage_from_migration runs because RT only tracks instances
      where the instance.host matches the RT.host and in this case the
      instance has been resized to another compute and the instance.host
      is pointing at the dest compute.

    The fix here is to simply check if we got the instance.old_flavor and
    not log the message if we do not have it, which gets us back to the old
    behavior.

    This bug was found by noticing it in CI job logs - there is a link to
    hits in logstash in the bug report.

    As for the "incoming and not tracked" case in _update_usage_from_migration
    I have not modified that since I am not sure we have the same race nor
    have I seen it in CI logs.

    Change-Id: I43e34b3ff1424d42632a3e8f842c93508905aa1a
    Closes-Bug: #1834349
    (cherry picked from commit 818419c9d313bd6151d1a05b3a087a15116f61b8)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.2

This issue was fixed in the openstack/nova 19.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/744958

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Stephen Finucane (<email address hidden>) on branch: master
Review: https://review.opendev.org/741995
Reason: Abandoning this in favor of https://review.opendev.org/#/c/744958/8 which is the same code change but with a better commit message that provides context on a bug the change also addresses.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/744958
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=44376d2e212e0f9405a58dc7fc4d5b38d70ac42e
Submitter: Zuul
Branch: master

commit 44376d2e212e0f9405a58dc7fc4d5b38d70ac42e
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:27:06 2020 +0100

    Don't unset Instance.old_flavor, new_flavor until necessary

    Since change Ia6d8a7909081b0b856bd7e290e234af7e42a2b38, the resource
    tracker's 'drop_move_claim' method has been capable of freeing up
    resource usage. However, this relies on accurate resource reporting.
    It transpires that there's a race whereby the resource tracker's
    'update_available_resource' periodic task can end up not accounting for
    usage from migrations that are in the process of being completed. The
    root cause is the resource tracker's reliance on the stashed flavor in a
    given migration record [1]. Previously, this information was deleted by
    the compute manager at the start of the confirm migration operation [2].
    The compute manager would then call the virt driver [3], which could
    take a not insignificant amount of time to return, before finally
    dropping the move claim. If the periodic task ran between the clearing
    of the stashed flavor and the return of the virt driver, it would find a
    migration record with no stashed flavor and would therefore ignore this
    record for accounting purposes [4], resulting in an incorrect record for
    the compute node, and an exception when the 'drop_move_claim' attempts
    to free up the resources that aren't being tracked.

    The solution to this issue is pretty simple. Instead of unsetting the
    old flavor record from the migration at the start of the various move
    operations, do it afterwards.

    [1] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1288
    [2] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4310-L4315
    [3] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4330-L4331
    [4] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1300

    Change-Id: I4760b01b695c94fa371b72216d398388cf981d28
    Signed-off-by: Stephen Finucane <email address hidden>
    Partial-Bug: #1879878
    Related-Bug: #1834349
    Related-Bug: #1818914

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/751352

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/751367

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/ussuri)

Reviewed: https://review.opendev.org/751352
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ce95af2caf69cb1b650459718fd4fa5f00ff28f5
Submitter: Zuul
Branch: stable/ussuri

commit ce95af2caf69cb1b650459718fd4fa5f00ff28f5
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:27:06 2020 +0100

    Don't unset Instance.old_flavor, new_flavor until necessary

    Since change Ia6d8a7909081b0b856bd7e290e234af7e42a2b38, the resource
    tracker's 'drop_move_claim' method has been capable of freeing up
    resource usage. However, this relies on accurate resource reporting.
    It transpires that there's a race whereby the resource tracker's
    'update_available_resource' periodic task can end up not accounting for
    usage from migrations that are in the process of being completed. The
    root cause is the resource tracker's reliance on the stashed flavor in a
    given migration record [1]. Previously, this information was deleted by
    the compute manager at the start of the confirm migration operation [2].
    The compute manager would then call the virt driver [3], which could
    take a not insignificant amount of time to return, before finally
    dropping the move claim. If the periodic task ran between the clearing
    of the stashed flavor and the return of the virt driver, it would find a
    migration record with no stashed flavor and would therefore ignore this
    record for accounting purposes [4], resulting in an incorrect record for
    the compute node, and an exception when the 'drop_move_claim' attempts
    to free up the resources that aren't being tracked.

    The solution to this issue is pretty simple. Instead of unsetting the
    old flavor record from the migration at the start of the various move
    operations, do it afterwards.

    [1] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1288
    [2] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4310-L4315
    [3] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4330-L4331
    [4] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1300

    Change-Id: I4760b01b695c94fa371b72216d398388cf981d28
    Signed-off-by: Stephen Finucane <email address hidden>
    Partial-Bug: #1879878
    Related-Bug: #1834349
    Related-Bug: #1818914
    (cherry picked from commit 44376d2e212e0f9405a58dc7fc4d5b38d70ac42e)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/c/openstack/nova/+/751367
Committed: https://opendev.org/openstack/nova/commit/75f9b288f8edfd24affe5ecbc1f3efb6a63726e4
Submitter: "Zuul (22348)"
Branch: stable/train

commit 75f9b288f8edfd24affe5ecbc1f3efb6a63726e4
Author: Stephen Finucane <email address hidden>
Date: Wed Aug 5 14:27:06 2020 +0100

    Don't unset Instance.old_flavor, new_flavor until necessary

    Since change Ia6d8a7909081b0b856bd7e290e234af7e42a2b38, the resource
    tracker's 'drop_move_claim' method has been capable of freeing up
    resource usage. However, this relies on accurate resource reporting.
    It transpires that there's a race whereby the resource tracker's
    'update_available_resource' periodic task can end up not accounting for
    usage from migrations that are in the process of being completed. The
    root cause is the resource tracker's reliance on the stashed flavor in a
    given migration record [1]. Previously, this information was deleted by
    the compute manager at the start of the confirm migration operation [2].
    The compute manager would then call the virt driver [3], which could
    take a not insignificant amount of time to return, before finally
    dropping the move claim. If the periodic task ran between the clearing
    of the stashed flavor and the return of the virt driver, it would find a
    migration record with no stashed flavor and would therefore ignore this
    record for accounting purposes [4], resulting in an incorrect record for
    the compute node, and an exception when the 'drop_move_claim' attempts
    to free up the resources that aren't being tracked.

    The solution to this issue is pretty simple. Instead of unsetting the
    old flavor record from the migration at the start of the various move
    operations, do it afterwards.

    Conflicts:
      nova/compute/manager.py

    NOTE(stephenfin): Conflicts are due to a number of missing cross-cell
    resize changes.

    [1] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1288
    [2] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4310-L4315
    [3] https://github.com/openstack/nova/blob/6557d67/nova/compute/manager.py#L4330-L4331
    [4] https://github.com/openstack/nova/blob/6557d67/nova/compute/resource_tracker.py#L1300

    Change-Id: I4760b01b695c94fa371b72216d398388cf981d28
    Signed-off-by: Stephen Finucane <email address hidden>
    Partial-Bug: #1879878
    Related-Bug: #1834349
    Related-Bug: #1818914
    (cherry picked from commit 44376d2e212e0f9405a58dc7fc4d5b38d70ac42e)
    (cherry picked from commit ce95af2caf69cb1b650459718fd4fa5f00ff28f5)

tags: added: in-stable-train
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.