Error updating resources for node ubuntu-bionic-ovh-bhs1-0008373888.: AttributeError: 'NoneType' object has no attribute 'flavorid' - race with resize confirm

Bug #1834349 reported by Matt Riedemann on 2019-06-26
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matt Riedemann
Stein
Medium
Matt Riedemann

Bug Description

Seeing this in CI jobs:

http://logs.openstack.org/45/666845/1/check/neutron-tempest-dvr-ha-multinode-full/5b09053/controller/logs/screen-n-cpu.txt#_Jun_26_13_51_48_973568

Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager [None req-903b3e73-3ce3-4f5c-9a30-811383077679 None None] Error updating resources for node ubuntu-bionic-ovh-bhs1-0008373888.: AttributeError: 'NoneType' object has no attribute 'flavorid'
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager Traceback (most recent call last):
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/manager.py", line 8101, in _update_available_resource_for_node
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager startup=startup)
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 735, in update_available_resource
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager self._update_available_resource(context, resources, startup=startup)
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager File "/usr/local/lib/python3.6/dist-packages/oslo_concurrency/lockutils.py", line 328, in inner
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager return f(*args, **kwargs)
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 783, in _update_available_resource
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager self._update_usage_from_migrations(context, migrations, nodename)
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1192, in _update_usage_from_migrations
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager nodename)
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1118, in _update_usage_from_migration
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager migration.uuid, itype.flavorid, instance=instance)
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager AttributeError: 'NoneType' object has no attribute 'flavorid'
Jun 26 13:51:48.973568 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: ERROR nova.compute.manager

I'm not sure if this is a race, but it's this code:

https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/compute/resource_tracker.py#L1118

which was introduced with this change (of mine) in Stein:

https://github.com/openstack/nova/commit/20a46ece5701c9798a5e0df12c944237cb1ece3e

According to logstash this is only showing up in multinode jobs:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22AttributeError%3A%20'NoneType'%20object%20has%20no%20attribute%20'flavorid'%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

Given this code to get the flavor:

https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/compute/resource_tracker.py#L1439

    def _get_instance_type(self, instance, prefix, migration):
        """Get the instance type from instance."""
        stashed_flavors = migration.migration_type in ('resize',)
        if stashed_flavors:
            return getattr(instance, '%sflavor' % prefix)
        else:
            # NOTE(ndipanov): Certain migration types (all but resize)
            # do not change flavors so there is no need to stash
            # them. In that case - just get the instance flavor.
            return instance.flavor

And this in the logs to load the instance.old_flavor:

Jun 26 13:51:48.911052 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: INFO nova.compute.resource_tracker [None req-903b3e73-3ce3-4f5c-9a30-811383077679 None None] [instance: 4342d535-d00f-4ebc-930a-8e634f343eef] Updating resource usage from migration 87280d26-3511-47cb-9a69-6d510fd92718
Jun 26 13:51:48.911397 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: DEBUG nova.objects.instance [None req-903b3e73-3ce3-4f5c-9a30-811383077679 None None] Lazy-loading 'old_flavor' on Instance uuid 4342d535-d00f-4ebc-930a-8e634f343eef {{(pid=27912) obj_load_attr /opt/stack/nova/nova/objects/instance.py:1053}}

This is coming from a resize operation. I also see this:

Jun 26 13:51:48.844575 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: DEBUG nova.compute.resource_tracker [None req-903b3e73-3ce3-4f5c-9a30-811383077679 None None] Migration for instance 4342d535-d00f-4ebc-930a-8e634f343eef refers to another host's instance! {{(pid=27912) _pair_instances_to_migrations /opt/stack/nova/nova/compute/resource_tracker.py:751}}

Which was added because of https://review.opendev.org/#/c/382195/ so we might just be racing with the confirm/revert during the update_available_resource task and the resize was confirmed/reverted.

Further up in the logs I also see this, confirming the instance is being resized:

Jun 26 13:51:48.559335 ubuntu-bionic-ovh-bhs1-0008373888 nova-compute[27912]: INFO nova.virt.libvirt.driver [None req-903b3e73-3ce3-4f5c-9a30-811383077679 None None] [instance: 4342d535-d00f-4ebc-930a-8e634f343eef] Periodic task is updating the host stats; it is trying to get disk info for instance-0000006c, but the backing disk was removed by a concurrent operation (task_state=None) and (vm_state=resized)

When confirming a resize (which tempest will do as soon as the server status goes to VERIFY_RESIZE in most tests), the instance.old_flavor is nulled out here:

https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/compute/manager.py#L4031

And then the move claim is dropped on the source compute:

https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/compute/manager.py#L4057

and that's where we remove it from tracked_migrations in the RT:

https://github.com/openstack/nova/blob/707deb158996d540111c23afd8c916ea1c18906a/nova/compute/resource_tracker.py#L464

Matt Riedemann (mriedem) on 2019-06-26
summary: Error updating resources for node ubuntu-bionic-ovh-bhs1-0008373888.:
- AttributeError: 'NoneType' object has no attribute 'flavorid'
+ AttributeError: 'NoneType' object has no attribute 'flavorid' - race
+ with resize confirm
tags: added: resize resource-tracker

Fix proposed to branch: master
Review: https://review.opendev.org/667687

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: New → In Progress
Matt Riedemann (mriedem) on 2019-06-26
Changed in nova:
importance: Undecided → Medium

Reviewed: https://review.opendev.org/667687
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=818419c9d313bd6151d1a05b3a087a15116f61b8
Submitter: Zuul
Branch: master

commit 818419c9d313bd6151d1a05b3a087a15116f61b8
Author: Matt Riedemann <email address hidden>
Date: Wed Jun 26 13:26:44 2019 -0400

    Fix AttributeError in RT._update_usage_from_migration

    Change Ieb539c9a0cfbac743c579a1633234537a8e3e3ee in Stein
    added some logging in _update_usage_from_migration to log
    the flavor for an inbound and outbound migration.

    If an instance is resized and then the resize is immediately
    confirmed, it's possible to race with ComputeManager._confirm_resize
    setting the instance.old_flavor to None before the migration
    status is changed to "confirmed" while the update_available_resource
    periodic is running which will result in _update_usage_from_migration
    hitting an AttributeError when trying to log instance.old_flavor.flavorid
    since instance.old_flavor is None.

    There are a few key points there:

    - We get into _update_usage_from_migration because the
      _update_available_resource method gets in-progress migrations
      related to the host (in this case the source compute) and the
      migration is consider in-progress until its status is "confirmed".

    - The instance is not in the tracked_instances dict when
      _update_usage_from_migration runs because RT only tracks instances
      where the instance.host matches the RT.host and in this case the
      instance has been resized to another compute and the instance.host
      is pointing at the dest compute.

    The fix here is to simply check if we got the instance.old_flavor and
    not log the message if we do not have it, which gets us back to the old
    behavior.

    This bug was found by noticing it in CI job logs - there is a link to
    hits in logstash in the bug report.

    As for the "incoming and not tracked" case in _update_usage_from_migration
    I have not modified that since I am not sure we have the same race nor
    have I seen it in CI logs.

    Change-Id: I43e34b3ff1424d42632a3e8f842c93508905aa1a
    Closes-Bug: #1834349

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/673856
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dd09686232c485225951eabadadae143d2b95e7d
Submitter: Zuul
Branch: stable/stein

commit dd09686232c485225951eabadadae143d2b95e7d
Author: Matt Riedemann <email address hidden>
Date: Wed Jun 26 13:26:44 2019 -0400

    Fix AttributeError in RT._update_usage_from_migration

    Change Ieb539c9a0cfbac743c579a1633234537a8e3e3ee in Stein
    added some logging in _update_usage_from_migration to log
    the flavor for an inbound and outbound migration.

    If an instance is resized and then the resize is immediately
    confirmed, it's possible to race with ComputeManager._confirm_resize
    setting the instance.old_flavor to None before the migration
    status is changed to "confirmed" while the update_available_resource
    periodic is running which will result in _update_usage_from_migration
    hitting an AttributeError when trying to log instance.old_flavor.flavorid
    since instance.old_flavor is None.

    There are a few key points there:

    - We get into _update_usage_from_migration because the
      _update_available_resource method gets in-progress migrations
      related to the host (in this case the source compute) and the
      migration is consider in-progress until its status is "confirmed".

    - The instance is not in the tracked_instances dict when
      _update_usage_from_migration runs because RT only tracks instances
      where the instance.host matches the RT.host and in this case the
      instance has been resized to another compute and the instance.host
      is pointing at the dest compute.

    The fix here is to simply check if we got the instance.old_flavor and
    not log the message if we do not have it, which gets us back to the old
    behavior.

    This bug was found by noticing it in CI job logs - there is a link to
    hits in logstash in the bug report.

    As for the "incoming and not tracked" case in _update_usage_from_migration
    I have not modified that since I am not sure we have the same race nor
    have I seen it in CI logs.

    Change-Id: I43e34b3ff1424d42632a3e8f842c93508905aa1a
    Closes-Bug: #1834349
    (cherry picked from commit 818419c9d313bd6151d1a05b3a087a15116f61b8)

This issue was fixed in the openstack/nova 19.0.2 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers