[SRU] (libvirt) KeyError updating resources for some node, guest.uuid is not in BDM list

Bug #1602057 reported by shiliang on 2016-07-12
52
This bug affects 8 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Dan Smith
Mitaka
Undecided
Edward Hope-Morley
Newton
Medium
Lee Yarwood
Ubuntu Cloud Archive
Undecided
Unassigned
Mitaka
Medium
Edward Hope-Morley
Newton
Medium
Unassigned
nova (Ubuntu)
Medium
Unassigned
Xenial
Medium
Edward Hope-Morley

Bug Description

[Impact]

There currently exists a race condition whereby the compute resource_tracker periodic task polls extant instances and checks their BDMs which can occur prior to any mappings having yet been created e.g. root disk mapping for new instances. This patch ensures that instances without any BDMs are skipped.

[Test Case]
  * deploy Openstack Mitaka with debug logging enabled (not essential but helps)

  * create an instance

  * delete its BDMs - pastebin.ubuntu.com/24287419/

  * watch /var/log/nova/nova-compute.log on hypervisor hosting instance and wait for next resource_tracker tick

  * ensure that exception mentioned in LP does not occur (happens after "Auditing locally available compute resources for node")

[Regression Potential]

The resource tracker information is used by the scheduler when deciding which compute hosts are able to have an instances scheduled to them. In this case the resource tracker would be skipping instances that would contribute to disk overcommit ratios. As such it is possible that that scheduler will have momentarily skewed information about resource consumption on that compute host until the next resource_tracker tick. Since the likelihood of this race condition occurring is hopefully slim and provided that users have a reasonable frequency for the resource_tracker, the likelihood of this becoming a long term problem is low since the issue will always be corrected by a subsequent tick (although if the compute host in question were saturated that would not be fixed until an instances was deleted or migrated).

[Other]
Note that this patch did not make it into upstream stable/mitaka branch due to the stable cutoff so the proposal is to carry in the archive (indefinitely).

--------

2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager [req-d5d5d486-b488-4429-bbb5-24c9f19ff2c0 - - - - -] Error updating resources for node controller.
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager Traceback (most recent call last):
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6726, in update_available_resource
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager rt.update_available_resource(context)
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 500, in update_available_resource
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager resources = self.driver.get_available_resource(self.nodename)
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5728, in get_available_resource
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager disk_over_committed = self._get_disk_over_committed_size_total()
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7397, in _get_disk_over_committed_size_total
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager local_instances[guest.uuid], bdms[guest.uuid])
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager KeyError: '0a5c5743-9555-4dfd-b26e-198449ebeee5'
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager

shiliang (shiliang) on 2016-07-12
Changed in fuel-plugin-contrail:
assignee: nobody → shiliang (shiliang)
shiliang (shiliang) on 2016-07-12
affects: fuel-plugin-contrail → nova
Changed in nova:
status: New → In Progress
Ilia (ipetrov) wrote :

I confirm this case
2016-07-12 12:34:33.724 3955 INFO nova.compute.resource_tracker [req-11cba8bf-6613-4d41-8e1d-8bf310942ced - - - - -] Auditing locally available compute resources for node node1.parking.cloud
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager [req-11cba8bf-6613-4d41-8e1d-8bf310942ced - - - - -] Error updating resources for node node1.parking.cloud.
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager Traceback (most recent call last):
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6452, in update_available_resource
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager rt.update_available_resource(context)
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 500, in update_available_resource
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager resources = self.driver.get_available_resource(self.nodename)
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5376, in get_available_resource
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager disk_over_committed = self._get_disk_over_committed_size_total()
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 7054, in _get_disk_over_committed_size_total
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager local_instances[guest.uuid], bdms[guest.uuid])
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager KeyError: 'c2d1e02b-2e71-44c9-8d6b-4adb6be0a34f'

Changed in nova:
importance: Undecided → Medium
tags: added: libvirt
summary: - Error updating resources for some node
+ (libvirt) KeyError updating resources for some node, guest.uuid is not
+ in BDM list
Changed in nova:
assignee: shiliang (shiliang) → Dan Smith (danms)

Reviewed: https://review.openstack.org/345162
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=66246c4c9b6f766f40ee922c38c46f35bb02ae70
Submitter: Jenkins
Branch: master

commit 66246c4c9b6f766f40ee922c38c46f35bb02ae70
Author: shi liang <email address hidden>
Date: Thu Jul 21 12:44:22 2016 +0800

    Fix exception due to BDM race in get_available_resource()

    If we run the resource tracker periodic at the right time, we
    may try to collect BDM info from a newly-created instance before
    we have any BDM records for it. This patch excludes instances
    that have no reported BDMs to avoid choking there. This also
    adds a test which simulates an instance that is partially in
    the database, but is not fully created.

    Closes-Bug: #1602057
    Change-Id: I12c9c1ae6ca27727e8742060647dbe7017cded08

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/387859
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=73e17c3c2e3041aaaff43896c023d1a63cd0ce1f
Submitter: Jenkins
Branch: stable/newton

commit 73e17c3c2e3041aaaff43896c023d1a63cd0ce1f
Author: shi liang <email address hidden>
Date: Thu Jul 21 12:44:22 2016 +0800

    Fix exception due to BDM race in get_available_resource()

    If we run the resource tracker periodic at the right time, we
    may try to collect BDM info from a newly-created instance before
    we have any BDM records for it. This patch excludes instances
    that have no reported BDMs to avoid choking there. This also
    adds a test which simulates an instance that is partially in
    the database, but is not fully created.

    Closes-Bug: #1602057
    Change-Id: I12c9c1ae6ca27727e8742060647dbe7017cded08
    (cherry picked from commit 66246c4c9b6f766f40ee922c38c46f35bb02ae70)

This issue was fixed in the openstack/nova 14.0.2 release.

This issue was fixed in the openstack/nova 15.0.0.0b1 development milestone.

tags: added: sts

This issue was fixed in the openstack/nova 14.0.2 release.

Changed in cloud-archive:
status: New → Fix Released

Change abandoned by Lee Yarwood (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/405467
Reason: Abandoning this review given it is not suitable for stable/mitaka.

This bug is also afecting to me in stable mitaka, how can we fix it ?

2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager [req-d83fa70e-e6a1-49d8-9f15-2ddaaa9c07d7 - - - - -] Error updating resources for node oscomp02.tentails.net.
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager Traceback (most recent call last):
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6487, in update_available_resource
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager rt.update_available_resource(context)
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 508, in update_available_resource
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager resources = self.driver.get_available_resource(self.nodename)
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5393, in get_available_resource
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager disk_over_committed = self._get_disk_over_committed_size_total()
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7074, in _get_disk_over_committed_size_total
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager local_instances[guest.uuid], bdms[guest.uuid])
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager KeyError: 'ba7eedbd-55c7-4751-ade9-30d3f52d6163'
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager

# rpm -qa | egrep nova
openstack-nova-common-13.1.2-1.el7.noarch
python-nova-13.1.2-1.el7.noarch
python2-novaclient-3.3.2-1.el7.noarch
openstack-nova-compute-13.1.2-1.el7.noarch

Any chance to apply the fix proposed Review: https://review.openstack.org/405467 ?

Alvaro Uría (aluria) on 2017-02-06
tags: added: canonical-bootstack
no longer affects: ubuntu
no longer affects: Ubuntu Xenial
tags: added: sts-sru
tags: added: sts-sru-needed
removed: sts-sru
Changed in nova (Ubuntu Xenial):
assignee: nobody → Edward Hope-Morley (hopem)
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu Xenial):
status: New → Confirmed
Changed in nova (Ubuntu):
status: New → Confirmed
summary: - (libvirt) KeyError updating resources for some node, guest.uuid is not
- in BDM list
+ [SRU] (libvirt) KeyError updating resources for some node, guest.uuid is
+ not in BDM list
description: updated
tags: added: sts-sponsor
description: updated
Dave Chiluk (chiluk) wrote :

@hopem

Can you please add dep3 headers to the patch? I know it will going into an LTS release, but for the sake of future reviews it would be helpful to have.

Edward Hope-Morley (hopem) wrote :
Edward Hope-Morley (hopem) wrote :

@chiluk fixed. I actually forgot to use the actual git diff (which contains dep3-sytle info) but have fixed now.

tags: removed: sts-sponsor
Changed in nova (Ubuntu):
importance: Undecided → Medium
Changed in nova (Ubuntu Xenial):
importance: Undecided → Medium
Brian Murray (brian-murray) wrote :

Has this been fixed in Zesty?

Changed in nova (Ubuntu Xenial):
status: Confirmed → Incomplete
JuanJo Ciarlante (jjo) wrote :

FYI we're also hitting this on trusty/mitaka for what looks
like incompletely deleted instances:

* still running at hypervisor, ie
virsh dominfo UUID # shows it ok

* deleted both at nova 'instances' and 'block_device_mapping' tables.

Once certain it's still running at hypervisor,
our workaround is to revive the instance at nova DB
with something like:

mysql> begin work;
mysql> update instances
  set vm_state='active', deleted=0, deleted_at=NULL
  where uuid='<UUID>';
mysql> update block_device_mapping
  set deleted=0, deleted_at=NULL
  where instance_uuid='<UUID>';
mysql> commit work;

Note also it has happened to us from failed migrations
(ie instance shown at the 'wrong' host at nova DB),
we've fixed those by adding to the 1st SQL

 host='<service_hostname>', node='<hypervisor_hostname>',

with above hostname-s as:
- <service_hostname> from nova service-list
- <hypervisor_hostname> from nova hypervisor-list

James Page (james-page) wrote :

Removing sponsors as update is already in the unapproved queue for xenial

Changed in nova (Ubuntu):
status: Confirmed → Fix Released
Changed in nova (Ubuntu Xenial):
status: Incomplete → Triaged

Hello shiliang, or anyone else affected,

Accepted nova into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/2:13.1.3-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in nova (Ubuntu Xenial):
status: Triaged → Fix Committed
tags: added: verification-needed
Edward Hope-Morley (hopem) wrote :

Fix verified on Xenial Mitaka.

tags: added: verification-done
removed: verification-needed
James Page (james-page) wrote :

Hello shiliang, or anyone else affected,

Accepted nova into mitaka-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:mitaka-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-mitaka-needed to verification-mitaka-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-mitaka-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-mitaka-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2:13.1.3-0ubuntu2

---------------
nova (2:13.1.3-0ubuntu2) xenial; urgency=medium

  * Fix exception due to BDM race in get_available_resource() (LP: #1602057)
    - d/p/fix-exception-due-to-bdm-race-in-get_available_resou.patch

 -- Edward Hope-Morley <email address hidden> Fri, 31 Mar 2017 10:38:17 +0100

Changed in nova (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

James Page (james-page) wrote :

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

James Page (james-page) wrote :

This bug was fixed in the package nova - 2:13.1.3-0ubuntu2~cloud0
---------------

 nova (2:13.1.3-0ubuntu2~cloud0) trusty-mitaka; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 nova (2:13.1.3-0ubuntu2) xenial; urgency=medium
 .
   * Fix exception due to BDM race in get_available_resource() (LP: #1602057)
     - d/p/fix-exception-due-to-bdm-race-in-get_available_resou.patch

Edward Hope-Morley (hopem) wrote :

fwiw i also tested trusty-mitaka-proposed and lgtm

tags: added: verification-mitaka-done
removed: verification-mitaka-needed
tags: added: sts-sru-done
removed: sts-sru-needed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers