[SRU] (libvirt) KeyError updating resources for some node, guest.uuid is not in BDM list

Bug #1602057 reported by shiliang
58
This bug affects 9 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Dan Smith
Mitaka
Won't Fix
Undecided
Edward Hope-Morley
Newton
Fix Committed
Medium
Lee Yarwood
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Mitaka
Fix Released
Medium
Edward Hope-Morley
Newton
Fix Released
Medium
Unassigned
nova (Ubuntu)
Fix Released
Medium
Unassigned
Xenial
Fix Released
Medium
Edward Hope-Morley

Bug Description

[Impact]

There currently exists a race condition whereby the compute resource_tracker periodic task polls extant instances and checks their BDMs which can occur prior to any mappings having yet been created e.g. root disk mapping for new instances. This patch ensures that instances without any BDMs are skipped.

[Test Case]
  * deploy Openstack Mitaka with debug logging enabled (not essential but helps)

  * create an instance

  * delete its BDMs - pastebin.ubuntu.com/24287419/

  * watch /var/log/nova/nova-compute.log on hypervisor hosting instance and wait for next resource_tracker tick

  * ensure that exception mentioned in LP does not occur (happens after "Auditing locally available compute resources for node")

[Regression Potential]

The resource tracker information is used by the scheduler when deciding which compute hosts are able to have an instances scheduled to them. In this case the resource tracker would be skipping instances that would contribute to disk overcommit ratios. As such it is possible that that scheduler will have momentarily skewed information about resource consumption on that compute host until the next resource_tracker tick. Since the likelihood of this race condition occurring is hopefully slim and provided that users have a reasonable frequency for the resource_tracker, the likelihood of this becoming a long term problem is low since the issue will always be corrected by a subsequent tick (although if the compute host in question were saturated that would not be fixed until an instances was deleted or migrated).

[Other]
Note that this patch did not make it into upstream stable/mitaka branch due to the stable cutoff so the proposal is to carry in the archive (indefinitely).

--------

2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager [req-d5d5d486-b488-4429-bbb5-24c9f19ff2c0 - - - - -] Error updating resources for node controller.
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager Traceback (most recent call last):
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6726, in update_available_resource
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager rt.update_available_resource(context)
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 500, in update_available_resource
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager resources = self.driver.get_available_resource(self.nodename)
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5728, in get_available_resource
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager disk_over_committed = self._get_disk_over_committed_size_total()
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7397, in _get_disk_over_committed_size_total
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager local_instances[guest.uuid], bdms[guest.uuid])
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager KeyError: '0a5c5743-9555-4dfd-b26e-198449ebeee5'
2016-07-12 09:54:36.021 10056 ERROR nova.compute.manager

shiliang (shiliang)
Changed in fuel-plugin-contrail:
assignee: nobody → shiliang (shiliang)
shiliang (shiliang)
affects: fuel-plugin-contrail → nova
Changed in nova:
status: New → In Progress
Revision history for this message
Ilia (ipetrov) wrote :

I confirm this case
2016-07-12 12:34:33.724 3955 INFO nova.compute.resource_tracker [req-11cba8bf-6613-4d41-8e1d-8bf310942ced - - - - -] Auditing locally available compute resources for node node1.parking.cloud
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager [req-11cba8bf-6613-4d41-8e1d-8bf310942ced - - - - -] Error updating resources for node node1.parking.cloud.
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager Traceback (most recent call last):
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6452, in update_available_resource
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager rt.update_available_resource(context)
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 500, in update_available_resource
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager resources = self.driver.get_available_resource(self.nodename)
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5376, in get_available_resource
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager disk_over_committed = self._get_disk_over_committed_size_total()
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 7054, in _get_disk_over_committed_size_total
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager local_instances[guest.uuid], bdms[guest.uuid])
2016-07-12 12:34:33.807 3955 ERROR nova.compute.manager KeyError: 'c2d1e02b-2e71-44c9-8d6b-4adb6be0a34f'

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/345162

Revision history for this message
Matt Riedemann (mriedem) wrote : Re: Error updating resources for some node
Changed in nova:
importance: Undecided → Medium
tags: added: libvirt
summary: - Error updating resources for some node
+ (libvirt) KeyError updating resources for some node, guest.uuid is not
+ in BDM list
Changed in nova:
assignee: shiliang (shiliang) → Dan Smith (danms)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/345162
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=66246c4c9b6f766f40ee922c38c46f35bb02ae70
Submitter: Jenkins
Branch: master

commit 66246c4c9b6f766f40ee922c38c46f35bb02ae70
Author: shi liang <email address hidden>
Date: Thu Jul 21 12:44:22 2016 +0800

    Fix exception due to BDM race in get_available_resource()

    If we run the resource tracker periodic at the right time, we
    may try to collect BDM info from a newly-created instance before
    we have any BDM records for it. This patch excludes instances
    that have no reported BDMs to avoid choking there. This also
    adds a test which simulates an instance that is partially in
    the database, but is not fully created.

    Closes-Bug: #1602057
    Change-Id: I12c9c1ae6ca27727e8742060647dbe7017cded08

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/387859

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/387859
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=73e17c3c2e3041aaaff43896c023d1a63cd0ce1f
Submitter: Jenkins
Branch: stable/newton

commit 73e17c3c2e3041aaaff43896c023d1a63cd0ce1f
Author: shi liang <email address hidden>
Date: Thu Jul 21 12:44:22 2016 +0800

    Fix exception due to BDM race in get_available_resource()

    If we run the resource tracker periodic at the right time, we
    may try to collect BDM info from a newly-created instance before
    we have any BDM records for it. This patch excludes instances
    that have no reported BDMs to avoid choking there. This also
    adds a test which simulates an instance that is partially in
    the database, but is not fully created.

    Closes-Bug: #1602057
    Change-Id: I12c9c1ae6ca27727e8742060647dbe7017cded08
    (cherry picked from commit 66246c4c9b6f766f40ee922c38c46f35bb02ae70)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.2

This issue was fixed in the openstack/nova 14.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.0.0b1

This issue was fixed in the openstack/nova 15.0.0.0b1 development milestone.

tags: added: sts
Revision history for this message
Edward Hope-Morley (hopem) wrote : Re: (libvirt) KeyError updating resources for some node, guest.uuid is not in BDM list

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/405467

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/405467

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.2

This issue was fixed in the openstack/nova 14.0.2 release.

Changed in cloud-archive:
status: New → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/mitaka)

Change abandoned by Lee Yarwood (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/405467
Reason: Abandoning this review given it is not suitable for stable/mitaka.

Revision history for this message
Maximiliano (massimo-6) wrote : Re: (libvirt) KeyError updating resources for some node, guest.uuid is not in BDM list

This bug is also afecting to me in stable mitaka, how can we fix it ?

2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager [req-d83fa70e-e6a1-49d8-9f15-2ddaaa9c07d7 - - - - -] Error updating resources for node oscomp02.tentails.net.
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager Traceback (most recent call last):
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6487, in update_available_resource
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager rt.update_available_resource(context)
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 508, in update_available_resource
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager resources = self.driver.get_available_resource(self.nodename)
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5393, in get_available_resource
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager disk_over_committed = self._get_disk_over_committed_size_total()
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7074, in _get_disk_over_committed_size_total
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager local_instances[guest.uuid], bdms[guest.uuid])
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager KeyError: 'ba7eedbd-55c7-4751-ade9-30d3f52d6163'
2016-12-28 12:20:44.847 9941 ERROR nova.compute.manager

# rpm -qa | egrep nova
openstack-nova-common-13.1.2-1.el7.noarch
python-nova-13.1.2-1.el7.noarch
python2-novaclient-3.3.2-1.el7.noarch
openstack-nova-compute-13.1.2-1.el7.noarch

Any chance to apply the fix proposed Review: https://review.openstack.org/405467 ?

Alvaro Uria (aluria)
tags: added: canonical-bootstack
no longer affects: ubuntu
no longer affects: Ubuntu Xenial
tags: added: sts-sru
tags: added: sts-sru-needed
removed: sts-sru
Changed in nova (Ubuntu Xenial):
assignee: nobody → Edward Hope-Morley (hopem)
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu Xenial):
status: New → Confirmed
Changed in nova (Ubuntu):
status: New → Confirmed
summary: - (libvirt) KeyError updating resources for some node, guest.uuid is not
- in BDM list
+ [SRU] (libvirt) KeyError updating resources for some node, guest.uuid is
+ not in BDM list
description: updated
tags: added: sts-sponsor
description: updated
Revision history for this message
Dave Chiluk (chiluk) wrote :

@hopem

Can you please add dep3 headers to the patch? I know it will going into an LTS release, but for the sake of future reviews it would be helpful to have.

Revision history for this message
Edward Hope-Morley (hopem) wrote :
Revision history for this message
Edward Hope-Morley (hopem) wrote :

@chiluk fixed. I actually forgot to use the actual git diff (which contains dep3-sytle info) but have fixed now.

tags: removed: sts-sponsor
Mathew Hodson (mhodson)
Changed in nova (Ubuntu):
importance: Undecided → Medium
Changed in nova (Ubuntu Xenial):
importance: Undecided → Medium
Revision history for this message
Brian Murray (brian-murray) wrote :

Has this been fixed in Zesty?

Changed in nova (Ubuntu Xenial):
status: Confirmed → Incomplete
Revision history for this message
JuanJo Ciarlante (jjo) wrote :

FYI we're also hitting this on trusty/mitaka for what looks
like incompletely deleted instances:

* still running at hypervisor, ie
virsh dominfo UUID # shows it ok

* deleted both at nova 'instances' and 'block_device_mapping' tables.

Once certain it's still running at hypervisor,
our workaround is to revive the instance at nova DB
with something like:

mysql> begin work;
mysql> update instances
  set vm_state='active', deleted=0, deleted_at=NULL
  where uuid='<UUID>';
mysql> update block_device_mapping
  set deleted=0, deleted_at=NULL
  where instance_uuid='<UUID>';
mysql> commit work;

Note also it has happened to us from failed migrations
(ie instance shown at the 'wrong' host at nova DB),
we've fixed those by adding to the 1st SQL

 host='<service_hostname>', node='<hypervisor_hostname>',

with above hostname-s as:
- <service_hostname> from nova service-list
- <hypervisor_hostname> from nova hypervisor-list

Revision history for this message
James Page (james-page) wrote :

Removing sponsors as update is already in the unapproved queue for xenial

Changed in nova (Ubuntu):
status: Confirmed → Fix Released
Changed in nova (Ubuntu Xenial):
status: Incomplete → Triaged
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello shiliang, or anyone else affected,

Accepted nova into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/2:13.1.3-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in nova (Ubuntu Xenial):
status: Triaged → Fix Committed
tags: added: verification-needed
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Fix verified on Xenial Mitaka.

tags: added: verification-done
removed: verification-needed
Revision history for this message
James Page (james-page) wrote :

Hello shiliang, or anyone else affected,

Accepted nova into mitaka-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:mitaka-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-mitaka-needed to verification-mitaka-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-mitaka-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-mitaka-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2:13.1.3-0ubuntu2

---------------
nova (2:13.1.3-0ubuntu2) xenial; urgency=medium

  * Fix exception due to BDM race in get_available_resource() (LP: #1602057)
    - d/p/fix-exception-due-to-bdm-race-in-get_available_resou.patch

 -- Edward Hope-Morley <email address hidden> Fri, 31 Mar 2017 10:38:17 +0100

Changed in nova (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Chris Halse Rogers (raof) wrote : Update Released

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
James Page (james-page) wrote :

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
James Page (james-page) wrote :

This bug was fixed in the package nova - 2:13.1.3-0ubuntu2~cloud0
---------------

 nova (2:13.1.3-0ubuntu2~cloud0) trusty-mitaka; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 nova (2:13.1.3-0ubuntu2) xenial; urgency=medium
 .
   * Fix exception due to BDM race in get_available_resource() (LP: #1602057)
     - d/p/fix-exception-due-to-bdm-race-in-get_available_resou.patch

Revision history for this message
Edward Hope-Morley (hopem) wrote :

fwiw i also tested trusty-mitaka-proposed and lgtm

tags: added: verification-mitaka-done
removed: verification-mitaka-needed
tags: added: sts-sru-done
removed: sts-sru-needed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.