libvirt: KeyError in _get_disk_over_committed_size_total

Bug #1600031 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann

Bug Description

Seen here:

http://logs.openstack.org/21/299621/7/gate/gate-tempest-dsvm-postgres-full/c182c88/logs/screen-n-cpu.txt.gz#_2016-07-07_21_08_11_773

2016-07-07 21:08:11.773 14821 ERROR nova.compute.manager [req-1bfb9ba6-a2de-417e-9ad7-f72053ff9d96 - -] Error updating resources for node ubuntu-trusty-osic-cloud1-2340260.
2016-07-07 21:08:11.773 14821 ERROR nova.compute.manager Traceback (most recent call last):
2016-07-07 21:08:11.773 14821 ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/manager.py", line 6338, in update_available_resource_for_node
2016-07-07 21:08:11.773 14821 ERROR nova.compute.manager rt.update_available_resource(context)
2016-07-07 21:08:11.773 14821 ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 502, in update_available_resource
2016-07-07 21:08:11.773 14821 ERROR nova.compute.manager resources = self.driver.get_available_resource(self.nodename)
2016-07-07 21:08:11.773 14821 ERROR nova.compute.manager File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5349, in get_available_resource
2016-07-07 21:08:11.773 14821 ERROR nova.compute.manager disk_over_committed = self._get_disk_over_committed_size_total()
2016-07-07 21:08:11.773 14821 ERROR nova.compute.manager File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 6898, in _get_disk_over_committed_size_total
2016-07-07 21:08:11.773 14821 ERROR nova.compute.manager local_instances[guest.uuid], bdms[guest.uuid])
2016-07-07 21:08:11.773 14821 ERROR nova.compute.manager KeyError: '6b454071-c139-490d-96af-df701521330f'
2016-07-07 21:08:11.773 14821 ERROR nova.compute.manager

It's coming from a DeleteServersAdminTestJSON test. There are two tests, I'm not sure which it is.

Tags: libvirt
Revision history for this message
Matt Riedemann (mriedem) wrote :

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22_get_disk_over_committed_size_total%5C%22%20AND%20message%3A%5C%22local_instances%5Bguest.uuid%5D%2C%20bdms%5Bguest.uuid%5D)%5C%22%20AND%20message%3A%5C%22KeyError%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

Shows this is new in the last 24 hours, so we should look for anything in nova in the last 24 hours related to BDMs or deletion code because looking at the code, it's getting all instances from the host (via libvirt), pulling those from the DB, and then pulling the BDMs from the DB for those instances, and if we're racing with delete, those might not match.

instance uuids from the host:

https://github.com/openstack/nova/blob/d2a89a85354ab7253d034335aa2eb996e81b6f1c/nova/virt/libvirt/driver.py#L6803

and use those to get them from the DB:

https://github.com/openstack/nova/blob/d2a89a85354ab7253d034335aa2eb996e81b6f1c/nova/virt/libvirt/driver.py#L6815

Which includes deleted instances by default.

Then use that instance uuid list to get BDMs:

https://github.com/openstack/nova/blob/d2a89a85354ab7253d034335aa2eb996e81b6f1c/nova/virt/libvirt/driver.py#L6821

Then get the guest object and check to see if it's in the list of instances from the database, which it is:

https://github.com/openstack/nova/blob/d2a89a85354ab7253d034335aa2eb996e81b6f1c/nova/virt/libvirt/driver.py#L6830

But it's not in the BDMs list, maybe because the BDM is deleted already.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm pretty sure this change made yesterday caused the regression:

https://review.openstack.org/#/c/334094/

It creates the BDM but delays setting the instance_uuid field, which is the problem we're having here, getting a list of BDMs but the instance on the compute isn't in that list yet.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Actually I'm thinking it's not https://review.openstack.org/#/c/334094/ because that's just changing up a bit how we create the instance and BDMs in the DB from the API, at that point we haven't even hit the scheduler or built the instance on the compute, which means for the periodic task in the compute to fail we have to have guests with those instance uuids, which wouldn't be at this point, so this is more likely a race with delete, so now I'm thinking it's related to https://review.openstack.org/#/c/335697/ .

Revision history for this message
Matt Riedemann (mriedem) wrote :

Bingo:

https://review.openstack.org/#/c/335697/2/nova/compute/api.py@1677

That assumes that if the instance is in error state it should be a local delete, but this test that's triggering it is resetting the instance into error state:

http://git.openstack.org/cgit/openstack/tempest/tree/tempest/api/compute/servers/test_delete_server.py#n141

Even though it's got a host.

Matt Riedemann (mriedem)
Changed in nova:
status: Confirmed → Triaged
assignee: nobody → melanie witt (melwitt)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/339307

Changed in nova:
status: Triaged → In Progress
Changed in nova:
assignee: melanie witt (melwitt) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/340479
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c1a83a3fb8b7a2ad6bd0d0bafb4a38719cf487ae
Submitter: Jenkins
Branch: master

commit c1a83a3fb8b7a2ad6bd0d0bafb4a38719cf487ae
Author: Matt Riedemann <email address hidden>
Date: Mon Jul 11 17:14:16 2016 +0000

    Revert "Detach volume after deleting instance with no host"

    This reverts commit 5ce74fa06c0e7a70fdc927b2c1f364af83f7de1d.

    We think this is causing a race and the postgres job to fail
    since it's erroneously doing local deletes and not cleaning
    up from the computes. We're not totally sure why it would
    only impact the postgres job though, but the original change
    failed the job and was rechecked, and the time it was merged
    coincides with when the postgres job started spiking with
    failures.

    Related-Bug: #1600005
    Closes-Bug: #1600031

    Change-Id: I0ed184b579b8a8d861e4d2a7c317bf0bc0623d50

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by melanie witt (<email address hidden>) on branch: master
Review: https://review.openstack.org/339307
Reason: This got squashed into re-proposal https://review.openstack.org/340614

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/nova 14.0.0.0b2

This issue was fixed in the openstack/nova 14.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.