Nova Compute Manager (Resource update) fails if a disk is missing

Bug #1824974 reported by Steven Relf
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Low
Vladyslav Drok

Bug Description

===Description===

We recently ran into an issue with the periodic resource update on a kvm hypervisor.

if for some reason a disk is missing or unreadable then the periodic resource updater will fail with

2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager [req-a9ee69b3-6e90-4137-8a6a-c4a4a4426b5a 05fafdddb8f7495fbe162e748fe3f63f 13a3d2b57166496e86e9d25ec8967869 - - -] Error updating resources for node ock00043i2.frn00006.cni.ukcloud.com.
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager Traceback (most recent call last):
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6460, in update_available_resource_for_node
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager rt.update_available_resource(context)
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 511, in update_available_resource
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager resources = self.driver.get_available_resource(self.nodename)
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5599, in get_available_resource
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager disk_over_committed = self._get_disk_over_committed_size_total()
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7192, in _get_disk_over_committed_size_total
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager block_device_info=block_device_info)
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7101, in _get_instance_disk_info
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager dk_size = disk_api.get_allocated_disk_size(path)
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 158, in get_allocated_disk_size
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager return images.qemu_img_info(path).disk_size
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 57, in qemu_img_info
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager raise exception.DiskNotFound(location=path)
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/7af164a3-2eb9-4cb3-9a81-3fd57bf087e9/disk
2019-04-16 09:38:29.708 151890 ERROR nova.compute.manager

This is of course expected, in that the disk is missing, but the issue arises by the fact that the nova.compute.manager periodic resource update is unable to update ANY resource information for this hypervisor, which leads to stale stats.

It would be better if this was logged, but the other stats CPU/Memory were able to be updated.

Steps to reproduce
===================

1. Boot an instance on a hypervisor
2. shut the instance down
3. rename the folder the disk is located in
4. watch the logs for the above error
5. Build additional instances on hypervisor
6. Look and see if stats are updated.

Expected results
=================
CPU and memory stats should still be updated, with maybe disk stats being not updated, or marked as stale?

Actual Results
=================
No stats are updated for the hypervisor.

Environment
===================
Newton OpenStack - Although looking at the latest code, i think this is still an issue in the latest release.

Matt Riedemann (mriedem)
tags: added: compute resource-tracker
Matt Riedemann (mriedem)
Changed in nova:
status: New → Triaged
importance: Undecided → Low
Revision history for this message
Matt Riedemann (mriedem) wrote :

I know about how that periodic can hit a failure for one instance which blows the whole task for all of them on that same host, similar to bug 1662867:

https://review.openstack.org/#/c/553067/

In fact, ^ was backported to Ocata but not Newton so are you sure you're not hitting that issue instead?

Otherwise just generically ignoring DiskNotFound could be risky if we hit that for some other reason than the one you describe (the disk on the host is busted).

I'm inclined to mark this bug as Opinion since "It would be better if this was logged, but the other stats CPU/Memory were able to be updated." is an opinion IMO - one could argue that if the disk is corrupted on the host, we shouldn't be reporting stats on the compute since the scheduler could incorrectly select it for a new build.

Maybe there are other options here? Like maybe adding a counter for how many times we trip over this for an instance that's in steady state (task_state is None) and still on the hypervisor - if we hit that say 10 times we consider it fatal and auto-disable the compute until the operator fixes the problem?

tags: added: libvirt
Revision history for this message
Matt Riedemann (mriedem) wrote :

Note that this would fix your issue https://review.openstack.org/#/c/571410/.

Changed in nova:
assignee: nobody → Vladyslav Drok (vdrok)
status: Triaged → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.