update_available_resource will raise DiskNotFound after resize but before confirm

Bug #1774249 reported by Matthew Booth on 2018-05-30
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Vladyslav Drok
Ocata
Medium
Unassigned
Pike
Medium
Unassigned
Queens
Medium
Lee Yarwood
Rocky
Medium
Lee Yarwood
Stein
Medium
Lee Yarwood

Bug Description

Original reported in RH Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1584315

Tested on OSP12 (Pike), but appears to be still present on master. Should only occur if nova compute is configured to use local file instance storage.

Create instance A on compute X

Resize instance A to compute Y
  Domain is powered off
  /var/lib/nova/instances/<uuid A> renamed to <uuid A>_resize on X
  Domain is *not* undefined

On compute X:
  update_available_resource runs as a periodic task
  First action is to update self
  rt calls driver.get_available_resource()
  ...calls _get_disk_over_committed_size_total
  ...iterates over all defined domains, including the ones whose disks we renamed
  ...fails because a referenced disk no longer exists

Results in errors in nova-compute.log:

    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager [req-bd52371f-c6ec-4a83-9584-c00c5377acd8 - - - - -] Error updating resources for node compute-0.localdomain.: DiskNotFound: No disk at /var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager Traceback (most recent call last):
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6695, in update_available_resource_for_node
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager rt.update_available_resource(context, nodename)
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 641, in update_available_resource
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager resources = self.driver.get_available_resource(nodename)
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5892, in get_available_resource
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager disk_over_committed = self._get_disk_over_committed_size_total()
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7393, in _get_disk_over_committed_size_total
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager config, block_device_info)
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7301, in _get_instance_disk_info_from_config
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager dk_size = disk_api.get_allocated_disk_size(path)
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 156, in get_allocated_disk_size
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager return images.qemu_img_info(path).disk_size
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 57, in qemu_img_info
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager raise exception.DiskNotFound(location=path)
    2018-05-30 02:17:08.647 1 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk

And resource tracker is no longer updated. We can find lots of these in the gate.

Note that change Icec2769bf42455853cbe686fb30fda73df791b25 nearly mitigates this, but doesn't because task_state is not set while the instance is awaiting confirm.

jichenjc (jichenjc) on 2018-05-31
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → jichenjc (jichenjc)
Changed in nova:
status: Confirmed → In Progress
Changed in nova:
assignee: jichenjc (jichenjc) → Lee Yarwood (lyarwood)
Changed in nova:
assignee: Lee Yarwood (lyarwood) → jichenjc (jichenjc)
Changed in nova:
assignee: jichenjc (jichenjc) → Lee Yarwood (lyarwood)
Changed in nova:
assignee: Lee Yarwood (lyarwood) → Vladyslav Drok (vdrok)
LWQ (lwqcz) wrote :

Any news regarding this issue? I've read the whole history here and on RedHat's Bugzilla and I assume that this issue is not fixed yet, am I correct? We are experiencing a quite significant level of log records regarding this issue. Please update info here, thank you.

Matt Riedemann (mriedem) wrote :

Note https://review.openstack.org/#/c/553067/ is related for a race during server delete.

tags: added: libvirt resize
Changed in nova:
assignee: Vladyslav Drok (vdrok) → nobody
status: In Progress → Confirmed
Matt Riedemann (mriedem) wrote :
Changed in nova:
status: Confirmed → In Progress
assignee: nobody → Vladyslav Drok (vdrok)
Changed in nova:
assignee: Vladyslav Drok (vdrok) → Lee Yarwood (lyarwood)
Matt Riedemann (mriedem) on 2019-05-20
Changed in nova:
assignee: Lee Yarwood (lyarwood) → Vladyslav Drok (vdrok)

Reviewed: https://review.opendev.org/571410
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=966192704c20d1b4e9faf384c8dafac8ea6e06ea
Submitter: Zuul
Branch: master

commit 966192704c20d1b4e9faf384c8dafac8ea6e06ea
Author: jichenjc <email address hidden>
Date: Mon May 21 02:03:51 2018 +0800

    libvirt: Do not reraise DiskNotFound exceptions during resize

    When an instance has VERIFY_RESIZE status, the instance disk on the
    source compute host has moved to <instance_path>/<instance_uuid>_resize
    folder, which leads to disk not found errors if the update available
    resource periodic task on the source compute runs before resize is
    actually confirmed.

    Icec2769bf42455853cbe686fb30fda73df791b25 almost fixed this issue but it
    will only set reraise to False when task_state is not None, that isn't
    the case when an instance is resized but resize is not yet confirmed.
    This patch adds a condition based on vm_state to ensure we don't
    reraise DiskNotFound exceptions while resize is not confirmed.

    Closes-Bug: 1774249
    Co-Authored-By: Vladyslav Drok <email address hidden>
    Change-Id: Id687e11e235fd6c2f99bb647184310dfdce9a08d

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/660361
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f1280ab849d20819791f7c4030f570a917d3e91d
Submitter: Zuul
Branch: stable/stein

commit f1280ab849d20819791f7c4030f570a917d3e91d
Author: jichenjc <email address hidden>
Date: Mon May 21 02:03:51 2018 +0800

    libvirt: Do not reraise DiskNotFound exceptions during resize

    When an instance has VERIFY_RESIZE status, the instance disk on the
    source compute host has moved to <instance_path>/<instance_uuid>_resize
    folder, which leads to disk not found errors if the update available
    resource periodic task on the source compute runs before resize is
    actually confirmed.

    Icec2769bf42455853cbe686fb30fda73df791b25 almost fixed this issue but it
    will only set reraise to False when task_state is not None, that isn't
    the case when an instance is resized but resize is not yet confirmed.
    This patch adds a condition based on vm_state to ensure we don't
    reraise DiskNotFound exceptions while resize is not confirmed.

    Closes-Bug: 1774249
    Co-Authored-By: Vladyslav Drok <email address hidden>
    Change-Id: Id687e11e235fd6c2f99bb647184310dfdce9a08d
    (cherry picked from commit 966192704c20d1b4e9faf384c8dafac8ea6e06ea)

This issue was fixed in the openstack/nova 19.0.1 release.

Reviewed: https://review.opendev.org/660362
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fd5c45473823105d8572d7940980163c6f09169c
Submitter: Zuul
Branch: stable/rocky

commit fd5c45473823105d8572d7940980163c6f09169c
Author: jichenjc <email address hidden>
Date: Mon May 21 02:03:51 2018 +0800

    libvirt: Do not reraise DiskNotFound exceptions during resize

    When an instance has VERIFY_RESIZE status, the instance disk on the
    source compute host has moved to <instance_path>/<instance_uuid>_resize
    folder, which leads to disk not found errors if the update available
    resource periodic task on the source compute runs before resize is
    actually confirmed.

    Icec2769bf42455853cbe686fb30fda73df791b25 almost fixed this issue but it
    will only set reraise to False when task_state is not None, that isn't
    the case when an instance is resized but resize is not yet confirmed.
    This patch adds a condition based on vm_state to ensure we don't
    reraise DiskNotFound exceptions while resize is not confirmed.

    Closes-Bug: 1774249
    Co-Authored-By: Vladyslav Drok <email address hidden>
    Change-Id: Id687e11e235fd6c2f99bb647184310dfdce9a08d
    (cherry picked from commit 966192704c20d1b4e9faf384c8dafac8ea6e06ea)
    (cherry picked from commit f1280ab849d20819791f7c4030f570a917d3e91d)

This issue was fixed in the openstack/nova 18.2.1 release.

Reviewed: https://review.opendev.org/660363
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d3bdeb26155c2d3b53850b790d3800a2dd78cada
Submitter: Zuul
Branch: stable/queens

commit d3bdeb26155c2d3b53850b790d3800a2dd78cada
Author: jichenjc <email address hidden>
Date: Mon May 21 02:03:51 2018 +0800

    libvirt: Do not reraise DiskNotFound exceptions during resize

    When an instance has VERIFY_RESIZE status, the instance disk on the
    source compute host has moved to <instance_path>/<instance_uuid>_resize
    folder, which leads to disk not found errors if the update available
    resource periodic task on the source compute runs before resize is
    actually confirmed.

    Icec2769bf42455853cbe686fb30fda73df791b25 almost fixed this issue but it
    will only set reraise to False when task_state is not None, that isn't
    the case when an instance is resized but resize is not yet confirmed.
    This patch adds a condition based on vm_state to ensure we don't
    reraise DiskNotFound exceptions while resize is not confirmed.

    Closes-Bug: 1774249
    Co-Authored-By: Vladyslav Drok <email address hidden>
    Change-Id: Id687e11e235fd6c2f99bb647184310dfdce9a08d
    (cherry picked from commit 966192704c20d1b4e9faf384c8dafac8ea6e06ea)
    (cherry picked from commit f1280ab849d20819791f7c4030f570a917d3e91d)
    (cherry picked from commit fd5c45473823105d8572d7940980163c6f09169c)

This issue was fixed in the openstack/nova 17.0.11 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers