Live block migration in Pike fails due to qemu-img

Bug #1718133 reported by György Szombathelyi
50
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Critical
James Page
Pike
Fix Released
Critical
James Page
nova (Ubuntu)
Fix Released
Critical
James Page
Artful
Fix Released
Critical
James Page
qemu (Ubuntu)
Invalid
Undecided
Unassigned
Artful
Invalid
Undecided
Unassigned

Bug Description

In Pike from Cloud Archive, Live Block Migrations fail:
Error updating resources for node compute1.: InvalidDiskInfo: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/ccca487b-d5db-4324-81fb-2665e60da038/disk : Unexpected error while running command.
Command: /usr/bin/python -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/ccca487b-d5db-4324-81fb-2665e60da038/disk
Exit code: 1
Stdout: u''
Stderr: u'qemu-img: Could not open \'/var/lib/nova/instances/ccca487b-d5db-4324-81fb-2665e60da038/disk\': Failed to get shared "write" lock\nIs another process using the image?\n'
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager Traceback (most recent call last):
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6629, in update_available_resource_for_node
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager rt.update_available_resource(context, nodename)
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 641, in update_available_resource
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager resources = self.driver.get_available_resource(nodename)
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5831, in get_available_resource
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager disk_over_committed = self._get_disk_over_committed_size_total()
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 7330, in _get_disk_over_committed_size_total
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager config, block_device_info)
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 7248, in _get_instance_disk_info_from_config
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager backing_file = libvirt_utils.get_disk_backing_file(path)
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/utils.py", line 200, in get_disk_backing_file
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager backing_file = images.qemu_img_info(path, format).backing_file
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/images.py", line 72, in qemu_img_info
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager raise exception.InvalidDiskInfo(reason=msg)
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager InvalidDiskInfo: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/ccca487b-d5db-4324-81fb-2665e60da038/disk : Unexpected error while running command.
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager Command: /usr/bin/python -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/ccca487b-d5db-4324-81fb-2665e60da038/disk
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager Exit code: 1
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager Stdout: u''
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager Stderr: u'qemu-img: Could not open \'/var/lib/nova/instances/ccca487b-d5db-4324-81fb-2665e60da038/disk\': Failed to get shared "write" lock\nIs another process using the image?\n'
2017-09-18 19:55:55.408 11607 ERROR nova.compute.manager
2017-09-18 19:56:44.775 11607 INFO nova.compute.manager [req-b6b3e84d-88fd-4052-9e56-bc89a1739ca3 3d4ca720acb84ba19cbbc7d5042d1f56 914a8d5ddac24f1c9b74e646633bab1c - default default] [instance: ccca487b-d5db-4324-81fb-2665e60da038] Terminating instance

Maybe it is not a nova, but qemu issue.

Tags: patch
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status: New → Confirmed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

@James, I found and fixed some of these issues, just look at "fixed in RC" on [1].
Since then all migrations I had worked, we need to find what is special in this case.

When reproducing watch out for:
- do you have the latest qemu /libvirt matching artful or is something left to be promoted to pike?
  Just report libvirt and qemu level and we can compare.
- any apparmor denials to the locking (we generate a k into the rules, but need to check)
- as usual the guest XML as we also found multi-use of an image can be an issue locking it
- as much as possible on the migration type, if you can pull out the virsh migrate (equivalent) command that would be nice
- does it use shared storage or any of the copy-storage options?

[1]: https://wiki.qemu.org/Planning/2.10

Revision history for this message
James Page (james-page) wrote :

Pike UCA is in sync with Artful versions; I'm working to reproduce so we can debug further.

Revision history for this message
György Szombathelyi (gyurco) wrote :

Maybe this is the problem (from Qemu 2.10 changelog):

- Image locking is added and enabled by default. Multiple QEMU processes cannot write to the same image as long as the host supports OFD or posix locking, unless options are specified otherwise.

But 'qemu-img info' should not require a write lock, I think. Also I think this newly introduced write-locking will cause other problems with OpenStack, since they are not really tested together.

Revision history for this message
György Szombathelyi (gyurco) wrote :

Seems there was a patchset submitted, but I don't see --no-lock or -L options in qemu-img:
https://lists.gnu.org/archive/html/qemu-block/2016-04/msg00349.html

If it would be there, then nova could call qemu-img --no-lock ...

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qemu (Ubuntu):
status: New → Confirmed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

This patch set was never accepted that way, but yeah reviving that might be an option to work around this. Also this was long ago, but the lock rewrite/enablement happened only recently so a totally new approach might be needed (or is present without us knowing).

We might ping qemu-devel about that to be clear of too much assumptions.

To help with that I found that repro could be as easy as:
1. start a guest
2. get the blockinfo and pick a disk
3. run qemu-img info on it while running

Example:
#1 (any other kvm guest will do as well)
uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=amd64 label=daily release=xenial
uvt-kvm create --password=ubuntu xenial-test release=xenial arch=amd64 label=daily
#2 Depending on how you started it you might know the path already
virsh domblklist xenial-test
Target Source
------------------------------------------------
vda /var/lib/uvtool/libvirt/images/xenial-test.qcow
vdb /var/lib/uvtool/libvirt/images/xenial-test-ds.qcow
#3 check img info
$ qemu-img info /var/lib/uvtool/libvirt/images/xenial-test.qcow
qemu-img: Could not open '/var/lib/uvtool/libvirt/images/xenial-test.qcow': Failed to get shared "write" lock
Is another process using the image?

To simplify further - we don't care what the guest does:
$ qemu-img create -f qcow2 /tmp/test.qcow2 1M
$ qemu-system-x86_64 -hda /tmp/test.qcow2 -enable-kvm -nodefaults -nographic &
$ qemu-img info /tmp/test.qcow2
qemu-img: Could not open '/tmp/test.qcow2': Failed to get shared "write" lock
Is another process using the image?

I't won't get easier, but that might be great to re-trigger the old nolock thread.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I sent a mail upstream asking for the outcome of that old effort and describing our current issue with it.
Will add a link to Mail-Archive once I have it.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

While not appeared on the Mailing List Archive as a link I have the reply.

TL;DR: The --no-lock option was later revised into --force-share

So you'd need to revise the calling code to:
$ qemu-img info --force-share ...

If you want/need to probe for lock features to know when to add that flag (other than >=2.10) we had a discussion about that in bug 1716028, so you can read some details there.

Revision history for this message
James Page (james-page) wrote :

Marking qemu task as invalid - this is really a new feature that nova needs to deal with.

Changed in nova (Ubuntu):
importance: Undecided → Critical
Changed in qemu (Ubuntu):
status: Confirmed → Invalid
Changed in nova (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
György Szombathelyi (gyurco) wrote :

Yepp, it works!
qemu-img info --force-share disk

It is easy to add this into Nova. Just need to care of older qemu-img versions somehow.

Revision history for this message
James Page (james-page) wrote :

I've uploaded test packages to:

  https://launchpad.net/~james-page/+archive/ubuntu/pike

The patch this includes will only work with qemu 2.10 as found in artful and the Pike UCA; will need todo a larger patch for submission to Nova to deal with multiple versions.

Revision history for this message
James Page (james-page) wrote :

Nova/devstack bug 1718295

James Page (james-page)
Changed in nova (Ubuntu Artful):
status: Triaged → In Progress
assignee: nobody → James Page (james-page)
Revision history for this message
James Page (james-page) wrote :
tags: added: patch
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2:16.0.0-0ubuntu2

---------------
nova (2:16.0.0-0ubuntu2) artful; urgency=medium

  * d/p/qemu-2.10-compat.patch: Compatibility patch to resolve issues
    with qemu-img info calls on running instances, which blocks live
    migration of instances among other things (LP: #1718133).

 -- James Page <email address hidden> Wed, 20 Sep 2017 18:12:27 +0100

Changed in nova (Ubuntu Artful):
status: In Progress → Fix Released
Revision history for this message
James Page (james-page) wrote :

Fix promoted to pike-proposed.

Revision history for this message
James Page (james-page) wrote :

Verified nova update for pike using tempest live migration tests and some manual exercising of live-migration including block-migration with attached storage volumes from cinder (yikes!).

Revision history for this message
James Page (james-page) wrote : Update Released

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
James Page (james-page) wrote :

This bug was fixed in the package nova - 2:16.0.0-0ubuntu2~cloud0
---------------

 nova (2:16.0.0-0ubuntu2~cloud0) xenial-pike; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 nova (2:16.0.0-0ubuntu2) artful; urgency=medium
 .
   * d/p/qemu-2.10-compat.patch: Compatibility patch to resolve issues
     with qemu-img info calls on running instances, which blocks live
     migration of instances among other things (LP: #1718133).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.