Live snapshot is corrupted (possibly race condition?)

Bug #1530275 reported by Favyen Bastani
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Slawek Kaplonski

Bug Description

We are using nova 2:12.0.0-0ubuntu2~cloud0. Instance disks are stored in qcow2 files on ext4 filesystem. When we live snapshot, 90% of the time the produced image is corrupted; specifically, the image is only a few megabytes (e.g. 30 MB) in size, while the disk size is several GB. Here is the log from a corrupted snapshot:

2015-12-31 01:40:33.304 16805 INFO nova.compute.manager [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] instance snapshotting
2015-12-31 01:40:33.410 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Beginning live snapshot process
2015-12-31 01:40:34.964 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot extracted, beginning image upload
2015-12-31 01:40:37.029 16805 INFO nova.virt.libvirt.driver [req-80187ec9-a3e7-4eaf-80d4-1617da40989e 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot image upload complete

The entire operation completes in a couple of seconds, which is unexpected.

While testing, I added some sleep calls to the _live_snapshot function in virt/libvirt/driver.py to debug the problem. A few live snapshot runs were successful, but I'm not confident that it fixed the problem. Anyway, here is the code that I changed:

        try:
            # NOTE (rmk): blockRebase cannot be executed on persistent
            # domains, so we need to temporarily undefine it.
            # If any part of this block fails, the domain is
            # re-defined regardless.
            if guest.has_persistent_configuration():
                guest.delete_configuration()

            # NOTE (rmk): Establish a temporary mirror of our root disk and
            # issue an abort once we have a complete copy.
            dev.rebase(disk_delta, copy=True, reuse_ext=True, shallow=True)

+ time.sleep(10.0)
            while dev.wait_for_job():
- time.sleep(0.5)
+ time.sleep(5.0)

            dev.abort_job()
            libvirt_utils.chown(disk_delta, os.getuid())
        finally:
            self._host.write_instance_config(xml)
            if require_quiesce:
                self.unquiesce(context, instance, image_meta)

And the resulting log (which indicates that it is sleeping for not just the initial 10 second call, but even more than that; this means wait_for_job is returning false immediately before applying the modification, but after the modification it is actually returning true after the initial sleep and seems to be performing correctly):

2015-12-31 01:42:12.438 21232 INFO nova.compute.manager [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] instance snapshotting
2015-12-31 01:42:12.670 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Beginning live snapshot process
2015-12-31 01:43:02.411 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot extracted, beginning image upload
2015-12-31 01:44:12.893 21232 INFO nova.virt.libvirt.driver [req-f3cc4b5b-98b0-4315-b514-de36a07cb8ed 94b1e02c35204ca89bd5aea99ff5ef2b 8341c85ad9ae49408fa25074adba0480 - - -] [instance: f9d52a00-8466-4436-b5b4-f0244d54dfe1] Snapshot image upload complete

Since sleeping 10 seconds before polling wait_for_job seemed to resolve it, I think there may be a race condition where wait_for_job may be called before the job is fully initialized from the rebase call. I have not had a chance to explore that possibility further though.

description: updated
tags: added: libvirt
Eli Qiao (taget-9)
Changed in nova:
assignee: nobody → Eli Qiao (taget-9)
Revision history for this message
Matt Riedemann (mriedem) wrote :

Do the libvirt logs show any errors? How about the domain logs for the instance that you're performing the live snapshot on?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Also, what version of libvirt/qemu are you using?

Changed in nova:
status: New → Incomplete
Revision history for this message
Favyen Bastani (fbastani) wrote :

There are no logs; what is happening AFAIK is the rebase job hasn't started yet, so when nova gets the status it thinks the job is completed, and then it prematurely aborts the job.

I am using the ones from cloud-archive:liberty repository:

libvirt 1.2.16-2ubuntu11~cloud0
qemu-kvm 1:2.3+dfsg-5ubuntu9~cloud0

Revision history for this message
Matt Riedemann (mriedem) wrote :

We probably need danpb or kashyap to take a look at this one.

Would be helpful if you could recreate with libvirtd debug logs, you can see how those are configured in devstack here:

https://github.com/openstack-dev/devstack/blob/76cbbe37aae5d54542d62a5c6deec428a8cdc75e/lib/nova_plugins/functions-libvirt#L101

tags: added: live-snapshot
Changed in nova:
status: Incomplete → New
assignee: Eli Qiao (taget-9) → nobody
status: New → Confirmed
Revision history for this message
Daniel Berrange (berrange) wrote :

Yep, we really need to have the libvirtd debug logs, and also the full corresponding nova compute log from the same time.

Revision history for this message
John Garbutt (johngarbutt) wrote :

Marking as incomplete until we get the information needed to debug this issue.

Changed in nova:
status: Confirmed → Incomplete
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I have same issue and I found that in first check in dev.wait_for_job() libvirt returns status.end = 0 and status.cur = 0. So both values are equal and nova ends this while loop and got forward. Then snapshot is broken.

Changed in nova:
assignee: nobody → Slawek Kaplonski (slaweq)
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I checked also that after even very short time (I added some debug logs between start blockRebase and this while loop, status.end value was not 0 and then snapshot was fine. I also noticed that status.end value was increasing every 0.5 second (loop iteration) so it was not set to final value in first check of status.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Libvirtd log file in attachment

Revision history for this message
Slawek Kaplonski (slaweq) wrote :
Changed in nova:
status: Incomplete → New
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/363926

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

TL;DR (of some long analysis that's upcoming): From the point-of-view of
libvirt, corruption of the snapshot can occur if libvirt issues the
'block-job-cancel' command too soon (before the status from
'query-block-jobs' command is reported as "ready": true) before the
snapshot operation completes.

However, Eric Blake (libvirt / QEMU developer) speculates on IRC that,
QEMU _might_ have a race condition that it is reporting to libvirt that
the live block operation is ready soon.

- - -

FWIW, I've also just tested this locally, and couldn't reproduce the
issue:

Set this in /etc/nova/nova.conf

    [workarounds]
    disable_libvirt_livesnapshot = False

So that the _live_snapshot() method in nova/virt/libvirt/driver.py
exercises this code path:

           [...]
            dev.rebase(disk_delta, copy=True, reuse_ext=True, shallow=True)

            while dev.wait_for_job():
                time.sleep(0.5)

            dev.abort_job()
           [...]

And the simple reproducer (tested with today's DevStack, with Nova
commit at 1abb6f7)

    $ nova keypair-add oskey1 > oskey1.priv
    $ chmod 600 oskey1.priv
    $ nova boot --flavor 2 --key-name oskey1 --image cirros-0.3.4-x86_64-disk cvm1
    $ nova image-create cvm1 "snap1-of-cvm1" --poll

Server snapshotting... 100% complete
Finished

The snapshot file (uploaded to glance doesn't seem to be corrupted).

My libvirt debug log:
https://kashyapc.fedorapeople.org/virt/temp/nova-live-snapshot-libvirtd.log

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :
Download full text (4.0 KiB)

A quick note on libvirt IDs that'd be useful to know when reading the
logs -- libvirt increments the ID for each QMP command it issues. E.g.
picking randomly, if you look for the ID "libvirt-99" in the above log,
libvirt uses it to issue the QEMU Monitor Protocol (QMP) command
'query-block' (and, you get a return from QEMU, as seen via the output
of qemuMonitorIOProcess(), with the same ID, libvirt-99).

* * *

Now let's analyze the libvirtd log that is attached in comment#10.
[I've manually line-wrapped the error messages.]

(1) Nova invokes the blockRebase() Python API (and in the logs we see
the C equivalent (virDomainBlockRebase() with certain parameters:

    2016-08-31 12:51:48.555+0000: 87266: debug :
    virDomainBlockRebase:9953 : dom=0x7f6938000a10, (VM:
    name=instance-001d7e72, uuid=73aa534b-924d-4136-a4e3-7a99187b5c41),
    disk=/home/instances/73aa534b-924d-4136-a4e3-7a99187b5c41/disk,
    base=/storage/snapshots/tmpnwpMv4/1184d0cb36534371bc4035b251f49f85.delta,
    bandwidth=0, flags=b

(2) Which is translated to QMP command 'drive-mirror' with equivalent
parameters:

    2016-08-31 12:51:48.888+0000: 87266: info : qemuMonitorSend:998 :
    QEMU_MONITOR_SEND_MSG: mon=0x7f6934005a90
    msg={"execute":"drive-mirror","arguments":{"device":"drive-virtio-disk0",
         "target":"/storage/snapshots/tmpnwpMv4/1184d0cb36534371bc4035b251f49f85.delta",
         "sync":"top","mode":"existing","format":"qcow2"},"id":"libvirt-99"}

(3) Then, libvirt keeps polling QEMU to check how far the live block
operation in step (2) has come along, by issuing the 'query-block-jobs'
command:

    2016-08-31 12:51:48.911+0000: 87262: info : qemuMonitorIOWrite:528 :
    QEMU_MONITOR_IO_WRITE: mon=0x7f6934005a90
    buf={"execute":"query-block-jobs","id":"libvirt-100"} len=51 ret=51
    errno=11

(4) And keeps an eye on the 'ready' parameter from the output QEMU
issues (so that libvirt can take an action, such as, gracefully
completing the live block operation):

    2016-08-31 12:51:48.918+0000: 87262: info : qemuMonitorIOProcess:423
    : QEMU_MONITOR_IO_PROCESS: mon=0x7f6934005a90 buf={"return":
    [{"io-status": "ok", "device": "drive-virtio-disk0", "busy": true,
    "len": 0, "offset": 0, "paused": false, "speed": 0, "ready": false,
    "type": "mirror"}], "id": "libvirt-100"} len=188

NOTE: The 'ready' status is "false" from the above command's result, so
QEMU is still doing some work.

(5) So libvirt continues to query, and on the third poll (of
'query-block-jobs') with ID "libvirt-102", QEMU tells libvirt that _now_
the mirror job started in step (2) job is 'ready':

    2016-08-31 12:51:49.927+0000: 87262: info : qemuMonitorIOProcess:423 :
    QEMU_MONITOR_IO_PROCESS: mon=0x7f6934005a90 buf={"return":
    [{"io-status": "ok", "device": "drive-virtio-disk0", "busy": true,
    "len": 22740992, "offset": 22740992, "paused": false, "speed": 0,
    "ready": true, "type": "mirror"}], "id": "libvirt-102"} len=201

NOTE: above QEMU reports that the 'ready' status as "true" -- which
means, the source and destination disk contents are same, and they
continue to synchronize, until you take an action.

(6) Now libvirt takes an action -- It a...

Read more...

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

@Kashyap Chamarthy: sorry for my mistake but I uploaded in comment 10 log from attempt to create snapshot with my patch for that issue (and my debugging temp logs which I added locally). So in nova-logs I could see that in first check in wait_for_job it returns me cur=0 and end=0 but nova continues waiting for job completed because of my patch.
Tomorrow I will try to reproduce it again (should be easy on my hosts where I have this issue) without this patch for nova and I will attach another libvirtd log here.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Here is libvirtd log from failed attempt to create live_snapshot.

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

The log attached in comment #15 is the failed case. And, as I guessed
in comment #12, if you look at the log file from comment #15 the
'block-job-cancel' command is issued too soon, before QEMU reports the
block device job status as "ready" true.

So, with request id "libvirt-14", when libvirt queries QEMU for the
block device job status:

  2016-09-01 11:11:43.053+0000: 29193: info : qemuMonitorIOWrite:528 :
  QEMU_MONITOR_IO_WRITE: mon=0x7f98c4000e00
  buf={"execute":"query-block-jobs","id":"libvirt-14"} len=50 ret=50
  errno=11

It reports that it is not ready yet (see -- "ready": false):

  2016-09-01 11:11:43.057+0000: 29193: info : qemuMonitorIOProcess:423 :
  QEMU_MONITOR_IO_PROCESS: mon=0x7f98c4000e00 buf={"return":
  [{"io-status": "ok", "device": "drive-virtio-disk0", "busy": true,
  "len": 0, "offset": 0, "paused": false, "speed": 0, "ready": false,
  "type": "mirror"}], "id": "libvirt-14"} len=187

Instead of waiting for the block device job status to turn to "ready":
true, Nova goes ahead and aborts the job (with 'block-job-cancel'), with
id "libvirt-15":

  2016-09-01 11:11:43.058+0000: 29193: info : qemuMonitorIOWrite:528 :
  QEMU_MONITOR_IO_WRITE: mon=0x7f98c4000e00
  buf={"execute":"block-job-cancel","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-15"}

So, this (aborting the job before it's ready) is causing the snapshot to
be corrupted. If Nova waits a little longer, then the snapshot is
successful (as it can be seen from my log analysis in comment #13, which
is the good case).

melanie witt (melwitt)
Changed in nova:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/363926
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6b20239a5d293f55889cd1bffa59e4792c75edbf
Submitter: Jenkins
Branch: master

commit 6b20239a5d293f55889cd1bffa59e4792c75edbf
Author: Sławek Kapłoński <email address hidden>
Date: Wed Aug 31 20:28:36 2016 +0000

    Fix race condition bug during live_snapshot

    During live_snapshot creation, when nova starts block_rebase
    operation in libvirt there is possibility that block_job is
    not yet started and libvirt blockJobInfo method will return
    status.end = 0 and status.cur = 0. In such case libvirt driver
    does not wait to finish block rebase operation and that causes
    a problem because created snapshot is corrupted.

    This patch adds check if status.end != 0 to return information
    that job is already finished.

    Change-Id: I45ac06eae0b1949f746dae305469718649bfcf23
    Closes-Bug: #1530275

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/372127

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.0.0rc1

This issue was fixed in the openstack/nova 14.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/mitaka)

Change abandoned by Lee Yarwood (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/372127

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.0.0rc1

This issue was fixed in the openstack/nova 14.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/mitaka)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/372127

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.