Nova snapshot fails when the instance is running(ceph backend)

Bug #1419734 reported by Vj
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Incomplete
Undecided
Erhan Ekici

Bug Description

I have setup openstack with ceph storage backend(nova, glance and cinder all use it). Linux Bridge-ml2 is my networking plugin. When I try to do a snapshot when the instance is running, snapshot fails. It seems that nova is trying to freeze the VM and try to take a cold snapshot, but when it resumes, it fails and the snapshot that gets is deleted immediately. Please check the nova.log attached. Taken during the snapshot process.:

I have checked glance logs , but no errors. If you need other logs, I will attach. I have already checked, https://bugs.launchpad.net/nova/+bug/1334398 and https://bugs.launchpad.net/mos/+bug/1381072, they are all related to live snapshots , but in this case even the cold snapshot is not working. When I stop the VM and take the snap , it works.

Revision history for this message
Vj (acetone-black) wrote :
Revision history for this message
jichenjc (jichenjc) wrote :

some questions , could you please help
1) whether you pause action before image capture?
2) is it a 100% occurrance issue or you only see it once ?

thanks

Revision history for this message
jichenjc (jichenjc) wrote :

[instance: adb449ab-ac5e-4fed-9e02-b8335d39fe27] VM Paused (Lifecycle Event) 2015-02-09 16:16:38.145 1197 DEBUG nova.compute.manager [-] [instance: adb449ab-ac5e-4fed-9e02-b8335d39fe27] Synchronizing instance power state after lifecycle event "Paused"; current vm_state: active, current task_state: image_snapshot, current DB power_state: 1, VM power_state: 3 handle_lifecycle_event

 [instance: adb449ab-ac5e-4fed-9e02-b8335d39fe27] VM Stopped (Lifecycle Event) 2015-02-09 16:16:39.206 1197 DEBUG nova.compute.manager [-] [instance: adb449ab-ac5e-4fed-9e02-b8335d39fe27] Synchronizing instance power state after lifecycle event "Stopped"; current vm_state: active, current task_state: image_snapshot, current DB power_state: 1, VM power_state: 4 handle_lifecycle_event /usr/lib/python2.7/dist-packages/nova/compute/manager.py:1111

looks like paused then right now stopped and when create domain
error occur

16:16:47.467 1197 ERROR nova.virt.libvirt.driver [req-30583370-c2d4-4a92-98ce-5f6b54488cfa None] Error launching a defined domain with XML: instance-000000aa adb449ab-ac5e-4fed-9e02-b8335d39fe27 cirros 2015-02-09 10:46:14 256 1 0 0 1 admin admin 262144 262144 1 OpenStack Foundation OpenStack Nova 2014.2.1 a7a9fb94-4349-9d21-ef9f-40fa6edf982f adb449ab-ac5e-4fed-9e02-b8335d39fe27 hvm destroy restart destroy /usr/bin/qemu-system-x86_64

Revision history for this message
Vj (acetone-black) wrote :

1. Nova automatically pauses the VM.

2. Yes , this is a 100% occurance issue, not just once.

My doubt is with this line(no:73in log file)

2015-02-09 16:16:39.253 1197 DEBUG nova.openstack.common.processutils [req-30583370-c2d4-4a92-98ce-5f6b54488cfa None] Running cmd (subprocess): qemu-img convert -O raw rbd:vms/adb449ab-ac5e-4fed-9e02-b8335d39fe27_disk:id=cinder:conf=/etc/ceph/ceph.conf /var/lib/nova/instances/snapshots/tmpB8VvsH/dcd0e075f10d4e988e6925d2d75f95ed execute /usr/lib/python2.7/dist-packages/nova/openstack/common/processutils.py:161

Later I see this too:(no:325 in log file):

2015-02-09 16:16:45.264 1197 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x7f5858c9dc10>> sleeping for 57.30 seconds _inner /usr/lib/python2.7/dist-packages/nova/openstack/common/loopingcall.py:132
2015-02-09 16:16:47.437 1197 DEBUG nova.openstack.common.processutils [req-30583370-c2d4-4a92-98ce-5f6b54488cfa None] Result was 0 execute /usr/lib/python2.7/dist-packages/nova/openstack/common/processutils.py:195

Immediately after this line is the instance launch failure. So I suspect , nova is trying to convert the disk image to Raw before uploading to glance, since the disk image is already raw, the command fails(Result was 0 execute /usr/lib/python2.7/dist-packages/nova/openstack/common/processutils.py:195) I think, and so does the conversion process.

Why is nova trying to convert an image to raw format, that is already raw.

Revision history for this message
Erhan Ekici (erhan-ekici) wrote :

Can you provide more information about your environment? sw level, openstack release etc and nova hypervisor(kvm, qemu)?

You can focus on "libvirtError: Cannot get interface MTU on 'brq6913fc1c-69': No such device" error message. On compute node, just confirm you have that device before snapshotting and after snapshot find out why it disappeared (provide neutron logs, and syslog).

Changed in nova:
assignee: nobody → Erhan Ekici (erhan-ekici)
status: New → Incomplete
Revision history for this message
Vj (acetone-black) wrote :

The openstack release is Juno, and the hypervisor is KVM. The bridge is there when the instance is running, however when the instance is paused, the bridge dissappears. There are no errors in neutron logs. My doubt is entirely on nova , since nova is the one who instructs neutron to remove the bridge while snapshotting, while resuming the bridge is not created, so instance launch fails. Normal suspend and resume works perfectly. So can we look into the lines I have mentioned before?

Revision history for this message
Vj (acetone-black) wrote :

Also, when I stop the instance and take snapshot, there is no issue. I have attached the log for that.

Revision history for this message
Vj (acetone-black) wrote :

Here is the log

Revision history for this message
Vj (acetone-black) wrote :

Another info: when there are 2 running instances , snapshotting works fine. This is because the bridge does not get deleted while pausing the instance.

Revision history for this message
Vj (acetone-black) wrote :

Now , my question is : is there any option in nova that leaves the network bridges untill the instance is terminated, so that these kind of issues dont occur.?

Revision history for this message
Jeffrey Zhang (jeffrey4l) wrote :

I meet this issue too. And I think i have the same configure with Vj.

Here the enviroment

* OpenStack Juno + Neutron ( Linux Bridge + ML2)
* Ceph as Nova, Glance, Cinder Backend
* Ubuntu 14.04

How to reproduce the bug:

1. Launch one instance into a compute node which has no other instance.
2. Make snapshot which the instance is running
3. Snapshot will fail and delete itself automatically. ( expected the snapshot create successfully)
4. Instance will crash and failed to start( A hard reboot will bring it back) ( expected the instance keep the running state)

The root cause is, in (2) step, the linux bridge is deleted and isn't created in step (4). So the snapshot is failed and instance is crashed.

* If there is another instance in the compute node( using the same bridge with above launched instance), this bug is not reproduced.
* If the above launched instance is stopped, this bug is not reproduced. Because in the step(4) the instance will not bring back.
* If using OVS+ML2 env, this bug is not reproduced too. Because the bridge( br-int, created by ovs) will not be delete when there is no instance in it.

Revision history for this message
Vj (acetone-black) wrote :

I am wondering , if this bug appears with OVS.

Revision history for this message
Vj (acetone-black) wrote :

Sorry, ignore my last comment

Revision history for this message
Vj (acetone-black) wrote :

Please guys, any ideas, this must be a simple case to you. But I cannot go through the all the code, since I don't know python.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.