OpenStack Compute (nova)

Nova snapshot fails when the instance is running(ceph backend)

Bug #1419734 reported by Vj on 2015-02-09

This bug report is a duplicate of: Bug #1328546: Race condition when hard rebooting instance. Edit Remove

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Incomplete	Undecided	Erhan Ekici

Bug Description

I have setup openstack with ceph storage backend(nova, glance and cinder all use it). Linux Bridge-ml2 is my networking plugin. When I try to do a snapshot when the instance is running, snapshot fails. It seems that nova is trying to freeze the VM and try to take a cold snapshot, but when it resumes, it fails and the snapshot that gets is deleted immediately. Please check the nova.log attached. Taken during the snapshot process.:

I have checked glance logs , but no errors. If you need other logs, I will attach. I have already checked, https://bugs.launchpad.net/nova/+bug/1334398 and https://bugs.launchpad.net/mos/+bug/1381072, they are all related to live snapshots , but in this case even the cold snapshot is not working. When I stop the VM and take the snap , it works.

Revision history for this message

Vj (acetone-black) wrote on 2015-02-09:

nova.log Edit (98.5 KiB, text/html)

Revision history for this message

jichenjc (jichenjc) wrote on 2015-02-09:

some questions , could you please help
1) whether you pause action before image capture?
2) is it a 100% occurrance issue or you only see it once ?

thanks

Revision history for this message

jichenjc (jichenjc) wrote on 2015-02-09:

[instance: adb449ab-ac5e-4fed-9e02-b8335d39fe27] VM Paused (Lifecycle Event) 2015-02-09 16:16:38.145 1197 DEBUG nova.compute.manager [-] [instance: adb449ab-ac5e-4fed-9e02-b8335d39fe27] Synchronizing instance power state after lifecycle event "Paused"; current vm_state: active, current task_state: image_snapshot, current DB power_state: 1, VM power_state: 3 handle_lifecycle_event

[instance: adb449ab-ac5e-4fed-9e02-b8335d39fe27] VM Stopped (Lifecycle Event) 2015-02-09 16:16:39.206 1197 DEBUG nova.compute.manager [-] [instance: adb449ab-ac5e-4fed-9e02-b8335d39fe27] Synchronizing instance power state after lifecycle event "Stopped"; current vm_state: active, current task_state: image_snapshot, current DB power_state: 1, VM power_state: 4 handle_lifecycle_event /usr/lib/python2.7/dist-packages/nova/compute/manager.py:1111

looks like paused then right now stopped and when create domain
error occur

16:16:47.467 1197 ERROR nova.virt.libvirt.driver [req-30583370-c2d4-4a92-98ce-5f6b54488cfa None] Error launching a defined domain with XML: instance-000000aa adb449ab-ac5e-4fed-9e02-b8335d39fe27 cirros 2015-02-09 10:46:14 256 1 0 0 1 admin admin 262144 262144 1 OpenStack Foundation OpenStack Nova 2014.2.1 a7a9fb94-4349-9d21-ef9f-40fa6edf982f adb449ab-ac5e-4fed-9e02-b8335d39fe27 hvm destroy restart destroy /usr/bin/qemu-system-x86_64

Revision history for this message

Vj (acetone-black) wrote on 2015-02-09:

1. Nova automatically pauses the VM.

2. Yes , this is a 100% occurance issue, not just once.

My doubt is with this line(no:73in log file)

2015-02-09 16:16:39.253 1197 DEBUG nova.openstack.common.processutils [req-30583370-c2d4-4a92-98ce-5f6b54488cfa None] Running cmd (subprocess): qemu-img convert -O raw rbd:vms/adb449ab-ac5e-4fed-9e02-b8335d39fe27_disk:id=cinder:conf=/etc/ceph/ceph.conf /var/lib/nova/instances/snapshots/tmpB8VvsH/dcd0e075f10d4e988e6925d2d75f95ed execute /usr/lib/python2.7/dist-packages/nova/openstack/common/processutils.py:161

Later I see this too:(no:325 in log file):

2015-02-09 16:16:45.264 1197 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x7f5858c9dc10>> sleeping for 57.30 seconds _inner /usr/lib/python2.7/dist-packages/nova/openstack/common/loopingcall.py:132
2015-02-09 16:16:47.437 1197 DEBUG nova.openstack.common.processutils [req-30583370-c2d4-4a92-98ce-5f6b54488cfa None] Result was 0 execute /usr/lib/python2.7/dist-packages/nova/openstack/common/processutils.py:195

Immediately after this line is the instance launch failure. So I suspect , nova is trying to convert the disk image to Raw before uploading to glance, since the disk image is already raw, the command fails(Result was 0 execute /usr/lib/python2.7/dist-packages/nova/openstack/common/processutils.py:195) I think, and so does the conversion process.

Why is nova trying to convert an image to raw format, that is already raw.

Revision history for this message

Erhan Ekici (erhan-ekici) wrote on 2015-02-10:

Can you provide more information about your environment? sw level, openstack release etc and nova hypervisor(kvm, qemu)?

You can focus on "libvirtError: Cannot get interface MTU on 'brq6913fc1c-69': No such device" error message. On compute node, just confirm you have that device before snapshotting and after snapshot find out why it disappeared (provide neutron logs, and syslog).

Changed in nova:
assignee:	nobody → Erhan Ekici (erhan-ekici)
status:	New → Incomplete

Revision history for this message

Vj (acetone-black) wrote on 2015-02-10:

The openstack release is Juno, and the hypervisor is KVM. The bridge is there when the instance is running, however when the instance is paused, the bridge dissappears. There are no errors in neutron logs. My doubt is entirely on nova , since nova is the one who instructs neutron to remove the bridge while snapshotting, while resuming the bridge is not created, so instance launch fails. Normal suspend and resume works perfectly. So can we look into the lines I have mentioned before?

Revision history for this message

Vj (acetone-black) wrote on 2015-02-10:

Also, when I stop the instance and take snapshot, there is no issue. I have attached the log for that.

Revision history for this message

Vj (acetone-black) wrote on 2015-02-10:

nova.log.1 Edit (5.9 KiB, text/plain)

Here is the log

Revision history for this message

Vj (acetone-black) wrote on 2015-02-10:

Another info: when there are 2 running instances , snapshotting works fine. This is because the bridge does not get deleted while pausing the instance.

Revision history for this message

Vj (acetone-black) wrote on 2015-02-10:

#10

Now , my question is : is there any option in nova that leaves the network bridges untill the instance is terminated, so that these kind of issues dont occur.?

Revision history for this message

Jeffrey Zhang (jeffrey4l) wrote on 2015-02-11:

#11

I meet this issue too. And I think i have the same configure with Vj.

Here the enviroment

* OpenStack Juno + Neutron ( Linux Bridge + ML2)
* Ceph as Nova, Glance, Cinder Backend
* Ubuntu 14.04

How to reproduce the bug:

1. Launch one instance into a compute node which has no other instance.
2. Make snapshot which the instance is running
3. Snapshot will fail and delete itself automatically. ( expected the snapshot create successfully)
4. Instance will crash and failed to start( A hard reboot will bring it back) ( expected the instance keep the running state)

The root cause is, in (2) step, the linux bridge is deleted and isn't created in step (4). So the snapshot is failed and instance is crashed.

* If there is another instance in the compute node( using the same bridge with above launched instance), this bug is not reproduced.
* If the above launched instance is stopped, this bug is not reproduced. Because in the step(4) the instance will not bring back.
* If using OVS+ML2 env, this bug is not reproduced too. Because the bridge( br-int, created by ovs) will not be delete when there is no instance in it.

Revision history for this message

Vj (acetone-black) wrote on 2015-02-11:

#12

I am wondering , if this bug appears with OVS.

Revision history for this message

Vj (acetone-black) wrote on 2015-02-11:

#13

Sorry, ignore my last comment

Revision history for this message

Vj (acetone-black) wrote on 2015-02-12:

#14

Please guys, any ideas, this must be a simple case to you. But I cannot go through the all the code, since I don't know python.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1328546 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.