Nova instance doesn't respond (ICMP/SSH/VNC) after live migration

Bug #1371130 reported by Artem Panchenko
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Confirmed
High
Pavel Boldin
5.0.x
Won't Fix
High
Pavel Boldin
5.1.x
Won't Fix
High
Pavel Boldin
6.0.x
Won't Fix
High
Pavel Boldin
6.1.x
Won't Fix
High
Pavel Boldin
7.0.x
Won't Fix
High
Pavel Boldin
8.0.x
Confirmed
High
Pavel Boldin

Bug Description

QEMU Ubuntu bug:
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1493049

-----------------------------------------------------------

api: '1.0'
astute_sha: f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13
auth_required: true
build_id: 2014-09-17_21-40-34
build_number: '11'
feature_groups:
- mirantis
fuellib_sha: d9b16846e54f76c8ebe7764d2b5b8231d6b25079
fuelmain_sha: 8ef433e939425eabd1034c0b70e90bdf888b69fd
nailgun_sha: eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d
ostf_sha: 64cb59c681658a7a55cc2c09d079072a41beb346
production: docker
release: '5.1'

This issue was reproduced on CI during system test 'Check VM backed with ceph migration in simple mode' (both Ubuntu and CentOS):

http://jenkins-product.srt.mirantis.net:8080/view/5.1_swarm/job/5.1_fuelmain.system_test.centos.thread_1/8/testReport/junit/(root)/migrate_vm_backed_with_ceph/migrate_vm_backed_with_ceph/

And I was able to reproduce it manually. Here are the steps:

1. Deploy cluster (Ubuntu, simple, nova flatDHCP, Ceph for volumes & images; 1 controller and 2 compute+ceph nodes)
2. Create new instance, create new volume, attach volume to the instance, associate floating ip address with instance
3. Migrate (live) instance to the another compute node (e.g. nova live-migration --disk-over-commit Test01 node-3.test.domain.local)

Expected result:

- instance successfully migrated and it is accessible

Actual:

- instance migration was completed and Nova reported that it's running, but it couldn't be reached over network (both public and private) or VNC (VM didn't respond on keys press, you can find screenshot in attachments)

After instance hard reboot it became accessible and fully operable. I didn't find errors in libvirt/qemu logs, but it seems that the issue is related to attached before migration volume, because live migration of instance without volumes works fine.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Irina Povolotskaya (ipovolotskaya) wrote :

Should this be included into Release Notes?
It there any workaround?

no longer affects: fuel
no longer affects: fuel/5.0.x
no longer affects: fuel/5.1.x
no longer affects: fuel/6.0.x
tags: added: nova
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

The guest vm hit kernel panic: http://paste.openstack.org/show/115102/

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Hmm, looks like this is purely a Cirros kernel issue. I've tried Fedora cloud image (http://download.fedoraproject.org/pub/fedora/linux/updates/20/Images/x86_64/Fedora-x86_64-20-20140407-sda.qcow2) and it worked like a charm

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

it's still not clear what triggers this error (live migration might succeed)

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Ok, so we see this time to time on CI, but I can't to reproduce this easily on deployed environments.

It seems that a guest VM may just randomly panic after a live migration. Everything seems to be ok on nova/qemu/libvirt side, so this may be just an issue with a guest kernel (TestVM is a cirros image).

Revision history for this message
Alexander Gubanov (ogubanov) wrote :

I reproduced this bug in MOS 5.1 on env:
- network neutron with GRE
- 1 controller, 2 compute nodes shared /var/lib/nova/instances by NFS
- images Cirros and Fedora19

after live migration VM stop to answer by ping/ssh/vnc
On another env (for example with ceph) it reproduced time by time.

I found warning message at /var/log/libvirt/libvirtd.log
2014-10-24 14:20:40.177+0000: 25091: warning : qemuDomainObjEnterMonitorInternal:1303 : This thread seems to be the async job owner; entering monitor without asking for a nested job is dangerous

it seems to https://bugzilla.redhat.com/show_bug.cgi?id=1018530

So I updated /etc/libvirt/qemu.conf
migration_port_min = 51152
migration_port_max = 51251

restarted libvirtd

/etc/init.d/libvirtd restart

added iptables rule on controller and computes nodes

iptables -I INPUT -p tcp -m multiport --ports 51152:51251 -m comment --comment "test: libvirt migration" -j ACCEPT

and successfully live migrated.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

If comment #8 is confirmed, this bug should be moved to Fuel.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Alexander, please give it another try and move the bug to Fuel, if it's confirmed.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

We still have no clue how to reproduce this and what the root cause is.

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Revision history for this message
Alexander Gubanov (ogubanov) wrote :

Please, ignore my comment #8, it's still reproducible occasionally and we haven't found the root cause yet.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Nastya, did you have a chance to take a look at this?

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Reproduced this issue (kernel panic on VM after live migration) on Ubuntu 14.04.1 LTS (kernel version 3.13.0-40):

http://paste.openstack.org/show/149517/

Diagnostic snapshot is attached. If you need an access to the environment with this issue, please let me know.

tags: added: release-note
tags: added: release-notes
removed: release-note
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Should be added to release notes:

"Occasionally after a successfully completed live migration an instance will hang with kernel panic. The root cause is not known yet, but it most likely is a qemu-kvm/libvirt issue. It's recommended to use offline migration of instances, if possible."

Revision history for this message
Ryan Moe (rmoe) wrote :

I'm able to reproduce this on a recent 6.0 ISO using Ceph for images, cinder, and ephemeral storage. It takes between 20 and 60 migrations before it occurs. It doesn't appear to matter how many cinder volumes are attached. The only thing that stands out to me in the logs is this:

2014-12-29 23:43:31.706+0000: 19643: debug : qemuMonitorJSONBlockJob:3676 : Requested operation is not valid: No active operation on device: drive-virtio-disk0
2014-12-29 23:43:31.706+0000: 19643: warning : qemuMigrationCancelDriveMirror:1421 : Unable to stop block job on drive-virtio-disk0

There is one message like that for each attached disk (disk1, disk2, etc). These messages appear even in the successful migrations so I'm not sure if they're relevant.

Revision history for this message
Andrey Korolyov (xdeller) wrote :

Ryan, this means that the by some reason libvirt has been told by nova to initiate block migration but with Ceph this is meaningless and probably will lead to the image corruption as URIs for disks will probably be the same at the receiving side. Even if migration succeed, those messages are indicating broken logic inside nova.

Revision history for this message
Ivan Udovichenko (iudovichenko) wrote :

Have checked it with Fuel 6.1:
http://paste.openstack.org/show/186471/

The problem still persists.
Env configuration is the same.

Some logs from computes (libvirt and qemu):
http://paste.openstack.org/show/186416/
http://paste.openstack.org/show/186417/

Have checked with guest OSes:
cirros-0.3.1-x86_64
cirros-0.3.3-x86_64
debian-testing-openstack-amd64
Fedora-Cloud-Base-20141203-21.x86_64
openSUSE_13.2_Server.x86_64-0.0.1
trusty-server-cloudimg-amd64

Used software:
qemu 2.0.0+dfsg-2ubuntu1.9
libvirt-bin 1.2.2-ubuntu2

Some screenshots:
https://imgur.com/a/I5PX2

Alex Ermolov (aermolov)
no longer affects: mos/5.1.1-updates
Revision history for this message
Ivan Udovichenko (iudovichenko) wrote :

After updating libvirt to version 1.2.13
And qemu to 2.2.1
I was able to make 82 successful migrations.

This command was used to boot and instance:
nova boot --flavor 1 --image cirros-0.3.3-x86_64 --key-name key vm1

Migration was marked as successful after connecting to the VM with public IP and making simple operations, like "uname -a".

After the 82 iteration, kernel panic was noticed.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Too late to fix this in 6.0.1, moving to 6.0.2

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Some screenshots: https://imgur.com/a/I5PX2

Could you please collect the complete Oops (preferably as a text), the one on the screenshot is missing many important info (in particular the kernel version)

Also a brief explanation of the migration setup (shared storage? If yes, which one?) would be also nice (so it's possible to make a similar setup without OpenStack)

> After the 82 iteration, kernel panic was noticed.

More details would be nice (the guest kernel version, the complete Oops message)

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Reproduced this issue (kernel panic on VM after live migration) on Ubuntu 14.04.1 LTS (kernel version 3.13.0-40):

> http://paste.openstack.org/show/149517

This might be similar to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1379340

Revision history for this message
Ivan Udovichenko (iudovichenko) wrote :

> Also a brief explanation of the migration setup (shared storage? If yes, which one?) would be also nice (so it's possible to make a similar setup without OpenStack)

It is described in "Bug Description".

> More details would be nice (the guest kernel version, the complete Oops message)

3.2.0-68-virtual in cirros-0.3.3-x86_64
It is reproducible, you may check it.

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> http://paste.openstack.org/show/149517

Can you reproduce this on non-AMD hosts?

Revision history for this message
Ivan Udovichenko (iudovichenko) wrote :

> Can you reproduce this on non-AMD hosts?

http://paste.openstack.org/show/196871/

Revision history for this message
Timofey Durakov (tdurakov) wrote :

This issue is only valid for qemu. Migration on kvm works well(more than 1000 migrations). Amount of successful live-migration on qemu host differs from time-to time. From 13 it could increase to 90 but there is no dependencies. Also tries to patch manually guest proc details - > also doesn't help. As I can see this issue doesn't relates to nova. It's better to look at underlying layer(libvirt)

Revision history for this message
Timofey Durakov (tdurakov) wrote :

moved issue to MOS Linux team.

tags: added: upgrades
Revision history for this message
Pavel Boldin (pboldin) wrote :

Since this only affects pure QEMU virtualization and customers use KVM I move that to 7.0

tags: added: release-notes-done
removed: release-notes
Revision history for this message
Pavel Boldin (pboldin) wrote :

This seems to be a bug in the QEMU itself, so the debugging will require a lot of job making it impossible to get the fix into 7.0 timeframe since there is Feature Freeze and Soft Code Freeze already.

Moving that to 8.0 and cooperating with Timofey to debug that.

Revision history for this message
Pavel Boldin (pboldin) wrote :

Confirmed. Looks like a memory corruption by QEMU migration.

Revision history for this message
Pavel Boldin (pboldin) wrote :
Revision history for this message
Pavel Boldin (pboldin) wrote :

The memory is different indeed.

Going to look up what this address contains:

[root@fuel ~]# diff left right -u
--- left 2015-08-28 19:03:09.120661229 +0000
+++ right 2015-08-28 19:03:40.387541127 +0000
@@ -448,7 +448,7 @@
 offset 1be0000:
 014f9f367bf3db962667d0a95d76103b
 offset 1c00000:
-7be31e524dbad8194db204084944f078
+b1643ebb91a73124d53187f1e11f8aa9
 offset 1c20000:
 4a7544de40f6cc0bf4c878d620655434
 offset 1c40000:

Revision history for this message
Pavel Boldin (pboldin) wrote :

Moving call to freeze the CPU *before* all the migration takes place seems to help.

Most likely, this means that QEMU has no means to dirty out non-IO pages that were modified after the live migration stage and before completion stage.

Revision history for this message
Pavel Boldin (pboldin) wrote :

So, basically, the problem seems to be in the way QEMU dirties out the pages after the *bulk* migration of the RAM is done by a calls to `ram_save_iterate`.

The QEMU only dirties pages when there is *miss* to the TLB cache: the code, generated by `tcg_out_tlb_load` called from inside `tcg_out_qemu_st` then calls a `helper_le_st*_mmu` which invalidates the page calling `x86_cpu_handle_mmu_fault` which calls `stl_phys_notdirty`. The last call marks necessary bits in the ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION].

If the memory page changed was looked up in the TLB cache then it is not dirtied. This will mostly happen for the *most used* pages such as structures of the timer interrupt or `delay_tsc` and this is exactly were it happens in our case.

This is the reason for the different pages between source and destination of the migration.

Revision history for this message
Pavel Boldin (pboldin) wrote :

Confirmed by making the patch that:
1. Flushes TLB
2. Switches off TLB usage in the QEMU emulation (so all the pages are dirtied correctly).

Revision history for this message
Pavel Boldin (pboldin) wrote :
Revision history for this message
Pavel Boldin (pboldin) wrote :
Revision history for this message
Pavel Boldin (pboldin) wrote :

The actual solution is just to flush TLB during the paused state of the CPUs in `ram_save_setup`.

See attached patch.

Revision history for this message
Pavel Boldin (pboldin) wrote :

Upstream is not vulnerable because the code there flushes TLBs in `memory_global_dirty_log_start` using the `tcg_commit` handler that in TCG case call tlb_flush for each of the CPUs.

Revision history for this message
Pavel Boldin (pboldin) wrote :
Revision history for this message
Pavel Boldin (pboldin) wrote :
Revision history for this message
Pavel Boldin (pboldin) wrote :

The comment was not completely correct:
https://bugs.launchpad.net/mos/7.0.x/+bug/1371130/comments/35

Overall scheme looks a little bit more complex: when the page in the TLB cache is not dirtied then there is a TLB_NOTDIRTY flag set at the lowest bits in the `addr' field. When it is the case, writing is done using `io_mem_write` that uses 'notdirty' section handlers to write to a page and mark it as dirty.

Pavel Boldin (pboldin)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.