Mirantis OpenStack

Nova instance doesn't respond (ICMP/SSH/VNC) after live migration

Bug #1371130 reported by Artem Panchenko on 2014-09-18

This bug report is a duplicate of: Bug #1493049: memory corruption during live-migration in TCG mode. Edit Remove

This bug affects 5 people

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Confirmed	High	Pavel Boldin	Mirantis OpenStack 6.1
5.0.x	Won't Fix	High	Pavel Boldin	Mirantis OpenStack 5.0-updates
5.1.x	Won't Fix	High	Pavel Boldin	Mirantis OpenStack 5.1.1-updates
6.0.x	Won't Fix	High	Pavel Boldin	Mirantis OpenStack 6.0-updates
6.1.x	Won't Fix	High	Pavel Boldin	Mirantis OpenStack 6.1
7.0.x	Won't Fix	High	Pavel Boldin	Mirantis OpenStack 7.0
8.0.x	Confirmed	High	Pavel Boldin	Mirantis OpenStack 8.0

Bug Description

QEMU Ubuntu bug:
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1493049

-----------------------------------------------------------

api: '1.0'
astute_sha: f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13
auth_required: true
build_id: 2014-09-17_21-40-34
build_number: '11'
feature_groups:
- mirantis
fuellib_sha: d9b16846e54f76c8ebe7764d2b5b8231d6b25079
fuelmain_sha: 8ef433e939425eabd1034c0b70e90bdf888b69fd
nailgun_sha: eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d
ostf_sha: 64cb59c681658a7a55cc2c09d079072a41beb346
production: docker
release: '5.1'

This issue was reproduced on CI during system test 'Check VM backed with ceph migration in simple mode' (both Ubuntu and CentOS):

http://jenkins-product.srt.mirantis.net:8080/view/5.1_swarm/job/5.1_fuelmain.system_test.centos.thread_1/8/testReport/junit/(root)/migrate_vm_backed_with_ceph/migrate_vm_backed_with_ceph/

And I was able to reproduce it manually. Here are the steps:

1. Deploy cluster (Ubuntu, simple, nova flatDHCP, Ceph for volumes & images; 1 controller and 2 compute+ceph nodes)
2. Create new instance, create new volume, attach volume to the instance, associate floating ip address with instance
3. Migrate (live) instance to the another compute node (e.g. nova live-migration --disk-over-commit Test01 node-3.test.domain.local)

Expected result:

- instance successfully migrated and it is accessible

Actual:

- instance migration was completed and Nova reported that it's running, but it couldn't be reached over network (both public and private) or VNC (VM didn't respond on keys press, you can find screenshot in attachments)

After instance hard reboot it became accessible and fully operable. I didn't find errors in libvirt/qemu logs, but it seems that the issue is related to attached before migration volume, because live migration of instance without volumes works fine.

See original description

Tags:

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2014-09-18:

fuel-snapshot-2014-09-18_14-44-25.tgz Edit (11.2 MiB, application/x-tar)

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2014-09-18:

instance_vnc.png Edit (24.5 KiB, image/png)

Revision history for this message

Irina Povolotskaya (ipovolotskaya) wrote on 2014-09-19:

Should this be included into Release Notes?
It there any workaround?

Roman Podoliaka (rpodolyaka) on 2014-09-24

no longer affects:	fuel
no longer affects:	fuel/5.0.x
no longer affects:	fuel/5.1.x
no longer affects:	fuel/6.0.x
tags:	added: nova

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2014-09-24:

The guest vm hit kernel panic: http://paste.openstack.org/show/115102/

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2014-09-24:

Hmm, looks like this is purely a Cirros kernel issue. I've tried Fedora cloud image (http://download.fedoraproject.org/pub/fedora/linux/updates/20/Images/x86_64/Fedora-x86_64-20-20140407-sda.qcow2) and it worked like a charm

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2014-10-17:

it's still not clear what triggers this error (live migration might succeed)

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2014-10-20:

Ok, so we see this time to time on CI, but I can't to reproduce this easily on deployed environments.

It seems that a guest VM may just randomly panic after a live migration. Everything seems to be ok on nova/qemu/libvirt side, so this may be just an issue with a guest kernel (TestVM is a cirros image).

Revision history for this message

Alexander Gubanov (ogubanov) wrote on 2014-10-24:

I reproduced this bug in MOS 5.1 on env:
- network neutron with GRE
- 1 controller, 2 compute nodes shared /var/lib/nova/instances by NFS
- images Cirros and Fedora19

after live migration VM stop to answer by ping/ssh/vnc
On another env (for example with ceph) it reproduced time by time.

I found warning message at /var/log/libvirt/libvirtd.log
2014-10-24 14:20:40.177+0000: 25091: warning : qemuDomainObjEnterMonitorInternal:1303 : This thread seems to be the async job owner; entering monitor without asking for a nested job is dangerous

it seems to https://bugzilla.redhat.com/show_bug.cgi?id=1018530

So I updated /etc/libvirt/qemu.conf
migration_port_min = 51152
migration_port_max = 51251

restarted libvirtd

/etc/init.d/libvirtd restart

added iptables rule on controller and computes nodes

iptables -I INPUT -p tcp -m multiport --ports 51152:51251 -m comment --comment "test: libvirt migration" -j ACCEPT

and successfully live migrated.

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-10-25:

If comment #8 is confirmed, this bug should be moved to Fuel.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2014-10-25:

#10

Alexander, please give it another try and move the bug to Fuel, if it's confirmed.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2014-11-12:

#11

We still have no clue how to reproduce this and what the root cause is.

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2014-11-14:

#12

fail_error_migrate_vm_backed_with_ceph-2014_11_14__05_10_09.tar.gz Edit (2.7 MiB, application/x-tar)

Reproduced on Jenkins job: http://jenkins-product.srt.mirantis.net:8080/view/6.0_swarm/job/6.0_fuelmain.system_test.ubuntu.thread_1/25/

Instance console output after migration: http://paste.openstack.org/show/133119/

Revision history for this message

Alexander Gubanov (ogubanov) wrote on 2014-12-05:

#13

Please, ignore my comment #8, it's still reproducible occasionally and we haven't found the root cause yet.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2014-12-08:

#14

Nastya, did you have a chance to take a look at this?

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2014-12-11:

#15

fuel-snapshot-2014-12-11_18-32-58.tgz Edit (39.6 MiB, application/x-tar)

Reproduced this issue (kernel panic on VM after live migration) on Ubuntu 14.04.1 LTS (kernel version 3.13.0-40):

http://paste.openstack.org/show/149517/

Diagnostic snapshot is attached. If you need an access to the environment with this issue, please let me know.

Dmitry Mescheryakov (dmitrymex) on 2014-12-24

tags:

added: release-note

Dmitry Mescheryakov (dmitrymex) on 2014-12-24

tags:

added: release-notes
removed: release-note

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2014-12-25:

#16

Should be added to release notes:

"Occasionally after a successfully completed live migration an instance will hang with kernel panic. The root cause is not known yet, but it most likely is a qemu-kvm/libvirt issue. It's recommended to use offline migration of instances, if possible."

Revision history for this message

Ryan Moe (rmoe) wrote on 2014-12-30:

#17

I'm able to reproduce this on a recent 6.0 ISO using Ceph for images, cinder, and ephemeral storage. It takes between 20 and 60 migrations before it occurs. It doesn't appear to matter how many cinder volumes are attached. The only thing that stands out to me in the logs is this:

2014-12-29 23:43:31.706+0000: 19643: debug : qemuMonitorJSONBlockJob:3676 : Requested operation is not valid: No active operation on device: drive-virtio-disk0
2014-12-29 23:43:31.706+0000: 19643: warning : qemuMigrationCancelDriveMirror:1421 : Unable to stop block job on drive-virtio-disk0

There is one message like that for each attached disk (disk1, disk2, etc). These messages appear even in the successful migrations so I'm not sure if they're relevant.

Revision history for this message

Andrey Korolyov (xdeller) wrote on 2014-12-30:

#18

Ryan, this means that the by some reason libvirt has been told by nova to initiate block migration but with Ceph this is meaningless and probably will lead to the image corruption as URIs for disks will probably be the same at the receiving side. Even if migration succeed, those messages are indicating broken logic inside nova.

Revision history for this message

Ivan Udovichenko (iudovichenko) wrote on 2015-03-03:

#19

Have checked it with Fuel 6.1:
http://paste.openstack.org/show/186471/

The problem still persists.
Env configuration is the same.

Some logs from computes (libvirt and qemu):
http://paste.openstack.org/show/186416/
http://paste.openstack.org/show/186417/

Have checked with guest OSes:
cirros-0.3.1-x86_64
cirros-0.3.3-x86_64
debian-testing-openstack-amd64
Fedora-Cloud-Base-20141203-21.x86_64
openSUSE_13.2_Server.x86_64-0.0.1
trusty-server-cloudimg-amd64

Used software:
qemu 2.0.0+dfsg-2ubuntu1.9
libvirt-bin 1.2.2-ubuntu2

Some screenshots:
https://imgur.com/a/I5PX2

Alex Ermolov (aermolov) on 2015-03-03

no longer affects:

mos/5.1.1-updates

Revision history for this message

Ivan Udovichenko (iudovichenko) wrote on 2015-03-13:

#20

After updating libvirt to version 1.2.13
And qemu to 2.2.1
I was able to make 82 successful migrations.

This command was used to boot and instance:
nova boot --flavor 1 --image cirros-0.3.3-x86_64 --key-name key vm1

Migration was marked as successful after connecting to the VM with public IP and making simple operations, like "uname -a".

After the 82 iteration, kernel panic was noticed.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-03-16:

#21

Too late to fix this in 6.0.1, moving to 6.0.2

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2015-03-26:

#22

> Some screenshots: https://imgur.com/a/I5PX2

Could you please collect the complete Oops (preferably as a text), the one on the screenshot is missing many important info (in particular the kernel version)

Also a brief explanation of the migration setup (shared storage? If yes, which one?) would be also nice (so it's possible to make a similar setup without OpenStack)

> After the 82 iteration, kernel panic was noticed.

More details would be nice (the guest kernel version, the complete Oops message)

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2015-03-26:

#23

> Reproduced this issue (kernel panic on VM after live migration) on Ubuntu 14.04.1 LTS (kernel version 3.13.0-40):

> http://paste.openstack.org/show/149517

This might be similar to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1379340

Revision history for this message

Ivan Udovichenko (iudovichenko) wrote on 2015-03-26:

#24

> Also a brief explanation of the migration setup (shared storage? If yes, which one?) would be also nice (so it's possible to make a similar setup without OpenStack)

It is described in "Bug Description".

> More details would be nice (the guest kernel version, the complete Oops message)

3.2.0-68-virtual in cirros-0.3.3-x86_64
It is reproducible, you may check it.

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2015-03-26:

#25

> http://paste.openstack.org/show/149517

Can you reproduce this on non-AMD hosts?

Revision history for this message

Ivan Udovichenko (iudovichenko) wrote on 2015-03-26:

#26

> Can you reproduce this on non-AMD hosts?

http://paste.openstack.org/show/196871/

Revision history for this message

Timofey Durakov (tdurakov) wrote on 2015-04-21:

#27

This issue is only valid for qemu. Migration on kvm works well(more than 1000 migrations). Amount of successful live-migration on qemu host differs from time-to time. From 13 it could increase to 90 but there is no dependencies. Also tries to patch manually guest proc details - > also doesn't help. As I can see this issue doesn't relates to nova. It's better to look at underlying layer(libvirt)

Revision history for this message

Timofey Durakov (tdurakov) wrote on 2015-04-21:

#28

moved issue to MOS Linux team.

Roman Podoliaka (rpodolyaka) on 2015-04-28

tags:

added: upgrades

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-05-12:

#29

Since this only affects pure QEMU virtualization and customers use KVM I move that to 7.0

Olga Gusarenko (ogusarenko) on 2015-05-26

tags:

added: release-notes-done
removed: release-notes

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-08-12:

#30

This seems to be a bug in the QEMU itself, so the debugging will require a lot of job making it impossible to get the fix into 7.0 timeframe since there is Feature Freeze and Soft Code Freeze already.

Moving that to 8.0 and cooperating with Timofey to debug that.

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-08-20:

#31

Confirmed. Looks like a memory corruption by QEMU migration.

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-08-26:

#32

https://lists.gnu.org/archive/html/qemu-devel/2015-03/msg05156.html

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-08-28:

#33

The memory is different indeed.

Going to look up what this address contains:

[root@fuel ~]# diff left right -u
--- left 2015-08-28 19:03:09.120661229 +0000
+++ right 2015-08-28 19:03:40.387541127 +0000
@@ -448,7 +448,7 @@
offset 1be0000:
014f9f367bf3db962667d0a95d76103b
offset 1c00000:
-7be31e524dbad8194db204084944f078
+b1643ebb91a73124d53187f1e11f8aa9
offset 1c20000:
4a7544de40f6cc0bf4c878d620655434
offset 1c40000:

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-09-02:

#34

Moving call to freeze the CPU *before* all the migration takes place seems to help.

Most likely, this means that QEMU has no means to dirty out non-IO pages that were modified after the live migration stage and before completion stage.

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-09-02:

#35

So, basically, the problem seems to be in the way QEMU dirties out the pages after the *bulk* migration of the RAM is done by a calls to `ram_save_iterate`.

The QEMU only dirties pages when there is *miss* to the TLB cache: the code, generated by `tcg_out_tlb_load` called from inside `tcg_out_qemu_st` then calls a `helper_le_st*_mmu` which invalidates the page calling `x86_cpu_handle_mmu_fault` which calls `stl_phys_notdirty`. The last call marks necessary bits in the ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION].

If the memory page changed was looked up in the TLB cache then it is not dirtied. This will mostly happen for the *most used* pages such as structures of the timer interrupt or `delay_tsc` and this is exactly were it happens in our case.

This is the reason for the different pages between source and destination of the migration.

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-09-02:

#36

Confirmed by making the patch that:
1. Flushes TLB
2. Switches off TLB usage in the QEMU emulation (so all the pages are dirtied correctly).

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-09-02:

#37

switch off tlb before migration fixing page dirtying Edit (3.5 KiB, text/plain)

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-09-02:

#38

migration debug: dump core and md5sum memory (requires libcoredumper by google) Edit (3.3 KiB, text/plain)

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-09-04:

#39

more clean fix Edit (2.8 KiB, text/plain)

The actual solution is just to flush TLB during the paused state of the CPUs in `ram_save_setup`.

See attached patch.

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-09-04:

#40

Upstream is not vulnerable because the code there flushes TLBs in `memory_global_dirty_log_start` using the `tcg_commit` handler that in TCG case call tlb_flush for each of the CPUs.

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-09-04:

#41

Upstream commit, fixing the bug:
http://git.qemu.org/?p=qemu.git;a=commit;h=6f6a5ef3e429f92f987678ea8c396aab4dc6aa19

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-09-07:

#42

Filed an Ubuntu Trusty bug:
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1493049

Revision history for this message

Pavel Boldin (pboldin) wrote on 2015-09-08:

#43

The comment was not completely correct:
https://bugs.launchpad.net/mos/7.0.x/+bug/1371130/comments/35

Overall scheme looks a little bit more complex: when the page in the TLB cache is not dirtied then there is a TLB_NOTDIRTY flag set at the lowest bits in the `addr' field. When it is the case, writing is done using `io_mem_write` that uses 'notdirty' section handlers to write to a page and mark it as dirty.

Pavel Boldin (pboldin) on 2015-09-14

description:

updated