memory corruption during live-migration in TCG mode

Bug #1493049 reported by Pavel Boldin on 2015-09-07
56
This bug affects 9 people
Affects Status Importance Assigned to Milestone
qemu (Ubuntu)
High
Unassigned
Trusty
High
Unassigned

Bug Description

[Impact]
* Live-migration of QEMU instances in pure-emulation (TCG) mode

[Test Case]
HOW TO REPRODUCE
1. Run a QEMU instance with a simply VM inside it. The VM should have as few running daemons as it is posible.
2. Live migrate machine back and forth a few times. Use monitor command 'migrate "exec:cat>filename"' to migrate out a VM and QEMU command line option '-incoming "exec:cat filename"' to load a migrated state.

EXPECTED BEHAVIOUR
- The VM is responding to the commands after each migration.

ACTUAL BEHAVIOUR
- The VM Kernel crashes in most-used part of the memory after 10 to 50 migrations.

[Additional Information]
qemu:
  Installed: (none)
  Candidate: 2.0.0+dfsg-2ubuntu1.18
  Version table:
     2.0.0+dfsg-2ubuntu1.18 0
        500 http://archive.ubuntu.com/ubuntu/ trusty-proposed/universe amd64 Packages
     2.0.0+dfsg-2ubuntu1.17 0
        500 http://ru.archive.ubuntu.com/ubuntu/ trusty-updates/universe amd64 Packages
        500 http://security.ubuntu.com/ubuntu/ trusty-security/universe amd64 Packages
     2.0.0~rc1+dfsg-0ubuntu3 0
        500 http://ru.archive.ubuntu.com/ubuntu/ trusty/universe amd64 Packages

The migrated memory is corrupted because the pages are not appropriately dirtied during the migration state. This is due to the only pages that go through `slow_path` access in TCG are marked as dirty.

Iff the pages are in the TLB cache then the access is done the fast way and pages are not marked dirty.

To fix that the TLB cache must be flushed before the VM enters live migration state.

See the bug descriptions for details: https://bugs.launchpad.net/mos/7.0.x/+bug/1371130

QEMU versions from 2.0.0 and up to 2.4.0 (excluding it) seems to be vulnerable.

The bug is fixed by the commit http://git.qemu.org/?p=qemu.git;a=commit;h=6f6a5ef3e429f92f987678ea8c396aab4dc6aa19

Pavel Boldin (pboldin) wrote :

The attachment "backported solution" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Serge Hallyn (serge-hallyn) wrote :

thanks for reporting this bug.

Your backported patch skips the part in render_memory_region() where:

- fr.dirty_log_mask = mr->dirty_log_mask;
+ fr.dirty_log_mask = memory_region_get_dirty_log_mask(mr);

Was that on purpose?

Also, in the NOVA bug you said you had filed a 'trusty' bug. From the description it seems as though wily should also be affected. That means I should fix in wily and SRU to trusty. Is there a reason why it need not be fixed in wily?

Changed in qemu (Ubuntu):
status: New → Incomplete
Pavel Boldin (pboldin) wrote :

There is no memory_region_get_dirty_log_mask in the 2.0.0, this was only introduced later. Strictly speaking, 2.0.0 is quite different from 2.4.0 in this matter but I checked the code logic here and it should be good.

Yes, I was not sure what versions it affects so I only referenced Trusty. If this affects anything else please make appropriate edits.

Serge Hallyn (serge-hallyn) wrote :

Great, thank you.

Changed in qemu (Ubuntu):
status: Incomplete → Triaged
importance: Undecided → High
Serge Hallyn (serge-hallyn) wrote :

@pboldin,

the patch you cited was one of several (7?) which appear to be related, including

commit 677e7805cf95f3b2bca8baf0888d1ebed7f0c606
Author: Paolo Bonzini <email address hidden>
Date: Mon Mar 23 10:53:21 2015 +0100
    memory: track DIRTY_MEMORY_CODE in mr->dirty_log_mask
    DIRTY_MEMORY_CODE is only needed for TCG. By adding it directly to
    mr->dirty_log_mask, we avoid testing for TCG everywhere a region is
    checked for the enabled/disabled state of dirty logging.

Are you certain only that one patch is needed?

Pavel Boldin (pboldin) wrote :

@serge-hallyn,

This patch is exactly what fixes the problem for me (was able to do around 150 successful migrations with it).

However, this should be rewritten to only contain exactly calls to tlb_flush when in TCG mode so there is no extra code and no unrelated changes.

The reason patches are different is quite reworked migration and dirtying mechanisms in new QEMU.

In exactly, new QEMU KVM has no `log_global_start' handler and the regions are to be marked as DIRTY_MEMORY_MIGRATE by the `log_start' handlers which are called only starting with the referenced patch in upstream. This call is done deeply in memory_region_transaction_commit -> address_space_update_topology -> *_pass.

Regarding the DIRTY_MEMORY_CODE, in the 2.0.0 it is used only inside the TCG-related code. Every time a TranslationBlock code is generated for a given VM code the pages of VM code are removed from TLB cache and marked as 'clean'. So, on next write to the VM code pages the TLB will be missed and the TranslationBlock will be updated accordingly.

Pavel Boldin (pboldin) wrote :

Here is the updated patch for the bug.

All it does is merely setting `tcg_commit' function as `log_global_start' callback. `tcg_commit' is then flushes all the appropriate TLBs on `memory_log_global_start' call.

Pavel Boldin (pboldin) wrote :

Applying the attached patch I was able to do around 300 migrations back and forth successfully.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qemu (Ubuntu Trusty):
status: New → Confirmed
Changed in qemu (Ubuntu Vivid):
status: New → Confirmed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:2.3+dfsg-5ubuntu6

---------------
qemu (1:2.3+dfsg-5ubuntu6) wily; urgency=medium

  * Make qemu-system-common and qemu-utils depend on qemu-block-extra
    to fix errors with missing block backends. (LP: #1495895)
  * Cherry pick fixes for vmdk stream-optimized subformat (LP: #1006655)
  * Apply fix for memory corruption during live-migration in tcg mode
    (LP: #1493049)
  * Apply tracing patch to remove use of custom vtable in newer glibc
    (LP: #1491972)

 -- Ryan Harper <email address hidden> Tue, 15 Sep 2015 09:37:23 -0500

Changed in qemu (Ubuntu):
status: Triaged → Fix Released
no longer affects: qemu (Ubuntu Vivid)

Hello Pavel, or anyone else affected,

Accepted qemu into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/2.0.0+dfsg-2ubuntu1.23 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in qemu (Ubuntu Trusty):
status: Confirmed → Fix Committed
tags: added: verification-needed
Mathew Hodson (mhodson) on 2016-04-02
Changed in qemu (Ubuntu Trusty):
importance: Undecided → High
Simon Déziel (sdeziel) wrote :

I'm chasing a bug similar in behavior but I'm using "qemu-system-x86_64 -enable-kvm" so it's not TCG, AFAICT. Would it be possible this problem also manifests in KVM mode? Or should I open a new bug?

@pboldin, when you get a chance, could you please check if the 2.0.0+dfsg-2ubuntu1.23 version (now over shadowed by the -security update .24) fixes this bug?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers