Regression: Live migrations can still crash after CVE-2016-5403 fix

Bug #1647389 reported by Maik Zumstrull
78
This bug affects 13 people
Affects Status Importance Assigned to Milestone
qemu (Ubuntu)
Fix Released
High
Dave Chiluk
Xenial
Fix Released
High
Unassigned

Bug Description

[Impact]

 * Libvirt migrations using tunnelled libvirt cause a failure to migrate on the destination with error VQ 2 size 0x80 < last_avail_idx 0x9 - used_idx 0xa

 * TBD: justification for backporting the fix to the stable release.
 * TBD: In addition, it is helpful, but not required, to include an
   explanation of how the upload fixes this bug.

[Test Case]
1. Create a VM on shared storage solution. In my case NFS.
2. set start_libvirtd="yes" in /etc/default/libvirt-bin
3. systemctl restart libvirt-bin
4. virsh dommemstat 1 <vm>
4. virsh -c qemu+ssh://${FROM}/system migrate --live --p2p --tunnelled ${VM} qemu+tcp://ubuntu@${TO}/system
5. Repeat until failure to migrate, then check /var/log/libvirt/qemu/<vm>.log for error from above.

* Yes --live, --p2p, and --tunnelled are all required to reproduce afaik.

[Regression Potential]
TBD
 * discussion of how regressions are most likely to manifest as a result of this change.
 * It is assumed that any SRU candidate patch is well-tested before
   upload and has a low overall risk of regression, but it's important
   to make the effort to think about what ''could'' happen in the
   event of a regression.
 * This both shows the SRU team that the risks have been considered,
   and provides guidance to testers in regression-testing the SRU.

[Other Info]
TBD
 * Anything else you think is useful to include
 * Anticipate questions from users, SRU, +1 maintenance, security teams and the Technical Board
 * and address these questions in advance

___________________ Original Description follows _____________________

See updates at the end of #1612089. Sample error message:

Dec 05 14:41:07 zbk130713 libvirtd[29690]: internal error: early end of file from monitor, possible problem:
2016-12-05T14:41:07.903932Z qemu-system-x86_64: VQ 2 size 0x80 < last_avail_idx 0x9 - used_idx 0xa
2016-12-05T14:41:07.903981Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:05.0/virtio-balloon'
2016-12-05T14:41:07.905180Z qemu-system-x86_64: load of migration failed: Operation not permitted

Seems related to this patch series:
https://lists.gnu.org/archive/html/qemu-devel/2016-08/msg03079.html

Revision history for this message
Maik Zumstrull (m-zumstrull) wrote :

This is with:

$ dpkg-query -W qemu-system-x86
qemu-system-x86 1:2.5+dfsg-5ubuntu10.6
$ qemu-system-x86_64 --version
QEMU emulator version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.6), Copyright (c) 2003-2008 Fabrice Bellard

Revision history for this message
Maik Zumstrull (m-zumstrull) wrote :

See also the thread at https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02634.html which appears to be about the same issue, and references two commits that might fix this if cherry-picked.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qemu (Ubuntu):
status: New → Confirmed
Revision history for this message
sean redmond (sean-redmond1) wrote :

I have seen an issue that seems it could be related:

ec 5 17:52:53 os-nova-compute libvirtd[40142]: failed to connect to monitor socket: No such process
Dec 5 17:52:53 os-nova-compute virtlogd[6936]: End of file while reading data: Input/output error
Dec 5 17:52:54 os-nova-compute libvirtd[40142]: internal error: End of file from monitor

root@os-nova-compute:~# dpkg-query -W qemu-system-x86
qemu-system-x86 1:2.5+dfsg-5ubuntu10.6
root@os-nova-compute:~# qemu-system-x86_64 --version
QEMU emulator version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.6), Copyright (c) 2003-2008 Fabrice Bellard
root@os-nova-compute:~# uname -a
Linux os-nova-compute 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
root@os-nova-compute:~#

Robie Basak (racb)
tags: added: regression-update
Changed in qemu (Ubuntu):
importance: Undecided → High
Revision history for this message
Robie Basak (racb) wrote :

13:39 <rbasak> mdeslaur: are you aware of bug 1647389?
13:40 <rbasak> Claimed second regression from bug 1612089 AFAICT.
13:51 <mdeslaur> rbasak: I saw it, I haven't investigated yet
13:51 <mdeslaur> rbasak: We already have all of the commits that are linked
13:52 <mdeslaur> rbasak: I'll look at it more after I'm back from holiday
13:52 <mdeslaur> rbasak: having a reproducer would help
13:58 <rbasak> OK, thanks

Dave Chiluk (chiluk)
tags: added: sts
Revision history for this message
Dave Chiluk (chiluk) wrote :

At the moment this appears to be a consequence of 104e70cae78bd4afd95d948c6aff188f10508a9c not being included in the original CVE patchset.

I'm attaching an early debdiff for that includes a first attempt at a backport of the above patch and am requesting comments and code review.

If anyone has a succinct/reliable way to reproduce this, I would greatly appreciate that. Due to lack of a good reproducer this is near impossible to test. As a result the backport was attempted purely based on code inspection and comments from upstream fixes.

Changed in qemu (Ubuntu):
assignee: nobody → Dave Chiluk (chiluk)
Revision history for this message
Dave Chiluk (chiluk) wrote :

I have created a ppa with the above fix that is available
https://launchpad.net/~chiluk/+archive/ubuntu/lp1647389

ppa:chiluk/lp1647389

If someone on copy for this bug has a way reliable way to test this issue please attempt using the qemu out of my ppa.

Thank you,

Revision history for this message
Dave Chiluk (chiluk) wrote :

I have confirmation from a user that 104e70cae does not resolve the issue.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

Is this happening with Windows guests?
Were the Windows guests created with an earlier version of the qemu package?

Revision history for this message
Alejandro Comisario (alejandro-f) wrote :

Yes, is only with windows guests.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :
Revision history for this message
Dave Chiluk (chiluk) wrote :

Marc we came to a similar conclusion. My backport of 104e70cae included a partial backport of 4eae2a657d1ff5ada56eb9b4966e.

The rest of 4eae2 didn't apply. I was curious if perhaps the VirtQueueElement isn't being properly initailized or possibly has some dirty data, but I haven't figured that out yet.

Dave.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

I think it has more to do with the first two segments of that commit, that looks like it's handling the "Windows balloon driver sends memory stats only if the balloon service (blnsvr.exe) is running" issue.

My other suspicion is perhaps 58a83c61496eeb0d31571a07a51bc1947e3379ac needs to be backed out.

Revision history for this message
Alejandro Comisario (alejandro-f) wrote :

Guys, this is HUGELY CRITICAL on 16.04 production openstack with windows guests, since the only workaround is to tell the customer to disable the baloon driver inside the windows machine before migrating, but when migration occurs because of a failure and the customer cant disable it, a simple task like a migration volume-backed doesnt work.

please, if you can fix it asap would be great

Revision history for this message
Dave Chiluk (chiluk) wrote :

I have confirmation that this can be worked around by turning off memory statistics via virsh before the migration. After migration memory statistics can be turned back on safely.

The command to turn off memory statistics is
virsh dommemstat --live --period 0 <VM instance name>

The command to turn on memory statistics is
virsh dommemstat --live --period 10 <VM instance name>

Revision history for this message
Alejandro Comisario (alejandro-f) wrote :

Any news other than workarounds ?
It's kinda unprofessional to ask for customers to do that before migrating their instances.

Revision history for this message
Dave Chiluk (chiluk) wrote :

@alejandro-f
You need to run the virsh commands on the compute hosts before doing the migration, customers should not be running the virsh commands.

Unfortunately, this looks to still exist upstream according to https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg02066.html, and is related to memory statistic reporting. If you find a solution we are all ears, I'm sure the upstream qemu project would love to hear about it.

Revision history for this message
Alejandro Comisario (alejandro-f) wrote :

@chiluk
unfortunatelly i dont have the necessary skills to find the code solution, all i can say i that on qemu packages on RHEL Openstack Platform version 9 this is fixed.

So it came into my attention that RH being a contributor to the qemu/KVM project did'nt pushed the solution.

Sadly, from the workarounf side, i have no option but to write a "step by step" guide to my customers (they are the cloud admins, i just support them when they cant fix things) to do this before doing migration.

Revision history for this message
Dave Chiluk (chiluk) wrote :

We have found another workaround for this in openstack clouds. Somehow this issue seems to be exacerbated by live_migration_tunnelled being on.

You may be able to work around this by setting
live_migration_tunnelled = false
In your nova.conf or nova-compute.conf.

This is set by default for juju deployed openstack clouds.

Revision history for this message
Alejandro Comisario (alejandro-f) wrote :

@chiluk
Let me try that out to see if everything works as expected.

In the mean time, if you please clan update about a final resolution from the "package" side, would be amazing.

best.

Revision history for this message
Dave Chiluk (chiluk) wrote :

Yes, I'm currently working through attempting to bisect this issue. Unfortunately I'm running into lots of issues getting iterations and bounds functioning in a manner that allows me to reliably reproduce the issue.

Revision history for this message
Alejandro Comisario (alejandro-f) wrote :
Download full text (4.2 KiB)

@chiluk

I can confirm that the flag provided doesnt work on ubuntu 16.04 with mitaka packages.
The first live migration to a compute node works, but if i try immediately to live migrate it back to the same compute node, or to other compute node, the migration fails sometimes geting on the new node as "Shut Off" with this log on nova-compute

2017-02-06 17:11:20.457 12456 ERROR nova.virt.libvirt.driver [req-231bec41-7937-4f3c-ab02-7b9a985cad22 66f68888afb1424a874a0fae3c5c5e52 3d7374bb9d4b4ad6a7db51a5187483a2 - - -] [instance: a3c0256b-290a-4345-865d-d31b7c79894d] Live Migration failure: operation failed: job: unexpectedly failed

And most of the time when you try to put the instance back on the previous compute node, the migration finishes as successfull, but does not changes compute node, with this logs on nova compute:

2017-02-06 17:41:42.275 10142 ERROR nova.compute.manager [instance: a3c0256b-290a-4345-865d-d31b7c79894d] RemoteError: Remote error: libvirtError Requested operation is not valid: transient domains do not have any persistent config
2017-02-06 17:41:42.275 10142 ERROR nova.compute.manager [instance: a3c0256b-290a-4345-865d-d31b7c79894d] [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 138, in _dispatch_and_reply\n incoming.message))\n', u' File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 185, in _dispatch\n return self._do_dispatch(endpoint, method, ctxt, args)\n', u' File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 127, in _do_dispatch\n result = func(ctxt, **new_args)\n', u' File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 110, in wrapped\n payload)\n', u' File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', u' File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', u' File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 89, in wrapped\n return f(self, context, *args, **kw)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 5052, in remove_volume_connection\n self._driver_detach_volume(context, instance, bdm, connection_info)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 4802, in _driver_detach_volume\n self.volume_api.roll_detaching(context, volume_id)\n', u' File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n', u' File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 4790, in _driver_detach_volume\n encryption=encryption)\n', u' File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 1459, in detach_volume\n live=live)\n', u' File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/guest.py", line 327, in detach_device_with_retry\n self.detach_device(conf, persistent, live)\n'...

Read more...

Vadim Mishustin (vvak)
Changed in qemu (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
s10 (vlad-esten) wrote :

This bug DOES NOT fixed in Xenial QEMU package.

Changed in qemu (Ubuntu):
status: Fix Released → Confirmed
Revision history for this message
Vadim Mishustin (vvak) wrote :

Sorry for my mistake.

Revision history for this message
Dave Chiluk (chiluk) wrote :

@Alejandro.

When trying with the live_migration_tunnelled = false, what are you seeing in the /var/log/libvirt/qemu/<specific instance>.log. You may be seeing a different issue.

Revision history for this message
Dave Chiluk (chiluk) wrote :

I have been able to create a smaller recreation environment for this.
1. Create a VM on shared storage solution. In my case NFS.
2. set start_libvirtd="yes" in /etc/default/libvirt-bin
3. systemctl restart libvirt-bin
4. virsh -c qemu+ssh://${FROM}/system migrate --live --p2p --tunnelled ${VM} qemu+tcp://ubuntu@${TO}/system
5. Repeat until failure to migrate, then check /var/log/libvirt/qemu/<vm>.log for error from above.

* Yes --live, --p2p, and --tunnelled are all required to reproduce afaik.

Using this reproducer I was able to identify upstream commit 4eae2a6 as the first good SHA where the migration starts working again.

Unfortunately this does not cherry-pick cleanly, and it appears that the virtqueue management has changed significantly as well. I'm currently trying to figure what other patches are needed to make the qemu virtqueue stable.

Revision history for this message
Alejandro Comisario (alejandro-f) wrote :

Dave,hi.
Thanks for working on this.

the problems y described previously are working fine on redhat & centos regarding the

live_migration_tunnelled = false, under the libvirt section.

so, i still think what i'm seeing is the ubuntu bug on migration.
please let me know as soon as you find soemthing else, i can help with something or have something to try !

best

Revision history for this message
Dave Chiluk (chiluk) wrote :

@Alehandro, are your redhat/centos installations on qemu 2.6+? They are working because the fix 4eae2a6+ the CVE is already available in that version of qemu. IMHO, redhat and centos got lucky with that version of QEMU. It has nothing to do with quality of distribution, and everything to do with when they forked from upstream QEMU. Ubuntu just got unlucky.

When you set
live_migration_tunnelled = false
Did you restart nova-compute, and your libvirtd services? If your tested migration failed, did you check the /var/lob/libvirt/qemu/<instance>.log for "
VQ 2 size 0x80 < last_avail_idx 0x9 - used_idx 0xa"
as reported in this bug or did you hit some other issue?

Thank you,

Revision history for this message
Alejandro Comisario (alejandro-f) wrote :

@Dave

let me try again and take a look at the logs you pointed, and get back to you.
if you have any news for me, let me know.

Revision history for this message
Dave Chiluk (chiluk) wrote :

Updated bug description with SRU template and test case so that the testcase can be updated as need be.

description: updated
Revision history for this message
Len (lwhite-5) wrote :

Hi,

I've figured out the actual problem, and I've made a patch that fixes the issue, not sure if it will apply cleanly as mine is based off the rhel version but thought I'd share it since it gave me a headache and many others it seems.

This is what happens
vdev->vq[i].inuse = (uint16_t)(vdev->vq[i].last_avail_idx - vdev->vq[i].used_idx);

if (vdev->vq[i].inuse > vdev->vq[i].vring.num)

Random example with last_avail_idx 0x1 used_idx 0x2, size 0x80:
1 - 2 = -1 however cast as unsigned it ends up being 65535
so if (65535 > 80) = headache

The patch I made basically checks if it's a negative and sets it to 0 as well as adding inuse to the error_report. I am sure if the error_report initially actually showed the true values being compared and not the source values, it would have been figured out sooner.

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "qemu.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Revision history for this message
Dave Chiluk (chiluk) wrote :

@Len

Can you provide links to the Rhel sources that you based your patch on in order to provide more context and provide appropriate attribution in the Ubuntu patch.

Thanks,
Dave.

Dave Chiluk (chiluk)
Changed in qemu (Ubuntu Xenial):
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Len (lwhite-5) wrote :

It's from this package: http://vault.centos.org/centos/7/virt/Source/kvm-common/qemu-kvm-ev-2.6.0-28.el7_3.6.1.src.rpm

With this applied from qemu-git: https://github.com/qemu/qemu/commit/e66bcc408146730958d1a840bda85d7ad51e0cd7.patch

Then the patch I posted here on top of that and recompiled the rpm, but the qemu-git patch isn't necessary for the one I posted to fix the issue.

Revision history for this message
Dave Chiluk (chiluk) wrote :

So I tested Len's patch, and it does seem to work.

However, I can't seem to understand why the below line is necessary, when upstream qemu has virtually identical code, and does not need this line. It almost makes me wonder if CVE-2016-5403-3.patch is incorrectly decrementing the inuse counter in our version of qemu.

"
vdev->vq[i].inuse = (inuse_tmp < 0 ? 0 : inuse_tmp);
"

@Len in the failing case are you always seeing a inuse value of -1?

I'm building a test qemu without 2016-5403-3 right now. The risk of removing that would be that we'd have a possible leak. It's at least worth a check.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

I had a feeling perhaps CVE-2016-5403-3.patch needed to be backed out, that's the commit I mentioned in comment #13.

Anxiously awaiting results of the test... :)

Revision history for this message
Alejandro Comisario (alejandro-f) wrote : Re: [Bug 1647389] Re: Regression: Live migrations can still crash after CVE-2016-5403 fix

Definitely​ im waiting anxiously about this to be resolved also!

On Mar 31, 2017 20:35, "Marc Deslauriers" <email address hidden>
wrote:

I had a feeling perhaps CVE-2016-5403-3.patch needed to be backed out,
that's the commit I mentioned in comment #13.

Anxiously awaiting results of the test... :)

--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/1647389

Title:
  Regression: Live migrations can still crash after CVE-2016-5403 fix

Status in qemu package in Ubuntu:
  Confirmed
Status in qemu source package in Xenial:
  Confirmed

Bug description:
  [Impact]

   * Libvirt migrations using tunnelled libvirt cause a failure to
  migrate on the destination with error VQ 2 size 0x80 < last_avail_idx
  0x9 - used_idx 0xa

   * TBD: justification for backporting the fix to the stable release.
   * TBD: In addition, it is helpful, but not required, to include an
     explanation of how the upload fixes this bug.

  [Test Case]
  1. Create a VM on shared storage solution. In my case NFS.
  2. set start_libvirtd="yes" in /etc/default/libvirt-bin
  3. systemctl restart libvirt-bin
  4. virsh dommemstat 1 <vm>
  4. virsh -c qemu+ssh://${FROM}/system migrate --live --p2p --tunnelled
${VM} qemu+tcp://ubuntu@${TO}/system
  5. Repeat until failure to migrate, then check
/var/log/libvirt/qemu/<vm>.log for error from above.

  * Yes --live, --p2p, and --tunnelled are all required to reproduce
  afaik.

  [Regression Potential]
  TBD
   * discussion of how regressions are most likely to manifest as a result
of this change.
   * It is assumed that any SRU candidate patch is well-tested before
     upload and has a low overall risk of regression, but it's important
     to make the effort to think about what ''could'' happen in the
     event of a regression.
   * This both shows the SRU team that the risks have been considered,
     and provides guidance to testers in regression-testing the SRU.

  [Other Info]
  TBD
   * Anything else you think is useful to include
   * Anticipate questions from users, SRU, +1 maintenance, security teams
and the Technical Board
   * and address these questions in advance

  ___________________ Original Description follows _____________________

  See updates at the end of #1612089. Sample error message:

  Dec 05 14:41:07 zbk130713 libvirtd[29690]: internal error: early end of
file from monitor, possible problem:
  2016-12-05T14:41:07.903932Z qemu-system-x86_64: VQ 2 size 0x80 <
last_avail_idx 0x9 - used_idx 0xa
  2016-12-05T14:41:07.903981Z qemu-system-x86_64: error while loading state
for instance 0x0 of device '0000:00:05.0/virtio-balloon'
  2016-12-05T14:41:07.905180Z qemu-system-x86_64: load of migration failed:
Operation not permitted

  Seems related to this patch series:
  https://lists.gnu.org/archive/html/qemu-devel/2016-08/msg03079.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389/+subscriptions

Revision history for this message
Dave Chiluk (chiluk) wrote :

I just tested removing CVE-2016-5403-3.patch, but that didn't seem to do it. I still don't understand how upsteam qemu functions with the calculation the way it is.

Revision history for this message
Len (lwhite-5) wrote :

Yes, whenever the bug gets triggered it's because it throws the value into the negative. However without the patch that's not what happens.

Let's take this for example:
2016-12-05T14:41:07.903932Z qemu-system-x86_64: VQ 2 size 0x80 <
last_avail_idx 0x9 - used_idx 0xa

Without patch: 0x9 - 0xA = 65535
With patch: 0x9 - 0xA = -1 (reset to 0)

Because the integers are unsigned in the comparison they don't support negative values, so it will end up reverting to the highest digit like 65535. So I thought the safest solution was to convert any negative value to 0 because any comparisons with a integer negative value against an unsigned could produce unexpected results elsewhere like it did here.

From my understanding the bug only gets triggered when you're migrating from an older qemu version or a VM that was originally booted on an older version, so that's likely why upstream still functions most of the time.

I spent an ungodly amount of time debugging this issue, and every report of a similar problem that was on google (with the same error message), the math always put the value into the negative. Started out by doing the same thing as everyone else, trying to find what patch broke the issue or any new commits that could help instead of analyzing the problem itself.

Really annoying issue because technically speaking the code is correct it's the behavior that's not because of mixed types. Primary goal was to retain the CVE patch if possible.

Revision history for this message
Len (lwhite-5) wrote :
Download full text (3.4 KiB)

I also forgot to mention in our case it didn't matter if the migration was tunneled or not, and turning off the memory stats before migration in virsh didn't help at all. Did not have access to the instance to try playing around with blnsrvr.exe though.

-----Original Message-----
From: <email address hidden> [mailto:<email address hidden>] On Behalf Of Dave Chiluk
Sent: Friday, March 31, 2017 7:47 PM
To: Len White <email address hidden>
Subject: [Bug 1647389] Re: Regression: Live migrations can still crash after CVE-2016-5403 fix

I just tested removing CVE-2016-5403-3.patch, but that didn't seem to do it. I still don't understand how upsteam qemu functions with the calculation the way it is.

--
You received this bug notification because you are subscribed to the bug report.
https://bugs.launchpad.net/bugs/1647389

Title:
  Regression: Live migrations can still crash after CVE-2016-5403 fix

Status in qemu package in Ubuntu:
  Confirmed
Status in qemu source package in Xenial:
  Confirmed

Bug description:
  [Impact]

   * Libvirt migrations using tunnelled libvirt cause a failure to
  migrate on the destination with error VQ 2 size 0x80 < last_avail_idx
  0x9 - used_idx 0xa

   * TBD: justification for backporting the fix to the stable release.
   * TBD: In addition, it is helpful, but not required, to include an
     explanation of how the upload fixes this bug.

  [Test Case]
  1. Create a VM on shared storage solution. In my case NFS.
  2. set start_libvirtd="yes" in /etc/default/libvirt-bin
  3. systemctl restart libvirt-bin
  4. virsh dommemstat 1 <vm>
  4. virsh -c qemu+ssh://${FROM}/system migrate --live --p2p --tunnelled ${VM} qemu+tcp://ubuntu@${TO}/system
  5. Repeat until failure to migrate, then check /var/log/libvirt/qemu/<vm>.log for error from above.

  * Yes --live, --p2p, and --tunnelled are all required to reproduce
  afaik.

  [Regression Potential]
  TBD
   * discussion of how regressions are most likely to manifest as a result of this change.
   * It is assumed that any SRU candidate patch is well-tested before
     upload and has a low overall risk of regression, but it's important
     to make the effort to think about what ''could'' happen in the
     event of a regression.
   * This both shows the SRU team that the risks have been considered,
     and provides guidance to testers in regression-testing the SRU.

  [Other Info]
  TBD
   * Anything else you think is useful to include
   * Anticipate questions from users, SRU, +1 maintenance, security teams and the Technical Board
   * and address these questions in advance

  ___________________ Original Description follows _____________________

  See updates at the end of #1612089. Sample error message:

  Dec 05 14:41:07 zbk130713 libvirtd[29690]: internal error: early end of file from monitor, possible problem:
  2016-12-05T14:41:07.903932Z qemu-system-x86_64: VQ 2 size 0x80 < last_avail_idx 0x9 - used_idx 0xa
  2016-12-05T14:41:07.903981Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:05.0/virtio-balloon'
  2016-12-05T14:41:07.905180Z qemu-system-x86_64: load of...

Read more...

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

There are some _untested_ qemu packages that work around this issue in the security team PPA:

https://launchpad.net/~ubuntu-security-proposed/+archive/ubuntu/ppa/+packages

They will be released as security updates once they've been through QA. Possibly in a couple of weeks.

Revision history for this message
Alejandro Comisario (alejandro-f) wrote :

looking forward to test this!

Revision history for this message
Dave Chiluk (chiluk) wrote :

@Marc

I reviewed your proposed changes, and I really feel you should log an error in the negative case.

Revision history for this message
sean redmond (sean-redmond1) wrote :

I find that running the below packages does not have this issue:

# dpkg -l | grep -i qemu
ii ipxe-qemu 1.0.0+git-20150424.a25a16d-1ubuntu1 all PXE boot firmware - ROM images for qemu
ii qemu-block-extra:amd64 1:2.5+dfsg-5ubuntu10.5 amd64 extra block backend modules for qemu-system and qemu-utils
ii qemu-slof 20151103+dfsg-1ubuntu1 all Slimline Open Firmware -- QEMU PowerPC version
ii qemu-system 1:2.5+dfsg-5ubuntu10.5 amd64 QEMU full system emulation binaries
ii qemu-system-arm 1:2.5+dfsg-5ubuntu10.5 amd64 QEMU full system emulation binaries (arm)
ii qemu-system-common 1:2.5+dfsg-5ubuntu10.5 amd64 QEMU full system emulation binaries (common files)
ii qemu-system-mips 1:2.5+dfsg-5ubuntu10.5 amd64 QEMU full system emulation binaries (mips)
ii qemu-system-misc 1:2.5+dfsg-5ubuntu10.5 amd64 QEMU full system emulation binaries (miscelaneous)
ii qemu-system-ppc 1:2.5+dfsg-5ubuntu10.5 amd64 QEMU full system emulation binaries (ppc)
ii qemu-system-sparc 1:2.5+dfsg-5ubuntu10.5 amd64 QEMU full system emulation binaries (sparc)
ii qemu-system-x86 1:2.5+dfsg-5ubuntu10.5 amd64 QEMU full system emulation binaries (x86)
ii qemu-utils 1:2.5+dfsg-5ubuntu10.5 amd64 QEMU utilities

But if I run any other version such as 10.6 or 10.10 I hit this most of the time.

Revision history for this message
Vadim Mishustin (vvak) wrote :

I tested with 10.11 windows 7, 2008, 2012. Live migration works more than twice. Thank you.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (6.8 KiB)

This bug was fixed in the package qemu - 1:2.5+dfsg-5ubuntu10.11

---------------
qemu (1:2.5+dfsg-5ubuntu10.11) xenial-security; urgency=medium

  * SECURITY UPDATE: DoS in virtio GPU device
    - debian/patches/CVE-2016-10028.patch: check virgl capabilities
      max_size in hw/display/virtio-gpu-3d.c.
    - CVE-2016-10028
  * SECURITY UPDATE: DoS in virtio GPU device
    - debian/patches/CVE-2016-10029-*.patch: check values in
      hw/display/virtio-gpu.c, hw/display/virtio-gpu-3d.c,
      include/hw/virtio/virtio-gpu.h.
    - CVE-2016-10029
  * SECURITY UPDATE: DoS via 6300esb unplug operations
    - debian/patches/CVE-2016-10155.patch: add exit function in
      hw/watchdog/wdt_i6300esb.c.
    - CVE-2016-10155
  * SECURITY UPDATE: DoS in i.MX Fast Ethernet Controller
    - debian/patches/CVE-2016-7907.patch: limit buffer descriptor count in
      hw/net/imx_fec.c.
    - CVE-2016-7907
  * SECURITY UPDATE: DoS in JAZZ RC4030 chipset emulation
    - debian/patches/CVE-2016-8667.patch: limit interval timer reload value
      in hw/dma/rc4030.c.
    - CVE-2016-8667
  * SECURITY UPDATE: DoS in 16550A UART emulation
    - debian/patches/CVE-2016-8669.patch: check divider value against baud
      base in hw/char/serial.c.
    - CVE-2016-8669
  * SECURITY UPDATE: privilege escalation via ioreq handling
    - debian/patches/CVE-2016-9381.patch: avoid double fetches and add
      bounds checks to xen-hvm.c.
    - CVE-2016-9381
  * SECURITY UPDATE: host filesystem access via virtFS
    - debian/patches/CVE-2016-9602-*.patch: don't follow symlinks in
      hw/9pfs/*.
    - CVE-2016-9602
  * SECURITY UPDATE: arbitrary code execution via Cirrus VGA
    - debian/patches/CVE-2016-9603.patch: remove bitblit support from
      console code in hw/display/cirrus_vga.c, include/ui/console.h,
      ui/console.c, ui/vnc.c.
    - CVE-2016-9603
  * SECURITY UPDATE: infinite loop in ColdFire Fast Ethernet Controller
    - debian/patches/CVE-2016-9776.patch: check receive buffer size
      register value in hw/net/mcf_fec.c.
    - CVE-2016-9776
  * SECURITY UPDATE: information leak in virtio GPU device
    - debian/patches/CVE-2016-9845.patch: properly clear out memory in
      hw/display/virtio-gpu-3d.c.
    - CVE-2016-9845
  * SECURITY UPDATE: DoS via memory leak in virtio GPU device
    - debian/patches/CVE-2016-9846.patch: properly free memory in
      hw/display/virtio-gpu.c.
    - CVE-2016-9846
  * SECURITY UPDATE: DoS via memory leak in USB redirector
    - debian/patches/CVE-2016-9907.patch: properly free memory in
      hw/usb/redirect.c.
    - CVE-2016-9907
  * SECURITY UPDATE: information leak in virtio GPU device
    - debian/patches/CVE-2016-9908.patch: properly clear out memory in
      hw/display/virtio-gpu-3d.c.
    - CVE-2016-9908
  * SECURITY UPDATE: DoS via memory leak in USB EHCI Emulation
    - debian/patches/CVE-2016-9911.patch: properly free memory in
      hw/usb/hcd-ehci.c.
    - CVE-2016-9911
  * SECURITY UPDATE: DoS via memory leak in virtio GPU device
    - debian/patches/CVE-2016-9912.patch: properly free memory in
      hw/display/virtio-gpu.c.
    - CVE-2016-9912
  * SECURITY UPDATE: DoS via virtFS
    - debian/patches/CVE-2016-9913....

Read more...

Changed in qemu (Ubuntu Xenial):
status: Confirmed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (5.1 KiB)

This bug was fixed in the package qemu - 2.0.0+dfsg-2ubuntu1.33

---------------
qemu (2.0.0+dfsg-2ubuntu1.33) trusty-security; urgency=medium

  * SECURITY UPDATE: DoS via 6300esb unplug operations
    - debian/patches/CVE-2016-10155.patch: add exit function in
      hw/watchdog/wdt_i6300esb.c.
    - CVE-2016-10155
  * SECURITY UPDATE: DoS in JAZZ RC4030 chipset emulation
    - debian/patches/CVE-2016-8667.patch: limit interval timer reload value
      in hw/dma/rc4030.c.
    - CVE-2016-8667
  * SECURITY UPDATE: DoS in 16550A UART emulation
    - debian/patches/CVE-2016-8669.patch: check divider value against baud
      base in hw/char/serial.c.
    - CVE-2016-8669
  * SECURITY UPDATE: privilege escalation via ioreq handling
    - debian/patches/CVE-2016-9381.patch: avoid double fetches and add
      bounds checks to xen-all.c.
    - CVE-2016-9381
  * SECURITY UPDATE: host filesystem access via virtFS
    - debian/patches/CVE-2016-9602-*.patch: don't follow symlinks in
      hw/9pfs/*.
    - CVE-2016-9602
  * SECURITY UPDATE: arbitrary code execution via Cirrus VGA
    - debian/patches/CVE-2016-9603.patch: remove bitblit support from
      console code in hw/display/cirrus_vga.c, include/ui/console.h,
      ui/console.c, ui/vnc.c.
    - CVE-2016-9603
  * SECURITY UPDATE: infinite loop in ColdFire Fast Ethernet Controller
    - debian/patches/CVE-2016-9776.patch: check receive buffer size
      register value in hw/net/mcf_fec.c.
    - CVE-2016-9776
  * SECURITY UPDATE: DoS via memory leak in USB redirector
    - debian/patches/CVE-2016-9907.patch: properly free memory in
      hw/usb/redirect.c.
    - CVE-2016-9907
  * SECURITY UPDATE: DoS via memory leak in USB EHCI Emulation
    - debian/patches/CVE-2016-9911.patch: properly free memory in
      hw/usb/hcd-ehci.c.
    - CVE-2016-9911
  * SECURITY UPDATE: DoS via virtFS
    - debian/patches/CVE-2016-9913.patch: adjust the order of resource
      cleanup in hw/9pfs/virtio-9p-device.c.
    - CVE-2016-9913
  * SECURITY UPDATE: DoS via virtFS
    - debian/patches/CVE-2016-9914-*.patch: add cleanup operations to
      fsdev/file-op-9p.h, hw/9pfs/virtio-9p-device.c.
    - CVE-2016-9914
  * SECURITY UPDATE: DoS via virtFS
    - debian/patches/CVE-2016-9915.patch: add cleanup operation to
      hw/9pfs/virtio-9p-handle.c.
    - CVE-2016-9915
  * SECURITY UPDATE: DoS via virtFS
    - debian/patches/CVE-2016-9916.patch: add cleanup operation to
      hw/9pfs/virtio-9p-proxy.c.
    - CVE-2016-9916
  * SECURITY UPDATE: DoS in Cirrus VGA
    - debian/patches/CVE-2016-9921-9922.patch: check bpp values in
      hw/display/cirrus_vga.c.
    - CVE-2016-9921
    - CVE-2016-9922
  * SECURITY UPDATE: code execution via Cirrus VGA
    - debian/patches/CVE-2017-2615.patch: fix oob access in
      hw/display/cirrus_vga.c.
    - CVE-2017-2615
  * SECURITY UPDATE: code execution via Cirrus VGA
    - debian/patches/CVE-2017-2620-pre.patch: add extra parameter to
      blit_is_unsafe in hw/display/cirrus_vga.c.
    - debian/patches/CVE-2017-2620.patch: add blit destination check to
      hw/display/cirrus_vga.c.
    - CVE-2017-2620
  * SECURITY UPDATE: memory corruption issues in VNC
    - debian/patches/CVE-2017-263...

Read more...

Changed in qemu (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Alejandro Comisario (alejandro-f) wrote :

Is this fix pushed to Xenial already ?

Revision history for this message
Dave Chiluk (chiluk) wrote :

Yes comment #46 shows it was pushed to xenial, and I checked that it is currently in updates.

Revision history for this message
Alejandro Comisario (alejandro-f) wrote :

Thanks Dave

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.