NFS: Fail to boot VM out of large snapshots (30GB+)

Bug #1646181 reported by JB Broccard
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Undecided
Eric Harney
OpenStack Compute (nova)
Fix Released
Medium
Sean Dague
Newton
Fix Committed
Medium
Kashyap Chamarthy

Bug Description

Description
===========
Using NFS Shared storage, when I try to boot a VM out of a smaller snapshot (1GB) it works fine.
Although, when i try to do the same out of a larger snapshot (30GB+) it fails regardless of the OpenStack Release Newton or Mitaka.

Steps to reproduce
==================
A chronological list of steps which will bring off the
issue you noticed:
* I have OpenStack RDO MNewton (or Mitaka) installed and functional
* I boot a VM out of a QCOW2 image of about 1GB
* Then I loginto that VM and create a large file (33GB) to inflat the VM image
* then I shutoff the VM and take a snapshot of it that i call "largeVMsnapshotImage"

Alternatively to the steps above,
* I have a snapshot from a large VM (30GB+) that I upload in glance and call "largeVMsnapshotImage"

Then I do:
* then I try to boot a new VM out of that snapshot using the same network
* Although the image seems to be copied to the compute node, the VM Creation fails on "qemu-img info" command

If I run the same command manually, it works:
/usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
image: /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
file format: raw
virtual size: 80G (85899345920 bytes)
disk size: 37G

Although, in the logs it fails and the VM Creation is interrupted, see log from nova-compute.log on the compute node:
...
2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] BuildAbortException: Build of instance d6889ea2-f277-40e5-afdc-b3b0698537ed aborted: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885 : Unexpected error while running command.
2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/2b54e1ca13134ceb7fc489d58d7aa6fd321b1885
2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Exit code: -9
2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Stdout: u''
2016-11-29 17:52:23.581 10284 ERROR nova.compute.manager [instance: d6889ea2-f277-40e5-afdc-b3b0698537ed] Stderr: u''
...

Expected result
===============
The VM should have been created/booted out of the larg snapshot image.

Actual result
=============
The command fails with exit code -9 when Noiva

Environment
===========
1. Running RDO Newton on Centos 7.2 (or Oracle Linux 7.2) and reproduced on RDO Mitaka as well

   If this is from a distro please provide
       $ [root@controller ~]# rpm -qa|grep nova
openstack-nova-console-14.0.0-1.el7.noarch
puppet-nova-9.4.0-1.el7.noarch
python-nova-14.0.0-1.el7.noarch
openstack-nova-novncproxy-14.0.0-1.el7.noarch
openstack-nova-conductor-14.0.0-1.el7.noarch
openstack-nova-api-14.0.0-1.el7.noarch
openstack-nova-common-14.0.0-1.el7.noarch
openstack-nova-scheduler-14.0.0-1.el7.noarch
openstack-nova-serialproxy-14.0.0-1.el7.noarch
python2-novaclient-6.0.0-1.el7.noarch
openstack-nova-cert-14.0.0-1.el7.noarch

2. Which hypervisor did you use?
   KVM

   details:
   [root@compute4 nova]# rpm -qa|grep -Ei "kvm|qemu|libvirt"
libvirt-gobject-0.1.9-1.el7.x86_64
libvirt-gconfig-0.1.9-1.el7.x86_64
libvirt-daemon-1.2.17-13.0.1.el7.x86_64
qemu-kvm-common-1.5.3-105.el7.x86_64
qemu-img-1.5.3-105.el7.x86_64
ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch
libvirt-client-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-nodedev-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-lxc-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-kvm-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-secret-1.2.17-13.0.1.el7.x86_64
libvirt-python-1.2.17-2.el7.x86_64
libvirt-daemon-config-network-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-config-nwfilter-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-storage-1.2.17-13.0.1.el7.x86_64
qemu-kvm-1.5.3-105.el7.x86_64
libvirt-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-interface-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-network-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-nwfilter-1.2.17-13.0.1.el7.x86_64
libvirt-daemon-driver-qemu-1.2.17-13.0.1.el7.x86_64
libvirt-glib-0.1.9-1.el7.x86_64

2. Which storage type did you use?
   NFS
   From a Sun ZFS or just from a server (both environment would have the same issue)
   [root@compute4 nova]# mount | grep -i instance
172.31.254.254:/nova on /var/lib/nova/instances type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.31.0.114,local_lock=none,addr=172.31.254.254)

3. Which networking type did you use?
   Neutron with OpenVSwitch

Logs & Configs
==============

From nova-compute.log:

2016-11-30 11:34:50.855 29776 INFO nova.compute.resource_tracker [req-dc5e7f75-5ae5-4929-a7b5-f4fdb95c171e - - - - -] Compute_service record updated for c00b05:c00b05
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Instance failed to spawn
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Traceback (most recent call last):
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2078, in _build_resources
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] yield resources
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1920, in _build_and_run_instance
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] block_device_info=block_device_info)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2571, in spawn
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] admin_pass=admin_password)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2975, in _create_image
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] fallback_from_host)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3075, in _create_and_inject_local_root
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] instance, size, fallback_from_host)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6537, in _try_fetch_image_cache
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] size=size)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 218, in cache
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] *args, **kwargs)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 565, in create_image
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] self.verify_base_size(base, size)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 265, in verify_base_size
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] base_size = self.get_disk_size(base)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 277, in get_disk_size
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] return disk.get_disk_size(name)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 148, in get_disk_size
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] return images.qemu_img_info(path).virtual_size
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 67, in qemu_img_info
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] raise exception.InvalidDiskInfo(reason=msg)
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] InvalidDiskInfo: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52 : Unexpected error while running command.
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Exit code: -9
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Stdout: u''
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Stderr: u''
2016-11-30 11:35:10.983 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]
2016-11-30 11:35:11.005 29776 WARNING nova.image.glance [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] No protocol specified in for api_server 'http://10.240.121.13:9292', please update [glance] api_servers with fully qualified url including scheme (http / https)
2016-11-30 11:35:11.016 29776 INFO nova.virt.libvirt.driver [-] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] During wait destroy, instance disappeared.
2016-11-30 11:35:11.045 29776 INFO nova.virt.libvirt.driver [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Deleting instance files /var/lib/nova/instances/1e1c2b5a-803a-4701-bc71-254699243f7d_del
2016-11-30 11:35:11.056 29776 INFO nova.virt.libvirt.driver [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Deletion of /var/lib/nova/instances/1e1c2b5a-803a-4701-bc71-254699243f7d_del complete
2016-11-30 11:35:11.194 29776 INFO nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Took 0.18 seconds to destroy the instance on the hypervisor.
2016-11-30 11:35:11.610 29776 INFO nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Took 0.42 seconds to deallocate network for instance.
2016-11-30 11:35:11.610 29776 WARNING nova.image.glance [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] No protocol specified in for api_server 'http://10.240.121.13:9292', please update [glance] api_servers with fully qualified url including scheme (http / https)
2016-11-30 11:35:11.789 29776 WARNING nova.image.glance [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] No protocol specified in for api_server 'http://10.240.121.13:9292', please update [glance] api_servers with fully qualified url including scheme (http / https)
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Build of instance 1e1c2b5a-803a-4701-bc71-254699243f7d aborted: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52 : Unexpected error while running command.
Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52
Exit code: -9
Stdout: u''
Stderr: u''
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Traceback (most recent call last):
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1779, in _do_build_and_run_instance
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] filter_properties)
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1971, in _build_and_run_instance
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] reason=e.format_message())
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] BuildAbortException: Build of instance 1e1c2b5a-803a-4701-bc71-254699243f7d aborted: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52 : Unexpected error while running command.
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/4cfca5366c949a8588736b75803acd375751ca52
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Exit code: -9
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Stdout: u''
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Stderr: u''
2016-11-30 11:35:11.793 29776 ERROR nova.compute.manager [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d]
2016-11-30 11:35:11.881 29776 INFO nova.compute.manager [req-ee0aafda-9ab6-4e8a-b7ad-669f84a5eaf8 7004c133bffa42858f0438c3866c5224 67d87660f58d4cc08c05227b7b668b2d - - -] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] Took 0.09 seconds to deallocate network for instance.
2016-11-30 11:35:12.177 29776 INFO nova.compute.manager [-] [instance: 1e1c2b5a-803a-4701-bc71-254699243f7d] During sync_power_state the instance has a pending task (spawning). Skip.

Revision history for this message
JB Broccard (jbbroccard) wrote :

Anyone?
I'm still investigating the issue and trying to debug images.py and the execute wrapper to see what's causing the issue.

Revision history for this message
JB Broccard (jbbroccard) wrote :

All,
I was able to get around the issue by triggering the "qemu-img info" command right before the python code executes it. See my change in /usr/lib/python2.7/site-packages/nova/virt/images.py :

def qemu_img_info(path, format=None):
    """Return an object containing the parsed output from qemu-img info."""
    # TODO(mikal): this code should not be referring to a libvirt specific
    # flag.
    # NOTE(sirp): The config option import must go here to avoid an import
    # cycle
    CONF.import_opt('images_type', 'nova.virt.libvirt.imagebackend',
                    group='libvirt')
    if not os.path.exists(path) and CONF.libvirt.images_type != 'rbd':
        raise exception.DiskNotFound(location=path)

    try:
        cmd = ('env', 'LC_ALL=C', 'LANG=C', 'qemu-img', 'info', path)
        if format is not None:
            cmd = cmd + ('-f', format)
        os.system('(time /usr/bin/qemu-img info ' + str(path) + ') >> /tmp/testfile 2>&1') # <- my change
        out, err = utils.execute(*cmd, prlimit=QEMU_IMG_LIMITS)

With this the issue does not occur anymore.
What I noticed is that through NFS accessing files on the NFS mounted FS can be very slow right after the copy/convert, running the same command right before the ones that trigger the exception made it.

I tried to track down where the Exit code: -9 came from but it seems to be coming from qemu-img program itself and oslo_concurrency libraries does not interpret this "-9".

Could anyone give an opinion on this workaround?

Matt Riedemann (mriedem)
summary: - Fail to boot VM out of large snapshots (30GB+)
+ NFS: Fail to boot VM out of large snapshots (30GB+)
Revision history for this message
Sean Dague (sdague) wrote :

So is the issue the prlimit command is getting in the way? What if you remove the prlimit?

Changed in nova:
status: New → Incomplete
tags: added: nfs
Revision history for this message
Sean Dague (sdague) wrote :

Or increase the QEMU_IMG_LIMITS to a larger value for cpu, my guess is the limits are getting hit because the NFS server is so slow, the command is bailing on that.

Revision history for this message
JB Broccard (jbbroccard) wrote :

Sean, how can i increase that value?

Revision history for this message
JB Broccard (jbbroccard) wrote :

I found this in the images.py code. But if that works with ... say multiply by 10:

...
QEMU_IMG_LIMITS = processutils.ProcessLimits(
    cpu_time=2,
    address_space=1 * units.Gi * 10)
...

This still requires to modify thenova code. Is there a way to change it from the outside, like from a parameter in nova.conf?

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Was discussing this issue with QEMU upstream this morning. (Thanks to
Stefan Hajnoczi & Dan Berrange for the chat on IRC).

The "-9" is actually SIGKILL signal, which is expected when the CPU /
memory limits are hit.

Also:

    'Exit status' comes from an exception which is in turn populated
    from python subprocess "returncode" attribute, whichis just
    documented as "Exit status of the child process"

---

I was also incidentally triaging a downstream bug report that has
similar error: they seem to notice the said error if they upload images
to Glance that are > ~20GB in size and then booting an instance out of
it.

And, it seems like the reporter downstream has confirmed by increasing
the CPU limit from 2 to 8, they're able to successfully boot instances.

So, can you also try by increasing the CPU limit from 2 -> 8 (or
similar, based on your machine's specifications)?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/408668

Changed in nova:
assignee: nobody → Sean Dague (sdague)
status: Incomplete → In Progress
Sean Dague (sdague)
Changed in nova:
importance: Undecided → Medium
Revision history for this message
Sean Dague (sdague) wrote :

Please see if the attached patch works for you and makes the issue go away.

Revision history for this message
JB Broccard (jbbroccard) wrote :

Sean,
I did not that patch but rather just changed cpu_time from 2 to 8 and it worked.
Thanks for the help - hopefully this patch will be rolled in the master branch for future releases.

Revision history for this message
JB Broccard (jbbroccard) wrote :

Sean, the cpu_time seems to be a variable to me, shouldn't it be changeable from the outside (like in nova.conf) instead of having to tune inside the nova code?

Revision history for this message
Sean Dague (sdague) wrote :

JB it's really a backstop for "stuff went horribly wrong and we're trying to protect you". I don't think we want to make this configurable.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/408668
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b78b1f8ce3aa407307a6adc5c60de1e960547897
Submitter: Jenkins
Branch: master

commit b78b1f8ce3aa407307a6adc5c60de1e960547897
Author: Sean Dague <email address hidden>
Date: Thu Dec 8 10:09:06 2016 -0500

    Bump prlimit cpu time for qemu from 2 to 8

    We've got user reported bugs that when opperating with slow NFS
    backends with large (30+ GB) disk files, the prlimit of cpu_time 2 is
    guessed to be the issue at hand because if folks hot patch a qemu-img
    that runs before the prlimitted one, the prlimitted one succeeds.

    This increases the allowed cpu timeout, as well as tweaking the error
    message so that we return something more prescriptive when the
    qemu-img command fails with prlimit abort.

    The original bug (#1449062) the main mitigation concern here was a
    carefully crafted image that gets qemu-img to generate > 1G of json,
    and hence could be a node attack vector. cpu_time was never mentioned,
    and I think was added originally as a belt and suspenders addition. As
    such, bumping it to 8 seconds shouldn't impact our protection in any
    real way.

    Change-Id: I1f4549b787fd3b458e2c48a90bf80025987f08c4
    Closes-Bug: #1646181

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/409774

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/409775

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/mitaka)

Change abandoned by Kashyap Chamarthy (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/409775
Reason: Alright, it skipped me that Mitaka is now Security-supported only, so obviously this doesn't fit into that criteria. Therefore, abandoning.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/409774
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a859481b797048579f69e8c07b09a4628776fb1f
Submitter: Jenkins
Branch: stable/newton

commit a859481b797048579f69e8c07b09a4628776fb1f
Author: Sean Dague <email address hidden>
Date: Thu Dec 8 10:09:06 2016 -0500

    Bump prlimit cpu time for qemu from 2 to 8

    We've got user reported bugs that when opperating with slow NFS
    backends with large (30+ GB) disk files, the prlimit of cpu_time 2 is
    guessed to be the issue at hand because if folks hot patch a qemu-img
    that runs before the prlimitted one, the prlimitted one succeeds.

    This increases the allowed cpu timeout, as well as tweaking the error
    message so that we return something more prescriptive when the
    qemu-img command fails with prlimit abort.

    The original bug (#1449062) the main mitigation concern here was a
    carefully crafted image that gets qemu-img to generate > 1G of json,
    and hence could be a node attack vector. cpu_time was never mentioned,
    and I think was added originally as a belt and suspenders addition. As
    such, bumping it to 8 seconds shouldn't impact our protection in any
    real way.

    Change-Id: I1f4549b787fd3b458e2c48a90bf80025987f08c4
    Closes-Bug: #1646181
    (cherry picked from commit b78b1f8ce3aa407307a6adc5c60de1e960547897)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.0.0b2

This issue was fixed in the openstack/nova 15.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.3

This issue was fixed in the openstack/nova 14.0.3 release.

Revision history for this message
Jacolex (jacolex) wrote :

Is possible to fix it also in Mitaka?

Revision history for this message
Eric Harney (eharney) wrote :

For Cinder, this same issue affects image->volume and needs the same fix.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/443752

Changed in cinder:
assignee: nobody → Eric Harney (eharney)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/443752
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=52310fa8645cc10b91de7d2b4e10a3b42d4ef073
Submitter: Jenkins
Branch: master

commit 52310fa8645cc10b91de7d2b4e10a3b42d4ef073
Author: Eric Harney <email address hidden>
Date: Thu Mar 9 11:25:53 2017 -0500

    Bump prlimit cpu time for qemu-img from 2 to 8

    Users have reported that the current CPU limit is not
    sufficient for processing large enough images when
    downloading images to volumes.

    This mirrors a similar increase made in Nova (b78b1f8ce).

    Closes-Bug: #1646181

    Change-Id: I5edea7d1d19fd991e51dca963d2beb7004177498

Changed in cinder:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/444913

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/444996

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/ocata)

Reviewed: https://review.openstack.org/444913
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=2a5e0086e130e5e7960f743b6db8344caf65b9e5
Submitter: Jenkins
Branch: stable/ocata

commit 2a5e0086e130e5e7960f743b6db8344caf65b9e5
Author: Eric Harney <email address hidden>
Date: Thu Mar 9 11:25:53 2017 -0500

    Bump prlimit cpu time for qemu-img from 2 to 8

    Users have reported that the current CPU limit is not
    sufficient for processing large enough images when
    downloading images to volumes.

    This mirrors a similar increase made in Nova (b78b1f8ce).

    Closes-Bug: #1646181

    Change-Id: I5edea7d1d19fd991e51dca963d2beb7004177498
    (cherry picked from commit 52310fa8645cc10b91de7d2b4e10a3b42d4ef073)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/newton)

Reviewed: https://review.openstack.org/444996
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=46799c4ff57e818fad770c5266f553a30fab6de2
Submitter: Jenkins
Branch: stable/newton

commit 46799c4ff57e818fad770c5266f553a30fab6de2
Author: Eric Harney <email address hidden>
Date: Thu Mar 9 11:25:53 2017 -0500

    Bump prlimit cpu time for qemu-img from 2 to 8

    Users have reported that the current CPU limit is not
    sufficient for processing large enough images when
    downloading images to volumes.

    This mirrors a similar increase made in Nova (b78b1f8ce).

    Closes-Bug: #1646181

    Change-Id: I5edea7d1d19fd991e51dca963d2beb7004177498
    (cherry picked from commit 52310fa8645cc10b91de7d2b4e10a3b42d4ef073)
    (cherry picked from commit 2a5e0086e130e5e7960f743b6db8344caf65b9e5)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 9.1.3

This issue was fixed in the openstack/cinder 9.1.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 10.0.1

This issue was fixed in the openstack/cinder 10.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 11.0.0.0b1

This issue was fixed in the openstack/cinder 11.0.0.0b1 development milestone.

Revision history for this message
Florian Haas (fghaas) wrote :

I am afraid this issue was prematurely marked fixed. As far as I can see it still exists in at least Ocata, just with a shifted threshold. I have a snapshot image that is 400GB in virtual size, 124GB qcow2 file size, on NFS, and I am seeing the prlimits error even with --cpu=8.

http://paste.openstack.org/show/719357/

Between lines 10 and 11, it takes 25 minutes to transfer the image to the compute node. Once the transfer is complete, qemu-img info fails and we never get to spawn the instance.

Revision history for this message
Florian Haas (fghaas) wrote :

For anyone following this issue: during the Pike cycle, this limit was bumped to 30 CPU seconds via bug 1705340. The 8-second limit still does exist in Ocata.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/562145

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.