[9.x][BVT] Launching of an instance fails on execution of qemu-img info

Bug #1643609 reported by Roman Podoliaka
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Fix Committed
Medium
MOS Nova

Bug Description

A few recent builds of 9.0-community.main.ubuntu.bvt_2 (MOS 9.2 as of now)

https://ci.fuel-infra.org/view/BVT/job/9.0-community.main.ubuntu.bvt_2/606/
https://ci.fuel-infra.org/view/BVT/job/9.0-community.main.ubuntu.bvt_2/614/

failed on launching of an instance with the following error:

2016-11-19 04:34:55.805 23824 DEBUG oslo_concurrency.processutils [req-b823b3fc-6435-40f7-87af-2ecfbadc78af d071d93fd9524714b1b1bbb49ed765e9 69393ef97a314fb1a2c41b98283f063c - - -] CMD "/usr/bin/python -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/3d5a408550fdf3a23c50da67bb61a63e723e7f2d.part" returned: -9 in 5.177s execute /usr/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:374
2016-11-19 04:34:55.806 23824 DEBUG oslo_concurrency.processutils [req-b823b3fc-6435-40f7-87af-2ecfbadc78af d071d93fd9524714b1b1bbb49ed765e9 69393ef97a314fb1a2c41b98283f063c - - -] u'/usr/bin/python -m oslo_concurrency.prlimit --as=1073741824 --cpu=2 -- env LC_ALL=C LANG=C qemu-img info /var/lib/nova/instances/_base/3d5a408550fdf3a23c50da67bb61a63e723e7f2d.part' failed. Not Retrying. execute /usr/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:422
2016-11-19 04:34:56.866 23824 DEBUG nova.compute.utils [req-b823b3fc-6435-40f7-87af-2ecfbadc78af d071d93fd9524714b1b1bbb49ed765e9 69393ef97a314fb1a2c41b98283f063c - - -] [instance: 83ae7e3b-377b-4091-8a33-74487bf21a9b] Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/3d5a408550fdf3a23c50da67bb61a63e723e7f2d.part : Unexpected error while running command.
2016-11-19 04:34:56.867 23824 ERROR nova.compute.manager [req-b823b3fc-6435-40f7-87af-2ecfbadc78af d071d93fd9524714b1b1bbb49ed765e9 69393ef97a314fb1a2c41b98283f063c - - -] [instance: 83ae7e3b-377b-4091-8a33-74487bf21a9b] Build of instance 83ae7e3b-377b-4091-8a33-74487bf21a9b aborted: Disk info file is invalid: qemu-img failed to execute on /var/lib/nova/instances/_base/3d5a408550fdf3a23c50da67bb61a63e723e7f2d.part : Unexpected error while running command.

Note, that it took `qemu-img info` 5.177s to return -9. We intentionally run this command with resource constraints set (see https://review.openstack.org/#/c/327624/ for details):

/usr/bin/python -m oslo_concurrency.prlimit --as=1073741824 --cpu=2
(max 1G of memory and 2 seconds of CPU time)

to prevent possible DoS attacks with malicious images, but in this case the image is correct - VMs are booted successfully in other OSTF test cases.

My current understanding is that sometimes a CI node a test environment is run on can be "slow" (disk? cpu?) and cause such errors.

Tags: area-nova
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
description: updated
description: updated
description: updated
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
description: updated
Changed in mos:
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → MOS Nova (mos-nova)
milestone: none → 9.2
tags: added: area-nova
Revision history for this message
Elena Ezhova (eezhova) wrote :

One more recent failure.

Revision history for this message
Elena Ezhova (eezhova) wrote :

Also, this upstream bug [0] can be related.

[0] https://bugs.launchpad.net/nova/+bug/1646181

Revision history for this message
Elena Ezhova (eezhova) wrote :

Example of trace from the most recent failure:
http://paste.openstack.org/show/591653/

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/nova (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/29313
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: ed209ae75f0aa9ab964ed0f3421f9f89a9f09ae9
Author: Sean Dague <email address hidden>
Date: Mon Dec 12 15:36:10 2016

Bump prlimit cpu time for qemu from 2 to 8

We've got user reported bugs that when opperating with slow NFS
backends with large (30+ GB) disk files, the prlimit of cpu_time 2 is
guessed to be the issue at hand because if folks hot patch a qemu-img
that runs before the prlimitted one, the prlimitted one succeeds.

This increases the allowed cpu timeout, as well as tweaking the error
message so that we return something more prescriptive when the
qemu-img command fails with prlimit abort.

The original bug (#1449062) the main mitigation concern here was a
carefully crafted image that gets qemu-img to generate > 1G of json,
and hence could be a node attack vector. cpu_time was never mentioned,
and I think was added originally as a belt and suspenders addition. As
such, bumping it to 8 seconds shouldn't impact our protection in any
real way.

Change-Id: I1f4549b787fd3b458e2c48a90bf80025987f08c4
Closes-Bug: #1643609
(cherry picked from commit b78b1f8ce3aa407307a6adc5c60de1e960547897)

Changed in mos:
status: Confirmed → Fix Committed
tags: added: on-verification
Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :

Verified on 9.2 snapshot #656.

Described issue is not reproduced in BVT runs #696-#718:
https://ci.fuel-infra.org/view/BVT/job/9.0-community.main.ubuntu.bvt_2/718/

tags: removed: on-verification
Changed in mos:
status: Fix Committed → Fix Released
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.