Nova can hang when creating a VM with disk injection

Bug #1735823 reported by David Moreau Simard
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Medium
Matt Riedemann
Ocata
New
Undecided
Unassigned
Pike
New
Undecided
Unassigned

Bug Description

We are noticing recurring failures in the gate under CentOS and OpenSUSE across Devstack and Packstack jobs on the master branches.
This has been happening only on the OVH cloud regions as far as we know.

Example failures:
- CentOS Devstack: http://logs.openstack.org/46/523646/1/check/legacy-tempest-dsvm-neutron-full-centos-7/5bf092c/job-output.txt#_2017-11-29_03_02_38_031560
- OpenSUSE Devstack: http://logs.openstack.org/23/522423/7/check/legacy-tempest-dsvm-neutron-full-opensuse-423/b6768d7/job-output.txt#_2017-11-27_21_53_13_340319
- CentOS Packstack: http://logs.openstack.org/14/516714/1/check/packstack-integration-scenario002-tempest/7ba8d06/job-output.txt.gz#_2017-10-31_17_55_39_845816

They all fail with the same stack trace:
=====
setUpClass (tempest.api.compute.servers.test_create_server.ServersTestJSON)
---------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "tempest/test.py", line 172, in setUpClass
        six.reraise(etype, value, trace)
      File "tempest/test.py", line 165, in setUpClass
        cls.resource_setup()
      File "tempest/api/compute/servers/test_create_server.py", line 64, in resource_setup
        volume_backed=cls.volume_backed)
      File "tempest/api/compute/base.py", line 190, in create_test_server
        **kwargs)
      File "tempest/common/compute.py", line 258, in create_test_server
        server['id'])
      File "/opt/stack/new/tempest/.tox/tempest/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
        self.force_reraise()
      File "/opt/stack/new/tempest/.tox/tempest/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
        six.reraise(self.type_, self.value, self.tb)
      File "tempest/common/compute.py", line 229, in create_test_server
        clients.servers_client, server['id'], wait_until)
      File "tempest/common/waiters.py", line 96, in wait_for_server_status
        raise lib_exc.TimeoutException(message)
    tempest.lib.exceptions.TimeoutException: Request timed out
    Details: (ServersTestJSON:setUpClass) Server 2f8de011-b218-4e73-b9e3-e7fcf9e9278b failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: BUILD. Current task state: spawning.
=====

Revision history for this message
David Moreau Simard (dmsimard) wrote : Re: Nova can hang when creating a VM with file injection
summary: - Tempest makes Nova hang when creating a VM with file injection
+ Nova can hang when creating a VM with file injection
description: updated
Revision history for this message
Matt Riedemann (mriedem) wrote :

Tracing the logs on one of these failures, we get here:

http://logs.openstack.org/46/523646/1/check/legacy-tempest-dsvm-neutron-full-centos-7/5bf092c/logs/screen-n-cpu.txt.gz#_Nov_29_02_16_33_785830

https://github.com/openstack/nova/blob/b6a245f0425a07be3871a976952646d2bdd44533/nova/virt/disk/vfs/guestfs.py#L75-L80

And then the logging context switches because all of a sudden we start getting logs for a different instance id with the same request_id:

Nov 29 02:16:33.785677 centos-7-ovh-bhs1-0001106830 nova-compute[7229]: DEBUG nova.virt.disk.vfs.api [None req-f5b10bda-e8cf-44d6-b3c7-9c6fe8da085d service nova] Instance for image image=<LocalFileImage:{'path': '/opt/stack/data/nova/instances/2f8de011-b218-4e73-b9e3-e7fcf9e9278b/disk', 'format': 'qcow2'}> partition=-1 {{(pid=7229) instance_for_image /opt/stack/new/nova/nova/virt/disk/vfs/api.py:50}}
Nov 29 02:16:33.785830 centos-7-ovh-bhs1-0001106830 nova-compute[7229]: DEBUG nova.virt.disk.vfs.api [None req-f5b10bda-e8cf-44d6-b3c7-9c6fe8da085d service nova] Using primary VFSGuestFS {{(pid=7229) instance_for_image /opt/stack/new/nova/nova/virt/disk/vfs/api.py:54}}
Nov 29 02:16:33.826277 centos-7-ovh-bhs1-0001106830 nova-compute[7229]: DEBUG nova.virt.libvirt.driver [None req-f5b10bda-e8cf-44d6-b3c7-9c6fe8da085d service nova] [instance: 57c7ef47-a65d-4041-9e6f-d8b056b0c4ad] End _get_guest_xml xml=<domain type="qemu">

So I think we're hitting something with greenthread switching here or something with the thread pool being used when setting up guestfs.

summary: - Nova can hang when creating a VM with file injection
+ Nova can hang when creating a VM with disk injection
Revision history for this message
Matt Riedemann (mriedem) wrote :

(2:07:07 PM) dmsimard: In EL7 the version of guestfs is libguestfs-1.36.3-6.el7_4.3.x86_64 .. in Ubuntu... I'm not seeing it installed?
(2:08:08 PM) mordred: dmsimard: it's possible that feature isn't enabled in the ubuntu jobs? (which would also explain why we only see the problem on non-ubuntu)
(2:08:36 PM) mriedem: no we do file injection on ubuntu jobs too
(2:08:37 PM) dmsimard: mordred: devstack jobs on centos and ubuntu would run a different set ? /me looks
(2:08:40 PM) mriedem: http://logs.openstack.org/10/523910/1/check/legacy-tempest-dsvm-neutron-full/8cd6149/logs/etc/nova/nova.conf.txt.gz
(2:08:47 PM) mriedem: inject_partition = -1
(2:09:03 PM) mriedem: http://logs.openstack.org/10/523910/1/check/legacy-tempest-dsvm-neutron-full/8cd6149/logs/dpkg-l.txt.gz
(2:09:08 PM) mriedem: ii libguestfs0:amd64 1:1.32.2-4ubuntu2 amd64 guest disk image management system - shared library
(2:09:14 PM) mordred: ah. k
(2:09:34 PM) mordred: darn. I was hoping it was more ammunition against the feature. /me goes back to other things
(2:11:53 PM) openstackgerrit: Jeremy Stanley proposed openstack-infra/puppet-mailman master: Enable modules before starting Apache https://review.openstack.org/524718
(2:11:55 PM) fungi: mordred: ^
(2:11:57 PM) dmsimard: mriedem: so there's a delta between 1.32.2 in Ubuntu and 1.36.3 in CentOS... and opensuse has 1.32.4

tags: added: guestfs injection
Revision history for this message
Matt Riedemann (mriedem) wrote :

This is the change that originally made us run guestfs in a thread pool to inspect it's capabilities:

https://review.openstack.org/#/c/181808/

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is a devstack change which will enable guestfs debugging so we can hopefully get some more information when we see this type of failure:

https://review.openstack.org/#/c/524710/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/524727

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
David Moreau Simard (dmsimard) wrote :
Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :

Well we know we aren't setting the force_tcg settings properly:

http://logs.openstack.org/10/524710/1/check/legacy-tempest-dsvm-neutron-full/a1867f6/logs/screen-n-cpu.txt.gz#_Dec_01_20_11_15_758749

That's sending in "force_tcg" but it turns that into a list("force_tcg") which is wrong.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Similar issue in the libguestfs mailing list:

https://www.redhat.com/archives/libguestfs/2016-August/msg00218.html

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/524748

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :

Bug 1737207 is related and I have a separate patch for that.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Oops, comment #12 meant bug 1737214.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/524727
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ec5c27a8afec53f167d30d8f81e23c1b42465d54
Submitter: Zuul
Branch: master

commit ec5c27a8afec53f167d30d8f81e23c1b42465d54
Author: Matt Riedemann <email address hidden>
Date: Fri Dec 1 15:56:34 2017 -0500

    Check the return code when forcing TCG mode with libguestfs

    According to the guestfs docs:

    http://libguestfs.org/guestfs.3.html#guestfs_set_backend_settings

    This method returns a 0 on success and -1 on failure. We suspect
    we might not be calling this correctly, so log a warning if it
    returns a non-zero return code.

    Change-Id: I14a5969667a6f6f4ae529aaf0f6b9475fdf25496
    Related-Bug: #1735823

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/524748
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2a8c0c800ba2107a94597761677007ed660a40a1
Submitter: Zuul
Branch: master

commit 2a8c0c800ba2107a94597761677007ed660a40a1
Author: Matt Riedemann <email address hidden>
Date: Fri Dec 1 17:41:11 2017 -0500

    Fix guestfs.set_backend_settings call

    This method takes a list of strings. It was never failing
    because python-libguestfs takes the input and casts it to
    a list, which turns "force_tcg" into:

      ['f', 'o', 'r', 'c', 'e', '_', 't', 'c', 'g']

    Which is obviously not correct.

    http://libguestfs.org/guestfs.3.html#guestfs_set_backend_settings

    This fixes that issue and leaves a TODO about investigating
    the usage of set_backend_setting for just setting the
    single force_tcg value instead of overrwriting all backend
    settings.

    Credit to Clark Boylan for identifying the issue.

    Change-Id: I36edfaf30fd0cc234f929fd6937d07b6a1b646ae
    Partial-Bug: #1735823

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.