libvirt: guestfs api makes nova-compute hang

Bug #1453666 reported by Qin Zhao
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Qin Zhao
Kilo
Fix Released
Undecided
Unassigned

Bug Description

Latest Kilo code.

In inspect_capabilities() of nova/virt/disk/vfs/guestfs.py, guestfs api, which is C-extension, will hang nova-compute process when it is invoked. This problem will result in message queue time out error and instance booting failure.

And example of this problem is:

2015-05-09 17:07:08.393 4449 DEBUG nova.virt.disk.vfs.api [req-1f7c1104-2679-43a5-bbcb-f73114ce9103 - - - - -] Using primary VFSGuestFS instance_for_image /usr/lib/python2.7/site-packages/nova/virt/disk/vfs/api.py:50
2015-05-09 17:08:35.443 4449 DEBUG nova.virt.disk.vfs.guestfs [req-1f7c1104-2679-43a5-bbcb-f73114ce9103 - - - - -] Setting up appliance for /var/lib/nova/instances/0517e2a9-469c-43f4-a129-f489fc1c8356/disk qcow2 setup /usr/lib/python2.7/site-packages/nova/virt/disk/vfs/guestfs.py:169
2015-05-09 17:08:35.457 4449 DEBUG nova.openstack.common.periodic_task [req-bb78b74b-bed7-450f-bd40-19686aab2c3e - - - - -] Running periodic task ComputeManager._instance_usage_audit run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:219
2015-05-09 17:08:35.461 4449 INFO oslo_messaging._drivers.impl_rabbit [req-bb78b74b-bed7-450f-bd40-19686aab2c3e - - - - -] Connecting to AMQP server on 127.0.0.1:5671
2015-05-09 17:08:35.472 4449 ERROR nova.compute.manager [-] Instance failed network setup after 1 attempt(s)
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager Traceback (most recent call last):
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1783, in _allocate_network_async
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager system_metadata=sys_meta)
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 739, in _instance_update
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager **kwargs)
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/conductor/api.py", line 308, in instance_update
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager updates, 'conductor')
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 194, in instance_update
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager service=service)
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager retry=self.retry)
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager timeout=timeout, retry=retry)
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager retry=retry)
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager result = self._waiter.wait(msg_id, timeout)
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager message = self.waiters.get(msg_id, timeout=timeout)
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager 'to message ID %s' % msg_id)
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager MessagingTimeout: Timed out waiting for a reply to message ID 8ff07520ea8743c997b5017f6638a0df
2015-05-09 17:08:35.472 4449 TRACE nova.compute.manager

Revision history for this message
Qin Zhao (zhaoqin) wrote :

From the first two lines of log, we can know that launch() of guestfs api in inspect_capabilities() takes nearly 90 seconds to complete. And during this time frame, nova-compute cannot switch to other green threads. Since rpc_response_timeout=60, rabbitmq client immediately encounters a timeout error, when it resumes to run.

2015-05-09 17:07:08.393 4449 DEBUG nova.virt.disk.vfs.api [req-1f7c1104-2679-43a5-bbcb-f73114ce9103 - - - - -] Using primary VFSGuestFS instance_for_image /usr/lib/python2.7/site-packages/nova/virt/disk/vfs/api.py:50
2015-05-09 17:08:35.443 4449 DEBUG nova.virt.disk.vfs.guestfs [req-1f7c1104-2679-43a5-bbcb-f73114ce9103 - - - - -] Setting up appliance for /var/lib/nova/instances/0517e2a9-469c-43f4-a129-f489fc1c8356/disk qcow2 setup /usr/lib/python2.7/site-packages/nova/virt/disk/vfs/guestfs.py:169

During my investigation, launch() of guestfs api always takes more than 10 seconds to complete, and will take very long time, if server is busy or its cache file has not been created. Anyway, it does not make sense to invoke a long blocking call in a green thread.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/181808

Changed in nova:
assignee: nobody → Qin Zhao (zhaoqin)
status: New → In Progress
Matt Riedemann (mriedem)
tags: added: kilo-backport-potential libvirt
Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/181808
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=da9671aa7f597d3dd2cd6d09bb06998b209c962a
Submitter: Jenkins
Branch: master

commit da9671aa7f597d3dd2cd6d09bb06998b209c962a
Author: Qin Zhao <email address hidden>
Date: Mon May 11 15:01:38 2015 +0800

    Libvirt: Use tpool to invoke guestfs api

    In inspect_capabilities() of nova/virt/disk/vfs/guestfs.py, guestfs api, which
    is C-extension, will hang nova-compute process when it is invoked. This problem
    will result in message queue timeout error and instance booting failure. Need
    to use tpool to invoke guestfs api, in order to make nova-compute switch to
    other green threads before guestfs api returns.

    Change-Id: I24187ba01965fb886351b78eee605bc052a94e21
    Closes-Bug: 1453666

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/182044

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/kilo)

Reviewed: https://review.openstack.org/182044
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8e9ecf74c918273033d86342ec2aba419e86f998
Submitter: Jenkins
Branch: stable/kilo

commit 8e9ecf74c918273033d86342ec2aba419e86f998
Author: Qin Zhao <email address hidden>
Date: Mon May 11 15:01:38 2015 +0800

    Libvirt: Use tpool to invoke guestfs api

    In inspect_capabilities() of nova/virt/disk/vfs/guestfs.py, guestfs api, which
    is C-extension, will hang nova-compute process when it is invoked. This problem
    will result in message queue timeout error and instance booting failure. Need
    to use tpool to invoke guestfs api, in order to make nova-compute switch to
    other green threads before guestfs api returns.

    Change-Id: I24187ba01965fb886351b78eee605bc052a94e21
    Closes-Bug: 1453666
    (cherry picked from commit da9671aa7f597d3dd2cd6d09bb06998b209c962a)

tags: added: in-stable-kilo
Thierry Carrez (ttx)
Changed in nova:
milestone: none → liberty-1
status: Fix Committed → Fix Released
Alan Pevec (apevec)
no longer affects: nova/kilo
tags: removed: kilo-backport-potential
Thierry Carrez (ttx)
Changed in nova:
milestone: liberty-1 → 12.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.