n-cpu seems to crash when running with libvirt 1.1.1 from ubuntu cloud archive

Bug #1228977 reported by Sean Dague on 2013-09-23
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Unassigned
Ubuntu Cloud Archive
Undecided
Unassigned
libvirt (Ubuntu)
High
Unassigned
Saucy
High
Unassigned

Bug Description

impact
------

any concurrent use of libvirt may lockup libirt

test case
---------

use libvirt concurrently, specifically in nwfilter + createDomain calls. e.g. run devstack-gate against this

regression potential
--------------------

upstream stable branch update - should be low

We experienced a series of jenkins rejects starting overnight on Saturday whose root cause is still not quite tracked down yet. However, they all have a couple of things in common:

1) they are the first attempt to use libvir 1.0.6 from havana cloud archive for ubuntu precise
2) the fails are all related to guests not spawning correctly
3) the n-cpu log just stops about 1/2 way through the tempest log, making my suspect that we did something to either lockup or hard crash n-cpu

After that change went in no devstack/tempest gating project managed to merge a change.

This needs more investigation, but creating this bug to both reverify against, as well as track down this issue.

Sean Dague (sdague) on 2013-09-23
Changed in nova:
importance: Undecided → High
Sean Dague (sdague) wrote :

On further investigation this isn't actually an n-cpu crash, it's actually a hard lockup of the nova-compute process. At some point it just stops responding, and the process running under devstack is not even ^C killable any more after this. So there is some deadlock introduced by the new packages.

Sean Dague (sdague) wrote :

To replicate run an environment with 4 way tempest under testr, it exposes it 100% consistently from my experience.

Jian Wen (wenjianhn) on 2013-09-25
Changed in nova:
status: New → Confirmed
Jian Wen (wenjianhn) wrote :

nova-compute is sleeping here forever.
somebody should wake it up.

jian@ubuntu-dev:~/devstack$ ps aux | grep compute
root 5365 0.0 0.0 16596 1008 pts/8 S+ Sep24 0:00 sg libvirtd /usr/local/bin/nova-compute --config-file /etc/nova/nova.conf
jian 5367 1.2 1.5 1749052 62356 pts/8 Sl+ Sep24 0:49 /usr/bin/python /usr/local/bin/nova-compute --config-file /etc/nova/nova.conf
jian 5494 0.0 0.0 8104 920 pts/1 S+ 00:41 0:00 grep --color=auto compute
jian@ubuntu-dev:~/devstack$ sudo strace -p 5367
Process 5367 attached - interrupt to quit
futex(0x43d934c, FUTEX_WAIT_PRIVATE, 1, NULL

Jian Wen (wenjianhn) wrote :

Restarting nova-compute, it is blocked at 'self._conn.numOfDomains() == 0:'

(Pdb) bt
  /usr/lib/python2.7/dist-packages/eventlet/greenthread.py(194)main()
-> result = function(*args, **kwargs)
  /opt/stack/nova/nova/openstack/common/service.py(65)run_service()
-> service.start()
  /opt/stack/nova/nova/service.py(154)start()
-> self.manager.init_host()
  /opt/stack/nova/nova/compute/manager.py(785)init_host()
-> self._destroy_evacuated_instances(context)
  /opt/stack/nova/nova/compute/manager.py(541)_destroy_evacuated_instances()
-> local_instances = self._get_instances_on_driver(context, filters)
  /opt/stack/nova/nova/compute/manager.py(509)_get_instances_on_driver()
-> driver_uuids = self.driver.list_instance_uuids()
  /opt/stack/nova/nova/virt/libvirt/driver.py(725)list_instance_uuids()
-> for domain_id in self.list_instance_ids():
> /opt/stack/nova/nova/virt/libvirt/driver.py(700)list_instance_ids()
-> def list_instance_ids(self):
(Pdb) n
> /opt/stack/nova/nova/virt/libvirt/driver.py(701)list_instance_ids()
-> if self._conn.numOfDomains() == 0:
(Pdb) n

Jian Wen (wenjianhn) wrote :

libvirt-bin 1.0.6-0ubuntu4~cloud0

The following script will never end.

#!/usr/bin/env python

import libvirt
import sys

conn = libvirt.openReadOnly(None)
if conn == None:
    print 'Failed to open connection to the hypervisor'
    sys.exit(1)

num = conn.numOfDomains() # will never return

Changed in cloud-archive:
status: New → Confirmed
Jian Wen (wenjianhn) wrote :

Steps to reproduce:

On a Ubuntu 12.04 server

# use cloud archive havana
sudo apt-get install ubuntu-cloud-keyring
sudo su -c "echo deb http://ubuntu-cloud.archive.canonical.com/ubuntu precise-updates/havana main > /etc/apt/sources.list.d/OS_H_CA.list"
sudo apt-get update

sudo apt-get install git
git clone https://github.com/openstack-dev/devstack.git
cd devstack
./stack.sh # Use default settings by pressing Enter multiple times

sudo apt-get install testrepository

./rejoin-stack.sh

# create multiple screen windows to execute Tempest tests
cd /opt/stack/tempest
testr init
testr run --parallel tempest

# switch to the n-cpu window
# wait unitil it stops logging

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in libvirt (Ubuntu):
status: New → Confirmed
Serge Hallyn (serge-hallyn) wrote :

I can't reproduce this by only installing qemu-kvm and libvirt-bin on precise with the cloud-archive. So it appears to be something which is exacerbated by configuration changes done by devstack.

Can you reproduce this on saucy (which has libvirt 1.1.1)?

Jian Wen (wenjianhn) wrote :

Today, I created a new Ubuntu 12.04 VM with larger disk(20G).
After tempest created and deleted 641 instances, it failed to reproduce the bug.

Created about 440 instances with nova commit 2a3b923 which was merged today or yesterday .
Created about 200 instances with nova commit ed8b470 which was merged last Sat.

I will try to reproduce the bug with a smaller disk.

Jian Wen (wenjianhn) wrote :

I managed to reproduce it by using a smaller disk for my devstack VM.

The backtrace of libvirtd is attached.

Looks like it's related to https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=929412

I guess I can reproduce the bug by deleting and creating instances with
NWFilter configured forever in 4 processes. I will try when I back home.

James Page (james-page) wrote :

Please can this be retested using the 1.1.1 version we now have in cloud-archive for Havana.

Thanks.

Clark Boylan (cboylan) wrote :

Daniel Berrange is proposing that nova stop supporting libvirt older than 0.9.11 as libvirt should be able to provide pip installable python bindings for libvirt >= 0.9.11. The vast majority of our test slaves run on precise which has libvirt 0.9.8 in the base install. The only sane way for us to support libvirt >= 0.9.11 is to use the cloud archive. Can someone test this with cloud archive's libvirt 1.1.1 so that we can move to it if the problem is solved?

ML thread is http://lists.openstack.org/pipermail/openstack-operators/2013-November/003748.html and there is a sister thread at http://lists.openstack.org/pipermail/openstack-dev/2013-November/020172.html.

Jeremy Stanley (fungi) wrote :

Worth noting, when running full tempest on an 8GB DevStack VM in both Rackspace and HPCloud with Ubuntu Cloud Archive added to the sources list, I get numerous job failures with corresponding repetitions of this in the console:

  Traceback (most recent call last):
    File "tempest/scenario/test_server_basic_ops.py", line 107, in test_server_basicops
      self.boot_instance()
    File "tempest/scenario/test_server_basic_ops.py", line 59, in boot_instance
      instance = self.create_server(create_kwargs=create_kwargs)
    File "tempest/scenario/manager.py", line 412, in create_server
      self.status_timeout(client.servers, server.id, 'ACTIVE')
    File "tempest/scenario/manager.py", line 304, in status_timeout
      not_found_exception=not_found_exception)
    File "tempest/scenario/manager.py", line 358, in _status_timeout
      self.config.compute.build_interval):
    File "tempest/test.py", line 290, in call_until_true
      if func():
    File "tempest/scenario/manager.py", line 349, in check_status
      raise exceptions.BuildErrorException(message)
  BuildErrorException: Server %(server_id)s failed to build and is in ERROR status
  Details: <Server: scenario-server--tempest-1807499527> failed to get to expected status. In ERROR state.

Removing UCA from the sources list, all tests complete as expected.

Chuck Short (zulcss) wrote :

Anything in the nova-compute logs when this happens?

Jeremy Stanley (fungi) wrote :

Probably--I wasn't collecting logs, just confirming whether or not gating is going to break if we reenable UCA. If nobody with nova debugging experience or interest in using UCA/newer libvirt has time to repeat that experiment, I can recreate it and find/attach the service logs some time in the next few days.

Jeremy Stanley (fungi) wrote :

I got a few minutes to recreate the failing test run and tarred up the screen-n-*.log files (attached) once it finished.

Jeremy Stanley (fungi) wrote :

Adding the libvirt log as Chuck requested in IRC just now.

Julien Danjou (jdanjou) wrote :

Any progress on this? This is blocking Ceilometer testing.

Jeremy Stanley (fungi) wrote :

Note that we inadvertently "tested" libvirt from UCA on all our CI infrastructure today and ran into bug 1266711 (probably related).

Serge Hallyn (serge-hallyn) wrote :

I don't think I've managed to reproduce this. I do get errors, but they seem to be VMS failing to start which is more likely due to other reasons. I tried to reproduce with the instructions in comment #7, and expected to get libvirt hung to the point that the script in comment #6 would hang. That never happened. If my understanding about how it should fail was wrong then please let me know.

(Was trying on a ec2 c3.large precise instance)

I notice that https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=929412 (from comment #11) is not yet resolved.

ubuntu cloud archive now has libvirt at version 1.1.1 (specifically 1.1.1-0ubuntu8~cloud2). Looking at:
http://ubuntu-cloud.archive.canonical.com/ubuntu/dists/precise-updates/havana/main/binary-i386/Packages

summary: - n-cpu seems to crash when running with libvirt 1.0.6 from ubuntu cloud
+ n-cpu seems to crash when running with libvirt 1.1.1 from ubuntu cloud
archive
Download full text (3.5 KiB)

https://review.openstack.org/#/c/67564/ has logs with libvirt 1.1.1 from UCA

Here is the specific run

http://logs.openstack.org/64/67564/4/check/check-tempest-dsvm-full/b076c4f/logs/

Here is the hang in screen-n-cpu.log

2014-01-21 04:15:07.082+0000: 27080: debug : virDomainFree:2406 : dom=0x7f31a4001760, (VM: name=instance-00000049, uuid=1a6453d4-edfc-4884-8d8d-c66d6f36a004)
2014-01-21 04:15:07.082+0000: 27080: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f31a4001760
2014-01-21 04:15:07.082+0000: 27080: debug : virObjectUnref:258 : OBJECT_DISPOSE: obj=0x7f31a4001760
2014-01-21 04:15:07.082+0000: 27080: debug : virDomainDispose:262 : release domain 0x7f31a4001760 instance-00000049 1a6453d4-edfc-4884-8d8d-c66d6f36a004
2014-01-21 04:15:07.082+0000: 27080: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f31f4001f00
2014-01-21 04:15:07.102+0000: 27080: debug : virDomainFree:2406 : dom=0x7f31a80015f0, (VM: name=instance-00000050, uuid=2c093d29-ad6e-4486-a470-a2fa09a94388)
2014-01-21 04:15:07.102+0000: 27080: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f31a80015f0
2014-01-21 04:15:07.102+0000: 27080: debug : virObjectUnref:258 : OBJECT_DISPOSE: obj=0x7f31a80015f0
2014-01-21 04:15:07.102+0000: 27080: debug : virDomainDispose:262 : release domain 0x7f31a80015f0 instance-00000050 2c093d29-ad6e-4486-a470-a2fa09a94388
2014-01-21 04:15:07.102+0000: 27080: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f31f4001f00
2014-01-21 04:15:07.112+0000: 27080: debug : virDomainFree:2406 : dom=0x7f31bc001850, (VM: name=instance-00000053, uuid=9199131f-7ae8-4e9e-9065-0b47787180c1)
2014-01-21 04:15:07.112+0000: 27080: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f31bc001850
2014-01-21 04:15:07.112+0000: 27080: debug : virObjectUnref:258 : OBJECT_DISPOSE: obj=0x7f31bc001850
2014-01-21 04:15:07.112+0000: 27080: debug : virDomainDispose:262 : release domain 0x7f31bc001850 instance-00000053 9199131f-7ae8-4e9e-9065-0b47787180c1
2014-01-21 04:15:07.112+0000: 27080: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f31f4001f00
2014-01-21 04:15:07.112+0000: 27080: debug : virDomainFree:2406 : dom=0x7f31ec001840, (VM: name=instance-00000053, uuid=9199131f-7ae8-4e9e-9065-0b47787180c1)
2014-01-21 04:15:07.112+0000: 27080: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f31ec001840
2014-01-21 04:15:07.112+0000: 27080: debug : virObjectUnref:258 : OBJECT_DISPOSE: obj=0x7f31ec001840
2014-01-21 04:15:07.112+0000: 27080: debug : virDomainDispose:262 : release domain 0x7f31ec001840 instance-00000053 9199131f-7ae8-4e9e-9065-0b47787180c1
2014-01-21 04:15:07.112+0000: 27080: debug : virObjectUnref:256 : OBJECT_UNREF: obj=0x7f31f4001f00
2014-01-21 04:15:07.566+0000: 27080: debug : virNWFilterUndefine:18265 : nwfilter=0x7f31e0001650
2014-01-21 04:15:07.566+0000: 27080: debug : virNetMessageNew:44 : msg=0x40150c0 tracked=0
2014-01-21 04:15:07.566+0000: 27080: debug : virNetMessageEncodePayload:373 : Encode length as 92
2014-01-21 04:15:07.566+0000: 27080: debug : virNetClientSendInternal:1952 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f31f40049d0 len=92 prog=536903814 vers=1 proc=181 type=0 status=0 serial=5238
2014-01-21 04:15:07.566+0000: 27080: debug : virNetClientCa...

Read more...

Daniel Berrange (berrange) wrote :

2014-01-21 04:15:07.566+0000: 27080: debug : virNetClientIO:1714 : Outgoing message prog=536903814 version=1 serial=5238 proc=181 type=0 length=92 dispatch=0x7f31d4001810
2014-01-21 04:15:07.566+0000: 27080: debug : virNetClientIO:1740 : Going to sleep head=0x7f31d4001810 call=0x3ec8050

The interesting message is the first line here - proc=181 corresponds to

    REMOTE_PROC_NWFILTER_UNDEFINE = 181,

So this client log shows that Nova issued a virNWFilterUndefine API call, and libvirtd never responded with any completion. Nova hasn't hung, it is just waiting for libvirtd to finish what its doing. The question is what's libvirtd doing.....

Daniel Berrange (berrange) wrote :

Unfortunately we can't see what libvirtd is doing because the job lacks the libvirtd logs - we need these two changes deployed to the gate env to capture logs

https://review.openstack.org/#/c/65834/
https://review.openstack.org/#/c/61892/

Daniel Berrange (berrange) wrote :

Ah ha, I knew I'd seen this kind of thing somewhere before. There is an outstanding, unsolved race condition in libvirt wrt nwfilters that can cause hangs

https://bugzilla.redhat.com/show_bug.cgi?id=929412

Daniel Berrange (berrange) wrote :

Looking slight further back in the logs before the virNWFilterUndefine API call, I see an attempt to start a guest:

2014-01-21 04:15:07.124+0000: 29622: debug : virDomainCreateWithFlags:9482 : dom=0x7f31a0001900, (VM: name=instance-00000053, uuid=9199131f-7ae8-4e9e-9065-0b47787180c1), flags=0

the logs show that this never completes. This pretty much confirms that we're hitting that libvirt race condition between parallel domain start / nwfilter changes.

@berrange, Thanks for the quick look. I'll get a run with the server side log and report back here

@berrange,

I've recreated the problem with server side logging enabled. please see libvirtd.txt.gz at the following url:

http://logs.openstack.org/64/67564/6/check/check-tempest-dsvm-full/dd0e0de/logs/

The screen-n-cpu.txt.gz has the client-side logs as usual.

Daniel Berrange (berrange) wrote :

Thanks, that confirms my earlier guess. I've managed to reproduce the scenario locally in a way that openstack would definitely hit. Basically concurrent use of virDomainCreate and virNWFilter{Define,Undefine} is racey and unsafe as it stands. I'm working on a fix.

Matt Riedemann (mriedem) on 2014-01-27
tags: added: libvirt
tags: added: libvirt111
tags: added: libvirt1x
removed: libvirt111
tags: added: libvirt-1x
removed: libvirt1x
tags: added: libvirt1x
removed: libvirt-1x
Daniel Berrange (berrange) wrote :

FYI libvirt master GIT repo has the fixed and it is being backported to all stable branches.

I've been able to confirm that the code in 1.1.1-maint libvirt tree works! Please see the last few runs of https://review.openstack.org/#/c/67564/

Can we please request the UCA team to update the UCA for precise+havana?

'deb http://ubuntu-cloud.archive.canonical.com/ubuntu precise-updates/havana main'

To be specific, please see the short log - http://libvirt.org/git/?p=libvirt.git;a=shortlog;h=refs/heads/v1.1.1-maint

We need the following patches from there:
Push nwfilter update locking up to top level v1.1.1-maint
Add a read/write lock implementation
Remove use of virConnectPtr from all remaining nwfilter...
Don't pass virConnectPtr in nwfilter 'struct domUpdateC...
Remove virConnectPtr arg from virNWFilterDefParse*
Don't ignore errors parsing nwfilter rules

Serge Hallyn (serge-hallyn) wrote :

Based on the newest listed patch being 'Push nwfilter update locking up to top level', and this patch being applied in trusty, i will mark this as fix released in trusty.

I'm testing a saucy package with all the v1.1.1-maint patches applied.

Changed in libvirt (Ubuntu):
importance: Undecided → High
status: Confirmed → Fix Released
Changed in libvirt (Ubuntu Saucy):
importance: Undecided → High
status: New → Triaged
Serge Hallyn (serge-hallyn) wrote :

Can anyone please update the bug description with a brief script to reproduce?

Thanks for uploading the fix for this bug report to -proposed. However, when reviewing the package in -proposed and the details of this bug report I noticed that the bug description is missing information required for the SRU process. You can find full details at http://wiki.ubuntu.com/StableReleaseUpdates#Procedure but essentially this bug is missing some of the following: a statement of impact, a test case and details regarding the regression potential. Thanks in advance!

Serge Hallyn (serge-hallyn) wrote :

Could someone please add a test case to the description?

description: updated
description: updated

Hi,

I am trying the icehouse precise proposed libvirt 1.2.1
 'deb http://ubuntu-cloud.archive.canonical.com/ubuntu precise-proposed/icehouse main'

It looks like the following bug is still present.
https://bugzilla.redhat.com/show_bug.cgi?id=1049391

Here's the trace from devstack-gate:
http://logs.openstack.org/89/74889/4/check/check-tempest-dsvm-full/76baed1/logs/screen-n-cpu.txt.gz?level=TRACE
http://logs.openstack.org/89/74889/4/check/check-tempest-dsvm-full/76baed1/logs/libvirtd.txt.gz

Here's the patch which was supposed to have fixed this problem in libvirt git tree
http://libvirt.org/git/?p=libvirt.git;a=commit;h=0e9373a5c04e5c69a95b6d1cae7cad8456332be7

Thanks,
dims

Joe Breu (breu) wrote :

libvirt-bin-1.1.1-0ubuntu8.5~cloud0 for precise in cloud-archive has a bug that is preventing the process from starting inside of a xen vm

https://bugs.launchpad.net/cloud-archive/+bug/1248025

Hello Sean, or anyone else affected,

Accepted libvirt into saucy-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/libvirt/1.1.1-0ubuntu8.6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in libvirt (Ubuntu Saucy):
status: Triaged → Fix Committed
tags: added: verification-needed
Daniel Berrange (berrange) wrote :

Unfortunately the upstream libvirt patches contained a flaw, and required an additional patch to fix the issue

https://www.redhat.com/archives/libvir-list/2014-March/msg00501.html
https://bugzilla.redhat.com/show_bug.cgi?id=1066801

The issue only appears when you have on the order of 20+ VMs being launched in a short period of time, but would definitely affect openstack.

Serge Hallyn (serge-hallyn) wrote :

Thanks, Daniel! I'm pushing a new proposed package with that additional patch.

Brian Murray (brian-murray) wrote :

Hello Sean, or anyone else affected,

Accepted libvirt into saucy-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/libvirt/1.1.1-0ubuntu8.7 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Marc Deslauriers (mdeslaur) wrote :

Could someone please test the package currently in saucy-proposed?
If it's not tested soon, it will be superseded by a security update...

Chuck Short (zulcss) on 2014-03-25
tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 1.1.1-0ubuntu8.7

---------------
libvirt (1.1.1-0ubuntu8.7) saucy-proposed; urgency=medium

  * cherrypick d/p/add-a-mutex-to-serialize-updates-to-firewall from
    upstream to complete the concurrency issue fix (LP: #1228977)

libvirt (1.1.1-0ubuntu8.6) saucy-proposed; urgency=medium

  * Cherrypick upstream patches to fix concurrency race (LP: #1228977)
 -- Serge Hallyn <email address hidden> Mon, 10 Mar 2014 11:22:20 -0500

Changed in libvirt (Ubuntu Saucy):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Joe Gordon (jogo) on 2014-06-19
Changed in nova:
status: Confirmed → Fix Released
status: Fix Released → Confirmed
status: Confirmed → Fix Released
James Page (james-page) on 2015-03-24
Changed in cloud-archive:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.