apparent deadlock on lock_bridge in n-cpu

Bug #1349452 reported by Doug Hellmann
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Davanum Srinivas (DIMS)
oslo-incubator
Fix Released
Critical
Unassigned
oslo.concurrency
Fix Released
Critical
Unassigned

Bug Description

It's not clear if n-cpu is dying trying to acquire the lock "lock_bridge" or if it's just hanging.

http://logs.openstack.org/08/109108/1/check/check-tempest-dsvm-full/4417111/logs/screen-n-cpu.txt.gz

The logs for n-cpu stop about 15 minutes before the rest of the test run, and all tests doing things that require the hypervisor executed after that point fail with different errors.

Tags: compute
Tracy Jones (tjones-i)
Changed in nova:
importance: Undecided → High
Tracy Jones (tjones-i)
tags: added: compute
melanie witt (melwitt)
Changed in nova:
status: New → Confirmed
Revision history for this message
Joe Gordon (jogo) wrote :
Changed in nova:
importance: High → Critical
Revision history for this message
Chet Burgess (cfb-n) wrote :
Download full text (6.2 KiB)

As vishy pointed out it looks like iptables-save is never returning, or the locks are blocking greenthread from switching and thus returning the data and releasing the lock.

2014-08-18 16:21:25.183 DEBUG nova.openstack.common.lockutils [req-fb34143f-be74-4719-a062-a9a041a46098 FloatingIPsTestXML-369233022 FloatingIPsTestXML-175936678] Got semaphore "4afac4bc-f7d8-470f-9c16-af270582e48e" internal_lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:263
2014-08-18 16:21:25.183 DEBUG nova.openstack.common.lockutils [req-fb34143f-be74-4719-a062-a9a041a46098 FloatingIPsTestXML-369233022 FloatingIPsTestXML-175936678] Got semaphore / lock "do_terminate_instance" inner /opt/stack/new/nova/nova/openstack/common/lockutils.py:324
2014-08-18 16:21:25.183 DEBUG nova.openstack.common.lockutils [req-fb34143f-be74-4719-a062-a9a041a46098 FloatingIPsTestXML-369233022 FloatingIPsTestXML-175936678] Got semaphore "4afac4bc-f7d8-470f-9c16-af270582e48e-events" internal_lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:263
2014-08-18 16:21:25.184 DEBUG nova.openstack.common.lockutils [req-fb34143f-be74-4719-a062-a9a041a46098 FloatingIPsTestXML-369233022 FloatingIPsTestXML-175936678] Got semaphore / lock "_clear_events" inner /opt/stack/new/nova/nova/openstack/common/lockutils.py:324
2014-08-18 16:21:25.184 DEBUG nova.openstack.common.lockutils [req-fb34143f-be74-4719-a062-a9a041a46098 FloatingIPsTestXML-369233022 FloatingIPsTestXML-175936678] Released semaphore "4afac4bc-f7d8-470f-9c16-af270582e48e-events" lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:291
2014-08-18 16:21:25.184 DEBUG nova.openstack.common.lockutils [req-fb34143f-be74-4719-a062-a9a041a46098 FloatingIPsTestXML-369233022 FloatingIPsTestXML-175936678] Semaphore / lock released "_clear_events" inner /opt/stack/new/nova/nova/openstack/common/lockutils.py:328
2014-08-18 16:21:25.277 AUDIT nova.compute.manager [req-fb34143f-be74-4719-a062-a9a041a46098 FloatingIPsTestXML-369233022 FloatingIPsTestXML-175936678] [instance: 4afac4bc-f7d8-470f-9c16-af270582e48e] Terminating instance
2014-08-18 16:21:27.759 DEBUG nova.virt.libvirt.vif [req-fb34143f-be74-4719-a062-a9a041a46098 FloatingIPsTestXML-369233022 FloatingIPsTestXML-175936678] vif_type=bridge instance=Instance(access_ip_v4=None,access_ip_v6=None,architecture=None,auto_disk_config=False,availability_zone=None,cell_name=None,cleaned=False,config_drive='',created_at=2014-08-18T16:21:07Z,default_ephemeral_device=None,default_swap_device=None,deleted=False,deleted_at=None,disable_terminate=False,display_description='floating_server-1914203176',display_name='floating_server-1914203176',ephemeral_gb=0,ephemeral_key_uuid=None,fault=<?>,host='devstack-trusty-hpcloud-b1-1523392',hostname='floating-server-1914203176',id=22,image_ref='af86fee1-9d89-4003-bcfe-7bf1213c8d9a',info_cache=InstanceInfoCache,instance_type_id=6,kernel_id='ccbbcd46-394e-4875-b6eb-245467adf1e4',key_data=None,key_name=None,launch_index=0,launched_at=2014-08-18T16:21:16Z,launched_on='devstack-trusty-hpcloud-b1-1523392',locked=False,locked_by=None,memory_mb=64,metadata={},node='devstack-trusty-hpcloud-b1-1523392',os_type=None,pci_devices=<?>,power_state=1,progress=...

Read more...

Revision history for this message
Joe Gordon (jogo) wrote :
Revision history for this message
Joe Gordon (jogo) wrote :

01:21 < cburgess> As far as I can tell we always use this threading.Semaphore. We should use to an eventlet
                  one back in the day but we don't anymore.
01:22 < cburgess> I think vishy might be onto something..
01:22 < cburgess> if debugger.enabled():
01:22 < cburgess> # turn off thread patching to enable the remote debugger
01:22 < cburgess> eventlet.monkey_patch(os=False, thread=False)
01:22 < cburgess> else:
01:22 < cburgess> eventlet.monkey_patch(os=False)
01:22 < cburgess> We don't pass thread=false unless the debugger is on.
01:23 < vishy> cburgess: right but the semaphore is way broke anyway
01:23 < vishy> so i think we are monkeypatching it but hitting the deadlock from the bug
01
:24 < cburgess> Possibly...
01:26 < vishy> so it used to use eventlet.semaphore.Semaphore
01:27 < cburgess> Yup
01:27 < cburgess> Looks like it got changed a while back.
01:27 < cburgess> This change.
01:27 < cburgess> https://github.com/openstack/nova/commit/a0bcd7b90c38b104cb278223679cedf5cc11c74c
01:28 < vishy> it also added a specific lock
01:29 < vishy> around the threading semaphore
01:29 < vishy> not sure when tht was added
01:30 < cburgess> This one https://github.com/openstack/nova/commit/4b6ea1e1b87241918baf21a16817417226a9fc62

Revision history for this message
Joe Gordon (jogo) wrote :
Revision history for this message
Joshua Harlow (harlowja) wrote :

Isn't eventlet.monkey_patch(os=False, thread=False) wrought with madness?

It's so simple to deadlock when that is done.

Example @ https://gist.github.com/harlowja/9c35e443dfa136a4f965

Revision history for this message
Chet Burgess (cfb-n) wrote :
Revision history for this message
Sean Dague (sdague) wrote :

This really seems to be back on lock utils as the core issue, and Nova is just getting this behavior because lockutils is deadlocking.

Changed in oslo-incubator:
status: New → Confirmed
importance: Undecided → Critical
Revision history for this message
Ben Nemec (bnemec) wrote :

Okay, so we're still thinking this is due to the mismatch of eventlet/threading locks in lockutils? I think we do have an issue there even when eventlet is monkey_patched fully:

>>> import threading
>>> threading.Lock
<built-in function allocate_lock>
>>> threading.Semaphore
<function Semaphore at 0x2355c08>
>>> import eventlet
>>> eventlet.monkey_patch()
>>> threading.Lock
<function allocate_lock at 0x25dd0c8>
>>> threading.Semaphore
<function Semaphore at 0x2355c08>

Apparently eventlet doesn't monkey_patch Semaphore, so we're using an eventlet Lock with a stdlib Semaphore.

So, to fix this without introducing an eventlet dep to lockutils we need to do the following:

1) Conditionally import eventlet
2) Any time we create a semaphore, check if we have eventlet and if it has monkeypatched thread, and explicitly use an eventlet semaphore if so.

Is my understanding of the current situation correct?

Revision history for this message
Joshua Harlow (harlowja) wrote :

Just fyi the sempahore object is based on a condition and lock:

http://hg.python.org/releasing/2.7.6/file/ba31940588b6/Lib/threading.py#l422

So if how all this monkey patches works is correct then self.__cond = Condition(Lock()) would be monkey patched by eventlet so that the Sempahore there would actually use a eventlet Lock(), maybe that's not how it turns out to work though?

Revision history for this message
Joshua Harlow (harlowja) wrote :

A little test of this:

>>> import eventlet
>>> eventlet.monkey_patch()
>>> import threading
>>> b = threading.Semaphore()
>>> b._Semaphore__cond
<Condition(<Semaphore c=1 _w[0]>, 0)>
>>> b._Semaphore__cond._Condition__lock
<Semaphore at 0x8704dcc c=1 _w[0]>
>>> a = b._Semaphore__cond._Condition__lock
>>> a.__module__
'eventlet.semaphore'
>>> b.__module__
'threading'
>>>

Changed in oslo-incubator:
status: Confirmed → Fix Committed
milestone: none → juno-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/119586

Changed in nova:
assignee: nobody → Davanum Srinivas (DIMS) (dims-v)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/119586
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=58d48ded2bbb90b8639a31b47e37e97c276eac87
Submitter: Jenkins
Branch: master

commit 58d48ded2bbb90b8639a31b47e37e97c276eac87
Author: Davanum Srinivas <email address hidden>
Date: Sat Sep 6 22:37:44 2014 -0400

    Sync oslo lockutils to nova

    The commit to be merged:

    942e1aa Use file locks by default again
    ac995be Fix E126 pep8 errors
    15b8352 Remove oslo.log from lockutils

    942e1aa is hopefully the fix for the "n-cpu stop about 15 minutes
    before the rest of the test run" problem at the gate. We need to
    make sure this actually fixes the gate problem

    Closes-Bug: #1349452

    Change-Id: I31265d4e08245879ad72d3e91dcd8be2359ca811

Changed in nova:
status: In Progress → Fix Committed
Changed in oslo-incubator:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: none → juno-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-rc1 → 2014.2
Ben Nemec (bnemec)
Changed in oslo.concurrency:
status: New → Fix Released
importance: Undecided → Critical
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.