Able to aquire the semaphore used in lockutils.synchronized_with_prefix twice at the same time

Bug #1367941 reported by Joe Gordon on 2014-09-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Unassigned
oslo-incubator
Medium
Ben Nemec
oslo.concurrency
Medium
Ben Nemec

Bug Description

In nova-compute the semaphore "compute_resources" is used in lockutils.synchronized_with_prefix('nova-') as part of nova/compute/resource_tracker.py

The "compute_resources" semaphore is acquired once at:

http://logs.openstack.org/58/117258/2/gate/gate-tempest-dsvm-neutron-full/48c8627/logs/screen-n-cpu.txt.gz?#_2014-09-10_20_19_17_176

And then again at:

In http://logs.openstack.org/58/117258/2/gate/gate-tempest-dsvm-neutron-full/48c8627/logs/screen-n-cpu.txt.gz?#_2014-09-10_20_19_52_234

without being released in between. This means lockutils.synchronized_with_prefix('nova-') isn't working as expected.

While https://review.openstack.org/#/c/119586/ is a possible culprit for this issue, a spot check of nova-compute logs from before that patch was merged show this was happening before (although in my spot checking it happened significantly less often, but I only checked one file).

Ben Nemec (bnemec) on 2014-09-10
Changed in nova:
status: New → Confirmed
importance: Undecided → Critical
Changed in oslo-incubator:
status: New → Confirmed
importance: Undecided → Critical
Joe Gordon (jogo) on 2014-09-10
Changed in nova:
milestone: none → juno-rc1
Ben Nemec (bnemec) wrote :

Hmm, I have not been able to reproduce this with lockutils and eventlet. I modified one of the existing test cases for verifying that multiple threads can't execute at the same time and it seems to work fine. I'll attach the test file that I created in incubator/tests/unit so everyone can take a look and tell me if I screwed anything up.

I ran it with "tox -e all -- --until-failure test_with_eventlet" and it didn't fail once in the time I ran it. If I remove the synchronized decorator it fails immediately, as expected.

Joshua Harlow (harlowja) wrote :

Is there anyway we can switch out the usage of the WeakValueDictionary for semaphores and temporarily use another type to see if its the cause?

Something like http://paste.ubuntu.com/8314661/

This would just rule out the usage of WeakValueDictionary as a cause, and likely can't make the situation worse....

Joshua Harlow (harlowja) wrote :

http://paste.ubuntu.com/8314680/ also includes a way to see when the weakref gets destroyed (which could be useful to determine what is going on here...)

Joe Gordon (jogo) wrote :

Test to see how many hits there are for this bug in screen-n-cpu file:

 grep compute_resource screen-n-cpu.txt.gz | grep -Pzo "Got semaphore.*\n.*Got"

Joe Gordon (jogo) wrote :

in recent gate-tempest-dsvm-full jobs for nova patches that landed the hit count is anywhere from 27 to 37 times in a single n-cpu file (sample size of 6 jobs).

Joe Gordon (jogo) wrote :

similar results before https://review.openstack.org/#/c/119586/ was merged

Ben Nemec (bnemec) on 2014-09-11
Changed in oslo.concurrency:
status: New → Triaged
importance: Undecided → Medium
Doug Hellmann (doug-hellmann) wrote :
Changed in oslo.concurrency:
milestone: none → juno-rc1
Changed in oslo-incubator:
milestone: none → juno-rc1
Joe Gordon (jogo) wrote :

This was a bug with logging and not in the actual logic or code, so we should do a incubator sync to get the correct logging when possible

Changed in nova:
importance: Critical → Medium

Fix proposed to branch: master
Review: https://review.openstack.org/120934

Changed in nova:
assignee: nobody → Davanum Srinivas (DIMS) (dims-v)
status: Confirmed → In Progress

So, with the updates to the logging, i am looking at
http://logs.openstack.org/34/120934/1/check/check-tempest-dsvm-neutron-full/4c35ad6/logs/screen-n-cpu.txt.gz

I can see 2 Acquired one after another w/o a Releasing

2014-09-11 23:50:26.641 31439 DEBUG nova.openstack.common.lockutils Acquired semaphore "compute_resources" lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:229
2014-09-11 23:51:10.473 DEBUG nova.openstack.common.lockutils Using existing semaphore "compute_resources" internal_lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:202
2014-09-11 23:51:37.804 DEBUG nova.openstack.common.lockutils Acquired semaphore "compute_resources" lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:229
2014-09-11 23:51:37.830 DEBUG nova.openstack.common.lockutils Releasing semaphore "compute_resources" lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:236

Ben Nemec (bnemec) on 2014-09-12
Changed in oslo-incubator:
importance: Critical → Medium
Ben Nemec (bnemec) wrote :

Somebody correct me if I'm wrong, but I think the exception happening while we're holding the lock means the releasing log message gets skipped because it breaks out of the context before we get to do that. It should still be doing the right thing, but we can't log it because the semaphore's context manager is taking care of cleanup.

I think this is supported by the fact that the first acquisition is _never_ released, according to the log, which almost certainly isn't true because we then re-create the semaphore, which means nobody had a reference to it so it couldn't have been held through that whole time.

The direct link to the relevant section of the logs is http://logs.openstack.org/34/120934/1/check/check-tempest-dsvm-neutron-full/4c35ad6/logs/screen-n-cpu.txt.gz#_2014-09-11_23_50_26_641

Joshua Harlow (harlowja) wrote :

I agree, if the user of the lock blows up then the last log message won't be printed.

This should be fixed @ https://review.openstack.org/121156

Changed in oslo.concurrency:
assignee: nobody → Ben Nemec (bnemec)
Changed in oslo-incubator:
assignee: nobody → Ben Nemec (bnemec)

Reviewed: https://review.openstack.org/120934
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cad742d9525cb4d0a630d4ca4c15037ea29e811e
Submitter: Jenkins
Branch: master

commit cad742d9525cb4d0a630d4ca4c15037ea29e811e
Author: Davanum Srinivas <email address hidden>
Date: Thu Sep 11 19:00:54 2014 -0400

    Clarify logging in lockutils

    The old logging statements in lockutils made for some confusing log
    entries where it would appear a given lock was acquired multiple
    times by different threads at the same time. See referenced bug
    for details.

    In order to alleviate that confusion, this change does a few things:
    1) Adds an explicit "acquired" message inside the lock so it is
       clear when the lock was actually acquired.
    2) Moves the release message inside the semaphore so there's no
       chance of it being logged out of order.
    3) Removes the "Got semaphore" message and splits it into two
       separate messages depending on whether the semaphore was found
       in the weakref dictionary. Making it clear which code path
       was followed should help with future debugging.

    This is the Nova port of oslo-incubator change
    If863c9d3896bc786f3c9c46196ebb0766e57d863

    Change-Id: I530befb42a963a40e7d5bf8d4d81fc097e1cb309
    Partial-Bug: #1367941

Changed in oslo-incubator:
status: Confirmed → Fix Committed
Changed in oslo.concurrency:
status: Triaged → Fix Committed
John Garbutt (johngarbutt) wrote :

Removing from nova juno-rc1 as I don't think this should block the juno release.

Changed in nova:
milestone: juno-rc1 → none
Joe Gordon (jogo) wrote :

this was a red herring in nova, its just logging issues. Last patch: https://review.openstack.org/#/c/121428/

Sean Dague (sdague) on 2014-09-18
Changed in nova:
status: In Progress → Invalid
assignee: Davanum Srinivas (DIMS) (dims-v) → nobody
Joe Gordon (jogo) on 2014-09-18
Changed in nova:
importance: Medium → Low

Reviewed: https://review.openstack.org/121428
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=671fe4303994cee5078bfc2716993d20922f295a
Submitter: Jenkins
Branch: master

commit 671fe4303994cee5078bfc2716993d20922f295a
Author: Davanum Srinivas <email address hidden>
Date: Sun Sep 14 20:10:32 2014 -0400

    Always log the releasing, even under failure

    Cherry picked from oslo-incubator change id:
    I4b87b5d9e5c42a1ab4c837ebb4d45b86faf87c21

    Change-Id: Iae1b9035398e283b5837a3214cf0097022af54a9
    Related-Bug: #1367941

Changed in oslo.concurrency:
milestone: next-juno → next-kilo
Changed in oslo-incubator:
status: Fix Committed → Fix Released
Changed in oslo.concurrency:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers