Error acquiring lock for resource

Bug #1607814 reported by Zane Bitter
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Crag Wolfe

Bug Description

Seen once in the gate:

http://logs.openstack.org/18/341818/3/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2/bfb610f/

2016-07-28 20:48:50.246 6821 ERROR heat.engine.resource [req-751a4fbb-96c4-4f85-8c3b-ebd58c7e3a5b e9c5b422acc24f7e9073a2c6c5fccbcf 87cc52a941ac44218303bba477cfd4be - - -] Error acquiring lock for resource id:985 with atomic_key:4, engine_id:None/8a388b6b-3f6a-4b53-b88c-d3d3e793de23

apparently resulting in the test failure:

Details: Stack AodhAlarmTest-664457513 failed to reach DELETE_COMPLETE status within the required time (1200 s).

This could be a race with the metadata, because on the previous line of the log, 2ms before the error, we have this:

2016-07-28 20:48:50.244 6824 DEBUG heat.engine.resource [req-15f33b7a-14bb-49e6-ac11-a03fe001f16a - 87cc52a941ac44218303bba477cfd4be - - -] Setting metadata for AutoScalingResourceGroup "asg" [9f404083-73cc-4c5f-9c60-092a6f3fab2f] Stack "AodhAlarmTest-664457513" [39ce99a5-fc1e-4bd3-a3fd-142156895720] metadata_set /opt/stack/new/heat/heat/engine/resource.py:411

Revision history for this message
Zane Bitter (zaneb) wrote :

Logstash query:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Error%20acquiring%20lock%20for%20resource%5C%22%20tags%3A%5C%22screen-h-eng%5C%22

This seems to be a relatively common error. However, it's notable that we didn't see it at all for ~2.5 days this week, and it started back up yesterday afternoon.

tags: added: convergence-bugs gate-failure
Revision history for this message
Rabi Mishra (rabi) wrote :

I noticed this error today where the job is killed. It seems to happen while deleting a resource in an IN_PROGRESS stack for StackValidationTest[1].

[1] http://logs.openstack.org/95/351095/3/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2/e103c01/logs/screen-h-eng.txt.gz#_2016-08-05_08_55_30_684

Revision history for this message
Anant Patil (ananta) wrote :

I investigated and found that this error message is emitted when a worker is trying to work on a resource which is in_progress. This happens when an update or delete of resource from previous update is going on; and completely normal. The locked resource automatically propagates the traversal of new update when it is done.

Revision history for this message
Zane Bitter (zaneb) wrote :

That doesn't sound like something an operator should be paged in the middle of the night for, so it shouldn't be logged at ERROR level. (Should probably be INFO level, and with a more descriptive message.)

That won't resolve the test failures, but perhaps the update-cancel patch will.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/370873

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master)

Reviewed: https://review.openstack.org/370873
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=7f5bd76f7a14ad67b1c8e38061b316dfc509d881
Submitter: Jenkins
Branch: master

commit 7f5bd76f7a14ad67b1c8e38061b316dfc509d881
Author: Zane Bitter <email address hidden>
Date: Thu Sep 15 10:16:54 2016 -0400

    Don't log locked resource ERROR level

    It's expected that during a convergence traversal, we may encounter a
    resource that is still locked by a previous traversal. Don't log an
    ERROR-level message about what is a normal condition. Instead, log at
    INFO level describing what is happening, with more details at DEBUG
    level.

    Change-Id: I645c2a173b828d4a983ba874037d059ee645955f
    Related-Bug: #1607814

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/371732

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (stable/newton)

Reviewed: https://review.openstack.org/371732
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=4a57772a74f002273c46172277b5087acf0effa3
Submitter: Jenkins
Branch: stable/newton

commit 4a57772a74f002273c46172277b5087acf0effa3
Author: Zane Bitter <email address hidden>
Date: Thu Sep 15 10:16:54 2016 -0400

    Don't log locked resource at ERROR level

    It's expected that during a convergence traversal, we may encounter a
    resource that is still locked by a previous traversal. Don't log an
    ERROR-level message about what is a normal condition. Instead, log at
    INFO level describing what is happening, with more details at DEBUG
    level.

    Change-Id: I645c2a173b828d4a983ba874037d059ee645955f
    Related-Bug: #1607814
    (cherry picked from commit 7f5bd76f7a14ad67b1c8e38061b316dfc509d881)

tags: added: in-stable-newton
Revision history for this message
Rabi Mishra (rabi) wrote :

It seems that if there is a 'delete traversal' and a resource is locked for updating metadata from a scaling activity, it just bails out with 'resource is locked for update; deferring'[1], then nothing happens(no re-triggering) and the stack gets stuck in DELETE_IN_PROGRESS.

We can probably fix the test by waiting for 'metadata['scaling_in_progress'] to be False. But that would not be the right solution.

[1] http://logs.openstack.org/40/374440/3/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2/2149180/logs/screen-h-eng.txt.gz#_2016-09-22_05_18_23_228

Changed in heat:
importance: Undecided → High
Revision history for this message
Anant Patil (ananta) wrote :

Rabi, to me it looks like a different issue, and I think we should fine another ticket.

I think that when a resource or a nested stack updates it-self, like a auto-scaling group, it doesn't check for a new traversal of parent stack. It is possible that the nested stack is being updated and at the same time the new traversal (update or delete) is running and it failed to grab the lock, and then assumed that the locked resource would re-trigger after unlocking. The solution that comes to my mind immediately is that the update should check if it was part of new traversal or not. If the nested stack update is part of new traversal then it shouldn't to anything, but if the update is a self update triggered by other event, then it should try to re-trigger by looking at sync point.

Revision history for this message
Rabi Mishra (rabi) wrote : Re: [Bug 1607814] Re: Error acquiring lock for resource

@anant,

I think it's the issue 'with the same test'. And if you are referring to
the error message, I assume, it has changed after the recent convergence
changes. This bug was initially reported 2 months back.

What you mentioned can be another situation, if we don't wait for scaling
to be completed before we start another update or delete.

However, in this test we are checking that scaling is completed(by checking
that the resource count has increased), but scaling_policy resource is
still doing metdata update as part of signal handling[1] and at the
sametime a delete traversal finds scaling policy locked (could not update
the db).

[1]
https://github.com/openstack/heat/blob/master/heat/engine/resources/openstack/heat/scaling_policy.py#L206

Revision history for this message
Rabi Mishra (rabi) wrote :

oh! probably between resource.py L1806 and L1807, atomic key is getting updated by metadata update that's why the misleading lock for update error.

Thomas Herve (therve)
Changed in heat:
milestone: none → ongoing
Hari (hosingh000)
Changed in heat:
assignee: nobody → Hari (hosingh000)
Revision history for this message
Hari (hosingh000) wrote :

I am looking for the steps to reproduce this issue, Does some one aware about steps should be performed?

Revision history for this message
Zane Bitter (zaneb) wrote :

I believe that this is fixed by https://review.openstack.org/#/c/449351/

Changed in heat:
assignee: Hari (hosingh000) → Crag Wolfe (cwolfe)
status: New → Fix Released
milestone: ongoing → pike-2
tags: removed: in-stable-newton
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.