OpenStack Heat

Error acquiring lock for resource

Bug #1607814 reported by Zane Bitter on 2016-07-29

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Heat	Fix Released	High	Crag Wolfe	OpenStack Heat pike-2

Bug Description

Seen once in the gate:

http://logs.openstack.org/18/341818/3/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2/bfb610f/

2016-07-28 20:48:50.246 6821 ERROR heat.engine.resource [req-751a4fbb-96c4-4f85-8c3b-ebd58c7e3a5b e9c5b422acc24f7e9073a2c6c5fccbcf 87cc52a941ac44218303bba477cfd4be - - -] Error acquiring lock for resource id:985 with atomic_key:4, engine_id:None/8a388b6b-3f6a-4b53-b88c-d3d3e793de23

apparently resulting in the test failure:

Details: Stack AodhAlarmTest-664457513 failed to reach DELETE_COMPLETE status within the required time (1200 s).

This could be a race with the metadata, because on the previous line of the log, 2ms before the error, we have this:

2016-07-28 20:48:50.244 6824 DEBUG heat.engine.resource [req-15f33b7a-14bb-49e6-ac11-a03fe001f16a - 87cc52a941ac44218303bba477cfd4be - - -] Setting metadata for AutoScalingResourceGroup "asg" [9f404083-73cc-4c5f-9c60-092a6f3fab2f] Stack "AodhAlarmTest-664457513" [39ce99a5-fc1e-4bd3-a3fd-142156895720] metadata_set /opt/stack/new/heat/heat/engine/resource.py:411

Tags:

Revision history for this message

Zane Bitter (zaneb) wrote on 2016-07-29:

Logstash query:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Error%20acquiring%20lock%20for%20resource%5C%22%20tags%3A%5C%22screen-h-eng%5C%22

This seems to be a relatively common error. However, it's notable that we didn't see it at all for ~2.5 days this week, and it started back up yesterday afternoon.

tags:

added: convergence-bugs gate-failure

Revision history for this message

Rabi Mishra (rabi) wrote on 2016-08-05:

I noticed this error today where the job is killed. It seems to happen while deleting a resource in an IN_PROGRESS stack for StackValidationTest[1].

[1] http://logs.openstack.org/95/351095/3/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2/e103c01/logs/screen-h-eng.txt.gz#_2016-08-05_08_55_30_684

Revision history for this message

Anant Patil (ananta) wrote on 2016-08-19:

I investigated and found that this error message is emitted when a worker is trying to work on a resource which is in_progress. This happens when an update or delete of resource from previous update is going on; and completely normal. The locked resource automatically propagates the traversal of new update when it is done.

Revision history for this message

Zane Bitter (zaneb) wrote on 2016-08-19:

That doesn't sound like something an operator should be paged in the middle of the night for, so it shouldn't be logged at ERROR level. (Should probably be INFO level, and with a more descriptive message.)

That won't resolve the test failures, but perhaps the update-cancel patch will.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-15: Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/370873

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-16: Related fix merged to heat (master)

Reviewed: https://review.openstack.org/370873
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=7f5bd76f7a14ad67b1c8e38061b316dfc509d881
Submitter: Jenkins
Branch: master

commit 7f5bd76f7a14ad67b1c8e38061b316dfc509d881
Author: Zane Bitter <email address hidden>
Date: Thu Sep 15 10:16:54 2016 -0400

Don't log locked resource ERROR level

    It's expected that during a convergence traversal, we may encounter a
    resource that is still locked by a previous traversal. Don't log an
    ERROR-level message about what is a normal condition. Instead, log at
    INFO level describing what is happening, with more details at DEBUG
    level.

Change-Id: I645c2a173b828d4a983ba874037d059ee645955f
Related-Bug: #1607814

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-16: Related fix proposed to heat (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/371732

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-19: Related fix merged to heat (stable/newton)

Reviewed: https://review.openstack.org/371732
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=4a57772a74f002273c46172277b5087acf0effa3
Submitter: Jenkins
Branch: stable/newton

commit 4a57772a74f002273c46172277b5087acf0effa3
Author: Zane Bitter <email address hidden>
Date: Thu Sep 15 10:16:54 2016 -0400

Don't log locked resource at ERROR level

    Change-Id: I645c2a173b828d4a983ba874037d059ee645955f
    Related-Bug: #1607814
    (cherry picked from commit 7f5bd76f7a14ad67b1c8e38061b316dfc509d881)

tags:

added: in-stable-newton

Revision history for this message

Rabi Mishra (rabi) wrote on 2016-09-22:

It seems that if there is a 'delete traversal' and a resource is locked for updating metadata from a scaling activity, it just bails out with 'resource is locked for update; deferring'[1], then nothing happens(no re-triggering) and the stack gets stuck in DELETE_IN_PROGRESS.

We can probably fix the test by waiting for 'metadata['scaling_in_progress'] to be False. But that would not be the right solution.

[1] http://logs.openstack.org/40/374440/3/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2/2149180/logs/screen-h-eng.txt.gz#_2016-09-22_05_18_23_228

Changed in heat:
importance:	Undecided → High

Revision history for this message

Anant Patil (ananta) wrote on 2016-09-22:

#10

Rabi, to me it looks like a different issue, and I think we should fine another ticket.

I think that when a resource or a nested stack updates it-self, like a auto-scaling group, it doesn't check for a new traversal of parent stack. It is possible that the nested stack is being updated and at the same time the new traversal (update or delete) is running and it failed to grab the lock, and then assumed that the locked resource would re-trigger after unlocking. The solution that comes to my mind immediately is that the update should check if it was part of new traversal or not. If the nested stack update is part of new traversal then it shouldn't to anything, but if the update is a self update triggered by other event, then it should try to re-trigger by looking at sync point.

Revision history for this message

Rabi Mishra (rabi) wrote on 2016-09-22: Re: [Bug 1607814] Re: Error acquiring lock for resource

#11

@anant,

I think it's the issue 'with the same test'. And if you are referring to
the error message, I assume, it has changed after the recent convergence
changes. This bug was initially reported 2 months back.

What you mentioned can be another situation, if we don't wait for scaling
to be completed before we start another update or delete.

However, in this test we are checking that scaling is completed(by checking
that the resource count has increased), but scaling_policy resource is
still doing metdata update as part of signal handling[1] and at the
sametime a delete traversal finds scaling policy locked (could not update
the db).

[1]
https://github.com/openstack/heat/blob/master/heat/engine/resources/openstack/heat/scaling_policy.py#L206

Revision history for this message

Rabi Mishra (rabi) wrote on 2016-09-23:

#12

oh! probably between resource.py L1806 and L1807, atomic key is getting updated by metadata update that's why the misleading lock for update error.

Thomas Herve (therve) on 2016-11-02

Changed in heat:
milestone:	none → ongoing

Hari (hosingh000) on 2017-02-15

Changed in heat:
assignee:	nobody → Hari (hosingh000)

Revision history for this message

Hari (hosingh000) wrote on 2017-02-15:

#14

I am looking for the steps to reproduce this issue, Does some one aware about steps should be performed?

Revision history for this message

Zane Bitter (zaneb) wrote on 2017-07-11:

#15

I believe that this is fixed by https://review.openstack.org/#/c/449351/

Changed in heat:
assignee:	Hari (hosingh000) → Crag Wolfe (cwolfe)
status:	New → Fix Released
milestone:	ongoing → pike-2
tags:	removed: in-stable-newton

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.