Don't set action to failed if acquire lock failed.

Bug #1648681 reported by Ethan Lynn
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
senlin
Fix Released
Critical
Ethan Lynn

Bug Description

For now, when an action trys to lock a cluster, it will retry several times. After then, it will report failed.

But now our engine can scan and pickup the action from db and then execute it, failed action will not be picked up again if it's because lock failed.

To address this, we can just ignore the acquire lock failed error and leave the action at READY status in db, waiting for next engine to pick it up and executed it.

There are two places need to be changed:
1. let engine pick up a random action instead of the first ready action.
2. let acquire lock failed error pass and ignore it.

Thoughts are welcome.

Ethan Lynn (ethanlynn)
Changed in senlin:
assignee: nobody → Ethan Lynn (ethanlynn)
Yanyan Hu (yanyanhu)
Changed in senlin:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
XueFeng Liu (jonnary-liu) wrote :

"To address this, we can just ignore the acquire lock failed error and leave the action at READY"

This may cause a problem:A action will without timeout in db layer? Some actions need a timeout I think.

Revision history for this message
Ethan Lynn (ethanlynn) wrote :

@xuefeng, yes, we might need to add more info in db to address timeout problem, like retry times. But I haven't figure out that in how many cases that an action will not always acquire a lock.

Actually the case of this issue is:
When multiple actions try to lock a cluster at the same time, some of these actions will failed and won't be executed again.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to senlin (master)

Fix proposed to branch: master
Review: https://review.openstack.org/409805

Changed in senlin:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/410095

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to senlin (master)

Reviewed: https://review.openstack.org/409805
Committed: https://git.openstack.org/cgit/openstack/senlin/commit/?id=f84a1a07dd3c7d33acd04a7869e96ab29a949849
Submitter: Jenkins
Branch: master

commit f84a1a07dd3c7d33acd04a7869e96ab29a949849
Author: Ethan Lynn <email address hidden>
Date: Mon Dec 12 21:51:13 2016 +0800

    Lookup a random action to execute

    This patch change the behavior of scheduler to random pick up
    a 'READY' action instead of the first 'READY' action.

    Change-Id: I470c0aa3776f78273f4a4d623d0140c99b92214f
    Partial-Bug: #1648681

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/410095
Committed: https://git.openstack.org/cgit/openstack/senlin/commit/?id=b6e8d758a0e33547a3fd78c434337e80b0a826fa
Submitter: Jenkins
Branch: master

commit b6e8d758a0e33547a3fd78c434337e80b0a826fa
Author: Ethan Lynn <email address hidden>
Date: Tue Dec 13 16:40:09 2016 +0800

    Remove retry logic from lock_acquire

    No need to retry, just wait for engine to pick action up again.

    The workflow is:
    ActionProc -> action.execute() -> return RES_RETRY
    -> action.set_status -> ignore RES_RETRY and continue

    Change-Id: Ic622a79b754131171cb940aa9f31ec5aef11ee47
    Closes-Bug: #1648681

Changed in senlin:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/senlin 3.0.0.0b2

This issue was fixed in the openstack/senlin 3.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.