race, openstack overcloud node introspect --all-manageable --provide may hang

Bug #1846791 reported by Harald Jensås
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Medium
Steve Baker

Bug Description

When instospection overcloud nodes the command get's stuck when providing nodes after succesful introspection in case the ironic node is locked. The workflow should retry and fail if repeated attempts keep failing.

Waiting for introspection to finish...
Waiting for messages on queue 'tripleo' with no timeout.
Introspection of node d8c98749-fe43-4407-8bec-550eaf8f7348 failed.
Introspection of node efcb9aac-83ab-4afe-9dba-55b4af3a54ab completed. Status:SUCCESS. Errors:None
Retrying 1 nodes that failed introspection. Attempt 1 of 3
Introspection of node d8c98749-fe43-4407-8bec-550eaf8f7348 completed. Status:SUCCESS. Errors:None
Successfully introspected 1 node(s).

Introspection completed.
Waiting for messages on queue 'tripleo' with no timeout.
[{}, {u'result': u"The action raised an exception [action_ex_id=5807bd14-3096-49ca-a48e-8e8ab53aa814, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'provide', u'node_uuid': u'd8c98749-fe43-4407-8bec-550eaf8f7348'}']\n IronicAction.node.set_provision_state failed: Node d8c98749-fe43-4407-8bec-550eaf8f7348 is locked by host undercloud.rdocloud, please retry after the current operation is completed. (HTTP 409)"}]

Changed in tripleo:
milestone: none → ussuri-1
Changed in tripleo:
milestone: ussuri-1 → ussuri-2
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3
Revision history for this message
kobig (kobi.ginon) wrote :

Does someone have a fix for this issue ?

Waiting for messages on queue 'tripleo' with no timeout.
[{}, {u'result': u"The action raised an exception [action_ex_id=4e2f2d80-76a4-479e-ad62-088164efc367, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='

{u'client_method_name': u'node.set_provision_state'}
', params='{u'state': u'provide', u'node_uuid': u'30925bac-bc22-4e74-bd40-de7e77b82db3'}']\n IronicAction.node.set_provision_state failed: Node 30925bac-bc22-4e74-bd40-de7e77b82db3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {}, {}, {}, {}]
No JSON object could be decoded

2020-04-01 09:36:47,227 - CbisDeployment - ERROR - e is: error occurred during command:
openstack overcloud node provide --all-manageable
error:
Waiting for messages on queue 'tripleo' with no timeout.
[{}, {u'result': u"The action raised an exception [action_ex_id=4e2f2d80-76a4-479e-ad62-088164efc367, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='

{u'client_method_name': u'node.set_provision_state'}
', params='{u'state': u'provide', u'node_uuid': u'30925bac-bc22-4e74-bd40-de7e77b82db3'}']\n IronicAction.node.set_provision_state failed: Node 30925bac-bc22-4e74-bd40-de7e77b82db3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)"}, {}, {}, {}, {}]
No JSON object could be decoded

wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Changed in tripleo:
milestone: victoria-1 → victoria-3
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

What might be happening here is there is a power-off action just before the provide, and the lock from the power-off is not released until after the provide action has timed out getting the lock.

There is a wait loop for the power-off, but according to Julia it is possible that the associated lock might not be released until well after the power state changes, so it might be worth also waiting for the node lock to be released.

I'll post a mistral patch which will be backportable to queens, then I'll take a look at the ansible equivalent.

Changed in tripleo:
assignee: nobody → Steve Baker (steve-stevebaker)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/745991

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/746234

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (stable/ussuri)

Change abandoned by Steve Baker (<email address hidden>) on branch: stable/ussuri
Review: https://review.opendev.org/745991
Reason: Ugh, ussuri is ansible based, will try again on train

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/746258

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/746234
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=99b06ea3a2e419702349b5605572747dfe9131fd
Submitter: Zuul
Branch: master

commit 99b06ea3a2e419702349b5605572747dfe9131fd
Author: Steve Baker <email address hidden>
Date: Fri Aug 14 10:32:25 2020 +1200

    Wait for node to be unlocked before provide

    This change reduces the risk of provide having a lock timeout by
    waiting for existing node locks to be released before starting the
    provide.

    Ansible based provide may not be affected by bug #1846791 because
    power-down happens after the provide, not before. However waiting for
    locks to be released is recommended practice, and doing it here may
    improve reliability.

    Change-Id: I5bced3b91e4fa3568185e2bbc85c0a000182394e
    Closes-Bug: #1846791

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/747594

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (stable/ussuri)

Reviewed: https://review.opendev.org/747594
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=14a808ed7ae3796c987883996d6439cf445f101f
Submitter: Zuul
Branch: stable/ussuri

commit 14a808ed7ae3796c987883996d6439cf445f101f
Author: Steve Baker <email address hidden>
Date: Fri Aug 14 10:32:25 2020 +1200

    Wait for node to be unlocked before provide

    This change reduces the risk of provide having a lock timeout by
    waiting for existing node locks to be released before starting the
    provide.

    Ansible based provide may not be affected by bug #1846791 because
    power-down happens after the provide, not before. However waiting for
    locks to be released is recommended practice, and doing it here may
    improve reliability.

    Change-Id: I5bced3b91e4fa3568185e2bbc85c0a000182394e
    Closes-Bug: #1846791
    (cherry picked from commit 32f673b8b73845be8ffa24e2988e3d33c0c79e6c)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/747797

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/747798

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/747799

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/train)

Reviewed: https://review.opendev.org/746258
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=bc29d7f938179e593403a29510d9989bd52e647c
Submitter: Zuul
Branch: stable/train

commit bc29d7f938179e593403a29510d9989bd52e647c
Author: Steve Baker <email address hidden>
Date: Thu Aug 13 14:23:16 2020 +1200

    Wait for lock release during power state change

    When power state change is slow, the subsequent provide workflow can
    fail because getting a lock on the node times out.

    Apparently a node can remain locked for some time after a power state
    change, so this issue should be solved by *also* waiting for the node
    to be unlocked in the wait_for_power_state action.

    Change-Id: I26f23330c50ccf7cb11fb9171d0a82279a497d22
    Closes-Bug: #1846791

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (stable/queens)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/747799
Reason: The gate is currently hitting the "docker api 429" issue, see #tripleo channel for more details. I'll abandon that patch so it's cleared from the gate. Please do not restore it as I'll take care of it when the gate is stable again. Thanks for your understanding and patience!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/queens)

Reviewed: https://review.opendev.org/747799
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=9e94e72d92aa458b640b92722a2eea4ed55d2cf8
Submitter: Zuul
Branch: stable/queens

commit 9e94e72d92aa458b640b92722a2eea4ed55d2cf8
Author: Steve Baker <email address hidden>
Date: Thu Aug 13 14:23:16 2020 +1200

    Wait for lock release during power state change

    When power state change is slow, the subsequent provide workflow can
    fail because getting a lock on the node times out.

    Apparently a node can remain locked for some time after a power state
    change, so this issue should be solved by *also* waiting for the node
    to be unlocked in the wait_for_power_state action.

    Change-Id: I26f23330c50ccf7cb11fb9171d0a82279a497d22
    Closes-Bug: #1846791
    (cherry picked from commit bc29d7f938179e593403a29510d9989bd52e647c)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/rocky)

Reviewed: https://review.opendev.org/747798
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=855a7803198c6632a049ac54c26fed0d23e2dadb
Submitter: Zuul
Branch: stable/rocky

commit 855a7803198c6632a049ac54c26fed0d23e2dadb
Author: Steve Baker <email address hidden>
Date: Thu Aug 13 14:23:16 2020 +1200

    Wait for lock release during power state change

    When power state change is slow, the subsequent provide workflow can
    fail because getting a lock on the node times out.

    Apparently a node can remain locked for some time after a power state
    change, so this issue should be solved by *also* waiting for the node
    to be unlocked in the wait_for_power_state action.

    Change-Id: I26f23330c50ccf7cb11fb9171d0a82279a497d22
    Closes-Bug: #1846791
    (cherry picked from commit bc29d7f938179e593403a29510d9989bd52e647c)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/stein)

Reviewed: https://review.opendev.org/747797
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=c453aee8a16891aea15702c725a866715fcc2b4a
Submitter: Zuul
Branch: stable/stein

commit c453aee8a16891aea15702c725a866715fcc2b4a
Author: Steve Baker <email address hidden>
Date: Thu Aug 13 14:23:16 2020 +1200

    Wait for lock release during power state change

    When power state change is slow, the subsequent provide workflow can
    fail because getting a lock on the node times out.

    Apparently a node can remain locked for some time after a power state
    change, so this issue should be solved by *also* waiting for the node
    to be unlocked in the wait_for_power_state action.

    Change-Id: I26f23330c50ccf7cb11fb9171d0a82279a497d22
    Closes-Bug: #1846791
    (cherry picked from commit bc29d7f938179e593403a29510d9989bd52e647c)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 11.5.0

This issue was fixed in the openstack/tripleo-common 11.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common rocky-eol

This issue was fixed in the openstack/tripleo-common rocky-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common queens-eol

This issue was fixed in the openstack/tripleo-common queens-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common stein-eol

This issue was fixed in the openstack/tripleo-common stein-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.