Failure to import new nodes using RDO Newton on RHEL

Bug #1634202 reported by David Hill on 2016-10-17
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Medium
David Hill

Bug Description

[stack@undercloud-0-newton ~]$ openstack baremetal import --json /home/stack/instackenv.json
Started Mistral Workflow. Execution ID: 021d8507-81d4-4193-8431-b3ad31683425
Exception registering nodes: [{u'__task_execution': {u'id': u'71ceeeed-3046-4f86-ac8d-f70fbb66d8f6', u'name': u'wait_for_provision_state'}}, {u'__task_execution': {u'id': u'29e3703c-94fd-4781-970e-3f0a872582ed', u'name': u'wait_for_provision_state'}}, {u'result': u"Failure caused by error in tasks: set_provision_state\n\n set_provision_state [task_ex_id=28e31bab-fb52-4cfc-b730-698ef946e97f] -> Failed to run action [action_ex_id=9009e42e-9b20-4da6-b591-0381fbc78f75, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'manage', u'node_uuid': u'bc83da8c-27a5-4baa-adf0-6b2589cdd870', u'configdrive': None, u'cleansteps': None}']\n IronicAction.node.set_provision_state failed: <class 'ironicclient.common.apiclient.exceptions.Conflict'>: Node bc83da8c-27a5-4baa-adf0-6b2589cdd870 is locked by host localhost.localdomain, please retry after the current operation is completed.\n [action_ex_id=9009e42e-9b20-4da6-b591-0381fbc78f75, idx=0]: Failed to run action [action_ex_id=9009e42e-9b20-4da6-b591-0381fbc78f75, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'manage', u'node_uuid': u'bc83da8c-27a5-4baa-adf0-6b2589cdd870', u'configdrive': None, u'cleansteps': None}']\n IronicAction.node.set_provision_state failed: <class 'ironicclient.common.apiclient.exceptions.Conflict'>: Node bc83da8c-27a5-4baa-adf0-6b2589cdd870 is locked by host localhost.localdomain, please retry after the current operation is completed.\n"}, {u'__task_execution': {u'id': u'c7e752df-5278-4fa6-ad20-793613922d53', u'name': u'wait_for_provision_state'}}, {u'__task_execution': {u'id': u'd60b51dd-07cd-48b5-8da6-b79307b03e67', u'name': u'wait_for_provision_state'}}, {u'__task_execution': {u'id': u'1bd734d5-4fd0-434b-b794-e16aafe909af', u'name': u'wait_for_provision_state'}}, {u'__task_execution': {u'id': u'01d14855-e192-4e2f-a972-8a8c7a888d53', u'name': u'wait_for_provision_state'}}]

This didn't happen in Liberty and Mitaka.

David Hill (david-hill-ubisoft) wrote :

I got the same issue with Ocata/Trunk

affects: cloud-init → tripleo
summary: - Failure to introspect new nodes using RDO Newton on RHEL
+ Failure to import new nodes using RDO Newton on RHEL
Changed in tripleo:
milestone: none → ocata-1
importance: Undecided → Medium
status: New → Triaged
wes hayutin (weshayutin) wrote :

need to know which yum repos were in play here.

for i in `ls /etc/yum.repos.d/`; do echo $i; cat /etc/yum.repos.d/$i; done

Changed in tripleo:
assignee: nobody → wes hayutin (weshayutin)
wes hayutin (weshayutin) on 2016-10-17
Changed in tripleo:
status: Triaged → Incomplete
David Hill (david-hill-ubisoft) wrote :

Ocata: delorean-openstack-neutron-f0ca030595cb7484be338c5678738b2e536b2369
Newton: Using current repos (and it's failing)

David Hill (david-hill-ubisoft) wrote :
Changed in tripleo:
status: Incomplete → New
David Hill (david-hill-ubisoft) wrote :

[openstack-newton]
name=OpenStack Newton Testing
baseurl=https://trunk.rdoproject.org/centos7-newton/current-passed-ci/
gpgcheck=0
enabled=1
[openstack-deps-newton]
name=OpenStack Newton Trunk Tested
baseurl=http://buildlogs.centos.org/centos/7/cloud/$basearch/openstack-newton/
gpgcheck=0
enabled=1

Exception registering nodes: [{u'__task_execution': {u'id': u'27a056e2-209c-4e18-8adf-f7d613071351', u'name': u'wait_for_provision_state'}}, {u'__task_execution': {u'id': u'09c9bd29-0bed-4c50-8b72-c36ecbada820', u'name': u'wait_for_provision_state'}}, {u'result': u"Failure caused by error in tasks: set_provision_state\n\n set_provision_state [task_ex_id=73525396-ce62-4563-9230-f2732069d2fb] -> Failed to run action [action_ex_id=d9eb6348-9930-48ca-bb7e-667973dc5c3f, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'manage', u'node_uuid': u'9de109de-b211-43c3-a969-797e94c1b5d3', u'configdrive': None, u'cleansteps': None}']\n IronicAction.node.set_provision_state failed: <class 'ironicclient.common.apiclient.exceptions.Conflict'>: Node 9de109de-b211-43c3-a969-797e94c1b5d3 is locked by host localhost.localdomain, please retry after the current operation is completed.\n [action_ex_id=d9eb6348-9930-48ca-bb7e-667973dc5c3f, idx=0]: Failed to run action [action_ex_id=d9eb6348-9930-48ca-bb7e-667973dc5c3f, action_cls='<class 'mistral.actions.action_factory.IronicAction'>', attributes='{u'client_method_name': u'node.set_provision_state'}', params='{u'state': u'manage', u'node_uuid': u'9de109de-b211-43c3-a969-797e94c1b5d3', u'configdrive': None, u'cleansteps': None}']\n IronicAction.node.set_provision_state failed: <class 'ironicclient.common.apiclient.exceptions.Conflict'>: Node 9de109de-b211-43c3-a969-797e94c1b5d3 is locked by host localhost.localdomain, please retry after the current operation is completed.\n"}, {u'__task_execution': {u'id': u'eea023cb-1d1a-4b9e-a1ff-d333c17b519f', u'name': u'wait_for_provision_state'}}, {u'__task_execution': {u'id': u'6db8300e-2748-4324-904b-18faeb53dd3c', u'name': u'wait_for_provision_state'}}, {u'__task_execution': {u'id': u'b7d3c139-8900-416a-a150-d3b54141d954', u'name': u'wait_for_provision_state'}}, {u'__task_execution': {u'id': u'6765afd7-172f-4420-90ae-441bda866d64', u'name': u'wait_for_provision_state'}}]

David Hill (david-hill-ubisoft) wrote :

Previously, I would see those lock error message but it seems like that with mistral, it's not retrying every seconds or so to get the lock and fails instead of retrying and succeeding.

Leif Madsen (leifmadsen) wrote :
Download full text (4.7 KiB)

I was debating on starting a new bug or not, but I'm seeing a similar issue when attempting to start the introspection:

Exception introspecting nodes: {u'result': u'Failed to run task [wf=WorkflowExecution {\'state_info\': None, \'accepted\': False, \'output\': {}, \'name\': u\'tripleo.baremetal.v1.introspect\', \'workflow_name\': u\'tripleo.baremetal.v1.introspect\', \'state\': \'RUNNING\', \'created_at\': \'2016-10-18 01:08:38.325385\', \'runtime_context\': {\'index\': 0}, \'updated_at\': \'2016-10-18 01:08:38.332061\', \'spec\': {u\'input\': [u\'node_uuids\', {u\'queue_name\': u\'tripleo\'}], u\'tasks\': {u\'send_message\': {u\'retry\': u\'count=5 delay=1\', u\'name\': u\'send_message\', u\'version\': \'2.0\', u\'action\': u\'zaqar.queue_post\', u\'input\': {u\'queue_name\': u\'<% $.queue_name %>\', u\'messages\': {u\'body\': {u\'type\': u\'tripleo.baremetal.v1.introspect\', u\'payload\': {u\'status\': u"<% $.get(\'status\', \'SUCCESS\') %>", u\'message\': u"<% $.get(\'message\', \'\') %>", u\'execution\': u\'<% execution() %>\', u\'introspected_nodes\': u"<% $.get(\'introspected_nodes\', []) %>"}}}}, u\'type\': \'direct\'}, u\'set_status_failed_start_introspection\': {u\'version\': \'2.0\', u\'type\': \'direct\', u\'name\': u\'set_status_failed_start_introspection\', u\'publish\': {u\'status\': u\'FAILED\', u\'message\': u\'<% task(start_introspection).result %>\', u\'introspected_nodes\': []}, u\'on-success\': u\'send_message\'}, u\'wait_for_introspection_to_finish\': {u\'name\': u\'wait_for_introspection_to_finish\', u\'on-success\': u\'send_message\', u\'publish\': {u\'status\': u\'SUCCESS\', u\'message\': u\'Successfully introspected nodes.\', u\'introspected_nodes\': u\'<% task(wait_for_introspection_to_finish).result %>\'}, u\'version\': \'2.0\', u\'action\': u\'baremetal_introspection.wait_for_finish\', u\'input\': {u\'uuids\': u\'<% $.node_uuids %>\'}, u\'type\': \'direct\'}, u\'start_introspection\': {u\'with-items\': u\'uuid in <% $.node_uuids %>\', u\'name\': u\'start_introspection\', u\'on-error\': u\'set_status_failed_start_introspection\', u\'on-success\': u\'wait_for_introspection_to_finish\', u\'version\': \'2.0\', u\'action\': u\'baremetal_introspection.introspect uuid=<% $.uuid %>\', u\'type\': \'direct\'}}, u\'description\': u\'Take a list of nodes and move them through introspection.\', u\'version\': u\'2.0\', u\'name\': u\'introspect\'}, \'workflow_id\': u\'07c7a9e2-9e23-4f03-b166-f6d811b7c90d\', \'params\': {\'index\': 0, \'task_execution_id\': u\'d7fccd05-ba45-400e-8811-cecfc1916806\'}, \'context\': {\'openstack\': {u\'project_name\': u\'admin\', u\'user_id\': u\'9f26c755eb2f4b9da21e8ce31718684c\', u\'roles\': [u\'admin\'], u\'auth_uri\': u\'http://172.16.200.10:5000/v3\', u\'auth_cacert\': None, u\'auth_token\': u\'704ea298f58643b1a5f38c754b27feec\', u\'expires_at\': u\'2016-10-18T05:08:35.000000Z\', u\'is_trust_scoped\': False, u\'project_id\': u\'dba8453eda9e45ed86251a7b4086461a\', u\'user_name\': u\'admin\', u\'target_service_catalog\': None}, \'__execution\': {\'id\': u\'5bf6adc2-3606-45a5-9796-ff0db952d166\'}}, \'input\': {u\'queue_name\': u\'04853dce-548d-4222-a165-49c66ef8ed42\', u\'node_uuids\': [u\'d417fe...

Read more...

David Hill (david-hill-ubisoft) wrote :

I don't think it's the same issue, if I were you I'd open a new bug for this.

David Hill (david-hill-ubisoft) wrote :
Download full text (4.9 KiB)

Here is what I see in ironic-api.log:
2016-10-17 22:00:44.108 23439 INFO eventlet.wsgi.server [req-9c5db457-ed0c-45bb-a64d-3d7bb1ba692c admin admin - - -] 192.0.2.1 "GET /v1/nodes/38e34003-aed8-4f27-aca8-9be780b7ca05/validate HTTP/1.1" status: 200 len: 1096 time: 0.3458369
2016-10-17 22:00:44.485 23439 INFO eventlet.wsgi.server [req-fb599685-37cc-4cf4-b137-2ff478780e3e admin admin - - -] 192.0.2.1 "PUT /v1/nodes/38e34003-aed8-4f27-aca8-9be780b7ca05/states/power HTTP/1.1" status: 202 len: 384 time: 0.3724360
2016-10-17 22:00:53.911 23436 DEBUG wsme.api [req-14615c18-1cda-4cc6-a40d-2b82fda88538 admin admin - - -] Client-side error: Node 38e34003-aed8-4f27-aca8-9be780b7ca05 is locked by host localhost.localdomain, please retry after the current operation is completed.
NodeLocked: Node 38e34003-aed8-4f27-aca8-9be780b7ca05 is locked by host localhost.localdomain, please retry after the current operation is completed.
2016-10-17 22:00:53.914 23436 INFO eventlet.wsgi.server [req-14615c18-1cda-4cc6-a40d-2b82fda88538 admin admin - - -] 192.0.2.1 "PUT /v1/nodes/38e34003-aed8-4f27-aca8-9be780b7ca05/states/provision HTTP/1.1" status: 409 len: 555 time: 2.4623940
2016-10-17 22:00:58.045 23436 DEBUG wsme.api [req-373a4840-d365-4134-a113-192e4c490226 admin admin - - -] Client-side error: Node 38e34003-aed8-4f27-aca8-9be780b7ca05 is locked by host localhost.localdomain, please retry after the current operation is completed.
NodeLocked: Node 38e34003-aed8-4f27-aca8-9be780b7ca05 is locked by host localhost.localdomain, please retry after the current operation is completed.
2016-10-17 22:00:58.048 23436 INFO eventlet.wsgi.server [req-373a4840-d365-4134-a113-192e4c490226 admin admin - - -] 192.0.2.1 "PUT /v1/nodes/38e34003-aed8-4f27-aca8-9be780b7ca05/states/provision HTTP/1.1" status: 409 len: 555 time: 2.1229970
2016-10-17 22:01:02.186 23436 DEBUG wsme.api [req-753dd253-33cc-4327-beb2-a78ecfa8573c admin admin - - -] Client-side error: Node 38e34003-aed8-4f27-aca8-9be780b7ca05 is locked by host localhost.localdomain, please retry after the current operation is completed.
NodeLocked: Node 38e34003-aed8-4f27-aca8-9be780b7ca05 is locked by host localhost.localdomain, please retry after the current operation is completed.
2016-10-17 22:01:02.190 23436 INFO eventlet.wsgi.server [req-753dd253-33cc-4327-beb2-a78ecfa8573c admin admin - - -] 192.0.2.1 "PUT /v1/nodes/38e34003-aed8-4f27-aca8-9be780b7ca05/states/provision HTTP/1.1" status: 409 len: 555 time: 2.1337779
2016-10-17 22:01:06.313 23436 DEBUG wsme.api [req-d456ae9f-d62b-4dd6-8b39-bc93b6b8d2c9 admin admin - - -] Client-side error: Node 38e34003-aed8-4f27-aca8-9be780b7ca05 is locked by host localhost.localdomain, please retry after the current operation is completed.
NodeLocked: Node 38e34003-aed8-4f27-aca8-9be780b7ca05 is locked by host localhost.localdomain, please retry after the current operation is completed.
2016-10-17 22:01:06.315 23436 INFO eventlet.wsgi.server [req-d456ae9f-d62b-4dd6-8b39-bc93b6b8d2c9 admin admin - - -] 192.0.2.1 "PUT /v1/nodes/38e34003-aed8-4f27-aca8-9be780b7ca05/states/provision HTTP/1.1" status: 409 len: 555 time: 2.1205130
2016-10-17 22:01:10.452 23436 DEBUG wsme.api [req-6f71229...

Read more...

Logs from ironic.

---
version: '2.0'
name: tripleo.baremetal.v1
description: TripleO Baremetal Workflows

workflows:

  set_node_state:
    input:
      - node_uuid
      - state_action
      - target_state

    tasks:
      retry_provision_state:
        on-success: wait_for_provision_state
        action: ironic.node_set_provision_state node_uuid=<% $.node_uuid %> state=<% $.state_action %>

      set_provision_state:
        on-success: wait_for_provision_state
        on-error: retry_provision_state
        action: ironic.node_set_provision_state node_uuid=<% $.node_uuid %> state=<% $.state_action %>

      wait_for_provision_state:
        action: ironic.node_get node_id=<% $.node_uuid %>
        timeout: 1200 #20 minutes
        retry:
          delay: 3
          count: 400
          continue-on: <% task(wait_for_provision_state).result.provision_state != $.target_state %>

Updating the baremetal workbook to retry when set_provision_state fails because the host is locked solves this issue.

Here is a patch that I've tested and can confirm it works.

Fix proposed to branch: master
Review: https://review.openstack.org/388920

Changed in tripleo:
assignee: wes hayutin (weshayutin) → David Hill (david-hill-ubisoft)
status: New → In Progress

diff --git a/workbooks/baremetal.yaml b/workbooks/baremetal.yaml
index 2f62413..375e8bc 100644
--- a/workbooks/baremetal.yaml
+++ b/workbooks/baremetal.yaml
@@ -16,6 +16,9 @@ workflows:
       set_provision_state:
         on-success: wait_for_provision_state
         action: ironic.node_set_provision_state node_uuid=<% $.node_uuid %> state=<% $.state_action %>
+ retry:
+ delay: 3
+ count: 3

       wait_for_provision_state:
         action: ironic.node_get node_id=<% $.node_uuid %>

Dmitry Tantsur (divius) wrote :
Download full text (3.5 KiB)

David, I took a look at your logs and I'm inclined to think it's a case of https://bugs.launchpad.net/tripleo/+bug/1638281. The manage action times out before the power state was changed. Here's the relevant part:

016-10-17 22:00:44.142 23471 DEBUG ironic.conductor.manager [req-fb599685-37cc-4cf4-b137-2ff478780e3e - - - - -] RPC change_node_power_state called for node 38e34003-aed8-4f27-aca8-9be780b7ca05. The desired new state is power off. change_node_power_state /usr/lib/python2.7/site-packages/ironic/conductor/manager.py:170
2016-10-17 22:00:44.158 23471 DEBUG ironic.conductor.task_manager [req-fb599685-37cc-4cf4-b137-2ff478780e3e - - - - -] Attempting to get exclusive lock on node 38e34003-aed8-4f27-aca8-9be780b7ca05 (for changing node power state) __init__ /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:208
2016-10-17 22:00:44.181 23471 DEBUG ironic.conductor.task_manager [req-fb599685-37cc-4cf4-b137-2ff478780e3e - - - - -] Node 38e34003-aed8-4f27-aca8-9be780b7ca05 successfully reserved for changing node power state (took 0.02 seconds) reserve_node /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:252
2016-10-17 22:00:51.713 23471 DEBUG ironic.conductor.task_manager [req-14615c18-1cda-4cc6-a40d-2b82fda88538 - - - - -] Attempting to get exclusive lock on node 38e34003-aed8-4f27-aca8-9be780b7ca05 (for provision action manage) __init__ /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:208
2016-10-17 22:00:55.971 23471 DEBUG ironic.conductor.task_manager [req-373a4840-d365-4134-a113-192e4c490226 - - - - -] Attempting to get exclusive lock on node 38e34003-aed8-4f27-aca8-9be780b7ca05 (for provision action manage) __init__ /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:208
2016-10-17 22:01:00.109 23471 DEBUG ironic.conductor.task_manager [req-753dd253-33cc-4327-beb2-a78ecfa8573c - - - - -] Attempting to get exclusive lock on node 38e34003-aed8-4f27-aca8-9be780b7ca05 (for provision action manage) __init__ /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:208
2016-10-17 22:01:04.245 23471 DEBUG ironic.conductor.task_manager [req-d456ae9f-d62b-4dd6-8b39-bc93b6b8d2c9 - - - - -] Attempting to get exclusive lock on node 38e34003-aed8-4f27-aca8-9be780b7ca05 (for provision action manage) __init__ /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:208
2016-10-17 22:01:08.383 23471 DEBUG ironic.conductor.task_manager [req-6f712294-44cb-4b2b-9472-4096f8a50069 - - - - -] Attempting to get exclusive lock on node 38e34003-aed8-4f27-aca8-9be780b7ca05 (for provision action manage) __init__ /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:208
2016-10-17 22:01:12.511 23471 DEBUG ironic.conductor.task_manager [req-690d8896-19b6-4cab-89c0-b167408dd90d - - - - -] Attempting to get exclusive lock on node 38e34003-aed8-4f27-aca8-9be780b7ca05 (for provision action manage) __init__ /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:208
2016-10-17 22:01:27.333 23471 WARNING ironic.conductor.utils [req-fb599685-37cc-4cf4-b137-2ff478780e3e - - - - -] Not going to change node 38e34003-aed8-4f27-aca8-9be780b7ca05 power state because current s...

Read more...

Change abandoned by David Hill (<email address hidden>) on branch: master
Review: https://review.openstack.org/388920
Reason: No longer needed

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers