CI: nonha jobs fails in introspection step because of mistral error

Bug #1609688 reported by Sagi (Sergey) Shnaidman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Periodic and gate nonha jobs fail with error in introspection step:

Failed to run task [wf=WorkflowExecution
{'state_info': None, 'accepted': False, 'name': u'tripleo.baremetal.v1.introspect', 'workflow_name': u'tripleo.baremetal.v1.introspect', 'state': u'RUNNING', 'created_at': '2016-08-04 07:23:32', 'tags': None, 'updated_at': '2016-08-04 07:23:32', 'spec':
{u'tasks':
{u'send_message':
{u'action': u'zaqar.queue_post', u'input':
{u'queue_name': u'<% $.queue_name %>', u'messages':
{u'body':
{u'type': u'tripleo.baremetal.v1.introspect', u'payload':
{u'status': u"<% $.get('status', 'SUCCESS') %>", u'message': u"<% $.get('message', '') %>", u'execution': u'<% execution() %>', u'introspected_nodes': u"<% $.get('introspected_nodes', []) %>"}}}}, u'version': u'2.0', u'type': u'direct', u'name': u'send_message'}, u'set_status_failed_start_introspection':
{u'version': u'2.0', u'type': u'direct', u'name': u'set_status_failed_start_introspection', u'publish':
{u'status': u'FAILED', u'message': u'<% task(start_introspection).result %>', u'introspected_nodes': []}, u'on-success': u'send_message'}, u'wait_for_introspection_to_finish':
{u'name': u'wait_for_introspection_to_finish', u'on-success': u'send_message', u'publish':
{u'status': u'SUCCESS', u'message': u'Succussfully introspected nodes.', u'introspected_nodes': u'<% task(wait_for_introspection_to_finish).result %>'}, u'version': u'2.0', u'action': u'baremetal_introspection.wait_for_finish', u'input':
{u'uuids': u'<% $.node_uuids %>'}, u'type': u'direct'}, u'start_introspection':
{u'with-items': u'uuid in <% $.node_uuids %>', u'name': u'start_introspection', u'on-error': u'set_status_failed_start_introspection', u'on-success': u'wait_for_introspection_to_finish', u'version': u'2.0', u'action': u'baremetal_introspection.introspect uuid=<% $.uuid %>', u'type': u'direct'}}, u'description': u'Take a list of nodes and move them through introspection.', u'version': u'2.0', u'input': [u'node_uuids',
{u'queue_name': u'tripleo'}], u'type': u'direct', u'name': u'introspect'}, 'workflow_id': u'0017e310-f7db-4c08-a317-60e127bd55d4', 'params':
{u'index': 0, u'task_execution_id': u'dc7d53b7-d1b2-4ffc-b1f1-86be0d45e8b4'}, 'context':
{u'openstack':
{u'project_name': u'admin', u'user_id': u'90d4fe8c54c7461b84833b15e8d55d70', u'roles': [u'admin'], u'is_trust_scoped': False, u'auth_cacert': None, u'auth_token': u'512fda7706ac4e77b63fda4a592efb5c', u'auth_uri': u'https://192.0.2.2:13000/v3', u'service_catalog': u'[
{"endpoints": [
{"adminURL": "http://192.0.2.1:8080", "region": "regionOne", "internalURL": "http://192.0.2.1:8080/v1/AUTH_2bd839d8da004c44a4d9c4eb2eb098ec", "publicURL": "https://192.0.2.2:13808/v1/AUTH_2bd839d8da004c44a4d9c4eb2eb098ec"}], "type": "object-store", "name": "swift"},
{"endpoints": [
{"adminURL": "http://192.0.2.1:35357/v2.0", "region": "regionOne", "internalURL": "http://192.0.2.1:5000/v2.0", "publicURL": "https://192.0.2.2:13000/v2.0"}], "type": "identity", "name": "keystone"},
{"endpoints": [
{"adminURL": "http://192.0.2.1:8989/v2", "region": "regionOne", "internalURL": "http://192.0.2.1:8989/v2", "publicURL": "https://192.0.2.2:13989/v2"}], "type": "workflowv2", "name": "mistral"},
{"endpoints": [
{"adminURL": "ws://192.0.2.1:9000/", "region": "regionOne", "internalURL": "ws://192.0.2.1:9000/", "publicURL": "ws://192.0.2.1:9000/"}], "type": "messaging-websocket", "name": "zaqar-websocket"},
{"endpoints": [
{"adminURL": "http://192.0.2.1:9292", "region": "regionOne", "internalURL": "http://192.0.2.1:9292", "publicURL": "https://192.0.2.2:13292"}], "type": "image", "name": "glance"},
{"endpoints": [
{"adminURL": "http://192.0.2.1:9696", "region": "regionOne", "internalURL": "http://192.0.2.1:9696", "publicURL": "https://192.0.2.2:13696"}], "type": "network", "name": "neutron"},
{"endpoints": [
{"adminURL": "http://192.0.2.1:8774/v2.1", "region": "regionOne", "internalURL": "http://192.0.2.1:8774/v2.1", "publicURL": "https://192.0.2.2:13774/v2.1"}], "type": "compute", "name": "nova"},
{"endpoints": [
{"adminURL": "http://192.0.2.1:8888/", "region": "regionOne", "internalURL": "http://192.0.2.1:8888/", "publicURL": "http://192.0.2.1:8888/"}], "type": "messaging", "name": "zaqar"},
{"endpoints": [
{"adminURL": "http://192.0.2.1:8004/v1/2bd839d8da004c44a4d9c4eb2eb098ec", "region": "regionOne", "internalURL": "http://192.0.2.1:8004/v1/2bd839d8da004c44a4d9c4eb2eb098ec", "publicURL": "https://192.0.2.2:13004/v1/2bd839d8da004c44a4d9c4eb2eb098ec"}], "type": "orchestration", "name": "heat"},
{"endpoints": [
{"adminURL": "http://192.0.2.1:6385", "region": "regionOne", "internalURL": "http://192.0.2.1:6385", "publicURL": "https://192.0.2.2:13385"}], "type": "baremetal", "name": "ironic"},
{"endpoints": [
{"adminURL": "http://192.0.2.1:5050", "region": "regionOne", "internalURL": "http://192.0.2.1:5050", "publicURL": "http://192.0.2.1:5050"}], "type": "baremetal-introspection", "name": "ironic-inspector"}]', u'project_id': u'2bd839d8da004c44a4d9c4eb2eb098ec', u'user_name': u'admin'}, u'__execution':
{u'input':
{u'queue_name': u'427f4966-2630-4b77-b215-708b8487a40f', u'node_uuids': [u'0a2500af-16db-48fb-9d74-6e2273315d14', u'c4b035a6-a4d9-43ce-b957-ffc2e2149f96', u'29a4d0a2-d009-4b21-acef-48a529169642']}, u'params':
{u'index': 0, u'task_execution_id': u'dc7d53b7-d1b2-4ffc-b1f1-86be0d45e8b4'}, u'id': u'b8d1b1fd-10d6-4a4d-968b-fbaf989327c3', u'spec':
{u'tasks':
{u'send_message':
{u'action': u'zaqar.queue_post', u'input':
{u'queue_name': u'<% $.queue_name %>', u'messages':
{u'body':
{u'type': u'tripleo.baremetal.v1.introspect', u'payload':
{u'status': u"<% $.get('status', 'SUCCESS') %>", u'message': u"<% $.get('message', '') %>", u'execution': u'<% execution() %>', u'introspected_nodes': u"<% $.get('introspected_nodes', []) %>"}}}}, u'version': u'2.0', u'type': u'direct', u'name': u'send_message'}, u'set_status_failed_start_introspection':
{u'version': u'2.0', u'type': u'direct', u'name': u'set_status_failed_start_introspection', u'publish':
{u'status': u'FAILED', u'message': u'<% task(start_introspection).result %>', u'introspected_nodes': []}, u'on-success': u'send_message'}, u'wait_for_introspection_to_finish':
{u'name': u'wait_for_introspection_to_finish', u'on-success': u'send_message', u'publish':
{u'status': u'SUCCESS', u'message': u'Succussfully introspected nodes.', u'introspected_nodes': u'<% task(wait_for_introspection_to_finish).result %>'}, u'version': u'2.0', u'action': u'baremetal_introspection.wait_for_finish', u'input':
{u'uuids': u'<% $.node_uuids %>'}, u'type': u'direct'}, u'start_introspection':
{u'with-items': u'uuid in <% $.node_uuids %>', u'name': u'start_introspection', u'on-error': u'set_status_failed_start_introspection', u'on-success': u'wait_for_introspection_to_finish', u'version': u'2.0', u'action': u'baremetal_introspection.introspect uuid=<% $.uuid %>', u'type': u'direct'}}, u'description': u'Take a list of nodes and move them through introspection.', u'version': u'2.0', u'input': [u'node_uuids',
{u'queue_name': u'tripleo'}], u'type': u'direct', u'name': u'introspect'}}, u'queue_name': u'427f4966-2630-4b77-b215-708b8487a40f', u'node_uuids': [u'0a2500af-16db-48fb-9d74-6e2273315d14', u'c4b035a6-a4d9-43ce-b957-ffc2e2149f96', u'29a4d0a2-d009-4b21-acef-48a529169642']}, 'input':
{u'queue_name': u'427f4966-2630-4b77-b215-708b8487a40f', u'node_uuids': [u'0a2500af-16db-48fb-9d74-6e2273315d14', u'c4b035a6-a4d9-43ce-b957-ffc2e2149f96', u'29a4d0a2-d009-4b21-acef-48a529169642']}, 'scope': u'private', 'project_id': u'2bd839d8da004c44a4d9c4eb2eb098ec', 'task_execution_id': u'dc7d53b7-d1b2-4ffc-b1f1-86be0d45e8b4', 'id': u'b8d1b1fd-10d6-4a4d-968b-fbaf989327c3', 'runtime_context':
{u'index': 0}, 'description': u'sub-workflow execution'}, task=wait_for_introspection_to_finish]: Invalid input [name=baremetal_introspection.wait_for_finish, class=NoneType, unexpected=[u'uuids']]
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/mistral/engine/task_handler.py", line 45, in run_task
    task.run()
  File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 154, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 235, in run
    self._run_new()
  File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 266, in _run_new
    self._schedule_actions()
  File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 311, in _schedule_actions
    action.validate_input(input_dict)
  File "/usr/lib/python2.7/site-packages/mistral/engine/actions.py", line 285, in validate_input
    e_utils.validate_input(self.action_def, input_dict)
  File "/usr/lib/python2.7/site-packages/mistral/engine/utils.py", line 58, in validate_input
    msg % tuple(msg_props)
InputException: Invalid input [name=baremetal_introspection.wait_for_finish, class=NoneType, unexpected=[u'uuids']]

http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-nonha/7566671/console.html#_2016-08-04_07_23_40_211292
http://logs.openstack.org/39/345539/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/92d7e34/console.html#_2016-08-04_07_20_01_227415

Tags: ci
summary: - CI: introspection fails in CI nonha jobs
+ CI: nonha jobs fails in introspection step because of mistral error
Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

It's an issue in Mistral and I'm trying to fix it here https://review.openstack.org/#/c/351134/

Changed in tripleo:
assignee: nobody → Juan Antonio Osorio Robles (juan-osorio-robles)
Revision history for this message
Alan Pevec (apevec) wrote :

Regression introduced in https://review.openstack.org/349968

Changed in tripleo:
status: New → In Progress
Revision history for this message
Attila Darazs (adarazs) wrote :

I see the fix from Juan is merged, I think it hit the master branch already but I'm still seeing this error in the latest promotion job. I don't think this is fixed.

Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

Is it the same output? AFAIK mistral is broken at the moment because of some major refactoring that's going on. So it might be a different issue.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

@Juan, I think output is different and seems like it's a different issue as well.
Not sure if we need to open a new bug for it, because it's still because of mistral and on the same stage.

http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-ha/f2d3624/console.html#_2016-08-06_07_31_42_904692

tripleo.sh -- Register nodes
#################
You must source a stackrc file for the Undercloud.
Attempting to source /home/jenkins/stackrc
Done

(pymysql.err.InternalError) (1054, u"Unknown column 'unique_key' in 'field list'") [SQL: u'INSERT INTO delayed_calls_v2 (created_at, id, factory_method_path, target_method_name, method_arguments, serializers, unique_key, auth_context, execution_time, processing) VALUES ...

Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

So I ran into the same issue. mistral upstream is broken by changes that are adding a new architecture and way of doing things. I was told by the mistral team that on monday/tuesday they would make things work; lets see.

How I ended up fixing this was by going back to a working state of mistral and cherry-picking my fix.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

I think we have a separated bugs both in tripleo and mistral for this issue, so this one could be closed. Thanks, @Juan!

https://bugs.launchpad.net/tripleo/+bug/1610279
https://bugs.launchpad.net/mistral/+bug/1610269

Changed in tripleo:
importance: Undecided → Critical
status: In Progress → Fix Released
Revision history for this message
Emilien Macchi (emilienm) wrote :
tags: added: alert
Changed in tripleo:
status: Fix Released → Confirmed
Changed in tripleo:
assignee: Juan Antonio Osorio Robles (juan-osorio-robles) → nobody
Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

I remember this issue involved several things:

* If mistral db-sync (or the part that fills the database with the openstack actions) fails, the errors are silently discarded.
  - We need this output in the logs.
* back then, the python-ironic-inspector required a working keystone in order to be created, if keystone is not available when this command is ran, it won't be able to create a client instance, and the dynamic creation of actions will fail. So we need keystone to be available before both ironic-inspector and before mistral.

Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

Also, if SSL is used, we need haproxy to be available before everything too. This was the commit initially fixing those dependencies: https://review.openstack.org/#/c/353865/

Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

This time it doesn't seem to be a dependency issue. I checked the execution times (and the line number where the log appears, and it really seems that db-populate is ran after the dependencies have ran already:

# db-populate
3171:2016-12-19 06:14:37.000 | 2016-12-19 06:14:37,111 INFO: Notice: /Stage[main]/Mistral::Db::Sync/Exec[mistral-db-populate]/returns: executed successfully
3172:2016-12-19 06:14:45.000 | 2016-12-19 06:14:45,271 INFO: Notice: /Stage[main]/Mistral::Db::Sync/Exec[mistral-db-populate]: Triggered 'refresh' from 5 events
# ironic-inspector
3168:2016-12-19 06:14:27.000 | 2016-12-19 06:14:27,918 INFO: Notice: /Stage[main]/Ironic::Inspector/Service[ironic-inspector]/ensure: ensure changed 'stopped' to 'running'
3169:2016-12-19 06:14:28.000 | 2016-12-19 06:14:28,575 INFO: Notice: /Stage[main]/Ironic::Inspector/Service[ironic-inspector-dnsmasq]/ensure: ensure changed 'stopped' to 'running'
# haproxy
3060:2016-12-19 06:07:22.000 | 2016-12-19 06:07:22,533 INFO: Notice: /Stage[main]/Haproxy/Haproxy::Instance[haproxy]/Haproxy::Service[haproxy]/File[/etc/sysconfig/haproxy]/content: content changed '{md5}2d8c2707741c581482214fefeed95a47' to '{md5}e2410b4501510c91882f106e6db20a3b'
3061:2016-12-19 06:07:23.000 | 2016-12-19 06:07:23,058 INFO: Notice: /Stage[main]/Haproxy/Haproxy::Instance[haproxy]/Haproxy::Service[haproxy]/Service[haproxy]/ensure: ensure changed 'stopped' to 'running'
# httpd
3083:2016-12-19 06:07:34.000 | 2016-12-19 06:07:34,709 INFO: Notice: /Stage[main]/Apache::Service/Service[httpd]/ensure: ensure changed 'stopped' to 'running'

So it might be an issue in how mistral creates the python-ironic-client Client instance? if someone can reproduce this we could check that out.

Revision history for this message
Julie Pichon (jpichon) wrote :

The action name with the problem in the trace is the same one than at bug 1649350 (baremetal_introspection.introspect) and slightly different from the description here, fwiw.

Changed in tripleo:
milestone: none → ocata-3
tags: added: promotion-blocker
tags: removed: alert promotion-blocker
Revision history for this message
Emilien Macchi (emilienm) wrote :

We reverted a patch in puppet-mistral and the problem is solved in tripleo. We'll have to rework the way we configure Mistral / Ironic in the Puppet modules. Closing it now.

Changed in tripleo:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.