Mistral DB/logs Grow Uncontrollably On Auth Failure

Bug #1775309 reported by Dan Sneddon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mistral
Invalid
Undecided
Unassigned
tripleo
Confirmed
High
Unassigned

Bug Description

I installed the undercloud from RDO TripleO master-ci from May 31st, and used it as an all-in-one OpenStack/Ironic deployment. I used the Ironic instance to deploy a few bare-metal hosts, but did not deploy an overcloud. After the Keystone Admin API became temporarily unavailable, Mistral got caught in a loop, attempting the same tasks over and over again. This caused the Mistral DB to grow to 31GB, and the Mistral logs to grow to over 5GB, and the /var partition filled up.

The count of records in the Mistral DB tables after 3 days was:

mistral.task_executions_v2: 489,413 records
mistral.action_executions_v2: 613,495 records
mistral.workflow_executions_v2: 61,196 records

The actions that were being repeated were related to introspection, and the errors had to do with authentication. At some point the Keystone API was unavailable for a short time, but the errors in Mistral repeated for several days after the Keystone API was restored to operation.

Below is an example from the logs which shows the actions that are repeating over and over again. These were reflected in the database.

| 34a79011-9297-449e-9406-e6d778125936 | complete_swift | NULL | SUCCESS | NULL |
| 34a79bb7-ac2c-4bc7-a9b5-1b13f93c5aa7 | complete_swift | NULL | SUCCESS | NULL |
| 34a7b7ef-69d4-474c-9dc3-157ed30407a2 | check_status | NULL | ERROR | Failed by tasks: [u'send_message'] |
| 34a80ea3-7352-400a-b254-9281086b5157 | merge_payload | NULL | SUCCESS | NULL |
| 34a871e4-ec10-4602-a890-35bca1cdd58c | merge_payload | NULL | SUCCESS | NULL |
| 34a8756c-e41c-4dfe-8229-e8e673e9958c | send_message | NULL | ERROR | Workflow failed due to message status |
| 34a889d2-7eb2-4f65-ab05-cf7f242a7d31 | complete_swift | NULL | SUCCESS | NULL |
| 34a8bd2f-d48f-4944-a1e9-b02aefff0e7e | branch_workflow | NULL | SUCCESS | NULL |
| 34a90005-8525-4f12-8b1c-e2b4d3cd0e81 | merge_payload | NULL | SUCCESS | NULL |
| 34a9032d-a039-403d-b2f3-c5b9073f2316 | branch_workflow | NULL | SUCCESS | NULL |
| 34a91146-c8ec-4e04-b552-924217d9ac3d | send_message | NULL | ERROR | Failed to run action [action_ex_id=f20c7c8e-291b-44a0-b791-f75db7420807, action_cls='<class 'mistral.actions.action_factory.ZaqarAction'>', attributes='{u'client_method_name': u'queue_post'}', params='{u'queue_name': u'tripleo', u'messages': {u'body': {u'type': u'tripleo.baremetal.v1.introspect', u'payload': {u'status': u'FAILED', u'failed_introspection': [u'a5c148b5-1719-4571-ae32-44ad97e1cf16'], u'introspected_nodes': {u'a5c148b5-1719-4571-ae32-44ad97e1cf16': {u'uuid': u'a5c148b5-1719-4571-ae32-44ad97e1cf16', u'links': [{u'href': u'http://10.9.92.130:13050/v1/introspection/a5c148b5-1719-4571-ae32-44ad97e1cf16', u'rel': u'self'}], u'finished_at': None, u'state': u'waiting', u'finished': False, u'error': None, u'started_at': u'2018-06-01T18:09:08'}}, u'message': u'Unhandled workflow error', u'execution': {u'name': u'tripleo.baremetal.v1.introspect', u'created_at': u'2018-06-01 17:48:50', u'updated_at': u'2018-06-03 05:36:34', u'id': u'42c73f81-5270-4e04-9316-80edd1b54dee', u'params': {u'root_execution_id': u'3e257e45-b64c-47de-b662-585c83d5b5fe', u'index': 0, u'namespace': u'', u'task_execution_id': u'1a3110b8-c43e-4c69-a4bc-5d0bf5e2f5f0', u'env': {}}, u'input': {u'node_timeout': 1200, u'run_validations': False, u'node_uuids': [u'a5c148b5-1719-4571-ae32-44ad97e1cf16'], u'queue_name': u'tripleo', u'max_retry_attempts': 2, u'concurrency': 20}, u'spec': {u'tasks': {u'run_validations': {u'name': u'run_validations', u'on-error': u'set_validations_failed', u'workflow': u'tripleo.validations.v1.run_groups', u'on-success': u'introspect_nodes', u'version': u'2.0', u'input': {u'queue_name': u'<% $.queue_name %>', u'group_names': [u'pre-introspection']}, u'type': u'direct'}, u'successful_introspection': {u'on-complete': u'send_message', u'version': u'2.0', u'type': u'direct', u'name': u'successful_introspection', u'publish': {u'status': u'SUCCESS', u'message': u'Successfully introspected <% $.introspected_nodes.len() %> node(s).'}}, u'increase_attempt_counter': {u'on-complete': u'retry_failed_nodes', u'version': u'2.0', u'type': u'direct', u'name': u'increase_attempt_counter', u'publish': {u'introspection_attempt': u'<% $.introspection_attempt + 1 %>'}}, u'send_message': {u'name': u'send_message', u'workflow': u'tripleo.messaging.v1.send', u'on-success': [{u'fail': u'<% $.get(\'status\') = "FAILED" %>'}], u'version': u'2.0', u'input': {u'status': u"<% $.get('status', 'SUCCESS') %>", u'queue_name': u'<% $.queue_name %>', u'message': u"<% $.get('message', '') %>", u'execution': u'<% execution() %>', u'type': u'<% execution().name %>', u'payload': u"<% { introspected_nodes => $.get('introspected_nodes', []), failed_introspection => $.get('failed_introspection', []) } %>"}, u'type': u'direct'}, u'introspect_nodes': {u'with-items': u'uuid in <% $.node_uuids %>', u'name': u'introspect_nodes', u'on-error': [{u'get_introspection_status': u'<% $.introspection_attempt <= $.max_retry_attempts %>'}, {u'max_retry_attempts_reached': u'<% $.introspection_attempt > $.max_retry_attempts %>'}], u'workflow': u'_introspect', u'on-success': u'get_introspection_status', u'version': u'2.0', u'concurrency': u'<% $.concurrency %>', u'input': {u'node_uuid': u'<% $.uuid %>', u'queue_name': u'<% $.queue_name %>', u'timeout': u'<% $.node_timeout %>'}, u'type': u'direct'}, u'set_validations_failed': {u'version': u'2.0', u'type': u'direct', u'name': u'set_validations_failed', u'publish': {u'status': u'FAILED', u'message': u'<% task(run_validations).result %>'}, u'on-success': u'send_message'}, u'retry_failed_nodes': {u'version': u'2.0', u'type': u'direct', u'name': u'retry_failed_nodes', u'publish': {u'status': u'RUNNING', u'message': u"<% 'Retrying {0} nodes that failed introspection. Attempt {1} of {2} '.format($.failed_introspection.len(), $.introspection_attempt, $.max_retry_attempts + 1) %>", u'node_uuids': u'<% $.failed_introspection %>'}, u'on-success': [u'send_message', u'introspect_nodes']}, u'max_retry_attempts_reached': {u'on-complete': u'send_message', u'version': u'2.0', u'type': u'direct', u'name': u'max_retry_attempts_reached', u'publish': {u'status': u'FAILED', u'message': u"<% 'Retry limit reached with {0} nodes still failing introspection'.format($.failed_introspection.len()) %>"}}, u'initialize': {u'on-complete': [{u'run_validations': u'<% $.run_validations %>'}, {u'introspect_nodes': u'<% not $.run_validations %>'}], u'version': u'2.0', u'type': u'direct', u'name': u'initialize', u'publish': {u'introspection_attempt': 1}}, u'unhandled_error': {u'on-complete': u'send_message', u'version': u'2.0', u'type': u'direct', u'name': u'unhandled_error', u'publish': {u'status': u'FAILED', u'message': u'Unhandled workflow error'}}, u'get_introspection_status': {u'with-items': u'uuid in <% $.node_uuids %>', u'name': u'get_introspection_status', u'on-error': u'increase_attempt_counter', u'on-success': [{u'successful_introspection': u'<% $.failed_introspection.len() = 0 %>'}, {u'increase_attempt_counter': u'<% $.failed_introspection.len() > 0 %>'}], u'publish-on-error': {u'failed_introspection': u'<% ($.node_uuids.toSet() - task().result.where(isDict($)).select($.uuid).toSet()) + task().result.where(isDict($)).where($.finished = true and $.error != null).toSet() + task().result.where(isDict($)).where($.finished = false).toSet() %>', u'introspected_nodes': u'<% task().result.where(isDict($)).toDict($.uuid, $) %>'}, u'publish': {u'failed_introspection': u'<% task().result.where($.finished = true and $.error != null).select($.uuid) + task().result.where($.finished = false).select($.uuid) %>', u'introspected_nodes': u'<% task().result.toDict($.uuid, $) %>'}, u'version': u'2.0', u'action': u'baremetal_introspection.get_status', u'input': {u'uuid': u'<% $.uuid %>'}, u'type': u'direct'}}, u'name': u'introspect', u'tags': [u'tripleo-common-managed'], u'task-defaults': {u'on-error': u'unhandled_error'}, u'version': u'2.0', u'input': [u'node_uuids', {u'run_validations': False}, {u'queue_name': u'tripleo'}, {u'concurrency': 20}, {u'max_retry_attempts': 2}, {u'node_timeout': 1200}], u'description': u'Take a list of nodes and move them through introspection.\nBy default each node will attempt introspection up to 3 times (two retries plus the initial attemp) if it fails. This behaviour can be modified by changing the max_retry_attempts input.\nThe workflow will assume the node has timed out after 20 minutes (1200 seconds). This can be changed by passing the node_timeout input in seconds.\n'}}, u'plan_name': None, u'execution_id': u'42c73f81-5270-4e04-9316-80edd1b54dee', u'deployment_status': None}}}}']
 ZaqarAction.queue_post failed: Error response from Zaqar. Code: 401. Text: {"error": {"message": "The request you have made requires authentication.", "code": 401, "title": "Unauthorized"}}. |
| 34a95717-8ea2-4b38-8511-8f21f1fe6ca7 | complete_swift | NULL | SUCCESS | NULL |
| 34a95919-37de-4c23-948e-3c12aeb84283 | branch_workflow | NULL | SUCCESS | NULL |
| 34a9ba97-0df8-4b67-ab97-33b21a8f089f | check_status | NULL | ERROR | Failed by tasks: [u'send_message'] |
| 34a9c58b-721d-463d-bd7f-66eb55e3cfe8 | merge_payload | NULL | SUCCESS | NULL |
| 34a9cc59-ca58-44b2-8bba-7c4540d1d32a | merge_payload | NULL | SUCCESS | NULL |

Revision history for this message
Dan Sneddon (dsneddon) wrote :

Versions installed:

openstack-mistral-engine-7.0.0-0.20180523140054.573d2d0.el7.noarch
openstack-mistral-common-7.0.0-0.20180523140054.573d2d0.el7.noarch
openstack-mistral-executor-7.0.0-0.20180523140054.573d2d0.el7.noarch
openstack-mistral-api-7.0.0-0.20180523140054.573d2d0.el7.noarch
python2-mistral-lib-0.5.0-0.20180425121912.ef3482e.el7.centos.noarch
python2-mistralclient-3.5.0-0.20180518153811.8cc7648.el7.noarch
python-mistral-7.0.0-0.20180523140054.573d2d0.el7.noarch
puppet-mistral-13.0.1-0.20180528142125.a94142c.el7.noarch

Revision history for this message
Dan Sneddon (dsneddon) wrote :

Steps taken:

1) Installed undercloud
2) Imported baremetal hosts.
3) Started introspection
4) Keystone service taken down before introspection completed (introspection eventually failed)
5) Keystone service brought back up
6) Started introspection, introspection succeeds
7) Deployed baremetal hosts with Ironic
8) Left system for 3 days, Mistral DB and logs continued to fill up disk

Revision history for this message
Dougal Matthews (d0ugal) wrote :

I'm almost certain this is a bug with the tripleo workflows and not a bug with Mistral itself. So I have added tripleo and marked it as invalid.

It may be that there is a Mistral feature to help avoid this issue, but lets see if we can track down the issue first.

The introspection workflows contain some iteration, my guess is that we never escape the loop if there is an authentication failure.

Changed in tripleo:
status: New → Confirmed
importance: Undecided → High
Changed in mistral:
status: New → Invalid
Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

Is this still an issue?

Revision history for this message
Dan Sneddon (dsneddon) wrote :

@juan-osorio-robles

I don't know if this is still an issue, I haven't gotten a chance to test it recently.

Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

Let me know if you do.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.