[TRAIN] mistral workflows that take longer than 10 minutes to generate a message fatal even if it's still running

Bug #1891071 reported by Alex Schultz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Medium
Alex Schultz

Bug Description

Seen in CI. The tripleo.deployment.v1.deploy_plan started and took 12 minutes to finished. However since we do not generate messages in the middle of the deploy_plan, the websocket times out at 600 seconds and we failed the deploy even though the execution was still running. Eventually it finished succesfully but we had already failed the deploy.

https://e32e77e5dd9922ea1bc2-0c9507f3c7942864f4e3074b37139748.ssl.cf5.rackcdn.com/741228/9/check/tripleo-ci-centos-8-containers-multinode/e532945/logs/undercloud/home/zuul/overcloud_deploy.log

2020-08-10 15:33:24 | + openstack overcloud deploy --override-ansible-cfg /home/zuul/custom_ansible.cfg --templates /usr/share/openstack-tripleo-heat-templates --libvirt-type qemu --timeout 90 -e /home/zuul/cloud-names.yaml -e /home/zuul/hostnamemap.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/deployed-server-environment.yaml --overcloud-ssh-user zuul -e /home/zuul/containers-prepare-parameter.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml -e /usr/share/openstack-tripleo-heat-templates/ci/environments/multinode-containers.yaml -e /home/zuul/overcloud_network_params.yaml -e /home/zuul/overcloud_storage_params.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml -e /home/zuul/src/opendev.org/openstack/tripleo-ci/test-environments/worker-config.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/debug.yaml --validation-errors-nonfatal --roles-file /home/zuul/overcloud_roles.yaml -e /home/zuul/overcloud-topology-config.yaml -e /home/zuul/config-download.yaml --disable-validations -e /home/zuul/overcloud-selinux-config.yaml
2020-08-10 15:35:34 | /usr/lib/python3.6/site-packages/tripleoclient/utils.py:2137: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
2020-08-10 15:35:34 | content = yaml.load(f)
2020-08-10 15:35:34 | WARNING: Following parameter(s) are defined but not currently used in the deployment plan. These parameters may be valid but not in use due to the service or deployment configuration. CephPoolDefaultPgNum, CephPoolDefaultSize, CinderLVMLoopDeviceSize, CinderWorkers, ComputeCount, GnocchiMetricdWorkers, HeatWorkers, ManilaCephFSDataPoolPGNum, ManilaCephFSMetadataPoolPGNum, MistralDockerGroup, NovaComputeExtraConfig, SaharaWorkers, SwiftRingGetTempurl, SwiftRingPutTempurl, SwiftWorkers
2020-08-10 15:45:36 | Timed out waiting for messages from Execution (ID: e443bbba-f480-4c78-be59-a31c608e3236, State: RUNNING). The WebSocket timed out before the Workflow completed.

From the engine logs...

2020-08-10 15:35:34.863 6 INFO workflow_trace [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Workflow 'tripleo.deployment.v1.deploy_plan' [IDLE -> RUNNING, msg=None] (execution_id=e443bbba-f480-4c78-be59-a31c608e3236)
2020-08-10 15:35:34.881 6 DEBUG mistral.engine.tasks [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Starting task [name=send_start_message, init_state=RUNNING, workflow_name=tripleo.deployment.v1.deploy_plan, execution_id=e443bbba-f480-4c78-be59-a31c608e3236] _run_new /usr/lib/python3.6/site-packages/mistral/engine/tasks.py:525
2020-08-10 15:35:37.716 6 INFO workflow_trace [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Task 'send_start_message' (05bf1573-afc4-4118-927f-96c5812861ad) [RUNNING -> SUCCESS, msg=None] (execution_id=e443bbba-f480-4c78-be59-a31c608e3236)
2020-08-10 15:35:37.724 6 DEBUG mistral.engine.tasks [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Starting task [name=add_validation_ssh_key, init_state=RUNNING, workflow_name=tripleo.deployment.v1.deploy_plan, execution_id=e443bbba-f480-4c78-be59-a31c608e3236] _run_new /usr/lib/python3.6/site-packages/mistral/engine/tasks.py:525
2020-08-10 15:35:45.719 6 DEBUG mistral.services.legacy_scheduler [req-109dbc06-596e-4fb5-bdb0-46f5306351fe 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Preparing next delayed call. [ID=1ce6fc48-c19c-47a8-aff2-e954716db862, factory_method_path=None, target_method_name=mistral.engine.workflow_handler._check_and_fix_integrity, method_arguments={'wf_ex_id': 'e443bbba-f480-4c78-be59-a31c608e3236'}] _prepare_calls /usr/lib/python3.6/site-packages/mistral/services/legacy_scheduler.py:238
2020-08-10 15:35:54.739 6 INFO workflow_trace [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Task 'add_validation_ssh_key' (b0ad3c1e-d8fc-4327-9aa6-1181a4f2f949) [RUNNING -> SUCCESS, msg=None] (execution_id=e443bbba-f480-4c78-be59-a31c608e3236)
2020-08-10 15:35:54.754 6 DEBUG mistral.engine.tasks [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Starting task [name=container_image_prepare_params, init_state=RUNNING, workflow_name=tripleo.deployment.v1.deploy_plan, execution_id=e443bbba-f480-4c78-be59-a31c608e3236] _run_new /usr/lib/python3.6/site-packages/mistral/engine/tasks.py:525
2020-08-10 15:37:45.887 6 DEBUG mistral.services.legacy_scheduler [req-109dbc06-596e-4fb5-bdb0-46f5306351fe 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Preparing next delayed call. [ID=59bf7b42-633b-4157-9b90-8db8c98b4b06, factory_method_path=None, target_method_name=mistral.engine.workflow_handler._check_and_fix_integrity, method_arguments={'wf_ex_id': 'e443bbba-f480-4c78-be59-a31c608e3236'}] _prepare_calls /usr/lib/python3.6/site-packages/mistral/services/legacy_scheduler.py:238
2020-08-10 15:39:45.059 6 DEBUG mistral.services.legacy_scheduler [req-109dbc06-596e-4fb5-bdb0-46f5306351fe 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Preparing next delayed call. [ID=b2d5f2e0-dee7-445d-9a5d-4b630b3c48ff, factory_method_path=None, target_method_name=mistral.engine.workflow_handler._check_and_fix_integrity, method_arguments={'wf_ex_id': 'e443bbba-f480-4c78-be59-a31c608e3236'}] _prepare_calls /usr/lib/python3.6/site-packages/mistral/services/legacy_scheduler.py:238
2020-08-10 15:41:45.343 6 DEBUG mistral.services.legacy_scheduler [req-109dbc06-596e-4fb5-bdb0-46f5306351fe 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Preparing next delayed call. [ID=846301f1-6dc2-4662-92ce-535368719fdd, factory_method_path=None, target_method_name=mistral.engine.workflow_handler._check_and_fix_integrity, method_arguments={'wf_ex_id': 'e443bbba-f480-4c78-be59-a31c608e3236'}] _prepare_calls /usr/lib/python3.6/site-packages/mistral/services/legacy_scheduler.py:238
2020-08-10 15:43:45.587 6 DEBUG mistral.services.legacy_scheduler [req-109dbc06-596e-4fb5-bdb0-46f5306351fe 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Preparing next delayed call. [ID=655d267a-e4cb-4005-b679-bab601f908dd, factory_method_path=None, target_method_name=mistral.engine.workflow_handler._check_and_fix_integrity, method_arguments={'wf_ex_id': 'e443bbba-f480-4c78-be59-a31c608e3236'}] _prepare_calls /usr/lib/python3.6/site-packages/mistral/services/legacy_scheduler.py:238
2020-08-10 15:45:45.752 6 DEBUG mistral.services.legacy_scheduler [req-109dbc06-596e-4fb5-bdb0-46f5306351fe 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Preparing next delayed call. [ID=5ac85a3e-6b18-45b0-98e5-f73234c5e5dc, factory_method_path=None, target_method_name=mistral.engine.workflow_handler._check_and_fix_integrity, method_arguments={'wf_ex_id': 'e443bbba-f480-4c78-be59-a31c608e3236'}] _prepare_calls /usr/lib/python3.6/site-packages/mistral/services/legacy_scheduler.py:238
2020-08-10 15:47:38.073 6 INFO workflow_trace [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Task 'container_image_prepare_params' (8f035b31-f7e2-4a0c-8c20-73e376ed1d0c) [RUNNING -> SUCCESS, msg=None] (execution_id=e443bbba-f480-4c78-be59-a31c608e3236)
2020-08-10 15:47:38.081 6 DEBUG mistral.engine.tasks [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Starting task [name=create_swift_rings_backup_plan, init_state=RUNNING, workflow_name=tripleo.deployment.v1.deploy_plan, execution_id=e443bbba-f480-4c78-be59-a31c608e3236] _run_new /usr/lib/python3.6/site-packages/mistral/engine/tasks.py:525
2020-08-10 15:47:45.115 6 DEBUG mistral.services.legacy_scheduler [req-109dbc06-596e-4fb5-bdb0-46f5306351fe 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Preparing next delayed call. [ID=7a3cedb0-86b0-4ff6-b380-2caef4c08716, factory_method_path=None, target_method_name=mistral.engine.workflow_handler._check_and_fix_integrity, method_arguments={'wf_ex_id': 'e443bbba-f480-4c78-be59-a31c608e3236'}] _prepare_calls /usr/lib/python3.6/site-packages/mistral/services/legacy_scheduler.py:238
2020-08-10 15:48:01.963 6 INFO workflow_trace [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Task 'create_swift_rings_backup_plan' (3d88d1b2-d623-4a50-a5ef-d06c59bc68c5) [RUNNING -> SUCCESS, msg=None] (execution_id=e443bbba-f480-4c78-be59-a31c608e3236)
2020-08-10 15:48:01.975 6 DEBUG mistral.engine.tasks [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Starting task [name=cell_v2_discover_hosts, init_state=RUNNING, workflow_name=tripleo.deployment.v1.deploy_plan, execution_id=e443bbba-f480-4c78-be59-a31c608e3236] _run_new /usr/lib/python3.6/site-packages/mistral/engine/tasks.py:525
2020-08-10 15:48:12.689 6 INFO workflow_trace [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Task 'cell_v2_discover_hosts' (8f442d62-fc71-4cfb-8bb4-d812891d2f42) [RUNNING -> SUCCESS, msg=None] (execution_id=e443bbba-f480-4c78-be59-a31c608e3236)
2020-08-10 15:48:12.696 6 DEBUG mistral.engine.tasks [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Starting task [name=send_stack_deployment_message, init_state=RUNNING, workflow_name=tripleo.deployment.v1.deploy_plan, execution_id=e443bbba-f480-4c78-be59-a31c608e3236] _run_new /usr/lib/python3.6/site-packages/mistral/engine/tasks.py:525
2020-08-10 15:48:16.748 6 INFO workflow_trace [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Task 'send_stack_deployment_message' (d0eaec13-db6e-411c-9483-35c38fafcc64) [RUNNING -> SUCCESS, msg=None] (execution_id=e443bbba-f480-4c78-be59-a31c608e3236)
2020-08-10 15:48:16.763 6 DEBUG mistral.engine.tasks [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Starting task [name=deploy, init_state=RUNNING, workflow_name=tripleo.deployment.v1.deploy_plan, execution_id=e443bbba-f480-4c78-be59-a31c608e3236] _run_new /usr/lib/python3.6/site-packages/mistral/engine/tasks.py:525
2020-08-10 15:49:14.941 6 INFO workflow_trace [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Task 'deploy' (60391d6b-313c-4143-8338-e8c64f3dd398) [RUNNING -> SUCCESS, msg=None] (execution_id=e443bbba-f480-4c78-be59-a31c608e3236)
2020-08-10 15:49:14.954 6 DEBUG mistral.engine.tasks [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Starting task [name=set_deployment_deploying, init_state=RUNNING, workflow_name=tripleo.deployment.v1.deploy_plan, execution_id=e443bbba-f480-4c78-be59-a31c608e3236] _run_new /usr/lib/python3.6/site-packages/mistral/engine/tasks.py:525
2020-08-10 15:49:15.003 6 INFO workflow_trace [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Task 'set_deployment_deploying' (191ef824-b0d2-4922-921b-128102243369) [RUNNING -> SUCCESS, msg=None] (execution_id=e443bbba-f480-4c78-be59-a31c608e3236)
2020-08-10 15:49:15.012 6 DEBUG mistral.engine.tasks [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Starting task [name=send_message, init_state=RUNNING, workflow_name=tripleo.deployment.v1.deploy_plan, execution_id=e443bbba-f480-4c78-be59-a31c608e3236] _run_new /usr/lib/python3.6/site-packages/mistral/engine/tasks.py:525
2020-08-10 15:49:18.007 6 INFO workflow_trace [req-8baccced-db86-4153-a5eb-8e4cf95d3420 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Task 'send_message' (4479eebd-5274-49e0-a525-1a030ab549af) [RUNNING -> SUCCESS, msg=None] (execution_id=e443bbba-f480-4c78-be59-a31c608e3236)
2020-08-10 15:49:18.045 6 DEBUG mistral.engine.workflows [-] Workflow completed [id=e443bbba-f480-4c78-be59-a31c608e3236] check_and_complete /usr/lib/python3.6/site-packages/mistral/engine/workflows.py:454
2020-08-10 15:49:18.090 6 INFO workflow_trace [-] Workflow 'tripleo.deployment.v1.deploy_plan' [RUNNING -> SUCCESS, msg=None] (execution_id=e443bbba-f480-4c78-be59-a31c608e3236)
2020-08-10 15:49:45.282 6 DEBUG mistral.services.legacy_scheduler [req-109dbc06-596e-4fb5-bdb0-46f5306351fe 9c5f32d61b7c43d8a43fea978d071a55 c2fde68cfb084797956f5903bbd002f3 - default default] Preparing next delayed call. [ID=9981443f-f422-43f2-b19a-450f4d3563c6, factory_method_path=None, target_method_name=mistral.engine.workflow_handler._check_and_fix_integrity, method_arguments={'wf_ex_id': 'e443bbba-f480-4c78-be59-a31c608e3236'}] _prepare_calls /usr/lib/python3.6/site-packages/mistral/services/legacy_scheduler.py:238

Revision history for this message
Alex Schultz (alex-schultz) wrote :

Since we dropped mistral for the most part in U, this only affects <= Train

summary: - mistral workflows that take longer than 10 minutes to generate a message
- fatal even if it's still running
+ [TRAIN] mistral workflows that take longer than 10 minutes to generate a
+ message fatal even if it's still running
tags: removed: ussuri-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to python-tripleoclient (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/745583

Revision history for this message
Alex Schultz (alex-schultz) wrote :
Changed in tripleo:
assignee: nobody → Alex Schultz (alex-schultz)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to python-tripleoclient (stable/train)

Reviewed: https://review.opendev.org/745583
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=846911ef53528e1f3437a179296bccb57d640387
Submitter: Zuul
Branch: stable/train

commit 846911ef53528e1f3437a179296bccb57d640387
Author: Alex Schultz <email address hidden>
Date: Mon Aug 10 13:45:14 2020 -0600

    [TRAIN-Only] Increase deploy workflow wait

    During the execution of the deploy plan, we will also run container
    image prepare in a dry run. This may still reach out to determine tags
    which can contribute to the process taking longer than 10 minutes in CI.

    We are doubling the wait time from 10 minutes to 20 minutes for now.
    Eventually we should probably not exit if the task is still running.

    We dropped this code in Ussuri so it only applies in Train.

    Change-Id: Ifa94bffcf4fadc5d4a87675b5d4a3357e8808ac3
    Parial-Bug: #1891071

tags: added: in-stable-train
Changed in tripleo:
milestone: victoria-3 → wallaby-1
Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.