Deployment workflow: mistral action error because heartbeat wasn't received

Bug #1821611 reported by Emilien Macchi
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Emilien Macchi

Bug Description

Environment: Stein on RHEL8 (Python3).

The "openstack overcloud deploy command" fails during step 1 (or so):

Exception occured while running the command
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tripleoclient/command.py", line 29, in run
    super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/cliff/command.py", line 184, in run
    return_code = self.take_action(parsed_args) or 0
  File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 949, in take_action
    verbosity=self.app_args.verbose_level)
  File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/deployment.py", line 323, in config_download
    raise exceptions.DeploymentError("Overcloud configuration failed.")
tripleoclient.exceptions.DeploymentError: Overcloud configuration failed.
Overcloud configuration failed.

But the actual deployments is still running (in a Mistral workflow). And the deployment goes to the end eventually, successfully.

However the operators gets an error back from tripleoclient.

Looking at Mistral Engine logs:
engine.log:2019-03-15 20:35:38.214 1 INFO mistral.engine.engine_server [req-b75cf539-88db-4419-9d44-588660cb26a2 - - - - -] Received RPC request 'report_running_actions'[action_ex_ids=['11d76
cda-4906-442b-bcc5-f3e118890e55']]
engine.log:2019-03-15 20:40:58.388 1 INFO mistral.services.action_execution_checker [req-940d159f-4ca6-4d06-8351-fe9b6c118840 - - - - -] Actions executions to transit to error, because heartbeat wasn't received

Workaround: Increasing the heartbeat intervals in Mistral:
max_missed_heartbeats = 30
check_interval = 40
first_heartbeat_timeout = 7200

Changed in tripleo:
status: New → Triaged
importance: Undecided → High
milestone: none → stein-rc1
Changed in tripleo:
assignee: nobody → Emilien Macchi (emilienm)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/647597
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=374fafd66afa792ba197403b479dadbfa3055bce
Submitter: Zuul
Branch: master

commit 374fafd66afa792ba197403b479dadbfa3055bce
Author: Emilien Macchi <email address hidden>
Date: Mon Mar 25 15:48:47 2019 -0400

    mistral: configure heartbeat parameters to avoid action timeout

    This patch configures and increases the defaults heartbeat parameters in
    Mistral so we don't hit timeouts when an action in a workflow takes
    times to reply back in Mistral, when deploying an Overcloud.

    Parameters added:

      MistralMaxMissedHeartbeats:
        type: number
        default: 15
        description: >
            The maximum amount of missed heartbeats to be allowed.
            If set to 0 then this feature is disabled. See check_interval for more
            details.
        constraints:
          - range: { min: 0 }
      MistralCheckInterval:
        type: number
        default: 20
        description: >
            How often (in seconds) action executions are checked.
            For example when check_interval is 10, check action
            executions every 10 seconds. When the checker runs it will
            transit all running action executions to error if the last
            heartbeat received is older than 10 * max_missed_heartbeats
            seconds. If set to 0 then this feature is disabled.
        constraints:
          - range: { min: 0 }
      MistralFirstHeartbeatTimeout:
        type: number
        default: 3600
        description: >
            The first heartbeat is handled differently, to provide a
            grace period in case there is no available executor to handle
            the action execution. For example when
            first_heartbeat_timeout = 3600, wait 3600 seconds before
            closing the action executions that never received a heartbeat.
        constraints:
          - range: { min: 0 }

    Configuration applied to Undercloud:
    Maximum missed heartbeats: 30 seconds
    Time between interval checks: 40 seconds
    First Heartbeat timeout after 7200 seconds

    Depends-On: I7a2313bed58485e077ae210d222902f4f997f0f0
    Change-Id: Id8663e76b61c9e09547c228da226b706383a3e20
    Closes-Bug: #1821611

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/651802

Changed in tripleo:
status: Fix Released → In Progress
Changed in tripleo:
milestone: stein-rc1 → train-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 10.5.0

This issue was fixed in the openstack/tripleo-heat-templates 10.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: master
Review: https://review.opendev.org/651802

Changed in tripleo:
milestone: train-1 → train-2
Changed in tripleo:
milestone: train-2 → train-3
Changed in tripleo:
milestone: train-3 → ussuri-1
Changed in tripleo:
milestone: ussuri-1 → ussuri-2
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.