M/N Upgrade - major-upgrade-pacemaker times out

Bug #1626628 reported by Michele Baldessari
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Michele Baldessari

Bug Description

So after applying https://review.openstack.org/#/c/374791/ to noop the postdeployment, the major-upgrade-pacmaker-init step completes successfully. Interestingly after running the major-upgrade-pacemaker step:
openstack overcloud deploy --templates /home/stack/tripleo-heat-templates --libvirt-type qemu \
  --control-flavor oooq_control --compute-flavor oooq_compute \
  --ceph-storage-flavor oooq_ceph --timeout 75 \
  --control-scale 3 --neutron-network-type vxlan --neutron-tunnel-types vxlan \
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
  -e /home/stack/tripleo-heat-templates/environments/network-isolation.yaml \
  -e /home/stack/tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml \
  -e $HOME/network-environment.yaml \
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml \
  -e /home/stack/tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml \
  --ntp-server clock.redhat.com \
  ${DEPLOY_ENV_YAML:+-e $DEPLOY_ENV_YAML} || deploy_status=1

It seems that heat times out:
2016-09-22 09:54:49Z [ControllerDeployment]: SIGNAL_COMPLETE Unknown
2016-09-22 09:54:51Z [NetworkDeployment]: SIGNAL_COMPLETE Unknown
2016-09-22 09:54:56Z [2]: SIGNAL_COMPLETE Unknown
2016-09-22 09:54:58Z [ControllerDeployment]: SIGNAL_COMPLETE Unknown
2016-09-22 09:55:00Z [NetworkDeployment]: SIGNAL_COMPLETE Unknown
2016-09-22 11:04:25Z [UpdateWorkflow]: UPDATE_FAILED UPDATE aborted
2016-09-22 11:04:25Z [overcloud]: UPDATE_FAILED Timed out
2016-09-22 11:04:25Z [CephMonUpgradeDeployment]: CREATE_FAILED CREATE aborted
2016-09-22 11:04:25Z [overcloud-UpdateWorkflow-nqo3h2msfua6]: UPDATE_FAILED Operation cancelled
 Stack overcloud UPDATE_FAILED

I have reproduced it twice now and *after* the NewtorkDeployment is COMPLETE nothing happens until a timeout kicks in. The resources are:
+--------------------------+-----------------------------------+--------------------+----------------------+
| resource_name |resource_type | resource_status | updated_time |
+--------------------------+-----------------------------------+--------------------+----------------------+
| UpdateWorkflow |OS::TripleO::Tasks::UpdateWorkflow | UPDATE_IN_PROGRESS | 2016-09-22T11:33:51Z |
| CephMonUpgradeDeployment |OS::Heat::SoftwareDeploymentGroup | UPDATE_IN_PROGRESS | 2016-09-22T11:33:52Z |
| Controller |OS::Heat::SoftwareDeployment | CREATE_IN_PROGRESS | 2016-09-22T11:33:52Z |
| get_param |OS::Heat::SoftwareDeployment | CREATE_IN_PROGRESS | 2016-09-22T11:33:53Z |
+--------------------------+-----------------------------------+--------------------+----------------------+

It seems heat is constantly stuck in this loop:
2016-09-22 11:44:37.014 1404 DEBUG heat.engine.scheduler [req-c89c0266-ed18-4287-ac26-3b88a2c05220 - - - - -] Task _run_update from SoftwareDeploymentGroup "CephMonUpgradeDeployment" [fcf70ad4-1338-4868-bf4e-e02a59a280df] Stack "overcloud-UpdateWorkflow-nqo3h2msfua6" [c59f2f0c-b30b-482d-a9c7-92e62667f658] running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:216
2016-09-22 11:44:37.637 1406 DEBUG heat.engine.scheduler [req-d53c4a24-aa62-41fa-8781-6c092c7478a9 - - - - -] Task update_task from Stack "overcloud-UpdateWorkflow-nqo3h2msfua6-CephMonUpgradeDeployment-j3pp67lv7kx6" [fcf70ad4-1338-4868-bf4e-e02a59a280df] running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:216
2016-09-22 11:44:37.638 1406 DEBUG heat.engine.scheduler [req-d53c4a24-aa62-41fa-8781-6c092c7478a9 - - - - -] Task Stack "overcloud-UpdateWorkflow-nqo3h2msfua6-CephMonUpgradeDeployment-j3pp67lv7kx6" [fcf70ad4-1338-4868-bf4e-e02a59a280df] Update running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:216
2016-09-22 11:44:37.638 1406 DEBUG heat.engine.scheduler [req-d53c4a24-aa62-41fa-8781-6c092c7478a9 - - - - -] Task _resource_update from Stack "overcloud-UpdateWorkflow-nqo3h2msfua6-CephMonUpgradeDeployment-j3pp67lv7kx6" [fcf70ad4-1338-4868-bf4e-e02a59a280df] Update running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:216
2016-09-22 11:44:37.668 1406 DEBUG heat.engine.scheduler [req-d53c4a24-aa62-41fa-8781-6c092c7478a9 - - - - -] Task _resource_update from Stack "overcloud-UpdateWorkflow-nqo3h2msfua6-CephMonUpgradeDeployment-j3pp67lv7kx6" [fcf70ad4-1338-4868-bf4e-e02a59a280df] Update running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:216
2016-09-22 11:44:37.692 1406 DEBUG heat.engine.scheduler [req-d53c4a24-aa62-41fa-8781-6c092c7478a9 - - - - -] Task update_task from Stack "overcloud-UpdateWorkflow-nqo3h2msfua6-CephMonUpgradeDeployment-j3pp67lv7kx6" [fcf70ad4-1338-4868-bf4e-e02a59a280df] sleeping _sleep /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:157
2016-09-22 11:44:38.020 1404 DEBUG heat.engine.scheduler [req-c89c0266-ed18-4287-ac26-3b88a2c05220 - - - - -] Task _run_update from SoftwareDeploymentGroup "CephMonUpgradeDeployment" [fcf70ad4-1338-4868-bf4e-e02a59a280df] Stack "overcloud-UpdateWorkflow-nqo3h2msfua6" [c59f2f0c-b30b-482d-a9c7-92e62667f658] running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:216

I have the system where this occurred twice and I am not touching it. If anyone wants to take a peek

Tags: upgrade
Changed in tripleo:
status: New → Triaged
Revision history for this message
Michele Baldessari (michele) wrote :

So if I remove all references of CephMon in extraconfig/tasks/major_upgrade_pacemaker.yaml I still get the issue and it all hangs on:
 UpdateWorkflow | OS::TripleO::Tasks::UpdateWorkflow | UPDATE_IN_PROGRESS | 2016-09-23T08:35:39Z | overcloud
 ControllerPacemakerUpgradeDeployment_Step1 | OS::Heat::SoftwareDeploymentGroup | CREATE_IN_PROGRESS | 2016-09-23T08:35:43Z | overcloud-UpdateWorkflow-6otltc4s2uij
 Controller | OS::Heat::SoftwareDeployment | CREATE_IN_PROGRESS | 2016-09-23T08:35:44Z | overcloud-UpdateWorkflow-6otltc4s2uij-ControllerPacemakerUpgra
deDeployment_Step1-v26irogdbskp
 get_param | OS::Heat::SoftwareDeployment | CREATE_IN_PROGRESS | 2016-09-23T08:35:44Z | overcloud-UpdateWorkflow-6otltc4s2uij-ControllerPacemakerUpgra
deDeployment_Step1-v26irogdbskp

Revision history for this message
Michele Baldessari (michele) wrote :

Setting confirmed, since Mathieu is now hitting this as well

Changed in tripleo:
status: Triaged → Confirmed
Revision history for this message
Michele Baldessari (michele) wrote :

So even with a much smaller extraconfig/tasks/major_upgrade_pacemaker.yaml:
heat_template_version: 2016-10-14
description: 'Upgrade for Pacemaker deployments'

parameters:
  servers:
    type: json
  input_values:
    type: json
    description: input values for the software deployments

resources:
  BlockStorageUpgradeConfig:
    type: OS::Heat::SoftwareConfig
    properties:
      group: script
      config: {get_file: major_upgrade_block_storage.sh}

  BlockStorageUpgradeDeployment:
    type: OS::Heat::SoftwareDeploymentGroup
    properties:
      servers: {get_param: servers, BlockStorage}
      config: {get_resource: BlockStorageUpgradeConfig}
      input_values: {get_param: input_values}

I can fully reproduce the problem:
UpdateWorkflow | OS::TripleO::Tasks::UpdateWorkflow | CREATE_IN_PROGRESS | overcloud
BlockStorageUpgradeDeployment | OS::Heat::SoftwareDeploymentGroup | CREATE_IN_PROGRESS | overcloud-UpdateWorkflow-n22z54he7m2y
BlockStorage | OS::Heat::SoftwareDeployment | CREATE_IN_PROGRESS | overcloud-UpdateWorkflow-n22z54he7m2y-BlockStorageUpgradeDeployment-7o3ubfbvrcem
get_param | OS::Heat::SoftwareDeployment | CREATE_IN_PROGRESS | overcloud-UpdateWorkflow-n22z54he7m2y-BlockStorageUpgradeDeployment-7o3ubfbvrcem

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/375576

Changed in tripleo:
assignee: nobody → Michele Baldessari (michele)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/375576
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9393a3e2a57d293eabe1fcfb702b0cecdf5e60ff
Submitter: Jenkins
Branch: master

commit 9393a3e2a57d293eabe1fcfb702b0cecdf5e60ff
Author: Michele Baldessari <email address hidden>
Date: Fri Sep 23 17:31:19 2016 +0200

    get_param calls with multiple arguments need brackets around them

    This issue was spotted during major upgrade where we had calls like
    this:

       servers: {get_param: servers, Controller}

    These get_param calls are hanging indefinitely and make the whole
    upgrade end in a timeout. We need to put brackets around the get_param
    function when there are multiple arguments:
    http://docs.openstack.org/developer/heat/template_guide/hot_spec.html#get-param

    This is already done in most of the tree, and the few places where this
    was not happening were parts not under CI. After this change the
    following grep returns only one false positive:

       grep -ir get_param: |grep -v -- '\[' |grep ','

    Change-Id: I65b23bb44f37b93e017dd15a5212939ffac76614
    Closes-Bug: #1626628

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 5.0.0.0rc2

This issue was fixed in the openstack/tripleo-heat-templates 5.0.0.0rc2 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.