Ocata -> Pike upgrade: upgrade gets stuck on split stack deployments during Deployment_Step2 because the cluster is in maintenance mode

Bug #1725175 reported by Marius Cornea
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Marius Cornea

Bug Description

[also discussed at https://bugzilla.redhat.com/show_bug.cgi?id=1503756 ]

Ocata -> Pike upgrade: upgrade gets stuck on split stack deployments during Deployment_Step2 because the cluster is in maintenance mode

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.3-0.20171014102841.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy Ocata split stack deployment with 3 ctrl, 3 messaging, 3 db, 2 compute node, 3 ceph nodes
2. Upgrade to Pike

Actual results:
While running major-upgrade-composable-steps-docker the upgrade gets stuck, checking the heat stacks:

(undercloud) [stack@undercloud-0 ~]$ openstack stack list --nested | grep PROGRESS
| 8e40a6c7-ebdb-4ccc-85c6-6275f8d3f3c5 | overcloud-AllNodesDeploySteps-wjo2dcwwmosx-AllNodesPostUpgradeSteps-4i7bmylywdnm-DatabaseDeployedServerDeployment_Step2-xcog6pw4h7ot | 08da50fc73114b118f112d645e8631dd | CREATE_IN_PROGRESS | 2017-10-18T15:42:08Z | None | dab455a8-18d2-4eab-8cea-7cabbb1d2659 |
| dab455a8-18d2-4eab-8cea-7cabbb1d2659 | overcloud-AllNodesDeploySteps-wjo2dcwwmosx-AllNodesPostUpgradeSteps-4i7bmylywdnm | 08da50fc73114b118f112d645e8631dd | UPDATE_IN_PROGRESS | 2017-10-18T09:09:53Z | 2017-10-18T15:33:00Z | 2edb57b9-eb04-4147-ae07-e3d766052ca2 |
| 2edb57b9-eb04-4147-ae07-e3d766052ca2 | overcloud-AllNodesDeploySteps-wjo2dcwwmosx | 08da50fc73114b118f112d645e8631dd | UPDATE_IN_PROGRESS | 2017-10-18T06:18:14Z | 2017-10-18T15:31:15Z | f63bd95d-d367-49e6-a83e-d223ee13c991 |
| f63bd95d-d367-49e6-a83e-d223ee13c991 | overcloud | 08da50fc73114b118f112d645e8631dd | UPDATE_IN_PROGRESS | 2017-10-18T06:10:20Z | 2017-10-18T15:23:44Z | None |
(undercloud) [stack@undercloud-0 ~]$

Going to the database nodes we can see that the mysql_init_bundle has been running for 23 minutes:

[root@database-0 ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4aa5d1cb91f3 192.168.0.1:8787/rhosp12/openstack-mariadb-docker:20171017.1 "/bin/bash -c 'cp -a " 23 minutes ago Up 23 minutes mysql_init_bundle
b9d4c6209a8c 192.168.0.1:8787/rhosp12/openstack-mariadb-docker:20171017.1 "kolla_start" 23 minutes ago Up 23 minutes clustercheck

[root@database-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: messaging-1 (version 1.1.16-12.el7_4.4-94ff4df) - partition with quorum
Last updated: Wed Oct 18 16:06:43 2017
Last change: Wed Oct 18 15:43:53 2017 by root via cibadmin on controller-0

18 nodes configured
36 resources configured (1 DISABLED)

              *** Resource management is DISABLED ***
  The cluster will not attempt to start, stop or recover services

Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]

Full list of resources:

 ip-192.168.0.66 (ocf::heartbeat:IPaddr2): Started controller-0 (unmanaged)
 ip-172.16.18.27 (ocf::heartbeat:IPaddr2): Started controller-1 (unmanaged)
 ip-10.0.0.16 (ocf::heartbeat:IPaddr2): Started controller-2 (unmanaged)
 ip-10.0.0.138 (ocf::heartbeat:IPaddr2): Started controller-0 (unmanaged)
 ip-10.0.1.14 (ocf::heartbeat:IPaddr2): Started controller-1 (unmanaged)
 openstack-cinder-volume (systemd:openstack-cinder-volume): Stopped (disabled, unmanaged)
 Docker container set: redis-bundle [192.168.0.1:8787/rhosp12/openstack-redis-docker:pcmklatest] (unmanaged)
   redis-bundle-0 (ocf::heartbeat:redis): Stopped (unmanaged)
   redis-bundle-1 (ocf::heartbeat:redis): Stopped (unmanaged)
   redis-bundle-2 (ocf::heartbeat:redis): Stopped (unmanaged)
 Docker container set: rabbitmq-bundle [192.168.0.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest] (unmanaged)
   rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped (unmanaged)
   rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped (unmanaged)
   rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped (unmanaged)
 Docker container set: galera-bundle [192.168.0.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest] (unmanaged)
   galera-bundle-0 (ocf::heartbeat:galera): Stopped (unmanaged)
   galera-bundle-1 (ocf::heartbeat:galera): Stopped (unmanaged)
   galera-bundle-2 (ocf::heartbeat:galera): Stopped (unmanaged)
 Docker container set: haproxy-bundle [192.168.0.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest] (unmanaged)
   haproxy-bundle-docker-0 (ocf::heartbeat:docker): Stopped (unmanaged)
   haproxy-bundle-docker-1 (ocf::heartbeat:docker): Stopped (unmanaged)
   haproxy-bundle-docker-2 (ocf::heartbeat:docker): Stopped (unmanaged)

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@database-0 ~]# pcs property list
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: tripleo_cluster
 dc-version: 1.1.16-12.el7_4.4-94ff4df
 have-watchdog: false
 maintenance-mode: true
 redis_REPL_INFO: controller-0
 stonith-enabled: false
Node Attributes:
 controller-0: cinder-volume-role=true haproxy-role=true redis-role=true
 controller-1: cinder-volume-role=true haproxy-role=true redis-role=true
 controller-2: cinder-volume-role=true haproxy-role=true redis-role=true
 database-0: galera-role=true
 database-1: galera-role=true
 database-2: galera-role=true
 messaging-0: rabbitmq-role=true rmq-node-attr-last-known-rabbitmq=rabbit@messaging-0
 messaging-1: rabbitmq-role=true rmq-node-attr-last-known-rabbitmq=rabbit@messaging-1
 messaging-2: rabbitmq-role=true rmq-node-attr-last-known-rabbitmq=rabbit@messaging-2

Expected results:
Upgrade doesn't get stuck.

Additional info:

After setting pcs property set maintenance-mode=false the upgrade gets unstuck and the resources get started:

Cluster name: tripleo_cluster
Stack: corosync
Current DC: messaging-1 (version 1.1.16-12.el7_4.4-94ff4df) - partition with quorum
Last updated: Wed Oct 18 16:10:37 2017
Last change: Wed Oct 18 16:09:36 2017 by rabbitmq-bundle-2 via crm_attribute on messaging-2

18 nodes configured
36 resources configured (1 DISABLED)

Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
GuestOnline: [ galera-bundle-0@database-0 galera-bundle-1@database-1 galera-bundle-2@database-2 rabbitmq-bundle-0@messaging-0 rabbitmq-bundle-1@messaging-1 rabbitmq-bundle-2@messaging-2 redis-bundle-0@controller-2 redis-bundle-1@controller-0 redis-bundle-2@controller-1 ]

Full list of resources:

 ip-192.168.0.66 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-172.16.18.27 (ocf::heartbeat:IPaddr2): Started controller-1
 ip-10.0.0.16 (ocf::heartbeat:IPaddr2): Started controller-2
 ip-10.0.0.138 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-10.0.1.14 (ocf::heartbeat:IPaddr2): Started controller-1
 openstack-cinder-volume (systemd:openstack-cinder-volume): Stopped (disabled)
 Docker container set: redis-bundle [192.168.0.1:8787/rhosp12/openstack-redis-docker:pcmklatest]
   redis-bundle-0 (ocf::heartbeat:redis): Slave controller-2
   redis-bundle-1 (ocf::heartbeat:redis): Master controller-0
   redis-bundle-2 (ocf::heartbeat:redis): Slave controller-1
 Docker container set: rabbitmq-bundle [192.168.0.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest]
   rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started messaging-0
   rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started messaging-1
   rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started messaging-2
 Docker container set: galera-bundle [192.168.0.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest]
   galera-bundle-0 (ocf::heartbeat:galera): Master database-0
   galera-bundle-1 (ocf::heartbeat:galera): Master database-1
   galera-bundle-2 (ocf::heartbeat:galera): Master database-2
 Docker container set: haproxy-bundle [192.168.0.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest]
   haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0
   haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-1
   haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/513654

Changed in tripleo:
assignee: nobody → Marius Cornea (mcornea)
status: Confirmed → In Progress
description: updated
Changed in tripleo:
milestone: none → queens-1
Changed in tripleo:
milestone: queens-1 → queens-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/513654
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=8e92d7c6db6fcae863a250f63b01a98f7a3f3340
Submitter: Zuul
Branch: master

commit 8e92d7c6db6fcae863a250f63b01a98f7a3f3340
Author: Marius Cornea <email address hidden>
Date: Fri Oct 20 10:20:50 2017 +0200

    Do not set cluster in maintenance mode during split stack upgrade

    This change noops ControllerDeployedServer{Pre,Post}Config to avoid
    getting the upgrade of a split stack deployment getting stuck due
    to the cluster being in maintenance mode. For reference a similar
    change has been done for the regular Controller role in:
    https://review.openstack.org/#/c/487313/

    Change-Id: Idd393011b3c4d0d236780e11a04a59d426750de1
    Closes-bug: 1725175

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/518597

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/pike)

Reviewed: https://review.openstack.org/518597
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=bcea7e76f4e766a1c04b1e14048da9ab08d42b86
Submitter: Zuul
Branch: stable/pike

commit bcea7e76f4e766a1c04b1e14048da9ab08d42b86
Author: Marius Cornea <email address hidden>
Date: Fri Oct 20 10:20:50 2017 +0200

    Do not set cluster in maintenance mode during split stack upgrade

    This change noops ControllerDeployedServer{Pre,Post}Config to avoid
    getting the upgrade of a split stack deployment getting stuck due
    to the cluster being in maintenance mode. For reference a similar
    change has been done for the regular Controller role in:
    https://review.openstack.org/#/c/487313/

    Change-Id: Idd393011b3c4d0d236780e11a04a59d426750de1
    Closes-bug: 1725175
    (cherry picked from commit 8e92d7c6db6fcae863a250f63b01a98f7a3f3340)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 7.0.4

This issue was fixed in the openstack/tripleo-heat-templates 7.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 8.0.0.0b2

This issue was fixed in the openstack/tripleo-heat-templates 8.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.