tripleo

During redeployment of overcloud, rabbitmq-bundle-0 pacemaker resource goes down

Bug #1933446 reported by Shyam Biradar on 2021-06-24

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	tripleo	New	Undecided	Unassigned

Bug Description

Description:

With no change in overcloud deploy command/cloud, if we retry the deploy command after a successful deployment, it fails with rabbitmq pacemeker resource in 'STOPPED' stated.

==================================================
Here is pcs status command output:

[root@overcloud-controller-0 heat-admin]# pcs status
Cluster name: tripleo_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: overcloud-controller-0 (version 2.1.0-2.el8-7c3f660707) - partition with quorum
  * Last updated: Thu Jun 24 04:51:06 2021
  * Last change: Wed Jun 23 11:41:32 2021 by hacluster via crmd on overcloud-controller-0
  * 5 nodes configured
  * 21 resource instances configured

Node List:
* Online: [ overcloud-controller-0 ]
* GuestOnline: [ galera-bundle-0@overcloud-controller-0 ovn-dbs-bundle-0@overcloud-controller-0 rabbitmq-bundle-0@overcloud-controller-0 redis-bundle-0@overcloud-controller-0 ]

Full List of Resources:
  * ip-192.168.24.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
  * ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
  * ip-172.16.2.8 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
  * ip-172.16.2.28 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
  * ip-172.16.1.185 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
  * ip-172.16.3.188 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
  * Container bundle: haproxy-bundle [cluster.common.tag/centos-binary-haproxy:pcmklatest]:
    * haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0
  * Container bundle: galera-bundle [cluster.common.tag/centos-binary-mariadb:pcmklatest]:
    * galera-bundle-0 (ocf::heartbeat:galera): Master overcloud-controller-0
  * Container bundle: rabbitmq-bundle [cluster.common.tag/centos-binary-rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped overcloud-controller-0
  * Container bundle: redis-bundle [cluster.common.tag/centos-binary-redis:pcmklatest]:
    * redis-bundle-0 (ocf::heartbeat:redis): Master overcloud-controller-0
  * Container bundle: ovn-dbs-bundle [cluster.common.tag/centos-binary-ovn-northd:pcmklatest]:
    * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master overcloud-controller-0
  * ip-172.16.2.168 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
  * Container bundle: openstack-cinder-volume [cluster.common.tag/centos-binary-cinder-volume:pcmklatest]:
    * openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0

Failed Resource Actions:
* rabbitmq_start_0 on rabbitmq-bundle-0 'error' (1): call=151216, status='Timed Out', exitreason='', last-rc-change='2021-06-23 12:00:21Z', queued=0ms, exec=200037ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

==================================================================
Here is rabbitmq container status
[root@overcloud-controller-0 heat-admin]# podman ps -a | grep rabbit
2e06b68afa42 cluster.common.tag/centos-binary-rabbitmq:pcmklatest /bin/bash /usr/lo... 2 days ago Up 2 days ago rabbitmq-bundle-podman-0
390b82e77949 192.168.24.1:8787/tripleotraincentos8/centos-binary-rabbitmq:current-tripleo kolla_start 58 minutes ago Exited (0) 58 minutes ago rabbitmq_bootstrap
71097f4b73f5 192.168.24.1:8787/tripleotraincentos8/centos-binary-rabbitmq:current-tripleo /container_puppet... 54 minutes ago Exited (1) 18 minutes ago rabbitmq_wait_bundle

====================================================================

Here is overcloud deployment failure logs:

Atatched in a file.

=======================================================

Steps to reproduce:
1. Deploy tripleo train centos8 overcloud using tripleo quickstart.
2. After successful deployment, Run overcloud-deploy.sh script available on unercloud node again.
3. Overcloud deployment will fails this time with above error mentioned in 'Description' section

======================================================
Expected result:

Redeployment should work fine.

=====================================================
Actual Result:

Redeployment of the cloud without any change is failing.
======================================================

Environment:

TripleO Train Centos8 Cloud using Tripleo Quickstart tool

=========================================================

Logs and Configs:

Provided in Description section.

overcloud-deply.sh:

(undercloud) [stack@undercloud ~]$ cat overcloud-deploy.sh
#!/bin/bash

set -ux

### --start_docs
## Deploying the overcloud
## =======================

## Prepare Your Environment
## ------------------------

## * Source in the undercloud credentials.
## ::

source /home/stack/stackrc

### --stop_docs
# Wait until there are hypervisors available.
while true; do
    count=$(openstack hypervisor stats show -c count -f value)
    if [ $count -gt 0 ]; then
        break
    fi
done

### --start_docs

## * Deploy the overcloud!
## ::
openstack overcloud deploy --stack overcloud \
  --override-ansible-cfg /home/stack/custom_ansible.cfg \
   --templates /usr/share/openstack-tripleo-heat-templates \
  --libvirt-type qemu --timeout 90 -e /home/stack/cloud-names.yaml -e /home/stack/neutronl3ha.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml -e /home/stack/containers-prepare-parameter.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e /home/stack/network-environment.yaml -e /home/stack/overcloud_storage_params.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml -e /home/stack/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-public-tls-certmonger.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/enable-internal-tls.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-everywhere-endpoints-dns.yaml --validation-warnings-fatal -e /home/stack/overcloud-topology-config.yaml -e /home/stack/overcloud-selinux-config.yaml \
  "$@" && status_code=0 || status_code=$?

### --stop_docs

# Check if the deployment has started. If not, exit gracefully. If yes, check for errors.
if ! openstack stack list | grep -q overcloud; then
echo "overcloud deployment not started. Check the deploy configurations"
exit 1

    # We don't always get a useful error code from the openstack deploy command,
    # so check `openstack stack list` for a CREATE_COMPLETE or an UPDATE_COMPLETE
    # status.
elif ! openstack stack list | grep -Eq '(CREATE|UPDATE)_COMPLETE'; then
        # get the failures list
    openstack stack failures list overcloud --long > /home/stack/failed_deployment_list.log || true

    # get any puppet related errors
    for failed in $(openstack stack resource list \
        --nested-depth 5 overcloud | grep FAILED |
        grep 'StructuredDeployment ' | cut -d '|' -f3)
    do
    echo "openstack software deployment show output for deployment: $failed" >> /home/stack/failed_deployments.log
    echo "######################################################" >> /home/stack/failed_deployments.log
    openstack software deployment show $failed >> /home/stack/failed_deployments.log
    echo "######################################################" >> /home/stack/failed_deployments.log
    echo "puppet standard error for deployment: $failed" >> /home/stack/failed_deployments.log
    echo "######################################################" >> /home/stack/failed_deployments.log
    # the sed part removes color codes from the text
    openstack software deployment show $failed -f json |
        jq -r .output_values.deploy_stderr |
        sed -r "s:\x1B\[[0-9;]*[mK]::g" >> /home/stack/failed_deployments.log
    echo "######################################################" >> /home/stack/failed_deployments.log
    # We need to exit with 1 because of the above || true
    done
    exit 1
fi
exit $status_code