During redeployment of overcloud, rabbitmq-bundle-0 pacemaker resource goes down

Bug #1933446 reported by Shyam Biradar
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
New
Undecided
Unassigned

Bug Description

Description:

With no change in overcloud deploy command/cloud, if we retry the deploy command after a successful deployment, it fails with rabbitmq pacemeker resource in 'STOPPED' stated.

==================================================
Here is pcs status command output:

[root@overcloud-controller-0 heat-admin]# pcs status
Cluster name: tripleo_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: overcloud-controller-0 (version 2.1.0-2.el8-7c3f660707) - partition with quorum
  * Last updated: Thu Jun 24 04:51:06 2021
  * Last change: Wed Jun 23 11:41:32 2021 by hacluster via crmd on overcloud-controller-0
  * 5 nodes configured
  * 21 resource instances configured

Node List:
  * Online: [ overcloud-controller-0 ]
  * GuestOnline: [ galera-bundle-0@overcloud-controller-0 ovn-dbs-bundle-0@overcloud-controller-0 rabbitmq-bundle-0@overcloud-controller-0 redis-bundle-0@overcloud-controller-0 ]

Full List of Resources:
  * ip-192.168.24.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
  * ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
  * ip-172.16.2.8 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
  * ip-172.16.2.28 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
  * ip-172.16.1.185 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
  * ip-172.16.3.188 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
  * Container bundle: haproxy-bundle [cluster.common.tag/centos-binary-haproxy:pcmklatest]:
    * haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0
  * Container bundle: galera-bundle [cluster.common.tag/centos-binary-mariadb:pcmklatest]:
    * galera-bundle-0 (ocf::heartbeat:galera): Master overcloud-controller-0
  * Container bundle: rabbitmq-bundle [cluster.common.tag/centos-binary-rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped overcloud-controller-0
  * Container bundle: redis-bundle [cluster.common.tag/centos-binary-redis:pcmklatest]:
    * redis-bundle-0 (ocf::heartbeat:redis): Master overcloud-controller-0
  * Container bundle: ovn-dbs-bundle [cluster.common.tag/centos-binary-ovn-northd:pcmklatest]:
    * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master overcloud-controller-0
  * ip-172.16.2.168 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
  * Container bundle: openstack-cinder-volume [cluster.common.tag/centos-binary-cinder-volume:pcmklatest]:
    * openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0

Failed Resource Actions:
  * rabbitmq_start_0 on rabbitmq-bundle-0 'error' (1): call=151216, status='Timed Out', exitreason='', last-rc-change='2021-06-23 12:00:21Z', queued=0ms, exec=200037ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

==================================================================
Here is rabbitmq container status
[root@overcloud-controller-0 heat-admin]# podman ps -a | grep rabbit
2e06b68afa42 cluster.common.tag/centos-binary-rabbitmq:pcmklatest /bin/bash /usr/lo... 2 days ago Up 2 days ago rabbitmq-bundle-podman-0
390b82e77949 192.168.24.1:8787/tripleotraincentos8/centos-binary-rabbitmq:current-tripleo kolla_start 58 minutes ago Exited (0) 58 minutes ago rabbitmq_bootstrap
71097f4b73f5 192.168.24.1:8787/tripleotraincentos8/centos-binary-rabbitmq:current-tripleo /container_puppet... 54 minutes ago Exited (1) 18 minutes ago rabbitmq_wait_bundle

====================================================================

Here is overcloud deployment failure logs:

Atatched in a file.

=======================================================

Steps to reproduce:
1. Deploy tripleo train centos8 overcloud using tripleo quickstart.
2. After successful deployment, Run overcloud-deploy.sh script available on unercloud node again.
3. Overcloud deployment will fails this time with above error mentioned in 'Description' section

======================================================
Expected result:

Redeployment should work fine.

=====================================================
Actual Result:

Redeployment of the cloud without any change is failing.
======================================================

Environment:

TripleO Train Centos8 Cloud using Tripleo Quickstart tool

=========================================================

Logs and Configs:

Provided in Description section.

overcloud-deply.sh:

(undercloud) [stack@undercloud ~]$ cat overcloud-deploy.sh
#!/bin/bash

set -ux

### --start_docs
## Deploying the overcloud
## =======================

## Prepare Your Environment
## ------------------------

## * Source in the undercloud credentials.
## ::

source /home/stack/stackrc

### --stop_docs
# Wait until there are hypervisors available.
while true; do
    count=$(openstack hypervisor stats show -c count -f value)
    if [ $count -gt 0 ]; then
        break
    fi
done

### --start_docs

## * Deploy the overcloud!
## ::
openstack overcloud deploy --stack overcloud \
  --override-ansible-cfg /home/stack/custom_ansible.cfg \
   --templates /usr/share/openstack-tripleo-heat-templates \
  --libvirt-type qemu --timeout 90 -e /home/stack/cloud-names.yaml -e /home/stack/neutronl3ha.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml -e /home/stack/containers-prepare-parameter.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e /home/stack/network-environment.yaml -e /home/stack/overcloud_storage_params.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml -e /home/stack/inject-trust-anchor.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/services/haproxy-public-tls-certmonger.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/enable-internal-tls.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-everywhere-endpoints-dns.yaml --validation-warnings-fatal -e /home/stack/overcloud-topology-config.yaml -e /home/stack/overcloud-selinux-config.yaml \
  "$@" && status_code=0 || status_code=$?

### --stop_docs

# Check if the deployment has started. If not, exit gracefully. If yes, check for errors.
if ! openstack stack list | grep -q overcloud; then
    echo "overcloud deployment not started. Check the deploy configurations"
    exit 1

    # We don't always get a useful error code from the openstack deploy command,
    # so check `openstack stack list` for a CREATE_COMPLETE or an UPDATE_COMPLETE
    # status.
elif ! openstack stack list | grep -Eq '(CREATE|UPDATE)_COMPLETE'; then
        # get the failures list
    openstack stack failures list overcloud --long > /home/stack/failed_deployment_list.log || true

    # get any puppet related errors
    for failed in $(openstack stack resource list \
        --nested-depth 5 overcloud | grep FAILED |
        grep 'StructuredDeployment ' | cut -d '|' -f3)
    do
    echo "openstack software deployment show output for deployment: $failed" >> /home/stack/failed_deployments.log
    echo "######################################################" >> /home/stack/failed_deployments.log
    openstack software deployment show $failed >> /home/stack/failed_deployments.log
    echo "######################################################" >> /home/stack/failed_deployments.log
    echo "puppet standard error for deployment: $failed" >> /home/stack/failed_deployments.log
    echo "######################################################" >> /home/stack/failed_deployments.log
    # the sed part removes color codes from the text
    openstack software deployment show $failed -f json |
        jq -r .output_values.deploy_stderr |
        sed -r "s:\x1B\[[0-9;]*[mK]::g" >> /home/stack/failed_deployments.log
    echo "######################################################" >> /home/stack/failed_deployments.log
    # We need to exit with 1 because of the above || true
    done
    exit 1
fi
exit $status_code

Revision history for this message
Shyam Biradar (shyambiradar) wrote :
Revision history for this message
Michele Baldessari (michele) wrote :

please paste the /var/log/pacemaker/pacemaker.log + /var/log/containers/rabbitmq/* + the journal output and the time of failure

Revision history for this message
Shyam (shyam.biradar) wrote :
Revision history for this message
Shyam (shyam.biradar) wrote :
Revision history for this message
Shyam Biradar (shyambiradar) wrote :

Hi Michele,

I have attached rabbitmq and pacemaker logs.
For journal outputs what process name I need to use here?

Thanks.

Revision history for this message
Michele Baldessari (michele) wrote :

The whole journal ideally or from before the redeploy to the failed start of rabbit

Revision history for this message
Shyam (shyam.biradar) wrote :
Revision history for this message
Shyam Biradar (shyambiradar) wrote :

Time of failure: 2021-06-24T08:10:16

Revision history for this message
Shyam Biradar (shyambiradar) wrote :

Hi Michele,

All required logs are attached.

Let me know if you need /varlog/containers/stdouts/rabbitmq*

Thanks,
Shyam

Revision history for this message
Shyam Biradar (shyambiradar) wrote :

Let me know your initial thoughts on this one.

Revision history for this message
Shyam Biradar (shyambiradar) wrote :

Hi,

Any update here?

Thanks

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.