During redeployment of overcloud, rabbitmq-bundle-0 pacemaker resource goes down
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
New
|
Undecided
|
Unassigned |
Bug Description
Description:
With no change in overcloud deploy command/cloud, if we retry the deploy command after a successful deployment, it fails with rabbitmq pacemeker resource in 'STOPPED' stated.
=======
Here is pcs status command output:
[root@overcloud
Cluster name: tripleo_cluster
Cluster Summary:
* Stack: corosync
* Current DC: overcloud-
* Last updated: Thu Jun 24 04:51:06 2021
* Last change: Wed Jun 23 11:41:32 2021 by hacluster via crmd on overcloud-
* 5 nodes configured
* 21 resource instances configured
Node List:
* Online: [ overcloud-
* GuestOnline: [ galera-
Full List of Resources:
* ip-192.168.24.10 (ocf::heartbeat
* ip-10.0.0.101 (ocf::heartbeat
* ip-172.16.2.8 (ocf::heartbeat
* ip-172.16.2.28 (ocf::heartbeat
* ip-172.16.1.185 (ocf::heartbeat
* ip-172.16.3.188 (ocf::heartbeat
* Container bundle: haproxy-bundle [cluster.
* haproxy-
* Container bundle: galera-bundle [cluster.
* galera-bundle-0 (ocf::heartbeat
* Container bundle: rabbitmq-bundle [cluster.
* rabbitmq-bundle-0 (ocf::heartbeat
* Container bundle: redis-bundle [cluster.
* redis-bundle-0 (ocf::heartbeat
* Container bundle: ovn-dbs-bundle [cluster.
* ovn-dbs-bundle-0 (ocf::ovn:
* ip-172.16.2.168 (ocf::heartbeat
* Container bundle: openstack-
* openstack-
Failed Resource Actions:
* rabbitmq_start_0 on rabbitmq-bundle-0 'error' (1): call=151216, status='Timed Out', exitreason='', last-rc-
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
=======
Here is rabbitmq container status
[root@overcloud
2e06b68afa42 cluster.
390b82e77949 192.168.
71097f4b73f5 192.168.
=======
Here is overcloud deployment failure logs:
Atatched in a file.
=======
Steps to reproduce:
1. Deploy tripleo train centos8 overcloud using tripleo quickstart.
2. After successful deployment, Run overcloud-deploy.sh script available on unercloud node again.
3. Overcloud deployment will fails this time with above error mentioned in 'Description' section
=======
Expected result:
Redeployment should work fine.
=======
Actual Result:
Redeployment of the cloud without any change is failing.
=======
Environment:
TripleO Train Centos8 Cloud using Tripleo Quickstart tool
=======
Logs and Configs:
Provided in Description section.
overcloud-deply.sh:
(undercloud) [stack@undercloud ~]$ cat overcloud-deploy.sh
#!/bin/bash
set -ux
### --start_docs
## Deploying the overcloud
## =======
## Prepare Your Environment
## -------
## * Source in the undercloud credentials.
## ::
source /home/stack/stackrc
### --stop_docs
# Wait until there are hypervisors available.
while true; do
count=
if [ $count -gt 0 ]; then
break
fi
done
### --start_docs
## * Deploy the overcloud!
## ::
openstack overcloud deploy --stack overcloud \
--override-
--templates /usr/share/
--libvirt-type qemu --timeout 90 -e /home/stack/
"$@" && status_code=0 || status_code=$?
### --stop_docs
# Check if the deployment has started. If not, exit gracefully. If yes, check for errors.
if ! openstack stack list | grep -q overcloud; then
echo "overcloud deployment not started. Check the deploy configurations"
exit 1
# We don't always get a useful error code from the openstack deploy command,
# so check `openstack stack list` for a CREATE_COMPLETE or an UPDATE_COMPLETE
# status.
elif ! openstack stack list | grep -Eq '(CREATE|
# get the failures list
openstack stack failures list overcloud --long > /home/stack/
# get any puppet related errors
for failed in $(openstack stack resource list \
grep 'StructuredDepl
do
echo "openstack software deployment show output for deployment: $failed" >> /home/stack/
echo "######
openstack software deployment show $failed >> /home/stack/
echo "######
echo "puppet standard error for deployment: $failed" >> /home/stack/
echo "######
# the sed part removes color codes from the text
openstack software deployment show $failed -f json |
jq -r .output_
sed -r "s:\x1B\
echo "######
# We need to exit with 1 because of the above || true
done
exit 1
fi
exit $status_code
please paste the /var/log/ pacemaker/ pacemaker. log + /var/log/ containers/ rabbitmq/ * + the journal output and the time of failure