Observed on a downstream train deployment.
When doing a minor update of an HA overcloud while some HA resources are in failed state in pacemaker (i.e. some replicas of the resource are not running on some controller nodes due to some errors), the script in charge of updating the HA resource locally tries to cleanup the failed resource locally with an invalid pcs call:
"b\"Wed Jun 9 16:45:29 UTC 2021: openstack-cinder-volume is currently not running on 'controller-0', cleaning up its state to restart it if necessary\\nWed Jun 9 16:45:30 UTC 2021: Wait until openstack-cinder-volume is restarted anywhere in the cluster in state Started\\nWed Jun 9 16:45:30 UTC 2021: Will probe resource state with the following XPath pattern: //bundle[@id='openstack-cinder-volume']//resource\\nWed Jun 9 16:45:31 UTC 2021: openstack-cinder-volume successfully restarted\\n\"",
"b\"Error: Specified option '--node' is not supported in this command\\n\"",
"Completed $ podman run --name cinder_volume_restart_bundle --label config_id=tripleo_step5 --label container_name=cinder_volume_restart_bundle --label managed_by=tripleo-ControllerOpenstack --label config_data={\"command\": \"/pacemaker_restart_bundle.sh cinder_volume openstack-cinder-volume openstack-cinder-volume _ Started\", \"config_volume\": \"cinder\", \"detach\": false, \"environment\": {\"TRIPLEO_MINOR_UPDATE\": \"\", \"TRIPLEO_CONFIG_HASH\":
\"773f0006d8da11eb69451d0e2d851517\"}, \"image\": \"undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-volume:16.1_20210602.1\", \"ipc\": \"host\", \"net\": \"host\", \"start_order\": 2, \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/etc/ipa/ca.crt:/etc/ipa/ca.crt:ro\", \"/var/lib/container-config-scripts/pacemaker_restart_bundle.sh:/pacemaker_restart_bundle.sh:ro\", \"/var/lib/container-config-scripts/pacemaker_wait_bundle.sh:/pacemaker_wait_bundle.sh:ro\", \"/dev/shm:/dev/shm:rw\", \"/etc/puppet:/etc/puppet:ro\", \"/var/lib/config-data/puppet-generated/cinder:/var/lib/kolla/config_files/src:ro\"]} --conmon-pidfile=/var/run/cinder_volume_restart_bundle.pid --log-driver k8s-file --log-opt path=/var/log/containers/stdouts/cinder_volume_restart_bundle.log --env=TRIPLEO_CONFIG_HASH=773f0006d8da11eb69451d0e2d851517 --env=TRIPLEO_MINOR_UPDATE --net=host --ipc=host --user=root --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ipa/ca.crt:/etc/ipa/ca.crt:ro --volume=/var/lib/container-config-scripts/pacemaker_restart_bundle.sh:/pacemaker_restart_bundle.sh:ro --volume=/var/lib/container-config-scripts/pacemaker_wait_bundle.sh:/pacemaker_wait_bundle.sh:ro --volume=/dev/shm:/dev/shm:rw --volume=/etc/puppet:/etc/puppet:ro --volume=/var/lib/config-data/puppet-generated/cinder:/var/lib/kolla/config_files/src:ro undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-volume:16.1_20210602.1 /pacemaker_restart_bundle.sh cinder_volume openstack-cinder-volume openstack-cinder-volume _ Started",
"stdout: Wed Jun 9 16:45:29 UTC 2021: openstack-cinder-volume is currently not running on 'controller-0', cleaning up its state to restart it if necessary",
"",
"stderr: Error: Specified option '--node' is not supported in this command"
]
}
In that case, the minor update continues in sequence without failing, but the resource is actually not restarted, so the minor update isn't given a chance to recover the failed resource as expected.
Fix proposed to branch: master /review. opendev. org/c/openstack /tripleo- heat-templates/ +/795704
Review: https:/