Docker restart in deploy steps causing malfunction in pacemaker services during upgrades
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
High
|
Jose Luis Franco |
Bug Description
This launchpad bug is based on BZ https:/
Some pacemaker resources are stopped during deploy tasks, causing the upgrade to fail. In this exact case, it was rabbitmq-bundle and the upgrade failed with:
"Error: rabbitmqctl status | grep -F \"{rabbit,\" returned 1 instead of one of [0]",
"Error: /Stage[
"Error: Failed to apply catalog: Command is still failing after 180 seconds expired!",
When checking the resource status via crm_mon, we could see that rabbit-
[root@controller-0 ~]# crm_mon -1 [179/1925]
Stack: corosync
Current DC: controller-0 (version 1.1.19-
Last updated: Tue Dec 4 11:42:18 2018
Last change: Tue Dec 4 10:24:32 2018 by hacluster via crmd on controller-0
4 nodes configured
17 resources configured
Online: [ controller-0 ]
GuestOnline: [ galera-
Active resources:
Docker container: rabbitmq-bundle [192.168.
rabbitmq-
Docker container: galera-bundle [192.168.
galera-bundle-0 (ocf::heartbeat
Docker container: redis-bundle [192.168.
redis-bundle-0 (ocf::heartbeat
ip-192.168.24.10 (ocf::heartbeat
ip-10.0.0.101 (ocf::heartbeat
ip-172.17.1.11 (ocf::heartbeat
ip-172.17.1.17 (ocf::heartbeat
ip-172.17.3.10 (ocf::heartbeat
ip-172.17.4.11 (ocf::heartbeat
Docker container: haproxy-bundle [192.168.
haproxy-
Docker container: openstack-
openstack-
Failed Actions:
* rabbitmq-
last-
* galera-
last-
* redis-bundle-
last-
This seemed to be due to a Docker service restart, which could be spotted 5 seconds before the first Time Outed service timestamp:
[root@controller-0 ~]# journalctl | grep 'ing Docker'
Dec 03 21:05:28 controller-0 systemd[1]: Stopping Docker Application Container Engine…
Dec 03 21:05:29 controller-0 systemd[1]: Starting Docker Storage Setup...
Dec 03 21:05:29 controller-0 systemd[1]: Starting Docker Application Container Engine...
Looking in the upgrade tasks for the task which performed this restart, we find in package_update.log that it’s caused by the ansible role container-registry :
2018-12-03 16:05:26,942 p=16722 u=mistral | TASK [container-registry : add deployment user to docker group] ****************
2018-12-03 16:05:26,942 p=16722 u=mistral | Monday 03 December 2018 16:05:26 -0500 (0:00:00.568) 0:01:44.899 *******
2018-12-03 16:05:26,974 p=16722 u=mistral | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2018-12-03 16:05:26,975 p=16722 u=mistral | RUNNING HANDLER [container-registry : restart docker] *******
2018-12-03 16:05:26,976 p=16722 u=mistral | Monday 03 December 2018 16:05:26 -0500 (0:00:00.033) 0:01:44.932 *******
2018-12-03 16:05:27,412 p=16722 u=mistral | changed: [controller-0] => {"changed": true, "cmd": ["/bin/true"], "delta": "0:00:00.014320", "end": "2018-12-03 21:05:27.337391", "rc": 0, "start": "2018-12-03 21:05:27.323071", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
2018-12-03 16:05:27,414 p=16722 u=mistral | RUNNING HANDLER [container-registry : Docker | reload systemd] *****************
2018-12-03 16:05:27,414 p=16722 u=mistral | Monday 03 December 2018 16:05:27 -0500 (0:00:00.437) 0:01:45.370 *******
2018-12-03 16:05:27,910 p=16722 u=mistral | ok: [controller-0] => {"changed": false, "name": null, "status": {}}
2018-12-03 16:05:27,911 p=16722 u=mistral | RUNNING HANDLER [container-registry : Docker | reload docker] ******************
2018-12-03 16:05:27,911 p=16722 u=mistral | Monday 03 December 2018 16:05:27 -0500 (0:00:00.497) 0:01:45.868 *******
Matching the controller-0 timestamp we observed above.
Changed in tripleo: | |
status: | Triaged → In Progress |
Reviewed: https:/ /review. openstack. org/622969 /git.openstack. org/cgit/ openstack/ tripleo- heat-templates/ commit/ ?id=5bc5ae944a3 2d5c24090a4c5f5 e44c64c2879c1a
Committed: https:/
Submitter: Zuul
Branch: master
commit 5bc5ae944a32d5c 24090a4c5f5e44c 64c2879c1a
Author: Jose Luis Franco Arza <email address hidden>
Date: Wed Dec 5 15:09:07 2018 +0100
Perform docker reconfiguration on upgrade_tasks.
The container-registry role is idempotent in a way that the
restarting of the docker service will be done only if some
configuration value has changed.
During the upgrade, host_prep_tasks are being run and if the
new templates bring some configuration change then the Docker
service gets restarted. The issue is the point at which they
get restarted, which is after the upgrade_tasks have already
run and prior to the deploy_tasks. This is causing issues with
Pacemaker handled resources.
For that reason, we include the very same task running in host_prep_tasks
into upgrade_tasks for the docker and docker-registry services,
forcing the Docker service reconfiguration to happen during
upgrade_tasks instead of at a latter point.
Closes-Bug: #1807418 2a3c7e8900f9572 024521164de
Change-Id: I5e6ca987c01ff7