tripleo

Docker restart in deploy steps causing malfunction in pacemaker services during upgrades

Bug #1807418 reported by Jose Luis Franco on 2018-12-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	High	Jose Luis Franco	tripleo stein-2

Bug Description

This launchpad bug is based on BZ https://bugzilla.redhat.com/show_bug.cgi?id=1656546:

Some pacemaker resources are stopped during deploy tasks, causing the upgrade to fail. In this exact case, it was rabbitmq-bundle and the upgrade failed with:

        "Error: rabbitmqctl status | grep -F \"{rabbit,\" returned 1 instead of one of [0]",
        "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]/returns: change from notrun to 0 failed: rabbitmqctl status | grep -F \"{rabbit,\" returned 1 instead of one of [0]",
        "Error: Failed to apply catalog: Command is still failing after 180 seconds expired!",

When checking the resource status via crm_mon, we could see that rabbit-bundle-docker-0 had a Time Out issue when being stopped:

[root@controller-0 ~]# crm_mon -1 [179/1925]
Stack: corosync
Current DC: controller-0 (version 1.1.19-8.el7_6.1-c3c624ea3d) - partition with quorum
Last updated: Tue Dec 4 11:42:18 2018
Last change: Tue Dec 4 10:24:32 2018 by hacluster via crmd on controller-0

4 nodes configured
17 resources configured

Online: [ controller-0 ]
GuestOnline: [ galera-bundle-0@controller-0 redis-bundle-0@controller-0 ]

Active resources:

Docker container: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-0
Docker container: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest]
   galera-bundle-0 (ocf::heartbeat:galera): Master controller-0
Docker container: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest]
   redis-bundle-0 (ocf::heartbeat:redis): Master controller-0
ip-192.168.24.10 (ocf::heartbeat:IPaddr2): Started controller-0
ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.1.11 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.1.17 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.3.10 (ocf::heartbeat:IPaddr2): Started controller-0
ip-172.17.4.11 (ocf::heartbeat:IPaddr2): Started controller-0
Docker container: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0
Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-0

Failed Actions:
* rabbitmq-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=72, status=Timed Out, exitreason='',
    last-rc-change='Mon Dec 3 21:09:22 2018', queued=0ms, exec=20007ms
* galera-bundle-docker-0_monitor_60000 on controller-0 'unknown error' (1): call=78, status=Timed Out, exitreason='',
    last-rc-change='Mon Dec 3 21:13:43 2018', queued=0ms, exec=0ms
* redis-bundle-docker-0_monitor_60000 on controller-0 'unknown error' (1): call=51, status=Timed Out, exitreason='',
    last-rc-change='Mon Dec 3 21:09:20 2018', queued=0ms, exec=0ms

This seemed to be due to a Docker service restart, which could be spotted 5 seconds before the first Time Outed service timestamp:

[root@controller-0 ~]# journalctl | grep 'ing Docker'
Dec 03 21:05:28 controller-0 systemd[1]: Stopping Docker Application Container Engine…
Dec 03 21:05:29 controller-0 systemd[1]: Starting Docker Storage Setup...
Dec 03 21:05:29 controller-0 systemd[1]: Starting Docker Application Container Engine...

Looking in the upgrade tasks for the task which performed this restart, we find in package_update.log that it’s caused by the ansible role container-registry :

2018-12-03 16:05:26,942 p=16722 u=mistral | TASK [container-registry : add deployment user to docker group] ****************
2018-12-03 16:05:26,942 p=16722 u=mistral | Monday 03 December 2018 16:05:26 -0500 (0:00:00.568) 0:01:44.899 *******
2018-12-03 16:05:26,974 p=16722 u=mistral | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2018-12-03 16:05:26,975 p=16722 u=mistral | RUNNING HANDLER [container-registry : restart docker] **************************
2018-12-03 16:05:26,976 p=16722 u=mistral | Monday 03 December 2018 16:05:26 -0500 (0:00:00.033) 0:01:44.932 *******
2018-12-03 16:05:27,412 p=16722 u=mistral | changed: [controller-0] => {"changed": true, "cmd": ["/bin/true"], "delta": "0:00:00.014320", "end": "2018-12-03 21:05:27.337391", "rc": 0, "start": "2018-12-03 21:05:27.323071", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
2018-12-03 16:05:27,414 p=16722 u=mistral | RUNNING HANDLER [container-registry : Docker | reload systemd] *****************
2018-12-03 16:05:27,414 p=16722 u=mistral | Monday 03 December 2018 16:05:27 -0500 (0:00:00.437) 0:01:45.370 *******
2018-12-03 16:05:27,910 p=16722 u=mistral | ok: [controller-0] => {"changed": false, "name": null, "status": {}}
2018-12-03 16:05:27,911 p=16722 u=mistral | RUNNING HANDLER [container-registry : Docker | reload docker] ******************
2018-12-03 16:05:27,911 p=16722 u=mistral | Monday 03 December 2018 16:05:27 -0500 (0:00:00.497) 0:01:45.868 *******

Matching the controller-0 timestamp we observed above.

Tags:

OpenStack Infra (hudson-openstack) on 2018-12-07

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-12-17: Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/622969
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=5bc5ae944a32d5c24090a4c5f5e44c64c2879c1a
Submitter: Zuul
Branch: master

commit 5bc5ae944a32d5c24090a4c5f5e44c64c2879c1a
Author: Jose Luis Franco Arza <email address hidden>
Date: Wed Dec 5 15:09:07 2018 +0100

Perform docker reconfiguration on upgrade_tasks.

    The container-registry role is idempotent in a way that the
    restarting of the docker service will be done only if some
    configuration value has changed.
    During the upgrade, host_prep_tasks are being run and if the
    new templates bring some configuration change then the Docker
    service gets restarted. The issue is the point at which they
    get restarted, which is after the upgrade_tasks have already
    run and prior to the deploy_tasks. This is causing issues with
    Pacemaker handled resources.

    For that reason, we include the very same task running in host_prep_tasks
    into upgrade_tasks for the docker and docker-registry services,
    forcing the Docker service reconfiguration to happen during
    upgrade_tasks instead of at a latter point.

Closes-Bug: #1807418
Change-Id: I5e6ca987c01ff72a3c7e8900f9572024521164de

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-12-18: Fix proposed to tripleo-heat-templates (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/625821

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-01-02: Fix merged to tripleo-heat-templates (stable/rocky)

Reviewed: https://review.openstack.org/625821
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=033e6f70811f4c5552fb814a2a9cb805110fb8cc
Submitter: Zuul
Branch: stable/rocky

commit 033e6f70811f4c5552fb814a2a9cb805110fb8cc
Author: Jose Luis Franco Arza <email address hidden>
Date: Wed Dec 5 15:09:07 2018 +0100

Perform docker reconfiguration on upgrade_tasks.

This patch also fixes the typo included in the master branch patch
I5e6ca987c01ff72a3c7e8900f9572024521164de that caused LP#1808974.

    Closes-Bug: #1807418
    Related-Bug: #1808974
    Change-Id: I5e6ca987c01ff72a3c7e8900f9572024521164de
    (cherry picked from commit 5bc5ae944a32d5c24090a4c5f5e44c64c2879c1a)