Docker restart in deploy steps causing malfunction in pacemaker services during upgrades

Bug #1807418 reported by Jose Luis Franco
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Jose Luis Franco

Bug Description

This launchpad bug is based on BZ https://bugzilla.redhat.com/show_bug.cgi?id=1656546:

Some pacemaker resources are stopped during deploy tasks, causing the upgrade to fail. In this exact case, it was rabbitmq-bundle and the upgrade failed with:

        "Error: rabbitmqctl status | grep -F \"{rabbit,\" returned 1 instead of one of [0]",
        "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Exec[rabbitmq-ready]/returns: change from notrun to 0 failed: rabbitmqctl status | grep -F \"{rabbit,\" returned 1 instead of one of [0]",
        "Error: Failed to apply catalog: Command is still failing after 180 seconds expired!",

When checking the resource status via crm_mon, we could see that rabbit-bundle-docker-0 had a Time Out issue when being stopped:

[root@controller-0 ~]# crm_mon -1 [179/1925]
Stack: corosync
Current DC: controller-0 (version 1.1.19-8.el7_6.1-c3c624ea3d) - partition with quorum
Last updated: Tue Dec 4 11:42:18 2018
Last change: Tue Dec 4 10:24:32 2018 by hacluster via crmd on controller-0

4 nodes configured
17 resources configured

Online: [ controller-0 ]
GuestOnline: [ galera-bundle-0@controller-0 redis-bundle-0@controller-0 ]

Active resources:

 Docker container: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped controller-0
 Docker container: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest]
   galera-bundle-0 (ocf::heartbeat:galera): Master controller-0
 Docker container: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest]
   redis-bundle-0 (ocf::heartbeat:redis): Master controller-0
 ip-192.168.24.10 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-172.17.1.11 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-172.17.1.17 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-172.17.3.10 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-172.17.4.11 (ocf::heartbeat:IPaddr2): Started controller-0
 Docker container: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-0

Failed Actions:
* rabbitmq-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=72, status=Timed Out, exitreason='',
    last-rc-change='Mon Dec 3 21:09:22 2018', queued=0ms, exec=20007ms
* galera-bundle-docker-0_monitor_60000 on controller-0 'unknown error' (1): call=78, status=Timed Out, exitreason='',
    last-rc-change='Mon Dec 3 21:13:43 2018', queued=0ms, exec=0ms
* redis-bundle-docker-0_monitor_60000 on controller-0 'unknown error' (1): call=51, status=Timed Out, exitreason='',
    last-rc-change='Mon Dec 3 21:09:20 2018', queued=0ms, exec=0ms

This seemed to be due to a Docker service restart, which could be spotted 5 seconds before the first Time Outed service timestamp:

[root@controller-0 ~]# journalctl | grep 'ing Docker'
Dec 03 21:05:28 controller-0 systemd[1]: Stopping Docker Application Container Engine…
Dec 03 21:05:29 controller-0 systemd[1]: Starting Docker Storage Setup...
Dec 03 21:05:29 controller-0 systemd[1]: Starting Docker Application Container Engine...

Looking in the upgrade tasks for the task which performed this restart, we find in package_update.log that it’s caused by the ansible role container-registry :

2018-12-03 16:05:26,942 p=16722 u=mistral | TASK [container-registry : add deployment user to docker group] ****************
2018-12-03 16:05:26,942 p=16722 u=mistral | Monday 03 December 2018 16:05:26 -0500 (0:00:00.568) 0:01:44.899 *******
2018-12-03 16:05:26,974 p=16722 u=mistral | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2018-12-03 16:05:26,975 p=16722 u=mistral | RUNNING HANDLER [container-registry : restart docker] **************************
2018-12-03 16:05:26,976 p=16722 u=mistral | Monday 03 December 2018 16:05:26 -0500 (0:00:00.033) 0:01:44.932 *******
2018-12-03 16:05:27,412 p=16722 u=mistral | changed: [controller-0] => {"changed": true, "cmd": ["/bin/true"], "delta": "0:00:00.014320", "end": "2018-12-03 21:05:27.337391", "rc": 0, "start": "2018-12-03 21:05:27.323071", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
2018-12-03 16:05:27,414 p=16722 u=mistral | RUNNING HANDLER [container-registry : Docker | reload systemd] *****************
2018-12-03 16:05:27,414 p=16722 u=mistral | Monday 03 December 2018 16:05:27 -0500 (0:00:00.437) 0:01:45.370 *******
2018-12-03 16:05:27,910 p=16722 u=mistral | ok: [controller-0] => {"changed": false, "name": null, "status": {}}
2018-12-03 16:05:27,911 p=16722 u=mistral | RUNNING HANDLER [container-registry : Docker | reload docker] ******************
2018-12-03 16:05:27,911 p=16722 u=mistral | Monday 03 December 2018 16:05:27 -0500 (0:00:00.497) 0:01:45.868 *******

Matching the controller-0 timestamp we observed above.

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/622969
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=5bc5ae944a32d5c24090a4c5f5e44c64c2879c1a
Submitter: Zuul
Branch: master

commit 5bc5ae944a32d5c24090a4c5f5e44c64c2879c1a
Author: Jose Luis Franco Arza <email address hidden>
Date: Wed Dec 5 15:09:07 2018 +0100

    Perform docker reconfiguration on upgrade_tasks.

    The container-registry role is idempotent in a way that the
    restarting of the docker service will be done only if some
    configuration value has changed.
    During the upgrade, host_prep_tasks are being run and if the
    new templates bring some configuration change then the Docker
    service gets restarted. The issue is the point at which they
    get restarted, which is after the upgrade_tasks have already
    run and prior to the deploy_tasks. This is causing issues with
    Pacemaker handled resources.

    For that reason, we include the very same task running in host_prep_tasks
    into upgrade_tasks for the docker and docker-registry services,
    forcing the Docker service reconfiguration to happen during
    upgrade_tasks instead of at a latter point.

    Closes-Bug: #1807418
    Change-Id: I5e6ca987c01ff72a3c7e8900f9572024521164de

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/625821

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/rocky)

Reviewed: https://review.openstack.org/625821
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=033e6f70811f4c5552fb814a2a9cb805110fb8cc
Submitter: Zuul
Branch: stable/rocky

commit 033e6f70811f4c5552fb814a2a9cb805110fb8cc
Author: Jose Luis Franco Arza <email address hidden>
Date: Wed Dec 5 15:09:07 2018 +0100

    Perform docker reconfiguration on upgrade_tasks.

    The container-registry role is idempotent in a way that the
    restarting of the docker service will be done only if some
    configuration value has changed.
    During the upgrade, host_prep_tasks are being run and if the
    new templates bring some configuration change then the Docker
    service gets restarted. The issue is the point at which they
    get restarted, which is after the upgrade_tasks have already
    run and prior to the deploy_tasks. This is causing issues with
    Pacemaker handled resources.

    For that reason, we include the very same task running in host_prep_tasks
    into upgrade_tasks for the docker and docker-registry services,
    forcing the Docker service reconfiguration to happen during
    upgrade_tasks instead of at a latter point.

    This patch also fixes the typo included in the master branch patch
    I5e6ca987c01ff72a3c7e8900f9572024521164de that caused LP#1808974.

    Closes-Bug: #1807418
    Related-Bug: #1808974
    Change-Id: I5e6ca987c01ff72a3c7e8900f9572024521164de
    (cherry picked from commit 5bc5ae944a32d5c24090a4c5f5e44c64c2879c1a)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 9.2.0

This issue was fixed in the openstack/tripleo-heat-templates 9.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 10.3.0

This issue was fixed in the openstack/tripleo-heat-templates 10.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.