Pacemaker containers image tagging race condition

Bug #1805826 reported by Jiří Stránský
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Jiří Stránský

Bug Description

Basic HA deployment fails, it is not apparent in CI because it's a race condition that comes into play only if you deploy >1 controllers.

Looks like a race condition, probably introduced in https://review.openstack.org/#/c/614827/ by moving the tagging from step 1 to step 2.

The failures look like this:

Failed Actions:
* rabbitmq-bundle-docker-1_start_0 on overcloud-controller-1 'unknown error' (1): call=165, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:16 2018', queued=1ms, exec=300ms
* rabbitmq-bundle-docker-2_start_0 on overcloud-controller-1 'unknown error' (1): call=200, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:17 2018', queued=0ms, exec=353ms
* galera-bundle-docker-0_start_0 on overcloud-controller-1 'unknown error' (1): call=183, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-mariadb:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:16 2018', queued=0ms, exec=251ms
* galera-bundle-docker-1_start_0 on overcloud-controller-1 'unknown error' (1): call=221, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-mariadb:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:20 2018', queued=0ms, exec=176ms
* redis-bundle-docker-0_start_0 on overcloud-controller-1 'unknown error' (1): call=223, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-redis:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:25 2018', queued=0ms, exec=150ms
* redis-bundle-docker-1_start_0 on overcloud-controller-1 'unknown error' (1): call=202, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-redis:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:17 2018', queued=0ms, exec=332ms
* haproxy-bundle-docker-1_start_0 on overcloud-controller-1 'unknown error' (1): call=225, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-haproxy:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:28 2018', queued=0ms, exec=147ms
* haproxy-bundle-docker-2_start_0 on overcloud-controller-1 'unknown error' (1): call=227, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-haproxy:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:28 2018', queued=0ms, exec=130ms
* rabbitmq-bundle-docker-1_start_0 on overcloud-controller-2 'unknown error' (1): call=204, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:17 2018', queued=0ms, exec=312ms
* rabbitmq-bundle-docker-2_start_0 on overcloud-controller-2 'unknown error' (1): call=169, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:16 2018', queued=0ms, exec=315ms
* galera-bundle-docker-0_start_0 on overcloud-controller-2 'unknown error' (1): call=218, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-mariadb:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:17 2018', queued=0ms, exec=194ms
* galera-bundle-docker-1_start_0 on overcloud-controller-2 'unknown error' (1): call=187, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-mariadb:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:16 2018', queued=0ms, exec=244ms
* redis-bundle-docker-0_start_0 on overcloud-controller-2 'unknown error' (1): call=221, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-redis:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:20 2018', queued=0ms, exec=178ms
* redis-bundle-docker-1_start_0 on overcloud-controller-2 'unknown error' (1): call=223, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-redis:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:25 2018', queued=0ms, exec=149ms
* haproxy-bundle-docker-1_start_0 on overcloud-controller-2 'unknown error' (1): call=227, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-haproxy:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:28 2018', queued=0ms, exec=135ms
* haproxy-bundle-docker-2_start_0 on overcloud-controller-2 'unknown error' (1): call=225, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-haproxy:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:28 2018', queued=0ms, exec=148ms

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/620918

Changed in tripleo:
assignee: nobody → Jiří Stránský (jistr)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/620918
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=3f57d7380d8ffc6908750efb02ff2d0221839786
Submitter: Zuul
Branch: master

commit 3f57d7380d8ffc6908750efb02ff2d0221839786
Author: Jiri Stransky <email address hidden>
Date: Thu Nov 29 12:02:21 2018 +0100

    Fix pacemaker tagging race condition

    Change I81bc48b53068c3a5ed90266a4fd3e62bfb017835 moved image fetching
    and tagging for pacemaker-managed services from step 1 to step 2. This
    is also a step when the services are started, which probably
    introduced a race condition for environments where pacemaker cluster
    consists of more than one machine.

    During the deployment you can get a lot of pcmk failures like:

    failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-mariadb:pcmklatest

    This only happens on non-bootstrap nodes. On bootstrap node the order
    is still correct, first download and tag image, and then start the
    pcmk resources. However, if non-bootstrap nodes are slower with
    downloading and tagging, pacemaker there might start the resources
    before the images are tagged (as the starting of resources is
    controlled globally from bootstrap node).

    Change-Id: Id669cc9a296a8366c7c80a5ee509bdb964b62a04
    Closes-Bug: #1805826

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 10.3.0

This issue was fixed in the openstack/tripleo-heat-templates 10.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.