tripleo

Pacemaker containers image tagging race condition

Bug #1805826 reported by Jiří Stránský on 2018-11-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Jiří Stránský	tripleo stein-2

Bug Description

Basic HA deployment fails, it is not apparent in CI because it's a race condition that comes into play only if you deploy >1 controllers.

Looks like a race condition, probably introduced in https://review.openstack.org/#/c/614827/ by moving the tagging from step 1 to step 2.

The failures look like this:

Failed Actions:
* rabbitmq-bundle-docker-1_start_0 on overcloud-controller-1 'unknown error' (1): call=165, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:16 2018', queued=1ms, exec=300ms
* rabbitmq-bundle-docker-2_start_0 on overcloud-controller-1 'unknown error' (1): call=200, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:17 2018', queued=0ms, exec=353ms
* galera-bundle-docker-0_start_0 on overcloud-controller-1 'unknown error' (1): call=183, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-mariadb:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:16 2018', queued=0ms, exec=251ms
* galera-bundle-docker-1_start_0 on overcloud-controller-1 'unknown error' (1): call=221, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-mariadb:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:20 2018', queued=0ms, exec=176ms
* redis-bundle-docker-0_start_0 on overcloud-controller-1 'unknown error' (1): call=223, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-redis:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:25 2018', queued=0ms, exec=150ms
* redis-bundle-docker-1_start_0 on overcloud-controller-1 'unknown error' (1): call=202, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-redis:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:17 2018', queued=0ms, exec=332ms
* haproxy-bundle-docker-1_start_0 on overcloud-controller-1 'unknown error' (1): call=225, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-haproxy:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:28 2018', queued=0ms, exec=147ms
* haproxy-bundle-docker-2_start_0 on overcloud-controller-1 'unknown error' (1): call=227, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-haproxy:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:28 2018', queued=0ms, exec=130ms
* rabbitmq-bundle-docker-1_start_0 on overcloud-controller-2 'unknown error' (1): call=204, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:17 2018', queued=0ms, exec=312ms
* rabbitmq-bundle-docker-2_start_0 on overcloud-controller-2 'unknown error' (1): call=169, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-rabbitmq:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:16 2018', queued=0ms, exec=315ms
* galera-bundle-docker-0_start_0 on overcloud-controller-2 'unknown error' (1): call=218, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-mariadb:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:17 2018', queued=0ms, exec=194ms
* galera-bundle-docker-1_start_0 on overcloud-controller-2 'unknown error' (1): call=187, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-mariadb:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:16 2018', queued=0ms, exec=244ms
* redis-bundle-docker-0_start_0 on overcloud-controller-2 'unknown error' (1): call=221, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-redis:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:20 2018', queued=0ms, exec=178ms
* redis-bundle-docker-1_start_0 on overcloud-controller-2 'unknown error' (1): call=223, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-redis:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:25 2018', queued=0ms, exec=149ms
* haproxy-bundle-docker-1_start_0 on overcloud-controller-2 'unknown error' (1): call=227, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-haproxy:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:28 2018', queued=0ms, exec=135ms
* haproxy-bundle-docker-2_start_0 on overcloud-controller-2 'unknown error' (1): call=225, status=complete, exitreason='failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-haproxy:pcmklatest',
    last-rc-change='Thu Nov 29 10:19:28 2018', queued=0ms, exec=148ms

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-11-29: Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/620918

Changed in tripleo:
assignee:	nobody → Jiří Stránský (jistr)
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-12-04: Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/620918
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=3f57d7380d8ffc6908750efb02ff2d0221839786
Submitter: Zuul
Branch: master

commit 3f57d7380d8ffc6908750efb02ff2d0221839786
Author: Jiri Stransky <email address hidden>
Date: Thu Nov 29 12:02:21 2018 +0100

Fix pacemaker tagging race condition

    Change I81bc48b53068c3a5ed90266a4fd3e62bfb017835 moved image fetching
    and tagging for pacemaker-managed services from step 1 to step 2. This
    is also a step when the services are started, which probably
    introduced a race condition for environments where pacemaker cluster
    consists of more than one machine.

During the deployment you can get a lot of pcmk failures like:

failed to pull image 192.168.24.1:8787/tripleomaster/centos-binary-mariadb:pcmklatest

    This only happens on non-bootstrap nodes. On bootstrap node the order
    is still correct, first download and tag image, and then start the
    pcmk resources. However, if non-bootstrap nodes are slower with
    downloading and tagging, pacemaker there might start the resources
    before the images are tagged (as the starting of resources is
    controlled globally from bootstrap node).

Change-Id: Id669cc9a296a8366c7c80a5ee509bdb964b62a04
Closes-Bug: #1805826