scenario003-multinode-oooq-container too slow to stop container, makes deployment fail

Bug #1799744 reported by Damien Ciabrini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Incomplete
High
Unassigned

Bug Description

From failed run http://logs.openstack.org/77/611677/4/check/tripleo-ci-centos-7-scenario003-multinode-oooq-container/c3d07bc

While deploying a HA containerized overcloud, pacemaker determines that it needs to restart a docker container which will be used to run mysql/galera.

When pacemaker schedules a stop operation, the ocf resource agent that manages docker containers spawns a 'docker stop' command Which takes too long (> 20s) to finish and makes the stop action time out for pacemaker.

A failed stop operation is a blocking operation for pacemaker (fencing is not configured at this stage), so the galera container won't restart, and the deployment eventually fails.

A docker stop operators usually never takes more than 20s to operate, so the host might have been under heavy/abnormal load.

Revision history for this message
Damien Ciabrini (dciabrin) wrote :
Download full text (7.1 KiB)

Having a look at the corosync logs on the overcloud node [1]

puppet-pacemaker enables the galera-bundle resource to allow start the containerized galera

Oct 23 11:46:55 [19679] centos-7-inap-mtl01-0003329495 crmd: info: abort_transition_graph: Transition aborted by deletion of meta_attributes[@id='galera-bundle-meta_attributes']: Configuration change | cib=0.17.0 source=te_update_diff:456 path=/cib/configuration/resources/bundle[@id='galera-bundle']/meta_attributes[@id='galera-bundle-meta_attributes'] complete=true

which let pengine schedule a start action to create the docker container

Oct 23 11:46:55 [19678] centos-7-inap-mtl01-0003329495 pengine: info: RecurringOp: Start recurring monitor (60s) for galera-bundle-docker-0 on centos-7-inap-mtl01-0003329495
Oct 23 11:46:55 [19679] centos-7-inap-mtl01-0003329495 crmd: notice: te_rsc_command: Initiating start operation galera-bundle-docker-0_start_0 locally on centos-7-inap-mtl01-0003329495 | action 27

Oct 23 11:46:55 docker(galera-bundle-docker-0)[40767]: INFO: checking for nsenter, which is required when 'monitor_cmd' is specified
Oct 23 11:46:55 docker(galera-bundle-docker-0)[40767]: INFO: running container galera-bundle-docker-0 for the first time
Oct 23 11:46:56 docker(galera-bundle-docker-0)[40767]: INFO: monitor cmd passed: exit code = 0
Oct 23 11:46:56 docker(galera-bundle-docker-0)[40767]: NOTICE: Container galera-bundle-docker-0 started successfully
Oct 23 11:46:56 [19676] centos-7-inap-mtl01-0003329495 lrmd: info: log_finished: finished - rsc:galera-bundle-docker-0 action:start call_id:17 pid:40767 exit-code:0 exec-time:642ms queue-time:0ms

and then a first recurring monitor operation is schedule to check wehther the docker container holding the galera service is running properly. And it succeeds

Oct 23 11:46:56 [19679] centos-7-inap-mtl01-0003329495 crmd: notice: te_rsc_command: Initiating monitor operation galera-bundle-docker-0_monitor_60000 locally on centos-7-inap-mtl01-0003329495 | action 28
Oct 23 11:46:56 docker(galera-bundle-docker-0)[40962]: INFO: monitor cmd passed: exit code = 0
Oct 23 11:46:56 [19679] centos-7-inap-mtl01-0003329495 crmd: info: process_lrm_event: Result of monitor operation for galera-bundle-docker-0 on centos-7-inap-mtl01-0003329495: 0 (ok) | call=18 key=galera-bundle-docker-0_monitor_60000 confirmed=false cib-update=105

then when the docker container is running, puppet-pacemaker creates a galera ocf resource in the cluster, to be ran in this docker container

Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: cib_process_request: Forwarding cib_apply_diff operation for section 'all' to all (origin=local/cibadmin/2)
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: cib_perform_op: Diff: --- 0.17.4 2
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: cib_perform_op: Diff: +++ 0.18.0 (null)
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: cib_perform_op: + /cib: @epoch=18, @num_updates=0
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: ci...

Read more...

Changed in tripleo:
importance: Undecided → Critical
milestone: none → stein-1
tags: added: queens-backport-potential rocky-backport-potential
Changed in tripleo:
status: New → Triaged
importance: Critical → High
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
milestone: stein-3 → stein-rc1
Changed in tripleo:
milestone: stein-rc1 → train-1
Changed in tripleo:
milestone: train-1 → train-2
Changed in tripleo:
milestone: train-2 → train-3
Changed in tripleo:
milestone: train-3 → ussuri-1
Changed in tripleo:
milestone: ussuri-1 → ussuri-2
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Incomplete
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Changed in tripleo:
milestone: victoria-1 → victoria-3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.