tripleo

Bug #1799744
Comment #1

Comment 1 for bug 1799744

Revision history for this message

Damien Ciabrini (dciabrin) wrote on 2018-10-24:

Having a look at the corosync logs on the overcloud node [1]

puppet-pacemaker enables the galera-bundle resource to allow start the containerized galera

Oct 23 11:46:55 [19679] centos-7-inap-mtl01-0003329495 crmd: info: abort_transition_graph: Transition aborted by deletion of meta_attributes[@id='galera-bundle-meta_attributes']: Configuration change | cib=0.17.0 source=te_update_diff:456 path=/cib/configuration/resources/bundle[@id='galera-bundle']/meta_attributes[@id='galera-bundle-meta_attributes'] complete=true

which let pengine schedule a start action to create the docker container

Oct 23 11:46:55 [19678] centos-7-inap-mtl01-0003329495 pengine: info: RecurringOp: Start recurring monitor (60s) for galera-bundle-docker-0 on centos-7-inap-mtl01-0003329495
Oct 23 11:46:55 [19679] centos-7-inap-mtl01-0003329495 crmd: notice: te_rsc_command: Initiating start operation galera-bundle-docker-0_start_0 locally on centos-7-inap-mtl01-0003329495 | action 27

Oct 23 11:46:55 docker(galera-bundle-docker-0)[40767]: INFO: checking for nsenter, which is required when 'monitor_cmd' is specified
Oct 23 11:46:55 docker(galera-bundle-docker-0)[40767]: INFO: running container galera-bundle-docker-0 for the first time
Oct 23 11:46:56 docker(galera-bundle-docker-0)[40767]: INFO: monitor cmd passed: exit code = 0
Oct 23 11:46:56 docker(galera-bundle-docker-0)[40767]: NOTICE: Container galera-bundle-docker-0 started successfully
Oct 23 11:46:56 [19676] centos-7-inap-mtl01-0003329495 lrmd: info: log_finished: finished - rsc:galera-bundle-docker-0 action:start call_id:17 pid:40767 exit-code:0 exec-time:642ms queue-time:0ms

and then a first recurring monitor operation is schedule to check wehther the docker container holding the galera service is running properly. And it succeeds

Oct 23 11:46:56 [19679] centos-7-inap-mtl01-0003329495 crmd: notice: te_rsc_command: Initiating monitor operation galera-bundle-docker-0_monitor_60000 locally on centos-7-inap-mtl01-0003329495 | action 28
Oct 23 11:46:56 docker(galera-bundle-docker-0)[40962]: INFO: monitor cmd passed: exit code = 0
Oct 23 11:46:56 [19679] centos-7-inap-mtl01-0003329495 crmd: info: process_lrm_event: Result of monitor operation for galera-bundle-docker-0 on centos-7-inap-mtl01-0003329495: 0 (ok) | call=18 key=galera-bundle-docker-0_monitor_60000 confirmed=false cib-update=105

then when the docker container is running, puppet-pacemaker creates a galera ocf resource in the cluster, to be ran in this docker container

Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: cib_process_request: Forwarding cib_apply_diff operation for section 'all' to all (origin=local/cibadmin/2)
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: cib_perform_op: Diff: --- 0.17.4 2
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: cib_perform_op: Diff: +++ 0.18.0 (null)
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: cib_perform_op: + /cib: @epoch=18, @num_updates=0
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: cib_perform_op: ++ /cib/configuration/resources/bundle[@id='galera-bundle']: <primitive class="ocf" id="galera" provider="heartbeat" type="galera"/>

then something changes the definition of the galera docker resource in pacemaker:

Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495 pengine: info: rsc_action_digest_cmp: Parameters to galera-bundle-docker-0_start_0 on centos-7-inap-mtl01-0003329495 changed: was d8041953cfb6acc917ec2931feb2ee67 vs. now 46b7a28056fbe69bef8a441022634eb6 (reload:3.0.14) 0:0;27:13:0:a6214ebb-5fc0-4249-a609-ae3747368a41
Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495 pengine: info: rsc_action_digest_cmp: Parameters to galera-bundle-docker-0_monitor_60000 on centos-7-inap-mtl01-0003329495 changed: was 8c00760613f45ebe053623f6a315b207 vs. now 21d84609373192f5c61a2019b46a49d6 (reschedule:3.0.14) 0:0;28:13:0:a6214ebb-5fc0-4249-a609-ae3747368a41

at the next pengine run, the pacemaker remote for galera container and the galera resource are scheduled for start, but the galera container is also scheduled for restart:

Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495 pengine: notice: LogAction: * Restart galera-bundle-docker-0 ( centos-7-inap-mtl01-0003329495 ) due to resource definition change
Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495 pengine: notice: LogAction: * Start galera-bundle-0 ( centos-7-inap-mtl01-0003329495 )
Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495 pengine: notice: LogAction: * Start galera:0 ( galera-bundle-0 )

so pengine schedule a stop of the galera docker container.

Oct 23 11:47:23 [19679] centos-7-inap-mtl01-0003329495 crmd: notice: te_rsc_command: Initiating stop operation galera-bundle-docker-0_stop_0 locally on centos-7-inap-mtl01-0003329495 | action 31

but the associated 'docker stop' command that is triggered by this stop action could not succeed in the next 20s, which resulted in a timeout:

Oct 23 11:47:43 [19676] centos-7-inap-mtl01-0003329495 lrmd: warning: child_timeout_callback: galera-bundle-docker-0_stop_0 process (PID 41798) timed out
Oct 23 11:47:43 [19676] centos-7-inap-mtl01-0003329495 lrmd: warning: operation_finished: galera-bundle-docker-0_stop_0:41798 - timed out after 20000ms
Oct 23 11:47:43 [19676] centos-7-inap-mtl01-0003329495 lrmd: info: log_finished: finished - rsc:galera-bundle-docker-0 action:stop call_id:20 pid:41798 exit-code:1 exec-time:20003ms queue-time:0ms
Oct 23 11:47:43 [19679] centos-7-inap-mtl01-0003329495 crmd: error: process_lrm_event: Result of stop operation for galera-bundle-docker-0 on centos-7-inap-mtl01-0003329495: Timed Out | call=20 key=galera-bundle-docker-0_stop_0 timeout=20000ms

and a failed stop operation being fatal for pacemaker (no fencing configured), the galera resource is never restarted, and the deployment eventually fails:

2018-10-23 12:21:06 | "Error: /usr/bin/clustercheck >/dev/null returned 1 instead of one of [0]",
2018-10-23 12:21:06 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql_bundle/Exec[galera-ready]/returns: change from notrun to 0 failed: /usr/bin/clustercheck >/dev/null returned 1 instead of one of [0]",
2018-10-23 12:21:06 | "Error: Failed to apply catalog: Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2 \"No such file or directory\")",
2018-10-23 12:21:06 | "+ rc=1",

[1] http://logs.openstack.org/77/611677/4/check/tripleo-ci-centos-7-scenario003-multinode-oooq-container/c3d07bc/logs/subnode-2/var/log/cluster/corosync.log.txt.gz
[2] http://logs.openstack.org/77/611677/4/check/tripleo-ci-centos-7-scenario003-multinode-oooq-container/c3d07bc/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-10-23_12_21_06

Having a look at the corosync logs on the overcloud node [1]

puppet-pacemaker enables the galera-bundle resource to allow start the containerized galera

Oct 23 11:46:55 [19679] centos-7-inap-mtl01-0003329495       crmd:     info: abort_transition_graph:	Transition aborted by deletion of meta_attributes[@id='galera-bundle-meta_attributes']: Configuration change | cib=0.17.0 source=te_update_diff:456 path=/cib/configuration/resources/bundle[@id='galera-bundle']/meta_attributes[@id='galera-bundle-meta_attributes'] complete=true

which let pengine schedule a start action to create the docker container

Oct 23 11:46:55 [19678] centos-7-inap-mtl01-0003329495    pengine:     info: RecurringOp:	 Start recurring monitor (60s) for galera-bundle-docker-0 on centos-7-inap-mtl01-0003329495
Oct 23 11:46:55 [19679] centos-7-inap-mtl01-0003329495       crmd:   notice: te_rsc_command:	Initiating start operation galera-bundle-docker-0_start_0 locally on centos-7-inap-mtl01-0003329495 | action 27

Oct 23 11:46:55  docker(galera-bundle-docker-0)[40767]:    INFO: checking for nsenter, which is required when 'monitor_cmd' is specified
Oct 23 11:46:55  docker(galera-bundle-docker-0)[40767]:    INFO: running container galera-bundle-docker-0 for the first time
Oct 23 11:46:56  docker(galera-bundle-docker-0)[40767]:    INFO: monitor cmd passed: exit code = 0
Oct 23 11:46:56  docker(galera-bundle-docker-0)[40767]:    NOTICE: Container galera-bundle-docker-0  started successfully
Oct 23 11:46:56 [19676] centos-7-inap-mtl01-0003329495       lrmd:     info: log_finished:	finished - rsc:galera-bundle-docker-0 action:start call_id:17 pid:40767 exit-code:0 exec-time:642ms queue-time:0ms

and then a first recurring monitor operation is schedule to check wehther the docker container holding the galera service is running properly. And it succeeds

Oct 23 11:46:56 [19679] centos-7-inap-mtl01-0003329495       crmd:   notice: te_rsc_command:	Initiating monitor operation galera-bundle-docker-0_monitor_60000 locally on centos-7-inap-mtl01-0003329495 | action 28
Oct 23 11:46:56  docker(galera-bundle-docker-0)[40962]:    INFO: monitor cmd passed: exit code = 0
Oct 23 11:46:56 [19679] centos-7-inap-mtl01-0003329495       crmd:     info: process_lrm_event:	Result of monitor operation for galera-bundle-docker-0 on centos-7-inap-mtl01-0003329495: 0 (ok) | call=18 key=galera-bundle-docker-0_monitor_60000 confirmed=false cib-update=105

then when the docker container is running, puppet-pacemaker creates a galera ocf resource in the cluster, to be ran in this docker container

Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495        cib:     info: cib_process_request:	Forwarding cib_apply_diff operation for section 'all' to all (origin=local/cibadmin/2)
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495        cib:     info: cib_perform_op:	Diff: --- 0.17.4 2
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495        cib:     info: cib_perform_op:	Diff: +++ 0.18.0 (null)
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495        cib:     info: cib_perform_op:	+  /cib:  @epoch=18, @num_updates=0
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495        cib:     info: cib_perform_op:	++ /cib/configuration/resources/bundle[@id='galera-bundle']:  <primitive class="ocf" id="galera" provider="heartbeat" type="galera"/>

then something changes the definition of the galera docker resource in pacemaker:

Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495    pengine:     info: rsc_action_digest_cmp:	Parameters to galera-bundle-docker-0_start_0 on centos-7-inap-mtl01-0003329495 changed: was d8041953cfb6acc917ec2931feb2ee67 vs. now 46b7a28056fbe69bef8a441022634eb6 (reload:3.0.14) 0:0;27:13:0:a6214ebb-5fc0-4249-a609-ae3747368a41
Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495    pengine:     info: rsc_action_digest_cmp:	Parameters to galera-bundle-docker-0_monitor_60000 on centos-7-inap-mtl01-0003329495 changed: was 8c00760613f45ebe053623f6a315b207 vs. now 21d84609373192f5c61a2019b46a49d6 (reschedule:3.0.14) 0:0;28:13:0:a6214ebb-5fc0-4249-a609-ae3747368a41

at the next pengine run, the pacemaker remote for galera container and the galera resource are scheduled for start, but the galera container is also scheduled for restart:

Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495    pengine:   notice: LogAction:	 * Restart    galera-bundle-docker-0       ( centos-7-inap-mtl01-0003329495 )   due to resource definition change
Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495    pengine:   notice: LogAction:	 * Start      galera-bundle-0              ( centos-7-inap-mtl01-0003329495 )
Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495    pengine:   notice: LogAction:	 * Start      galera:0                     (                galera-bundle-0 )

so pengine schedule a stop of the galera docker container.

Oct 23 11:47:23 [19679] centos-7-inap-mtl01-0003329495       crmd:   notice: te_rsc_command:	Initiating stop operation galera-bundle-docker-0_stop_0 locally on centos-7-inap-mtl01-0003329495 | action 31

but the associated 'docker stop' command that is triggered by this stop action could not succeed in the next 20s, which resulted in a timeout:

Oct 23 11:47:43 [19676] centos-7-inap-mtl01-0003329495       lrmd:  warning: child_timeout_callback:	galera-bundle-docker-0_stop_0 process (PID 41798) timed out
Oct 23 11:47:43 [19676] centos-7-inap-mtl01-0003329495       lrmd:  warning: operation_finished:	galera-bundle-docker-0_stop_0:41798 - timed out after 20000ms
Oct 23 11:47:43 [19676] centos-7-inap-mtl01-0003329495       lrmd:     info: log_finished:	finished - rsc:galera-bundle-docker-0 action:stop call_id:20 pid:41798 exit-code:1 exec-time:20003ms queue-time:0ms
Oct 23 11:47:43 [19679] centos-7-inap-mtl01-0003329495       crmd:    error: process_lrm_event:	Result of stop operation for galera-bundle-docker-0 on centos-7-inap-mtl01-0003329495: Timed Out | call=20 key=galera-bundle-docker-0_stop_0 timeout=20000ms

and a failed stop operation being fatal for pacemaker (no fencing configured), the galera resource is never restarted, and the deployment eventually fails:

2018-10-23 12:21:06 |         "Error: /usr/bin/clustercheck >/dev/null returned 1 instead of one of [0]", 
2018-10-23 12:21:06 |         "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql_bundle/Exec[galera-ready]/returns: change from notrun to 0 failed: /usr/bin/clustercheck >/dev/null returned 1 instead of one of [0]", 
2018-10-23 12:21:06 |         "Error: Failed to apply catalog: Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2 \"No such file or directory\")", 
2018-10-23 12:21:06 |         "+ rc=1",

[1] http://logs.openstack.org/77/611677/4/check/tripleo-ci-centos-7-scenario003-multinode-oooq-container/c3d07bc/logs/subnode-2/var/log/cluster/corosync.log.txt.gz 
[2] http://logs.openstack.org/77/611677/4/check/tripleo-ci-centos-7-scenario003-multinode-oooq-container/c3d07bc/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-10-23_12_21_06