Having a look at the corosync logs on the overcloud node [1]
puppet-pacemaker enables the galera-bundle resource to allow start the containerized galera
Oct 23 11:46:55 [19679] centos-7-inap-mtl01-0003329495 crmd: info: abort_transition_graph: Transition aborted by deletion of meta_attributes[@id='galera-bundle-meta_attributes']: Configuration change | cib=0.17.0 source=te_update_diff:456 path=/cib/configuration/resources/bundle[@id='galera-bundle']/meta_attributes[@id='galera-bundle-meta_attributes'] complete=true
which let pengine schedule a start action to create the docker container
Oct 23 11:46:55 [19678] centos-7-inap-mtl01-0003329495 pengine: info: RecurringOp: Start recurring monitor (60s) for galera-bundle-docker-0 on centos-7-inap-mtl01-0003329495
Oct 23 11:46:55 [19679] centos-7-inap-mtl01-0003329495 crmd: notice: te_rsc_command: Initiating start operation galera-bundle-docker-0_start_0 locally on centos-7-inap-mtl01-0003329495 | action 27
Oct 23 11:46:55 docker(galera-bundle-docker-0)[40767]: INFO: checking for nsenter, which is required when 'monitor_cmd' is specified
Oct 23 11:46:55 docker(galera-bundle-docker-0)[40767]: INFO: running container galera-bundle-docker-0 for the first time
Oct 23 11:46:56 docker(galera-bundle-docker-0)[40767]: INFO: monitor cmd passed: exit code = 0
Oct 23 11:46:56 docker(galera-bundle-docker-0)[40767]: NOTICE: Container galera-bundle-docker-0 started successfully
Oct 23 11:46:56 [19676] centos-7-inap-mtl01-0003329495 lrmd: info: log_finished: finished - rsc:galera-bundle-docker-0 action:start call_id:17 pid:40767 exit-code:0 exec-time:642ms queue-time:0ms
and then a first recurring monitor operation is schedule to check wehther the docker container holding the galera service is running properly. And it succeeds
Oct 23 11:46:56 [19679] centos-7-inap-mtl01-0003329495 crmd: notice: te_rsc_command: Initiating monitor operation galera-bundle-docker-0_monitor_60000 locally on centos-7-inap-mtl01-0003329495 | action 28
Oct 23 11:46:56 docker(galera-bundle-docker-0)[40962]: INFO: monitor cmd passed: exit code = 0
Oct 23 11:46:56 [19679] centos-7-inap-mtl01-0003329495 crmd: info: process_lrm_event: Result of monitor operation for galera-bundle-docker-0 on centos-7-inap-mtl01-0003329495: 0 (ok) | call=18 key=galera-bundle-docker-0_monitor_60000 confirmed=false cib-update=105
then when the docker container is running, puppet-pacemaker creates a galera ocf resource in the cluster, to be ran in this docker container
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: cib_process_request: Forwarding cib_apply_diff operation for section 'all' to all (origin=local/cibadmin/2)
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: cib_perform_op: Diff: --- 0.17.4 2
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: cib_perform_op: Diff: +++ 0.18.0 (null)
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: cib_perform_op: + /cib: @epoch=18, @num_updates=0
Oct 23 11:47:23 [19674] centos-7-inap-mtl01-0003329495 cib: info: cib_perform_op: ++ /cib/configuration/resources/bundle[@id='galera-bundle']: <primitive class="ocf" id="galera" provider="heartbeat" type="galera"/>
then something changes the definition of the galera docker resource in pacemaker:
Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495 pengine: info: rsc_action_digest_cmp: Parameters to galera-bundle-docker-0_start_0 on centos-7-inap-mtl01-0003329495 changed: was d8041953cfb6acc917ec2931feb2ee67 vs. now 46b7a28056fbe69bef8a441022634eb6 (reload:3.0.14) 0:0;27:13:0:a6214ebb-5fc0-4249-a609-ae3747368a41
Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495 pengine: info: rsc_action_digest_cmp: Parameters to galera-bundle-docker-0_monitor_60000 on centos-7-inap-mtl01-0003329495 changed: was 8c00760613f45ebe053623f6a315b207 vs. now 21d84609373192f5c61a2019b46a49d6 (reschedule:3.0.14) 0:0;28:13:0:a6214ebb-5fc0-4249-a609-ae3747368a41
at the next pengine run, the pacemaker remote for galera container and the galera resource are scheduled for start, but the galera container is also scheduled for restart:
Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495 pengine: notice: LogAction: * Restart galera-bundle-docker-0 ( centos-7-inap-mtl01-0003329495 ) due to resource definition change
Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495 pengine: notice: LogAction: * Start galera-bundle-0 ( centos-7-inap-mtl01-0003329495 )
Oct 23 11:47:23 [19678] centos-7-inap-mtl01-0003329495 pengine: notice: LogAction: * Start galera:0 ( galera-bundle-0 )
so pengine schedule a stop of the galera docker container.
Oct 23 11:47:23 [19679] centos-7-inap-mtl01-0003329495 crmd: notice: te_rsc_command: Initiating stop operation galera-bundle-docker-0_stop_0 locally on centos-7-inap-mtl01-0003329495 | action 31
but the associated 'docker stop' command that is triggered by this stop action could not succeed in the next 20s, which resulted in a timeout:
Oct 23 11:47:43 [19676] centos-7-inap-mtl01-0003329495 lrmd: warning: child_timeout_callback: galera-bundle-docker-0_stop_0 process (PID 41798) timed out
Oct 23 11:47:43 [19676] centos-7-inap-mtl01-0003329495 lrmd: warning: operation_finished: galera-bundle-docker-0_stop_0:41798 - timed out after 20000ms
Oct 23 11:47:43 [19676] centos-7-inap-mtl01-0003329495 lrmd: info: log_finished: finished - rsc:galera-bundle-docker-0 action:stop call_id:20 pid:41798 exit-code:1 exec-time:20003ms queue-time:0ms
Oct 23 11:47:43 [19679] centos-7-inap-mtl01-0003329495 crmd: error: process_lrm_event: Result of stop operation for galera-bundle-docker-0 on centos-7-inap-mtl01-0003329495: Timed Out | call=20 key=galera-bundle-docker-0_stop_0 timeout=20000ms
and a failed stop operation being fatal for pacemaker (no fencing configured), the galera resource is never restarted, and the deployment eventually fails:
2018-10-23 12:21:06 | "Error: /usr/bin/clustercheck >/dev/null returned 1 instead of one of [0]",
2018-10-23 12:21:06 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql_bundle/Exec[galera-ready]/returns: change from notrun to 0 failed: /usr/bin/clustercheck >/dev/null returned 1 instead of one of [0]",
2018-10-23 12:21:06 | "Error: Failed to apply catalog: Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2 \"No such file or directory\")",
2018-10-23 12:21:06 | "+ rc=1",
Having a look at the corosync logs on the overcloud node [1]
puppet-pacemaker enables the galera-bundle resource to allow start the containerized galera
Oct 23 11:46:55 [19679] centos- 7-inap- mtl01-000332949 5 crmd: info: abort_transitio n_graph: Transition aborted by deletion of meta_attributes [@id='galera- bundle- meta_attributes ']: Configuration change | cib=0.17.0 source= te_update_ diff:456 path=/cib/ configuration/ resources/ bundle[ @id='galera- bundle' ]/meta_ attributes[ @id='galera- bundle- meta_attributes '] complete=true
which let pengine schedule a start action to create the docker container
Oct 23 11:46:55 [19678] centos- 7-inap- mtl01-000332949 5 pengine: info: RecurringOp: Start recurring monitor (60s) for galera- bundle- docker- 0 on centos- 7-inap- mtl01-000332949 5 7-inap- mtl01-000332949 5 crmd: notice: te_rsc_command: Initiating start operation galera- bundle- docker- 0_start_ 0 locally on centos- 7-inap- mtl01-000332949 5 | action 27
Oct 23 11:46:55 [19679] centos-
Oct 23 11:46:55 docker( galera- bundle- docker- 0)[40767] : INFO: checking for nsenter, which is required when 'monitor_cmd' is specified galera- bundle- docker- 0)[40767] : INFO: running container galera- bundle- docker- 0 for the first time galera- bundle- docker- 0)[40767] : INFO: monitor cmd passed: exit code = 0 galera- bundle- docker- 0)[40767] : NOTICE: Container galera- bundle- docker- 0 started successfully 7-inap- mtl01-000332949 5 lrmd: info: log_finished: finished - rsc:galera- bundle- docker- 0 action:start call_id:17 pid:40767 exit-code:0 exec-time:642ms queue-time:0ms
Oct 23 11:46:55 docker(
Oct 23 11:46:56 docker(
Oct 23 11:46:56 docker(
Oct 23 11:46:56 [19676] centos-
and then a first recurring monitor operation is schedule to check wehther the docker container holding the galera service is running properly. And it succeeds
Oct 23 11:46:56 [19679] centos- 7-inap- mtl01-000332949 5 crmd: notice: te_rsc_command: Initiating monitor operation galera- bundle- docker- 0_monitor_ 60000 locally on centos- 7-inap- mtl01-000332949 5 | action 28 galera- bundle- docker- 0)[40962] : INFO: monitor cmd passed: exit code = 0 7-inap- mtl01-000332949 5 crmd: info: process_lrm_event: Result of monitor operation for galera- bundle- docker- 0 on centos- 7-inap- mtl01-000332949 5: 0 (ok) | call=18 key=galera- bundle- docker- 0_monitor_ 60000 confirmed=false cib-update=105
Oct 23 11:46:56 docker(
Oct 23 11:46:56 [19679] centos-
then when the docker container is running, puppet-pacemaker creates a galera ocf resource in the cluster, to be ran in this docker container
Oct 23 11:47:23 [19674] centos- 7-inap- mtl01-000332949 5 cib: info: cib_process_ request: Forwarding cib_apply_diff operation for section 'all' to all (origin= local/cibadmin/ 2) 7-inap- mtl01-000332949 5 cib: info: cib_perform_op: Diff: --- 0.17.4 2 7-inap- mtl01-000332949 5 cib: info: cib_perform_op: Diff: +++ 0.18.0 (null) 7-inap- mtl01-000332949 5 cib: info: cib_perform_op: + /cib: @epoch=18, @num_updates=0 7-inap- mtl01-000332949 5 cib: info: cib_perform_op: ++ /cib/configurat ion/resources/ bundle[ @id='galera- bundle' ]: <primitive class="ocf" id="galera" provider= "heartbeat" type="galera"/>
Oct 23 11:47:23 [19674] centos-
Oct 23 11:47:23 [19674] centos-
Oct 23 11:47:23 [19674] centos-
Oct 23 11:47:23 [19674] centos-
then something changes the definition of the galera docker resource in pacemaker:
Oct 23 11:47:23 [19678] centos- 7-inap- mtl01-000332949 5 pengine: info: rsc_action_ digest_ cmp: Parameters to galera- bundle- docker- 0_start_ 0 on centos- 7-inap- mtl01-000332949 5 changed: was d8041953cfb6acc 917ec2931feb2ee 67 vs. now 46b7a28056fbe69 bef8a441022634e b6 (reload:3.0.14) 0:0;27: 13:0:a6214ebb- 5fc0-4249- a609-ae3747368a 41 7-inap- mtl01-000332949 5 pengine: info: rsc_action_ digest_ cmp: Parameters to galera- bundle- docker- 0_monitor_ 60000 on centos- 7-inap- mtl01-000332949 5 changed: was 8c00760613f45eb e053623f6a315b2 07 vs. now 21d84609373192f 5c61a2019b46a49 d6 (reschedule:3.0.14) 0:0;28: 13:0:a6214ebb- 5fc0-4249- a609-ae3747368a 41
Oct 23 11:47:23 [19678] centos-
at the next pengine run, the pacemaker remote for galera container and the galera resource are scheduled for start, but the galera container is also scheduled for restart:
Oct 23 11:47:23 [19678] centos- 7-inap- mtl01-000332949 5 pengine: notice: LogAction: * Restart galera- bundle- docker- 0 ( centos- 7-inap- mtl01-000332949 5 ) due to resource definition change 7-inap- mtl01-000332949 5 pengine: notice: LogAction: * Start galera-bundle-0 ( centos- 7-inap- mtl01-000332949 5 ) 7-inap- mtl01-000332949 5 pengine: notice: LogAction: * Start galera:0 ( galera-bundle-0 )
Oct 23 11:47:23 [19678] centos-
Oct 23 11:47:23 [19678] centos-
so pengine schedule a stop of the galera docker container.
Oct 23 11:47:23 [19679] centos- 7-inap- mtl01-000332949 5 crmd: notice: te_rsc_command: Initiating stop operation galera- bundle- docker- 0_stop_ 0 locally on centos- 7-inap- mtl01-000332949 5 | action 31
but the associated 'docker stop' command that is triggered by this stop action could not succeed in the next 20s, which resulted in a timeout:
Oct 23 11:47:43 [19676] centos- 7-inap- mtl01-000332949 5 lrmd: warning: child_timeout_ callback: galera- bundle- docker- 0_stop_ 0 process (PID 41798) timed out 7-inap- mtl01-000332949 5 lrmd: warning: operation_finished: galera- bundle- docker- 0_stop_ 0:41798 - timed out after 20000ms 7-inap- mtl01-000332949 5 lrmd: info: log_finished: finished - rsc:galera- bundle- docker- 0 action:stop call_id:20 pid:41798 exit-code:1 exec-time:20003ms queue-time:0ms 7-inap- mtl01-000332949 5 crmd: error: process_lrm_event: Result of stop operation for galera- bundle- docker- 0 on centos- 7-inap- mtl01-000332949 5: Timed Out | call=20 key=galera- bundle- docker- 0_stop_ 0 timeout=20000ms
Oct 23 11:47:43 [19676] centos-
Oct 23 11:47:43 [19676] centos-
Oct 23 11:47:43 [19679] centos-
and a failed stop operation being fatal for pacemaker (no fencing configured), the galera resource is never restarted, and the deployment eventually fails:
2018-10-23 12:21:06 | "Error: /usr/bin/ clustercheck >/dev/null returned 1 instead of one of [0]", main]/Tripleo: :Profile: :Pacemaker: :Database: :Mysql_ bundle/ Exec[galera- ready]/ returns: change from notrun to 0 failed: /usr/bin/ clustercheck >/dev/null returned 1 instead of one of [0]", extra-file= /root/. my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/ mysql/mysql. sock' (2 \"No such file or directory\")",
2018-10-23 12:21:06 | "Error: /Stage[
2018-10-23 12:21:06 | "Error: Failed to apply catalog: Execution of '/usr/bin/mysql --defaults-
2018-10-23 12:21:06 | "+ rc=1",
[1] http:// logs.openstack. org/77/ 611677/ 4/check/ tripleo- ci-centos- 7-scenario003- multinode- oooq-container/ c3d07bc/ logs/subnode- 2/var/log/ cluster/ corosync. log.txt. gz logs.openstack. org/77/ 611677/ 4/check/ tripleo- ci-centos- 7-scenario003- multinode- oooq-container/ c3d07bc/ logs/undercloud /home/zuul/ overcloud_ deploy. log.txt. gz#_2018- 10-23_12_ 21_06
[2] http://