A couple of drive-by thoughts on this one.
It seems that the issue is more around slowness in general as pacemaker times out on the monitoring of all its services:
Feb 20 18:50:14 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: child_timeout_callback: haproxy-bundle-docker-0_monitor_60000 process (PID 86216) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: operation_finished: haproxy-bundle-docker-0_monitor_60000:86216 - timed out after 20000ms
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: child_timeout_callback: rabbitmq-bundle-docker-0_monitor_60000 process (PID 86274) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: child_timeout_callback: galera-bundle-docker-0_monitor_60000 process (PID 86434) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: child_timeout_callback: redis-bundle-docker-0_monitor_60000 process (PID 86625) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: operation_finished: galera-bundle-docker-0_monitor_60000:86434 - timed out after 20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for galera on galera-bundle-0: Timed Out | call=139 key=galera_monitor_10000 timeout=30000ms
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: operation_finished: redis-bundle-docker-0_monitor_60000:86625 - timed out after 20000ms
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: operation_finished: rabbitmq-bundle-docker-0_monitor_60000:86274 - timed out after 20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for haproxy-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=52 key=haproxy-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for galera-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=22 key=galera-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for redis-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=33 key=redis-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for rabbitmq-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=11 key=rabbitmq-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: info: throttle_check_thresholds: Moderate CPU load detected: 12.020000
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: info: throttle_send_command: New throttle mode: 0010 (was 0000)
What is interesting though is that it times out monitoring on all docker services and not the (IPs) for example. This might either mean that the box itself was slowed down *or* that docker was in a slow state.
I'd tend to think that the box itself was particularly slow due to the following two hints:
1) Haproxy thought that galera was gone (i.e. no pcmk involved)
Feb 20 18:50:18 centos-7-inap-mtl01-0002637151 haproxy[57468]: Backup Server mysql/centos-7-inap-mtl01-0002637151.internalapi.localdomain is DOWN, reason: Layer7 timeout, check duration: 10001ms. 0 active and 0 backup servers left. 10 sessions active, 0 requeued, 0 remaining in queue.
2)
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: info: throttle_check_thresholds: Moderate CPU load detected: 12.020000
Now the above isn't particularly a high value since the box has 8 CPUs, but it does hint us off that it was slow.
One thing we *could* do is to not hard-code all these timeouts as we do now and instead allow them to be configurable, so we could increase them a bit in environments like CI
A couple of drive-by thoughts on this one. 7-inap- mtl01-000263715 1 lrmd: warning: child_timeout_ callback: haproxy- bundle- docker- 0_monitor_ 60000 process (PID 86216) timed out 7-inap- mtl01-000263715 1 lrmd: warning: operation_finished: haproxy- bundle- docker- 0_monitor_ 60000:86216 - timed out after 20000ms 7-inap- mtl01-000263715 1 lrmd: warning: child_timeout_ callback: rabbitmq- bundle- docker- 0_monitor_ 60000 process (PID 86274) timed out 7-inap- mtl01-000263715 1 lrmd: warning: child_timeout_ callback: galera- bundle- docker- 0_monitor_ 60000 process (PID 86434) timed out 7-inap- mtl01-000263715 1 lrmd: warning: child_timeout_ callback: redis-bundle- docker- 0_monitor_ 60000 process (PID 86625) timed out 7-inap- mtl01-000263715 1 lrmd: warning: operation_finished: galera- bundle- docker- 0_monitor_ 60000:86434 - timed out after 20000ms 7-inap- mtl01-000263715 1 crmd: error: process_lrm_event: Result of monitor operation for galera on galera-bundle-0: Timed Out | call=139 key=galera_ monitor_ 10000 timeout=30000ms 7-inap- mtl01-000263715 1 lrmd: warning: operation_finished: redis-bundle- docker- 0_monitor_ 60000:86625 - timed out after 20000ms 7-inap- mtl01-000263715 1 lrmd: warning: operation_finished: rabbitmq- bundle- docker- 0_monitor_ 60000:86274 - timed out after 20000ms 7-inap- mtl01-000263715 1 crmd: error: process_lrm_event: Result of monitor operation for haproxy- bundle- docker- 0 on centos- 7-inap- mtl01-000263715 1: Timed Out | call=52 key=haproxy- bundle- docker- 0_monitor_ 60000 timeout=20000ms 7-inap- mtl01-000263715 1 crmd: error: process_lrm_event: Result of monitor operation for galera- bundle- docker- 0 on centos- 7-inap- mtl01-000263715 1: Timed Out | call=22 key=galera- bundle- docker- 0_monitor_ 60000 timeout=20000ms 7-inap- mtl01-000263715 1 crmd: error: process_lrm_event: Result of monitor operation for redis-bundle- docker- 0 on centos- 7-inap- mtl01-000263715 1: Timed Out | call=33 key=redis- bundle- docker- 0_monitor_ 60000 timeout=20000ms 7-inap- mtl01-000263715 1 crmd: error: process_lrm_event: Result of monitor operation for rabbitmq- bundle- docker- 0 on centos- 7-inap- mtl01-000263715 1: Timed Out | call=11 key=rabbitmq- bundle- docker- 0_monitor_ 60000 timeout=20000ms 7-inap- mtl01-000263715 1 crmd: info: throttle_ check_threshold s: Moderate CPU load detected: 12.020000 7-inap- mtl01-000263715 1 crmd: info: throttle_ send_command: New throttle mode: 0010 (was 0000)
It seems that the issue is more around slowness in general as pacemaker times out on the monitoring of all its services:
Feb 20 18:50:14 [15128] centos-
Feb 20 18:50:38 [15128] centos-
Feb 20 18:50:38 [15128] centos-
Feb 20 18:50:38 [15128] centos-
Feb 20 18:50:38 [15128] centos-
Feb 20 18:50:38 [15128] centos-
Feb 20 18:50:38 [15131] centos-
Feb 20 18:50:38 [15128] centos-
Feb 20 18:50:38 [15128] centos-
Feb 20 18:50:38 [15131] centos-
Feb 20 18:50:38 [15131] centos-
Feb 20 18:50:38 [15131] centos-
Feb 20 18:50:38 [15131] centos-
Feb 20 18:50:38 [15131] centos-
Feb 20 18:50:38 [15131] centos-
What is interesting though is that it times out monitoring on all docker services and not the (IPs) for example. This might either mean that the box itself was slowed down *or* that docker was in a slow state.
I'd tend to think that the box itself was particularly slow due to the following two hints: 7-inap- mtl01-000263715 1 haproxy[57468]: Backup Server mysql/centos- 7-inap- mtl01-000263715 1.internalapi. localdomain is DOWN, reason: Layer7 timeout, check duration: 10001ms. 0 active and 0 backup servers left. 10 sessions active, 0 requeued, 0 remaining in queue.
1) Haproxy thought that galera was gone (i.e. no pcmk involved)
Feb 20 18:50:18 centos-
2) 7-inap- mtl01-000263715 1 crmd: info: throttle_ check_threshold s: Moderate CPU load detected: 12.020000
Feb 20 18:50:38 [15131] centos-
Now the above isn't particularly a high value since the box has 8 CPUs, but it does hint us off that it was slow.
One thing we *could* do is to not hard-code all these timeouts as we do now and instead allow them to be configurable, so we could increase them a bit in environments like CI