tripleo

Bug #1750662
Comment #1

Comment 1 for bug 1750662

Revision history for this message

Michele Baldessari (michele) wrote on 2018-02-21:

A couple of drive-by thoughts on this one.
It seems that the issue is more around slowness in general as pacemaker times out on the monitoring of all its services:
Feb 20 18:50:14 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: child_timeout_callback: haproxy-bundle-docker-0_monitor_60000 process (PID 86216) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: operation_finished: haproxy-bundle-docker-0_monitor_60000:86216 - timed out after 20000ms
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: child_timeout_callback: rabbitmq-bundle-docker-0_monitor_60000 process (PID 86274) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: child_timeout_callback: galera-bundle-docker-0_monitor_60000 process (PID 86434) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: child_timeout_callback: redis-bundle-docker-0_monitor_60000 process (PID 86625) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: operation_finished: galera-bundle-docker-0_monitor_60000:86434 - timed out after 20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for galera on galera-bundle-0: Timed Out | call=139 key=galera_monitor_10000 timeout=30000ms
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: operation_finished: redis-bundle-docker-0_monitor_60000:86625 - timed out after 20000ms
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: operation_finished: rabbitmq-bundle-docker-0_monitor_60000:86274 - timed out after 20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for haproxy-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=52 key=haproxy-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for galera-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=22 key=galera-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for redis-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=33 key=redis-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for rabbitmq-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=11 key=rabbitmq-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: info: throttle_check_thresholds: Moderate CPU load detected: 12.020000
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: info: throttle_send_command: New throttle mode: 0010 (was 0000)

What is interesting though is that it times out monitoring on all docker services and not the (IPs) for example. This might either mean that the box itself was slowed down *or* that docker was in a slow state.

I'd tend to think that the box itself was particularly slow due to the following two hints:
1) Haproxy thought that galera was gone (i.e. no pcmk involved)
Feb 20 18:50:18 centos-7-inap-mtl01-0002637151 haproxy[57468]: Backup Server mysql/centos-7-inap-mtl01-0002637151.internalapi.localdomain is DOWN, reason: Layer7 timeout, check duration: 10001ms. 0 active and 0 backup servers left. 10 sessions active, 0 requeued, 0 remaining in queue.

2)
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: info: throttle_check_thresholds: Moderate CPU load detected: 12.020000

Now the above isn't particularly a high value since the box has 8 CPUs, but it does hint us off that it was slow.

One thing we *could* do is to not hard-code all these timeouts as we do now and instead allow them to be configurable, so we could increase them a bit in environments like CI

A couple of drive-by thoughts on this one. 
It seems that the issue is more around slowness in general as pacemaker times out on the monitoring of all its services:
Feb 20 18:50:14 [15128] centos-7-inap-mtl01-0002637151       lrmd:  warning: child_timeout_callback:    haproxy-bundle-docker-0_monitor_60000 process (PID 86216) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151       lrmd:  warning: operation_finished:        haproxy-bundle-docker-0_monitor_60000:86216 - timed out after 20000ms
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151       lrmd:  warning: child_timeout_callback:    rabbitmq-bundle-docker-0_monitor_60000 process (PID 86274) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151       lrmd:  warning: child_timeout_callback:    galera-bundle-docker-0_monitor_60000 process (PID 86434) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151       lrmd:  warning: child_timeout_callback:    redis-bundle-docker-0_monitor_60000 process (PID 86625) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151       lrmd:  warning: operation_finished:        galera-bundle-docker-0_monitor_60000:86434 - timed out after 20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151       crmd:    error: process_lrm_event: Result of monitor operation for galera on galera-bundle-0: Timed Out | call=139 key=galera_monitor_10000 timeout=30000ms
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151       lrmd:  warning: operation_finished:        redis-bundle-docker-0_monitor_60000:86625 - timed out after 20000ms
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151       lrmd:  warning: operation_finished:        rabbitmq-bundle-docker-0_monitor_60000:86274 - timed out after 20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151       crmd:    error: process_lrm_event: Result of monitor operation for haproxy-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=52 key=haproxy-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151       crmd:    error: process_lrm_event: Result of monitor operation for galera-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=22 key=galera-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151       crmd:    error: process_lrm_event: Result of monitor operation for redis-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=33 key=redis-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151       crmd:    error: process_lrm_event: Result of monitor operation for rabbitmq-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=11 key=rabbitmq-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151       crmd:     info: throttle_check_thresholds: Moderate CPU load detected: 12.020000
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151       crmd:     info: throttle_send_command:     New throttle mode: 0010 (was 0000)

2) 
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151       crmd:     info: throttle_check_thresholds: Moderate CPU load detected: 12.020000

Now the above isn't particularly a high value since the box has 8 CPUs, but it does hint us off that it was slow.

One thing we *could* do is to not hard-code all these timeouts as we do now and instead allow them to be configurable, so we could increase them a bit in environments like CI