pacemaker monitoring of mysql causes deployment failure when mysql is fenced

Bug #1750662 reported by Alex Schultz on 2018-02-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
Unassigned

Bug Description

From http://logs.openstack.org/65/544065/2/gate/tripleo-ci-centos-7-scenario002-multinode-oooq-container/da0522d/logs

The job failed because the deployment process got a 500 from keystone when registering endpoints. In digging into this failure, it was because the mysql connection went away. In looking into why mysql went away, it appears that the mysql monitor failed and mysql was restarted.

http://logs.openstack.org/65/544065/2/gate/tripleo-ci-centos-7-scenario002-multinode-oooq-container/da0522d/logs/subnode-2/var/log/pacemaker/bundles/galera-bundle-0/pacemaker.log.txt.gz

Feb 20 18:37:24 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: log_finished: finished - rsc:galera action:promote call_id:68 pid:576 exit-code:0 exec-time:10084ms queue-time:1ms
Feb 20 18:50:18 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: warning: child_timeout_callback: galera_monitor_10000 process (PID 5447) timed out
Feb 20 18:50:18 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: warning: operation_finished: galera_monitor_10000:5447 - timed out after 30000ms
Feb 20 18:50:40 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher.
Feb 20 18:50:40 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: cancel_recurring_action: Cancelling ocf operation galera_monitor_10000
Feb 20 18:50:40 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: error: crm_send_tls: Connection terminated rc = -53
Feb 20 18:50:40 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: error: crm_remote_send: Failed to send remote msg, rc = -53
Feb 20 18:50:40 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: error: lrmd_tls_send_msg: Failed to send remote lrmd tls msg, rc = -53
Feb 20 18:50:40 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: warning: send_client_notify: Notification of client remote-lrmd-centos-7-inap-mtl01-0002637151:3123/1cd9b0b5-3976-483d-ba4b-4b405ac6c63f failed
Feb 20 18:50:40 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: remote-lrmd-centos-7-inap-mtl01-0002637151:3123 id: 1cd9b0b5-3976-483d-ba4b-4b405ac6c63f
Feb 20 18:50:52 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores
Feb 20 18:50:52 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: qb_ipcs_us_publish: server name: lrmd
Feb 20 18:50:52 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: notice: lrmd_init_remote_tls_server: Starting a tls listener on port 3123.
Feb 20 18:50:52 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: notice: bind_and_listen: Listening on address ::
Feb 20 18:50:52 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_ro
Feb 20 18:50:52 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_rw
Feb 20 18:50:52 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_shm
Feb 20 18:50:52 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: qb_ipcs_us_publish: server name: attrd
Feb 20 18:50:52 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: qb_ipcs_us_publish: server name: stonith-ng
Feb 20 18:50:52 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: qb_ipcs_us_publish: server name: crmd
Feb 20 18:50:52 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: main: Starting
Feb 20 18:50:52 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: crm_remote_accept: New remote connection from ::ffff:192.168.24.3
Feb 20 18:50:52 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: notice: lrmd_remote_listen: LRMD client connection established. 0x56272f160d70 id: 65e0cf11-e2ce-4e59-ad03-36accac427e7
Feb 20 18:50:53 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: process_lrmd_get_rsc_info: Resource 'galera' not found (0 active resources)
Feb 20 18:50:53 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: process_lrmd_get_rsc_info: Resource 'galera:0' not found (0 active resources)
Feb 20 18:50:53 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: process_lrmd_rsc_register: Added 'galera' to the rsc list (1 active resources)
Feb 20 18:50:54 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: notice: check_sbd_timeout: Watchdog may be enabled but stonith-watchdog-timeout is disabled: (null)
Feb 20 18:50:54 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: check_sbd_timeout: Watchdog functionality is consistent: (null) delay exceeds timeout of -1ms
Feb 20 18:50:55 [13] centos-7-inap-mtl01-0002637151 pacemaker_remoted: info: log_execute: executing - rsc:galera action:start call_id:13

http://logs.openstack.org/65/544065/2/gate/tripleo-ci-centos-7-scenario002-multinode-oooq-container/da0522d/logs/subnode-2/var/log/containers/keystone/keystone.log.txt.gz#_2018-02-20_18_50_18_174

2018-02-20 18:50:18.174 23 ERROR keystone.common.wsgi [req-debd0ebd-196f-441a-822a-62d67fa67d6f - - - - -] (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query'): DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')

http://logs.openstack.org/65/544065/2/gate/tripleo-ci-centos-7-scenario002-multinode-oooq-container/da0522d/logs/postci.txt.gz

            "Error: Execution of '/usr/bin/openstack endpoint create --format shell f428d84729514400844ee45ba9b2bca0 public http://192.168.24.7:8776/v2/%(tenant_id)s --region regionOne' returned 1: An unexpected error prevented the server from fulfilling your request. (HTTP 500) (Request-ID: req-debd0ebd-196f-441a-822a-62d67fa67d6f)",

Michele Baldessari (michele) wrote :
Download full text (4.1 KiB)

A couple of drive-by thoughts on this one.
It seems that the issue is more around slowness in general as pacemaker times out on the monitoring of all its services:
Feb 20 18:50:14 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: child_timeout_callback: haproxy-bundle-docker-0_monitor_60000 process (PID 86216) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: operation_finished: haproxy-bundle-docker-0_monitor_60000:86216 - timed out after 20000ms
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: child_timeout_callback: rabbitmq-bundle-docker-0_monitor_60000 process (PID 86274) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: child_timeout_callback: galera-bundle-docker-0_monitor_60000 process (PID 86434) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: child_timeout_callback: redis-bundle-docker-0_monitor_60000 process (PID 86625) timed out
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: operation_finished: galera-bundle-docker-0_monitor_60000:86434 - timed out after 20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for galera on galera-bundle-0: Timed Out | call=139 key=galera_monitor_10000 timeout=30000ms
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: operation_finished: redis-bundle-docker-0_monitor_60000:86625 - timed out after 20000ms
Feb 20 18:50:38 [15128] centos-7-inap-mtl01-0002637151 lrmd: warning: operation_finished: rabbitmq-bundle-docker-0_monitor_60000:86274 - timed out after 20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for haproxy-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=52 key=haproxy-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for galera-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=22 key=galera-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for redis-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=33 key=redis-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: error: process_lrm_event: Result of monitor operation for rabbitmq-bundle-docker-0 on centos-7-inap-mtl01-0002637151: Timed Out | call=11 key=rabbitmq-bundle-docker-0_monitor_60000 timeout=20000ms
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: info: throttle_check_thresholds: Moderate CPU load detected: 12.020000
Feb 20 18:50:38 [15131] centos-7-inap-mtl01-0002637151 crmd: info: throttle_send_command: New throttle mode: 0010 (was 0000)

What is interesting though is that it times out monitoring on all docker services and not the (IPs)...

Read more...

Alex Schultz (alex-schultz) wrote :

+1 to being able to tune timeouts

Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
milestone: stein-3 → stein-rc1
Changed in tripleo:
milestone: stein-rc1 → train-1
Changed in tripleo:
milestone: train-1 → train-2
Changed in tripleo:
milestone: train-2 → train-3
Changed in tripleo:
milestone: train-3 → ussuri-1
Changed in tripleo:
milestone: ussuri-1 → ussuri-2
wes hayutin (weshayutin) on 2020-02-10
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers