Segfault: pacemaker segfaults randomly on Ubuntu trusty 14.04

Bug #1327222 reported by born2chill
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
corosync (Ubuntu)
Invalid
Undecided
Unassigned
pacemaker (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

I'm running a two node HA Cluster with pacemaker/corosync and a pretty simple configuration - only an IP address, one service and two clone sets of resources are managed (see below). however i run into constant crashes of pacemaker (looked like corossync at first) on both nodes. At the moment this behaviour makes the cluster unusable.

I attached the cluster config, cib.xml and the crashdumps to the bug, hopefully someone can make something of it.

~# crm_mon -1
crm_mon -1
Last updated: Fri Jun 6 15:43:14 2014
Last change: Fri Jun 6 10:28:17 2014 via cibadmin on lbsrv52
Stack: corosync
Current DC: lbsrv51 (1) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
6 Resources configured

Online: [ lbsrv51 lbsrv52 ]

 Resource Group: grp_HAProxy-Front-IPs
     res_IPaddr2_Test (ocf::heartbeat:IPaddr2): Started lbsrv51
 res_pdnsd_pdnsd (lsb:pdnsd): Started lbsrv51
 Clone Set: cl_isc-dhcp-server_1 [res_isc-dhcp-server_1]
     Started: [ lbsrv51 lbsrv52 ]
 Clone Set: cl_tftpd-hpa_1 [res_tftpd-hpa_1]
     Started: [ lbsrv51 lbsrv52 ]

== corosync.log; ==
Jun 06 15:14:56 [2324] lbsrv51 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Jun 06 15:14:56 [2327] lbsrv51 attrd: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Jun 06 15:14:56 [2327] lbsrv51 attrd: crit: attrd_cs_destroy: Lost connection to Corosync service!
Jun 06 15:14:56 [2327] lbsrv51 attrd: notice: main: Exiting...
Jun 06 15:14:56 [2324] lbsrv51 cib: error: cib_cs_destroy: Corosync connection lost! Exiting.
Jun 06 15:14:56 [2327] lbsrv51 attrd: notice: main: Disconnecting client 0x7f1f86244a10, pid=2329...
Jun 06 15:14:56 [2324] lbsrv51 cib: info: terminate_cib: cib_cs_destroy: Exiting fast...
Jun 06 15:14:56 [2324] lbsrv51 cib: info: crm_client_destroy: Destroying 0 events
Jun 06 15:14:56 [2327] lbsrv51 attrd: error: attrd_cib_connection_destroy: Connection to the CIB terminated...
Jun 06 15:14:56 [2324] lbsrv51 cib: info: qb_ipcs_us_withdraw: withdrawing server sockets
Jun 06 15:14:56 [2324] lbsrv51 cib: info: crm_client_destroy: Destroying 0 events
Jun 06 15:14:56 [2324] lbsrv51 cib: info: crm_client_destroy: Destroying 0 events
Jun 06 15:14:56 [2325] lbsrv51 stonith-ng: error: crm_ipc_read: Connection to cib_rw failed
Jun 06 15:14:56 [2325] lbsrv51 stonith-ng: error: mainloop_gio_callback: Connection to cib_rw[0x7f52f2d82c10] closed (I/O condition=17)
Jun 06 15:14:56 [2324] lbsrv51 cib: info: qb_ipcs_us_withdraw: withdrawing server sockets
Jun 06 15:14:56 [2324] lbsrv51 cib: info: crm_client_destroy: Destroying 0 events
Jun 06 15:14:56 [2324] lbsrv51 cib: info: qb_ipcs_us_withdraw: withdrawing server sockets
Jun 06 15:14:56 [2324] lbsrv51 cib: info: crm_xml_cleanup: Cleaning up memory from libxml2
Jun 06 15:14:56 [2325] lbsrv51 stonith-ng: notice: cib_connection_destroy: Connection to the CIB terminated. Shutting down.
Jun 06 15:14:56 [2325] lbsrv51 stonith-ng: info: stonith_shutdown: Terminating with 1 clients
Jun 06 15:14:56 [2325] lbsrv51 stonith-ng: info: crm_client_destroy: Destroying 0 events
Jun 06 15:14:56 [2325] lbsrv51 stonith-ng: info: qb_ipcs_us_withdraw: withdrawing server sockets
Jun 06 15:14:56 [2325] lbsrv51 stonith-ng: info: main: Done
Jun 06 15:14:56 [2325] lbsrv51 stonith-ng: info: crm_xml_cleanup: Cleaning up memory from libxml2
Jun 06 15:14:56 [2329] lbsrv51 crmd: error: crm_ipc_read: Connection to cib_shm failed
Jun 06 15:14:56 [2329] lbsrv51 crmd: error: mainloop_gio_callback: Connection to cib_shm[0x7f97ed1f6980] closed (I/O condition=17)
Jun 06 15:14:56 [2329] lbsrv51 crmd: error: crmd_cib_connection_destroy: Connection to the CIB terminated...
Jun 06 15:14:56 [2329] lbsrv51 crmd: error: do_log: FSA: Input I_ERROR from crmd_cib_connection_destroy() received in state S_IDLE
Jun 06 15:14:56 [2329] lbsrv51 crmd: notice: do_state_transition: State transition S_IDLE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=crmd_cib_connection_destroy ]
Jun 06 15:14:56 [2329] lbsrv51 crmd: warning: do_recover: Fast-tracking shutdown in response to errors
Jun 06 15:14:56 [2329] lbsrv51 crmd: warning: do_election_vote: Not voting in election, we're in state S_RECOVERY
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: do_dc_release: DC role released
Jun 06 15:14:56 [2322] lbsrv51 pacemakerd: info: pcmk_child_exit: Child process stonith-ng (2325) exited: OK (0)
Jun 06 15:14:56 [2322] lbsrv51 pacemakerd: info: crm_cs_flush: Sent 0 CPG messages (1 remaining, last=10): Library error (2)
Jun 06 15:14:56 [2322] lbsrv51 pacemakerd: notice: pcmk_process_exit: Respawning failed child process: stonith-ng
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: pe_ipc_destroy: Connection to the Policy Engine released
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: do_te_control: Transitioner is now inactive
Jun 06 15:14:56 [2329] lbsrv51 crmd: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: do_state_transition: State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ]
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: do_shutdown: Disconnecting STONITH...
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: tengine_stonith_connection_destroy: Fencing daemon disconnected
Jun 06 15:14:56 [2322] lbsrv51 pacemakerd: info: start_child: Forked child 59988 for process stonith-ng
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: stop_recurring_actions: Cancelling op 27 for res_tftpd-hpa_1 (res_tftpd-hpa_1:27)
Jun 06 15:14:56 [2322] lbsrv51 pacemakerd: error: pcmk_child_exit: Child process attrd (2327) exited: Transport endpoint is not connected (107)
Jun 06 15:14:56 [2322] lbsrv51 pacemakerd: notice: pcmk_process_exit: Respawning failed child process: attrd
Jun 06 15:14:56 [2328] lbsrv51 pengine: info: crm_client_destroy: Destroying 0 events
Jun 06 15:14:56 [2322] lbsrv51 pacemakerd: info: start_child: Using uid=111 and group=119 for process attrd
Jun 06 15:14:56 [2322] lbsrv51 pacemakerd: info: start_child: Forked child 59989 for process attrd
Jun 06 15:14:56 [2322] lbsrv51 pacemakerd: info: mcp_quorum_destroy: connection closed
Jun 06 15:14:56 [2322] lbsrv51 pacemakerd: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Jun 06 15:14:56 [2322] lbsrv51 pacemakerd: error: mcp_cpg_destroy: Connection destroyed
Jun 06 15:14:56 [2322] lbsrv51 pacemakerd: info: crm_xml_cleanup: Cleaning up memory from libxml2
Jun 06 15:14:56 [2326] lbsrv51 lrmd: info: cancel_recurring_action: Cancelling operation res_tftpd-hpa_1_status_15000
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: stop_recurring_actions: Cancelling op 35 for res_IPaddr2_Test (res_IPaddr2_Test:35)
Jun 06 15:14:56 [2326] lbsrv51 lrmd: info: cancel_recurring_action: Cancelling operation res_IPaddr2_Test_monitor_10000
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: stop_recurring_actions: Cancelling op 41 for res_pdnsd_pdnsd (res_pdnsd_pdnsd:41)
Jun 06 15:14:56 [2326] lbsrv51 lrmd: info: cancel_recurring_action: Cancelling operation res_pdnsd_pdnsd_status_15000
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: stop_recurring_actions: Cancelling op 47 for res_isc-dhcp-server_1 (res_isc-dhcp-server_1:47)
Jun 06 15:14:56 [2326] lbsrv51 lrmd: info: cancel_recurring_action: Cancelling operation res_isc-dhcp-server_1_status_15000
Jun 06 15:14:56 [59989] lbsrv51 attrd: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Jun 06 15:14:56 [59989] lbsrv51 attrd: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 2
Jun 06 15:14:56 [59989] lbsrv51 attrd: error: main: HA Signon failed
Jun 06 15:14:56 [2329] lbsrv51 crmd: notice: lrm_state_verify_stopped: Stopped 4 recurring operations at (null) (3942893656 ops remaining)
Jun 06 15:14:56 [59989] lbsrv51 attrd: error: main: Aborting startup
Jun 06 15:14:56 [2329] lbsrv51 crmd: notice: lrm_state_verify_stopped: Recurring action res_pdnsd_pdnsd:41 (res_pdnsd_pdnsd_monitor_15000) incomplete at shutdown
Jun 06 15:14:56 [2329] lbsrv51 crmd: notice: lrm_state_verify_stopped: Recurring action res_isc-dhcp-server_1:47 (res_isc-dhcp-server_1_monitor_15000) incomplete at shutdown
Jun 06 15:14:56 [2329] lbsrv51 crmd: notice: lrm_state_verify_stopped: Recurring action res_IPaddr2_Test:35 (res_IPaddr2_Test_monitor_10000) incomplete at shutdown
Jun 06 15:14:56 [2329] lbsrv51 crmd: error: lrm_state_verify_stopped: 3 resources were active at shutdown.
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: do_lrm_control: Disconnecting from the LRM
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: lrmd_api_disconnect: Disconnecting from lrmd service
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: lrmd_ipc_connection_destroy: IPC connection destroyed
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: lrm_connection_destroy: LRM Connection disconnected
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: lrmd_api_disconnect: Disconnecting from lrmd service
Jun 06 15:14:56 [2329] lbsrv51 crmd: notice: do_lrm_control: Disconnected from the LRM
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: crm_cluster_disconnect: Disconnecting from cluster infrastructure: corosync
Jun 06 15:14:56 [2329] lbsrv51 crmd: notice: terminate_cs_connection: Disconnecting from Corosync
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: crm_cluster_disconnect: Disconnected from corosync
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: do_ha_control: Disconnected from the cluster
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: do_cib_control: Disconnecting CIB
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: qb_ipcs_us_withdraw: withdrawing server sockets
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: do_exit: [crmd] stopped (0)
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: crmd_exit: Dropping I_PENDING: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_election_vote ]
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: crmd_exit: Dropping I_RELEASE_SUCCESS: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_dc_release ]
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: crmd_exit: Dropping I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: crmd_quorum_destroy: connection closed
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: crmd_cs_destroy: connection closed
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: crmd_init: 2329 stopped: OK (0)
Jun 06 15:14:56 [2329] lbsrv51 crmd: error: crmd_fast_exit: Could not recover from internal error
Jun 06 15:14:56 [2329] lbsrv51 crmd: info: crm_xml_cleanup: Cleaning up memory from libxml2
Jun 06 15:14:56 [2326] lbsrv51 lrmd: info: crm_client_destroy: Destroying 0 events
Jun 06 15:14:56 [59988] lbsrv51 stonith-ng: info: crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root
Jun 06 15:14:56 [59988] lbsrv51 stonith-ng: info: get_cluster_type: Verifying cluster type: 'corosync'
Jun 06 15:14:56 [59988] lbsrv51 stonith-ng: info: get_cluster_type: Assuming an active 'corosync' cluster
Jun 06 15:14:56 [59988] lbsrv51 stonith-ng: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Jun 06 15:14:56 [59988] lbsrv51 stonith-ng: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 2
Jun 06 15:14:56 [59988] lbsrv51 stonith-ng: crit: main: Cannot sign in to the cluster... terminating
Jun 06 15:14:56 [59988] lbsrv51 stonith-ng: info: crm_xml_cleanup: Cleaning up memory from libxml2

== dmesg: ==
[60379.304488] show_signal_msg: 18 callbacks suppressed
[60379.304493] crm_resource[19768]: segfault at 0 ip 00007f276681c0aa sp 00007fffe49ea2a8 error 4 in libc-2.19.so[7f27666db000+1bc000]
[60379.858371] cib[2234]: segfault at 0 ip 00007f59013760aa sp 00007fff0e21a0d8 error 4 in libc-2.19.so[7f5901235000+1bc000]

== syslog: ==
Jun 6 15:14:56 lbsrv51 cibmon[15100]: error: crm_ipc_read: Connection to cib_ro failed
Jun 6 15:14:56 lbsrv51 cibmon[15100]: error: mainloop_gio_callback: Connection to cib_ro[0x7f188c76f240] closed (I/O condition=17)
Jun 6 15:14:56 lbsrv51 cibmon[15100]: error: cib_connection_destroy: Connection to the CIB terminated... exiting
Jun 6 15:14:56 lbsrv51 attrd[59989]: notice: crm_add_logfile: Additional logging available in /var/log/corosync/corosync.log
Jun 6 15:14:56 lbsrv51 crm_simulate[59990]: notice: crm_log_args: Invoked: crm_simulate -s -S -VVVVV -L
Jun 6 15:14:56 lbsrv51 stonith-ng[59988]: notice: crm_add_logfile: Additional logging available in /var/log/corosync/corosync.log
Jun 6 15:14:56 lbsrv51 crm_simulate[60012]: notice: crm_log_args: Invoked: crm_simulate -s -S -VVVVV -L
Jun 6 15:14:56 lbsrv51 crm_simulate[60038]: notice: crm_log_args: Invoked: crm_simulate -s -S -VVVVV -L

Revision history for this message
born2chill (david-gabriel) wrote :
affects: ubuntu → corosync (Ubuntu)
Revision history for this message
born2chill (david-gabriel) wrote :
Revision history for this message
born2chill (david-gabriel) wrote :
Revision history for this message
born2chill (david-gabriel) wrote :
Revision history for this message
born2chill (david-gabriel) wrote :

At the moment I'm running corosync in debug mode, so I should get more logs soon.

description: updated
summary: - Segfault: corosync segfaults randomly on Ubuntu trusty 14.04
+ Segfault: pacemaker segfaults randomly on Ubuntu trusty 14.04
Changed in corosync (Ubuntu):
status: New → Invalid
Changed in pacemaker (Ubuntu):
status: New → Invalid
Revision history for this message
born2chill (david-gabriel) wrote :

I found out that not the cluster stack itself was causing the issues but the tool that I used to configure the cluster: LCMC. Although LCMC has been working flawlessly for me on older versions of corosync/pacemaker, it seems as it hasn't been updated to work with corosync 2.3.x and pacemaker 1.1x. So everyone watch out until the LCMC gets updated (at least 1.6.8 as of 2014-06-16 doesn't work reliably).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.