Comment 8 for bug 1439649

Revision history for this message
Felipe Reyes (freyes) wrote :

I'm seeing this problem in another environment, similar deployment (3 lxc containers)

Apr 20 16:39:26 juju-machine-3-lxc-4 crm_verify[31774]: notice: crm_log_args: Invoked: crm_verify -V -p
Apr 20 16:39:27 juju-machine-3-lxc-4 cibadmin[31786]: notice: crm_log_args: Invoked: cibadmin -p -P
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[780]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[780]: error: cib_cs_destroy: Corosync connection lost! Exiting.
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[785]: error: crmd_quorum_destroy: connection terminated
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 stonith-ng[781]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[785]: notice: crmd_exit: Forcing immediate exit: Link has been severed (67)
Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]: warning: qb_ipcs_event_sendv: new_event_notification (782-785-6): Bad file descriptor (9)
Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]: warning: send_client_notify: Notification of client crmd/8ad990ba-cf09-4ba3-b74b-a7d05d377a1b failed
Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]: error: crm_abort: crm_glib_handler: Forked child 760 to record non-fatal assert at logging.c:63 : Source ID 4601370 was not found when attempting to remove it
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: pcmk_child_exit: Child process cib (780) exited: Invalid argument (22)
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: notice: pcmk_process_exit: Respawning failed child process: cib
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: pcmk_child_exit: Child process crmd (785) exited: Link has been severed (67)
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: notice: pcmk_process_exit: Respawning failed child process: crmd
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: crit: attrd_cs_destroy: Lost connection to Corosync service!
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: notice: main: Exiting...
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: notice: main: Disconnecting client 0x7ff985e478e0, pid=785...
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: mcp_cpg_destroy: Connection destroyed
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: error: attrd_cib_connection_destroy: Connection to the CIB terminated...
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: debug: crm_update_callsites: Enabling callsites based on priority=7, files=(null), functions=(null), formats=(null), tags=(null)
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[767]: debug: crm_update_callsites: Enabling callsites based on priority=7, files=(null), functions=(null), formats=(null), tags=(null)
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[767]: notice: main: CRM Git Version: 42f2063
Apr 20 16:50:01 juju-machine-3-lxc-4 stonith-ng[781]: error: stonith_peer_cs_destroy: Corosync connection terminated
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 2
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: crit: cib_init: Cannot sign in to the cluster... terminating
Apr 20 16:50:02 juju-machine-3-lxc-4 crmd[767]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry
Apr 20 16:50:05 juju-machine-3-lxc-4 crmd[767]: warning: do_cib_control: Couldn't complete CIB registration 2 times... pause and retry

These are the only processes running in one of the nodes:

root 782 0.0 0.0 81464 1828 ? Ss Feb12 25:13 /usr/lib/pacemaker/lrmd
haclust+ 784 0.0 0.0 73920 776 ? Ss Feb12 8:25 /usr/lib/pacemaker/pengine
root 780 0.8 0.0 130256 4152 ? Ssl 16:50 0:00 /usr/sbin/corosync

A possible explanation could be: http://thread.gmane.org/gmane.linux.highavailability.corosync/592/focus=639

I only have logs for one of the nodes, I'm trying to get logs of the other 2 nodes to get a better understanding of what was happening with the communication.