I'm seeing this problem in another environment, similar deployment (3 lxc containers)
Apr 20 16:39:26 juju-machine-3-lxc-4 crm_verify[31774]: notice: crm_log_args: Invoked: crm_verify -V -p Apr 20 16:39:27 juju-machine-3-lxc-4 cibadmin[31786]: notice: crm_log_args: Invoked: cibadmin -p -P Apr 20 16:50:01 juju-machine-3-lxc-4 cib[780]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Apr 20 16:50:01 juju-machine-3-lxc-4 cib[780]: error: cib_cs_destroy: Corosync connection lost! Exiting. Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[785]: error: crmd_quorum_destroy: connection terminated Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Apr 20 16:50:01 juju-machine-3-lxc-4 stonith-ng[781]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[785]: notice: crmd_exit: Forcing immediate exit: Link has been severed (67) Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]: warning: qb_ipcs_event_sendv: new_event_notification (782-785-6): Bad file descriptor (9) Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]: warning: send_client_notify: Notification of client crmd/8ad990ba-cf09-4ba3-b74b-a7d05d377a1b failed Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]: error: crm_abort: crm_glib_handler: Forked child 760 to record non-fatal assert at logging.c:63 : Source ID 4601370 was not found when attempting to remove it Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: pcmk_child_exit: Child process cib (780) exited: Invalid argument (22) Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: notice: pcmk_process_exit: Respawning failed child process: cib Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: pcmk_child_exit: Child process crmd (785) exited: Link has been severed (67) Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: notice: pcmk_process_exit: Respawning failed child process: crmd Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: crit: attrd_cs_destroy: Lost connection to Corosync service! Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: notice: main: Exiting... Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: notice: main: Disconnecting client 0x7ff985e478e0, pid=785... Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: mcp_cpg_destroy: Connection destroyed Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: error: attrd_cib_connection_destroy: Connection to the CIB terminated... Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: debug: crm_update_callsites: Enabling callsites based on priority=7, files=(null), functions=(null), formats=(null), tags=(null) Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[767]: debug: crm_update_callsites: Enabling callsites based on priority=7, files=(null), functions=(null), formats=(null), tags=(null) Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[767]: notice: main: CRM Git Version: 42f2063 Apr 20 16:50:01 juju-machine-3-lxc-4 stonith-ng[781]: error: stonith_peer_cs_destroy: Corosync connection terminated Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 2 Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: crit: cib_init: Cannot sign in to the cluster... terminating Apr 20 16:50:02 juju-machine-3-lxc-4 crmd[767]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry Apr 20 16:50:05 juju-machine-3-lxc-4 crmd[767]: warning: do_cib_control: Couldn't complete CIB registration 2 times... pause and retry
These are the only processes running in one of the nodes:
root 782 0.0 0.0 81464 1828 ? Ss Feb12 25:13 /usr/lib/pacemaker/lrmd haclust+ 784 0.0 0.0 73920 776 ? Ss Feb12 8:25 /usr/lib/pacemaker/pengine root 780 0.8 0.0 130256 4152 ? Ssl 16:50 0:00 /usr/sbin/corosync
A possible explanation could be: http://thread.gmane.org/gmane.linux.highavailability.corosync/592/focus=639
I only have logs for one of the nodes, I'm trying to get logs of the other 2 nodes to get a better understanding of what was happening with the communication.
I'm seeing this problem in another environment, similar deployment (3 lxc containers)
Apr 20 16:39:26 juju-machine- 3-lxc-4 crm_verify[31774]: notice: crm_log_args: Invoked: crm_verify -V -p 3-lxc-4 cibadmin[31786]: notice: crm_log_args: Invoked: cibadmin -p -P 3-lxc-4 cib[780]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) 3-lxc-4 cib[780]: error: cib_cs_destroy: Corosync connection lost! Exiting. 3-lxc-4 crmd[785]: error: crmd_quorum_ destroy: connection terminated 3-lxc-4 attrd[783]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) 3-lxc-4 stonith-ng[781]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) 3-lxc-4 crmd[785]: notice: crmd_exit: Forcing immediate exit: Link has been severed (67) 3-lxc-4 lrmd[782]: warning: qb_ipcs_ event_sendv: new_event_ notification (782-785-6): Bad file descriptor (9) 3-lxc-4 lrmd[782]: warning: send_client_notify: Notification of client crmd/8ad990ba- cf09-4ba3- b74b-a7d05d377a 1b failed 3-lxc-4 lrmd[782]: error: crm_abort: crm_glib_handler: Forked child 760 to record non-fatal assert at logging.c:63 : Source ID 4601370 was not found when attempting to remove it 3-lxc-4 pacemakerd[773]: error: pcmk_child_exit: Child process cib (780) exited: Invalid argument (22) 3-lxc-4 pacemakerd[773]: notice: pcmk_process_exit: Respawning failed child process: cib 3-lxc-4 pacemakerd[773]: error: pcmk_child_exit: Child process crmd (785) exited: Link has been severed (67) 3-lxc-4 pacemakerd[773]: notice: pcmk_process_exit: Respawning failed child process: crmd 3-lxc-4 attrd[783]: crit: attrd_cs_destroy: Lost connection to Corosync service! 3-lxc-4 attrd[783]: notice: main: Exiting... 3-lxc-4 attrd[783]: notice: main: Disconnecting client 0x7ff985e478e0, pid=785... 3-lxc-4 pacemakerd[773]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) 3-lxc-4 pacemakerd[773]: error: mcp_cpg_destroy: Connection destroyed 3-lxc-4 attrd[783]: error: attrd_cib_ connection_ destroy: Connection to the CIB terminated... 3-lxc-4 cib[761]: debug: crm_update_ callsites: Enabling callsites based on priority=7, files=(null), functions=(null), formats=(null), tags=(null) 3-lxc-4 crmd[767]: debug: crm_update_ callsites: Enabling callsites based on priority=7, files=(null), functions=(null), formats=(null), tags=(null) 3-lxc-4 crmd[767]: notice: main: CRM Git Version: 42f2063 3-lxc-4 stonith-ng[781]: error: stonith_ peer_cs_ destroy: Corosync connection terminated 3-lxc-4 cib[761]: notice: crm_cluster_ connect: Connecting to cluster infrastructure: corosync 3-lxc-4 cib[761]: error: cluster_ connect_ cpg: Could not connect to the Cluster Process Group API: 2 3-lxc-4 cib[761]: crit: cib_init: Cannot sign in to the cluster... terminating 3-lxc-4 crmd[767]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry 3-lxc-4 crmd[767]: warning: do_cib_control: Couldn't complete CIB registration 2 times... pause and retry
Apr 20 16:39:27 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:01 juju-machine-
Apr 20 16:50:02 juju-machine-
Apr 20 16:50:05 juju-machine-
These are the only processes running in one of the nodes:
root 782 0.0 0.0 81464 1828 ? Ss Feb12 25:13 /usr/lib/ pacemaker/ lrmd pacemaker/ pengine
haclust+ 784 0.0 0.0 73920 776 ? Ss Feb12 8:25 /usr/lib/
root 780 0.8 0.0 130256 4152 ? Ssl 16:50 0:00 /usr/sbin/corosync
A possible explanation could be: http:// thread. gmane.org/ gmane.linux. highavailabilit y.corosync/ 592/focus= 639
I only have logs for one of the nodes, I'm trying to get logs of the other 2 nodes to get a better understanding of what was happening with the communication.