Juju Charms Collection
hacluster package

deploy region hacluster-nova-cloud-controller stuck

Bug #1491228 reported by David Britton on 2015-09-02

This bug report is a duplicate of: Bug #1490727: "Invalid IPC credentials" after corosync, pacemaker service restarts. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	hacluster (Juju Charms Collection)	New	Undecided	Unassigned

Bug Description

Problem: hacluster-nova-cloud-controller hook is stuck, keeps looping on `crm node list`, even though pacemaker is down

root@juju-machine-1-lxc-4:/var/log/juju# tail unit-hacluster-nova-cloud-controller-1.log
2015-09-02 03:51:43 DEBUG juju-log hanode:24: Parsing cluster configuration using rid: ha:28, unit: nova-cloud-controller/1
2015-09-02 03:51:43 DEBUG juju-log hanode:24: Configuring and (maybe) restarting corosync
2015-09-02 03:51:43 INFO juju-log hanode:24: Writing file /etc/default/corosync root:root 444
2015-09-02 03:51:43 INFO juju-log hanode:24: Writing file /etc/corosync/authkey root:root 400
2015-09-02 03:51:43 INFO juju-log hanode:24: Writing file /etc/corosync/corosync.conf root:root 444
2015-09-02 03:51:43 INFO hanode-relation-changed * Restarting corosync daemon corosync
2015-09-02 03:51:43 INFO hanode-relation-changed notice [MAIN ] Corosync Cluster Engine ('2.3.3'): started and ready to provide service.
2015-09-02 03:51:43 INFO hanode-relation-changed info [MAIN ] Corosync built-in features: dbus testagents rdma watchdog augeas pie relro bindnow
2015-09-02 03:51:43 INFO hanode-relation-changed ...done.
2015-09-02 03:51:48 INFO hanode-relation-changed Starting Pacemaker Cluster Manager: [ OK ]
root@juju-machine-1-lxc-4:/var/log/juju#

# However, we keep just running new crm node list commands over and over:

root@juju-machine-1-lxc-4:/var/log/juju# while sleep 2; do ps -ef |grep '[c]rm node list'; done;
root 216447 107295 0 04:22 ? 00:00:00 sh -c { crm node list; } 2>&1
root 216562 107295 0 04:22 ? 00:00:00 sh -c { crm node list; } 2>&1
root 216563 216562 0 04:22 ? 00:00:00 /usr/bin/python /usr/sbin/crm node list
root 216679 107295 0 04:22 ? 00:00:00 sh -c { crm node list; } 2>&1
root 216680 216679 0 04:22 ? 00:00:00 /usr/bin/python /usr/sbin/crm node list
root 216802 107295 0 04:23 ? 00:00:00 sh -c { crm node list; } 2>&1
root 216803 216802 0 04:23 ? 00:00:00 /usr/bin/python /usr/sbin/crm node list
root 217042 107295 0 04:23 ? 00:00:00 sh -c { crm node list; } 2>&1
root 217043 217042 0 04:23 ? 00:00:00 /usr/bin/python /usr/sbin/crm node list
root 217159 107295 0 04:23 ? 00:00:00 sh -c { crm node list; } 2>&1
root 217160 217159 0 04:23 ? 00:00:00 /usr/bin/python /usr/sbin/crm node list
root 217399 107295 0 04:23 ? 00:00:00 sh -c { crm node list; } 2>&1
root 217400 217399 0 04:23 ? 00:00:00 /usr/bin/python /usr/sbin/crm node list

# Machine interaction, pacemaker is not running

root@juju-machine-1-lxc-4:/var/log# service pacemaker status pacemakerd is stopped
root@juju-machine-1-lxc-4:/var/log# service corosync status
* corosync is running
root@juju-machine-1-lxc-4:/var/log# ps -ef |grep crm
root 198713 107295 0 04:17 ? 00:00:00 sh -c { crm node list; } 2>&1
root 198714 198713 0 04:17 ? 00:00:00 /usr/bin/python /usr/sbin/crm node list
root 198716 181655 0 04:17 pts/2 00:00:00 grep --color=auto crm
root@juju-machine-1-lxc-4:/var/log# crm node list
ERROR: running cibadmin -Ql: Could not establish cib_rw connection: Connection refused (111)
Signon to CIB failed: Transport endpoint is not connected
Init failed, could not perform requested operations

syslog snippit:

Sep 2 03:50:01 juju-machine-1-lxc-4 CRON[90120]: (root) CMD (/usr/local/lib/nagios/plugins/check_exit_status.pl -s /etc/init.d/apache2 status > /var/lib/nagios/service-check-apache2.txt)
Sep 2 03:50:01 juju-machine-1-lxc-4 CRON[90114]: (CRON) info (No MTA installed, discarding output)
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107575]: [MAIN ] Corosync Cluster Engine ('2.3.3'): started and ready to provide service.
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107575]: [MAIN ] Corosync built-in features: dbus testagents rdma watchdog augeas pie relro bindnow
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] Initializing transport (UDP/IP Unicast).
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] The network interface [10.1.54.161] is now up.
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [SERV ] Service engine loaded: corosync configuration map access [0]
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QB ] server name: cmap
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [SERV ] Service engine loaded: corosync configuration service [1]
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QB ] server name: cfg
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QB ] server name: cpg
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [SERV ] Service engine loaded: corosync profile loading service [4]
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [WD ] No Watchdog, try modprobe <a watchdog>
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [WD ] no resources configured.
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [SERV ] Service engine loaded: corosync watchdog service [7]
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QUORUM] Using quorum provider corosync_votequorum
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QB ] server name: votequorum
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QB ] server name: quorum
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] adding new UDPU member {10.1.54.139}
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] adding new UDPU member {10.1.54.161}
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] adding new UDPU member {10.1.54.156}
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] A new membership (10.1.54.161:4) was formed. Members joined: 1001
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QUORUM] Members[1]: 1001
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] A new membership (10.1.54.156:8) was formed. Members joined: 1002
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QUORUM] This node is within the primary component and will provide service.
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QUORUM] Members[2]: 1002 1001
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: mcp_read_config: Configured corosync to accept connections from group 116: Library error (2)
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: main: Starting Pacemaker 1.1.10 (Build: 42f2063): generated-manpages agent-manpages ncurses libqb-logging libqb-ipc lha-fencing upstart nagios heartbeat corosync-native snmp libesmtp
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: cluster_connect_quorum: Quorum acquired
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: corosync_node_name: Unable to get node name for nodeid 1000
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: corosync_node_name: Unable to get node name for nodeid 1001
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: corosync_node_name: Unable to get node name for nodeid 1002
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: corosync_node_name: Unable to get node name for nodeid 1001
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: get_node_name: Defaulting to uname -n for the local corosync node name
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: corosync_node_name: Unable to get node name for nodeid 1002
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[1002] - state is now member (was (null))
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: crm_update_peer_state: pcmk_quorum_notification: Node juju-machine-1-lxc-4[1001] - state is now member (was (null))
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: notice: main: Using new config location: /var/lib/pacemaker/cib
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: warning: retrieveCib: Cluster configuration not found: /var/lib/pacemaker/cib/cib.xml
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: warning: readCibXmlFile: Primary configuration corrupt or unusable, trying backups
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: warning: readCibXmlFile: Continuing with an empty configuration.
Sep 2 03:51:44 juju-machine-1-lxc-4 stonith-ng[107588]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Sep 2 03:51:44 juju-machine-1-lxc-4 attrd[107590]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Sep 2 03:51:44 juju-machine-1-lxc-4 stonith-ng[107588]: notice: corosync_node_name: Unable to get node name for nodeid 1001
Sep 2 03:51:44 juju-machine-1-lxc-4 stonith-ng[107588]: notice: get_node_name: Defaulting to uname -n for the local corosync node name
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: corosync_node_name: Unable to get node name for nodeid 1002
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[1002] - state is now member (was (null))
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: crm_update_peer_state: pcmk_quorum_notification: Node juju-machine-1-lxc-4[1001] - state is now member (was (null))
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: notice: main: Using new config location: /var/lib/pacemaker/cib
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: warning: retrieveCib: Cluster configuration not found: /var/lib/pacemaker/cib/cib.xml
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: warning: readCibXmlFile: Primary configuration corrupt or unusable, trying backups
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: warning: readCibXmlFile: Continuing with an empty configuration.
Sep 2 03:51:44 juju-machine-1-lxc-4 stonith-ng[107588]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Sep 2 03:51:44 juju-machine-1-lxc-4 attrd[107590]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Sep 2 03:51:44 juju-machine-1-lxc-4 stonith-ng[107588]: notice: corosync_node_name: Unable to get node name for nodeid 1001
Sep 2 03:51:44 juju-machine-1-lxc-4 stonith-ng[107588]: notice: get_node_name: Defaulting to uname -n for the local corosync node name
Sep 2 03:51:44 juju-machine-1-lxc-4 crmd[107592]: notice: main: CRM Git Version: 42f2063
Sep 2 03:51:44 juju-machine-1-lxc-4 corosync[107576]: [MAIN ] Denied connection attempt from 110:116
Sep 2 03:51:44 juju-machine-1-lxc-4 corosync[107576]: [QB ] Invalid IPC credentials (107576-107587-2).
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: crit: cib_init: Cannot sign in to the cluster... terminating
Sep 2 03:51:44 juju-machine-1-lxc-4 corosync[107576]: [MAIN ] Denied connection attempt from 110:116
Sep 2 03:51:44 juju-machine-1-lxc-4 corosync[107576]: [QB ] Invalid IPC credentials (107576-107590-2).
Sep 2 03:51:44 juju-machine-1-lxc-4 attrd[107590]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
Sep 2 03:51:44 juju-machine-1-lxc-4 attrd[107590]: error: main: HA Signon failed
Sep 2 03:51:44 juju-machine-1-lxc-4 attrd[107590]: error: main: Aborting startup
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: error: pcmk_child_exit: Child process attrd (107590) exited: Network is down (100)
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: warning: pcmk_child_exit: Pacemaker child process attrd no longer wishes to be respawned. Shutting ourselves down.
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: pcmk_shutdown_worker: Shuting down Pacemaker
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: stop_child: Stopping crmd: Sent -15 to process 107592
Sep 2 03:51:44 juju-machine-1-lxc-4 crmd[107592]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: error: pcmk_child_exit: Child process cib (107587) exited: Network is down (100)
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: warning: pcmk_child_exit: Pacemaker child process cib no longer wishes to be respawned. Shutting ourselves down.
Sep 2 03:51:44 juju-machine-1-lxc-4 crmd[107592]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms
Sep 2 03:51:44 juju-machine-1-lxc-4 crmd[107592]: warning: do_log: FSA: Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
Sep 2 03:51:44 juju-machine-1-lxc-4 crmd[107592]: notice: do_state_transition: State transition S_STARTING -> S_STOPPING [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
Sep 2 03:51:44 juju-machine-1-lxc-4 crmd[107592]: notice: terminate_cs_connection: Disconnecting from Corosync
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: stop_child: Stopping pengine: Sent -15 to process 107591
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: stop_child: Stopping lrmd: Sent -15 to process 107589
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: stop_child: Stopping stonith-ng: Sent -15 to process 107588
Sep 2 03:51:53 juju-machine-1-lxc-4 stonith-ng[107588]: error: setup_cib: Could not connect to the CIB service: Transport endpoint is not connected (-107)
Sep 2 03:51:53 juju-machine-1-lxc-4 pacemakerd[107585]: notice: pcmk_shutdown_worker: Shutdown complete
Sep 2 03:51:53 juju-machine-1-lxc-4 pacemakerd[107585]: notice: pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error
Sep 2 03:53:28 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] A new membership (10.1.54.139:12) was formed. Members joined: 1000
Sep 2 03:53:29 juju-machine-1-lxc-4 corosync[107576]: [QUORUM] Members[3]: 1000 1002 1001
Sep 2 03:53:29 juju-machine-1-lxc-4 corosync[107576]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 2 03:55:01 juju-machine-1-lxc-4 CRON[118680]: (root) CMD (/usr/local/lib/nagios/plugins/check_exit_status.pl -s /etc/init.d/apache2 status > /var/lib/nagios/service-check-apache2.txt)
Sep 2 03:55:01 juju-machine-1-lxc-4 CRON[118681]: (root) CMD (/usr/local/lib/nagios/plugins/check_exit_status.pl -s /etc/init.d/haproxy status > /var/lib/nagios/service-check-haproxy.txt)

Tags:

David Britton (dpb) on 2015-09-02

information type:

Proprietary → Public

Revision history for this message

David Britton (dpb) wrote on 2015-09-02:

unit-hacluster-nova-cloud-controller-0.log Edit (40.3 KiB, text/plain)

David Britton (dpb) on 2015-09-03

no longer affects:	landscape
no longer affects:	landscape/release-29
tags:	added: landscape-release-29

Revision history for this message

David Britton (dpb) wrote on 2015-09-03:

Hit it a second time, stalled in the same place, looping on crm status with pacemaker dead.

Revision history for this message

David Britton (dpb) wrote on 2015-09-03:

Root cause: https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1439649

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-09-03:

Looking through the logs, this appears to be what David mentioned and related to bug 1439649. That affects pacemaker/libqb which I believe is still valid as there's some sort of issue there which is causing the IPC not to work without the acl workaround.

(Specifically noted is the trace: [QB ] Invalid IPC credentials (107576-107590-2)).

For the hacluster charm specific change, I'm going to mark this as a duplicate of bug #1490727, which was opened by JuanJo earlier.