deploy region hacluster-nova-cloud-controller stuck

Bug #1491228 reported by David Britton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
hacluster (Juju Charms Collection)
New
Undecided
Unassigned

Bug Description

Problem: hacluster-nova-cloud-controller hook is stuck, keeps looping on `crm node list`, even though pacemaker is down

root@juju-machine-1-lxc-4:/var/log/juju# tail unit-hacluster-nova-cloud-controller-1.log
2015-09-02 03:51:43 DEBUG juju-log hanode:24: Parsing cluster configuration using rid: ha:28, unit: nova-cloud-controller/1
2015-09-02 03:51:43 DEBUG juju-log hanode:24: Configuring and (maybe) restarting corosync
2015-09-02 03:51:43 INFO juju-log hanode:24: Writing file /etc/default/corosync root:root 444
2015-09-02 03:51:43 INFO juju-log hanode:24: Writing file /etc/corosync/authkey root:root 400
2015-09-02 03:51:43 INFO juju-log hanode:24: Writing file /etc/corosync/corosync.conf root:root 444
2015-09-02 03:51:43 INFO hanode-relation-changed * Restarting corosync daemon corosync
2015-09-02 03:51:43 INFO hanode-relation-changed notice [MAIN ] Corosync Cluster Engine ('2.3.3'): started and ready to provide service.
2015-09-02 03:51:43 INFO hanode-relation-changed info [MAIN ] Corosync built-in features: dbus testagents rdma watchdog augeas pie relro bindnow
2015-09-02 03:51:43 INFO hanode-relation-changed ...done.
2015-09-02 03:51:48 INFO hanode-relation-changed Starting Pacemaker Cluster Manager: [ OK ]
root@juju-machine-1-lxc-4:/var/log/juju#

# However, we keep just running new crm node list commands over and over:

root@juju-machine-1-lxc-4:/var/log/juju# while sleep 2; do ps -ef |grep '[c]rm node list'; done;
root 216447 107295 0 04:22 ? 00:00:00 sh -c { crm node list; } 2>&1
root 216562 107295 0 04:22 ? 00:00:00 sh -c { crm node list; } 2>&1
root 216563 216562 0 04:22 ? 00:00:00 /usr/bin/python /usr/sbin/crm node list
root 216679 107295 0 04:22 ? 00:00:00 sh -c { crm node list; } 2>&1
root 216680 216679 0 04:22 ? 00:00:00 /usr/bin/python /usr/sbin/crm node list
root 216802 107295 0 04:23 ? 00:00:00 sh -c { crm node list; } 2>&1
root 216803 216802 0 04:23 ? 00:00:00 /usr/bin/python /usr/sbin/crm node list
root 217042 107295 0 04:23 ? 00:00:00 sh -c { crm node list; } 2>&1
root 217043 217042 0 04:23 ? 00:00:00 /usr/bin/python /usr/sbin/crm node list
root 217159 107295 0 04:23 ? 00:00:00 sh -c { crm node list; } 2>&1
root 217160 217159 0 04:23 ? 00:00:00 /usr/bin/python /usr/sbin/crm node list
root 217399 107295 0 04:23 ? 00:00:00 sh -c { crm node list; } 2>&1
root 217400 217399 0 04:23 ? 00:00:00 /usr/bin/python /usr/sbin/crm node list

# Machine interaction, pacemaker is not running

root@juju-machine-1-lxc-4:/var/log# service pacemaker status pacemakerd is stopped
root@juju-machine-1-lxc-4:/var/log# service corosync status
 * corosync is running
root@juju-machine-1-lxc-4:/var/log# ps -ef |grep crm
root 198713 107295 0 04:17 ? 00:00:00 sh -c { crm node list; } 2>&1
root 198714 198713 0 04:17 ? 00:00:00 /usr/bin/python /usr/sbin/crm node list
root 198716 181655 0 04:17 pts/2 00:00:00 grep --color=auto crm
root@juju-machine-1-lxc-4:/var/log# crm node list
ERROR: running cibadmin -Ql: Could not establish cib_rw connection: Connection refused (111)
Signon to CIB failed: Transport endpoint is not connected
Init failed, could not perform requested operations

syslog snippit:

Sep 2 03:50:01 juju-machine-1-lxc-4 CRON[90120]: (root) CMD (/usr/local/lib/nagios/plugins/check_exit_status.pl -s /etc/init.d/apache2 status > /var/lib/nagios/service-check-apache2.txt)
Sep 2 03:50:01 juju-machine-1-lxc-4 CRON[90114]: (CRON) info (No MTA installed, discarding output)
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107575]: [MAIN ] Corosync Cluster Engine ('2.3.3'): started and ready to provide service.
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107575]: [MAIN ] Corosync built-in features: dbus testagents rdma watchdog augeas pie relro bindnow
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] Initializing transport (UDP/IP Unicast).
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] The network interface [10.1.54.161] is now up.
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [SERV ] Service engine loaded: corosync configuration map access [0]
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QB ] server name: cmap
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [SERV ] Service engine loaded: corosync configuration service [1]
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QB ] server name: cfg
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QB ] server name: cpg
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [SERV ] Service engine loaded: corosync profile loading service [4]
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [WD ] No Watchdog, try modprobe <a watchdog>
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [WD ] no resources configured.
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [SERV ] Service engine loaded: corosync watchdog service [7]
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QUORUM] Using quorum provider corosync_votequorum
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QB ] server name: votequorum
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QB ] server name: quorum
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] adding new UDPU member {10.1.54.139}
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] adding new UDPU member {10.1.54.161}
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] adding new UDPU member {10.1.54.156}
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] A new membership (10.1.54.161:4) was formed. Members joined: 1001
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QUORUM] Members[1]: 1001
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] A new membership (10.1.54.156:8) was formed. Members joined: 1002
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QUORUM] This node is within the primary component and will provide service.
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [QUORUM] Members[2]: 1002 1001
Sep 2 03:51:43 juju-machine-1-lxc-4 corosync[107576]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: mcp_read_config: Configured corosync to accept connections from group 116: Library error (2)
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: main: Starting Pacemaker 1.1.10 (Build: 42f2063): generated-manpages agent-manpages ncurses libqb-logging libqb-ipc lha-fencing upstart nagios heartbeat corosync-native snmp libesmtp
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: cluster_connect_quorum: Quorum acquired
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: corosync_node_name: Unable to get node name for nodeid 1000
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: corosync_node_name: Unable to get node name for nodeid 1001
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: corosync_node_name: Unable to get node name for nodeid 1002
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: corosync_node_name: Unable to get node name for nodeid 1001
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: get_node_name: Defaulting to uname -n for the local corosync node name
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: corosync_node_name: Unable to get node name for nodeid 1002
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[1002] - state is now member (was (null))
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: crm_update_peer_state: pcmk_quorum_notification: Node juju-machine-1-lxc-4[1001] - state is now member (was (null))
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: notice: main: Using new config location: /var/lib/pacemaker/cib
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: warning: retrieveCib: Cluster configuration not found: /var/lib/pacemaker/cib/cib.xml
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: warning: readCibXmlFile: Primary configuration corrupt or unusable, trying backups
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: warning: readCibXmlFile: Continuing with an empty configuration.
Sep 2 03:51:44 juju-machine-1-lxc-4 stonith-ng[107588]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Sep 2 03:51:44 juju-machine-1-lxc-4 attrd[107590]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Sep 2 03:51:44 juju-machine-1-lxc-4 stonith-ng[107588]: notice: corosync_node_name: Unable to get node name for nodeid 1001
Sep 2 03:51:44 juju-machine-1-lxc-4 stonith-ng[107588]: notice: get_node_name: Defaulting to uname -n for the local corosync node name
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: corosync_node_name: Unable to get node name for nodeid 1002
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[1002] - state is now member (was (null))
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: crm_update_peer_state: pcmk_quorum_notification: Node juju-machine-1-lxc-4[1001] - state is now member (was (null))
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: notice: main: Using new config location: /var/lib/pacemaker/cib
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: warning: retrieveCib: Cluster configuration not found: /var/lib/pacemaker/cib/cib.xml
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: warning: readCibXmlFile: Primary configuration corrupt or unusable, trying backups
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: warning: readCibXmlFile: Continuing with an empty configuration.
Sep 2 03:51:44 juju-machine-1-lxc-4 stonith-ng[107588]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Sep 2 03:51:44 juju-machine-1-lxc-4 attrd[107590]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Sep 2 03:51:44 juju-machine-1-lxc-4 stonith-ng[107588]: notice: corosync_node_name: Unable to get node name for nodeid 1001
Sep 2 03:51:44 juju-machine-1-lxc-4 stonith-ng[107588]: notice: get_node_name: Defaulting to uname -n for the local corosync node name
Sep 2 03:51:44 juju-machine-1-lxc-4 crmd[107592]: notice: main: CRM Git Version: 42f2063
Sep 2 03:51:44 juju-machine-1-lxc-4 corosync[107576]: [MAIN ] Denied connection attempt from 110:116
Sep 2 03:51:44 juju-machine-1-lxc-4 corosync[107576]: [QB ] Invalid IPC credentials (107576-107587-2).
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
Sep 2 03:51:44 juju-machine-1-lxc-4 cib[107587]: crit: cib_init: Cannot sign in to the cluster... terminating
Sep 2 03:51:44 juju-machine-1-lxc-4 corosync[107576]: [MAIN ] Denied connection attempt from 110:116
Sep 2 03:51:44 juju-machine-1-lxc-4 corosync[107576]: [QB ] Invalid IPC credentials (107576-107590-2).
Sep 2 03:51:44 juju-machine-1-lxc-4 attrd[107590]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
Sep 2 03:51:44 juju-machine-1-lxc-4 attrd[107590]: error: main: HA Signon failed
Sep 2 03:51:44 juju-machine-1-lxc-4 attrd[107590]: error: main: Aborting startup
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: error: pcmk_child_exit: Child process attrd (107590) exited: Network is down (100)
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: warning: pcmk_child_exit: Pacemaker child process attrd no longer wishes to be respawned. Shutting ourselves down.
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: pcmk_shutdown_worker: Shuting down Pacemaker
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: stop_child: Stopping crmd: Sent -15 to process 107592
Sep 2 03:51:44 juju-machine-1-lxc-4 crmd[107592]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: error: pcmk_child_exit: Child process cib (107587) exited: Network is down (100)
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: warning: pcmk_child_exit: Pacemaker child process cib no longer wishes to be respawned. Shutting ourselves down.
Sep 2 03:51:44 juju-machine-1-lxc-4 crmd[107592]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms
Sep 2 03:51:44 juju-machine-1-lxc-4 crmd[107592]: warning: do_log: FSA: Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
Sep 2 03:51:44 juju-machine-1-lxc-4 crmd[107592]: notice: do_state_transition: State transition S_STARTING -> S_STOPPING [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
Sep 2 03:51:44 juju-machine-1-lxc-4 crmd[107592]: notice: terminate_cs_connection: Disconnecting from Corosync
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: stop_child: Stopping pengine: Sent -15 to process 107591
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: stop_child: Stopping lrmd: Sent -15 to process 107589
Sep 2 03:51:44 juju-machine-1-lxc-4 pacemakerd[107585]: notice: stop_child: Stopping stonith-ng: Sent -15 to process 107588
Sep 2 03:51:53 juju-machine-1-lxc-4 stonith-ng[107588]: error: setup_cib: Could not connect to the CIB service: Transport endpoint is not connected (-107)
Sep 2 03:51:53 juju-machine-1-lxc-4 pacemakerd[107585]: notice: pcmk_shutdown_worker: Shutdown complete
Sep 2 03:51:53 juju-machine-1-lxc-4 pacemakerd[107585]: notice: pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error
Sep 2 03:53:28 juju-machine-1-lxc-4 corosync[107576]: [TOTEM ] A new membership (10.1.54.139:12) was formed. Members joined: 1000
Sep 2 03:53:29 juju-machine-1-lxc-4 corosync[107576]: [QUORUM] Members[3]: 1000 1002 1001
Sep 2 03:53:29 juju-machine-1-lxc-4 corosync[107576]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 2 03:55:01 juju-machine-1-lxc-4 CRON[118680]: (root) CMD (/usr/local/lib/nagios/plugins/check_exit_status.pl -s /etc/init.d/apache2 status > /var/lib/nagios/service-check-apache2.txt)
Sep 2 03:55:01 juju-machine-1-lxc-4 CRON[118681]: (root) CMD (/usr/local/lib/nagios/plugins/check_exit_status.pl -s /etc/init.d/haproxy status > /var/lib/nagios/service-check-haproxy.txt)

David Britton (dpb)
information type: Proprietary → Public
Revision history for this message
David Britton (dpb) wrote :
David Britton (dpb)
no longer affects: landscape
no longer affects: landscape/release-29
tags: added: landscape-release-29
Revision history for this message
David Britton (dpb) wrote :

Hit it a second time, stalled in the same place, looping on crm status with pacemaker dead.

Revision history for this message
David Britton (dpb) wrote :
Revision history for this message
Billy Olsen (billy-olsen) wrote :

Looking through the logs, this appears to be what David mentioned and related to bug 1439649. That affects pacemaker/libqb which I believe is still valid as there's some sort of issue there which is causing the IPC not to work without the acl workaround.

(Specifically noted is the trace: [QB ] Invalid IPC credentials (107576-107590-2)).

For the hacluster charm specific change, I'm going to mark this as a duplicate of bug #1490727, which was opened by JuanJo earlier.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.