Bug #1379484 “Corosync - 100% CPU consumption - HA not working” : Bugs : hacluster package : Juju Charms Collection

Jorge Niedbalski (niedbalski) on 2014-10-09

tags:	added: cts
tags:	added: ha openstack

Revision history for this message

James Page (james-page) wrote on 2014-10-14:

#1

This looks like you may have a conflicting cluster configuration within your deployment; specifically two services (keystone and horizon) using the same multicast port maybe?

James Page (james-page) on 2014-10-14

Changed in keystone (Juju Charms Collection):
status:	New → Incomplete

Revision history for this message

Juan L. Negron (negronjl) wrote on 2014-10-27:

#2

Hi Jorge:

To clarify things here, can you paste your configuration?

-Juan

Revision history for this message

Ante Karamatić (ivoks) wrote on 2014-12-11:

#3

IMO, this was caused by bad corosync configuration and should be fixed in /next hacluster charm.

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2014-12-28:

#4

Possibly a symptom of https://bugs.launchpad.net/charms/+source/keystone/+bug/1403132. If so, ensuring a different mcastaddr is used for each cluster should fix this.

Revision history for this message

James Page (james-page) wrote on 2015-01-19:

#5

I spent time before christmas re-testing the various deployment options for the hacluster charm; I specifically saw this issue when deploying services in LXC containers under the MAAS provider; it would appear that the various bridges and network interfaces involved in connecting a container to the outside world are a bit flaky from a multicast perspective.

My recommendation is that you updated to the latest version of the hacluster charm, and then switch to using unicast; you'll most likely need todo the following if corosync is spinning at 100% CPU time:

juju run --service keystone "sudo pkill -9 corosync"
juju upgrade-charm hacluster-keystone && juju set hacluster-keystone corosync_transport=unicast

After a short while, your cluster should be reconfigured to use unicast, and all services should be back up and running. You'll also need to tidy up the old node records (the unicast ones will start at 1001 onwards - don't delete those):

   sudo crm node list (for the ID's
   sudo crm configure
      delete <id>

this can be done from one of the nodes in the cluster.

Changed in keystone (Juju Charms Collection):
status:	Incomplete → Triaged
importance:	Undecided → High

James Page (james-page) on 2015-01-26

summary:

- Keystone HA - Corosync - 100% CPU consumption - HA not working
+ Corosync - 100% CPU consumption - HA not working

Revision history for this message

Paul Gear (paulgear) wrote on 2015-02-05:

#6

I'm seeing this with unicast transport on hacluster charm r41. I'm currently trying to get sufficient logging out of corosync to diagnose the issue.

Revision history for this message

Jorge Niedbalski (niedbalski) wrote on 2015-02-06:

#7

Quick note: We are seeing this behavior even by enabling unicast transport on LXC containers.

Revision history for this message

Xiang Hui (xianghui) wrote on 2015-02-06:

#8

syslog.100cpu Edit (217.4 KiB, text/plain)

Download full text (3.3 KiB)

We have reproduced this issue with unicast , but specially with maas+lxc so far, not happened with local+lxc.
Deploy three keystone nodes with hacluster by setting 'corosync_transport=udpu'
unit hostname id ip
keystone/0 juju-machine-1-lxc-0 1002 192.168.100.67
keystone/1 juju-machine-4-lxc-0 1001 192.168.100.69
keystone/2 juju-machine-5-lxc-0 1000 192.168.100.71

After adding keystone hacluster relation and wait for a while, split brain happened, keystone/1/2 is a cluster, keystone/0 single.

keystone/1/2:
root@juju-machine-5-lxc-0:~# crm status
Online: [ juju-machine-4-lxc-0 juju-machine-5-lxc-0 ]

Resource Group: grp_ks_vips
res_ks_eth0_vip (ocf::heartbeat:IPaddr2): Started juju-machine-4-lxc-0
Clone Set: cl_ks_haproxy [res_ks_haproxy]
Started: [ juju-machine-4-lxc-0 juju-machine-5-lxc-0 ]

root@juju-machine-5-lxc-0:~# corosync-quorumtool -l
Membership information
----------------------
    Nodeid Votes Name
      1001 1 192.168.100.69
      1000 1 192.168.100.71 (local)

keystone/0:
root@juju-machine-1-lxc-0:~# crm status
Last updated: Fri Feb 6 08:18:26 2015
Last change: Fri Feb 6 07:17:28 2015 via crmd on juju-machine-1-lxc-0
Stack: corosync
Current DC: juju-machine-1-lxc-0 (1002) - partition WITHOUT quorum
Version: 1.1.10-42f2063
1 Nodes configured
2 Resources configured

Online: [ juju-machine-1-lxc-0 ]

Resource Group: grp_ks_vips
res_ks_eth0_vip (ocf::heartbeat:IPaddr2): Started juju-machine-1-lxc-0
Clone Set: cl_ks_haproxy [res_ks_haproxy]
Started: [ juju-machine-1-lxc-0 ]

root@juju-machine-1-lxc-0:~# corosync-quorumtool -l
(hang forever ....)

So turn to keystone/0 node and checking cpu, we find it's 100% used by corosync.
asks: 42 total, 2 running, 40 sleeping, 0 stopped, 0 zombie
%Cpu(s):100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 1017896 total, 932296 used, 85600 free, 19148 buffers
KiB Swap: 1770492 total, 5572 used, 1764920 free. 409312 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18637 root 20 0 704252 199272 34016 R 99.9 19.6 44:40.43 corosync

From netstat output, one interesting finding is the Recv-Q size has a value 320256, which is higher than normal, one explanation from Internet is the Recv-Q getting hung at some point of time increased cpu usage of the corosync process which uses the socket, not sure whether we should bump the Recv-Q buffer, or it's related the linux bridge juju-br0 interface some kind of bottleneck.
And after simply doing pkill -9 corosync and restart corosync/pacemaker, the whole cluster are back normal.

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
udp 320256 0 192.168.100.67:5434 0.0.0.0:* ...

We have reproduced this issue with unicast , but specially with maas+lxc so far, not happened with local+lxc.
Deploy three keystone nodes with hacluster by setting 'corosync_transport=udpu'
unit                            hostname             id      ip
keystone/0  juju-machine-1-lxc-0 1002  192.168.100.67
keystone/1  juju-machine-4-lxc-0 1001  192.168.100.69
keystone/2  juju-machine-5-lxc-0 1000  192.168.100.71

After adding keystone hacluster relation and wait for a while, split brain happened, keystone/1/2 is a cluster, keystone/0 single.

keystone/1/2:
root@juju-machine-5-lxc-0:~# crm status
Online: [ juju-machine-4-lxc-0 juju-machine-5-lxc-0 ]

Resource Group: grp_ks_vips
     res_ks_eth0_vip    (ocf::heartbeat:IPaddr2):       Started juju-machine-4-lxc-0 
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-machine-4-lxc-0 juju-machine-5-lxc-0 ]

root@juju-machine-5-lxc-0:~# corosync-quorumtool -l
Membership information
----------------------
    Nodeid      Votes Name
      1001          1 192.168.100.69
      1000          1 192.168.100.71 (local)

keystone/0:
root@juju-machine-1-lxc-0:~# crm status
Last updated: Fri Feb  6 08:18:26 2015
Last change: Fri Feb  6 07:17:28 2015 via crmd on juju-machine-1-lxc-0
Stack: corosync
Current DC: juju-machine-1-lxc-0 (1002) - partition WITHOUT quorum
Version: 1.1.10-42f2063
1 Nodes configured
2 Resources configured

Online: [ juju-machine-1-lxc-0 ]

Resource Group: grp_ks_vips
     res_ks_eth0_vip    (ocf::heartbeat:IPaddr2):       Started juju-machine-1-lxc-0
 Clone Set: cl_ks_haproxy [res_ks_haproxy]
     Started: [ juju-machine-1-lxc-0 ]

root@juju-machine-1-lxc-0:~# corosync-quorumtool -l
(hang forever ....)

So turn to keystone/0 node and checking cpu, we find it's 100% used by corosync.
asks:  42 total,   2 running,  40 sleeping,   0 stopped,   0 zombie
%Cpu(s):100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   1017896 total,   932296 used,    85600 free,    19148 buffers
KiB Swap:  1770492 total,     5572 used,  1764920 free.   409312 cached Mem

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                                                  
18637 root      20   0  704252 199272  34016 R 99.9 19.6  44:40.43 corosync

From netstat output, one interesting finding is the Recv-Q size has a value 320256, which is higher than normal, one explanation from Internet is the Recv-Q getting hung at some point of time increased cpu usage of the corosync process which uses the socket, not sure whether we should bump the Recv-Q buffer, or it's related the linux bridge juju-br0 interface some kind of bottleneck.
And after simply doing pkill -9 corosync and restart corosync/pacemaker, the whole cluster are back normal.

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
udp   320256      0 192.168.100.67:5434     0.0.0.0:*                           18637/corosync

Udp: 
    539832 packets received
    619 packets to unknown port received.
    407249 packet receive errors
    1007262 packets sent
    RcvbufErrors: 69940

Revision history for this message

Xiang Hui (xianghui) wrote on 2015-02-06:

#9

syslog.after-restart-corosync Edit (300.6 KiB, text/plain)

Above attach ment syslog.100cpu is when split brain shows up, syslog on keystone/0
syslog.after-restart-corosync is after restart corosync, everything back normal on keystone/0

Revision history for this message

Matt Rae (mattrae) wrote on 2015-02-06:

#10

Noting that this bug happens with all charms which can use the hacluster subordinate charm, not just keystone.

Edward Hope-Morley (hopem) on 2015-02-06

Changed in hacluster (Juju Charms Collection):
status:	New → Triaged
importance:	Undecided → Critical
importance:	Critical → High
no longer affects:	keystone (Juju Charms Collection)

Revision history for this message

James Page (james-page) wrote on 2015-02-08:

#11

I'd recommend setting the cluster_count configuration to 3 - this will ensure that any isolated node without quorum will shutdown its services, rather than assuming that its the only one left.

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-02-08:

#12

Spent awhile trying to recreate this tonight and while I didn't recreate the split brain, but I did recreate corosync going into a 100% cpu spin. In this scenario, corosync locked up early and pacemaker was unable to communicate with the corosync and crm -S would continually fail. I gathered a core dump of the process while it was running and got the current backtrace of the process from an attached gdb session. Unfortunately corosync was not in debug mode, but I will attach an sosreport and a core dump I triggered from the machine.

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-02-08:

#13

sosreport Edit (8.3 MiB, application/octet-stream)

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-02-08:

#14

sosreport-wolsen-20150208072252.tar.xz.md5 Edit (33 bytes, text/plain)

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-02-08:

#15

core.4256 Edit (108.4 MiB, application/octet-stream)

Jorge Niedbalski (niedbalski) on 2015-02-08

Changed in hacluster (Juju Charms Collection):
importance:	High → Critical

Revision history for this message

Jorge Niedbalski (niedbalski) wrote on 2015-02-08:

#16

From the core generated by @billy-olsen seems that the corosync thread 2 is looping on rrp_* and then jumping into joinlist_inform_clients for every process on the group (crmd, pacemaker, and so ) as part of the sync barrier stage.

At this specific point i don't know ( because of no logs ) is if the service synchronization was marked as done or it was rejected for some reason. What's interesting is that this phase occurs right after the totem ring is formed.

Regarding to this, there is an interesting patch on upstream for making sure left nodes are really removed when cpg process is exited[0]

From @xianghui's syslog the only interesting events i am observing is a continous loop over:

Sep 03 17:28:15 [22111] dev-cluster2-node4 corosync debug [QB ] qb_ipcs_disconnect(22112-22133-19) state:2
Sep 03 17:28:15 [22111] dev-cluster2-node4 corosync debug [QB ] epoll_ctl(del): Bad file descriptor (9)
Sep 03 17:28:15 [22111] dev-cluster2-node4 corosync debug [MAIN ] cs_ipcs_connection_closed()

This latest could be related to this upstream patch [1], worth the effort to check if this is fixed in the Ubuntu's version for both bugs.

[0] http://lists.corosync.org/pipermail/discuss/2014-February/003019.html
[1] https://bugzilla.redhat.com/show_bug.cgi?id=959189

If in the meanwhile anybody from the community wants to help hunting this bug, please proceed:

1) Enable corosync debugging, on juju , use 'juju set hacluster debug=True"

2) Once you reach the split-brain scenario , please run the following commands on all of the cluster's nodes:
2.1) sudo crm_report -f 00:00

This will generate a file named pcmk-datetime stored on the current working directory, Please attach this file into this case.

2.2) sudo killall -USR1 crmd
2.3) sudo killall -TRAP crmd

This command will generate a directory on /var/lib/pacemaker/blackbor/crmd-* , Please attach this directory /var/lib/pacemaker/blackbox/crmd* on this case.

3) Please make sure to grab a full sosreport at this point, and also compress /var/lib/pacemaker and /var/lib/corosync and upload it here.

Thanks

From the core generated by @billy-olsen  seems that the corosync thread 2 is looping on rrp_* and then jumping into joinlist_inform_clients for every process on the group (crmd, pacemaker, and so ) as part of the sync barrier stage.

At this specific point i don't know ( because of no logs ) is if the service synchronization was marked as done or it was rejected for some reason. What's interesting is that this phase occurs right after the totem ring is formed.

Regarding to this, there is an interesting patch on upstream for making sure left nodes are really removed when cpg process is exited[0]

From @xianghui's syslog the only interesting events i am observing is a continous loop over:

Sep 03 17:28:15 [22111] dev-cluster2-node4 corosync debug   [QB    ] qb_ipcs_disconnect(22112-22133-19) state:2
Sep 03 17:28:15 [22111] dev-cluster2-node4 corosync debug   [QB    ] epoll_ctl(del): Bad file descriptor (9)
Sep 03 17:28:15 [22111] dev-cluster2-node4 corosync debug   [MAIN  ] cs_ipcs_connection_closed()

This latest could be related to this upstream patch [1], worth the effort to check if this is fixed in the Ubuntu's version for both bugs.

[0] http://lists.corosync.org/pipermail/discuss/2014-February/003019.html
[1] https://bugzilla.redhat.com/show_bug.cgi?id=959189

If in the meanwhile anybody from the community wants to help hunting this bug, please proceed:

1) Enable corosync debugging, on juju , use 'juju set hacluster debug=True"

2) Once you reach the split-brain scenario , please run the following commands on all of the  cluster's nodes: 
2.1) sudo crm_report -f 00:00

This will generate a file named pcmk-datetime stored on the current working directory, Please attach this file into this case.

2.2) sudo killall -USR1 crmd 
2.3) sudo killall -TRAP crmd

This command will generate a directory on /var/lib/pacemaker/blackbor/crmd-* , Please attach this directory /var/lib/pacemaker/blackbox/crmd* on this case.

3) Please make sure to grab a full sosreport at this point, and also compress /var/lib/pacemaker and /var/lib/corosync  and upload it here.

Thanks

Revision history for this message

Matt Rae (mattrae) wrote on 2015-02-12:

#17

We were originally older version of the charms when we ran into this issue. Noting that after using a newer version of the charms, we haven't seen this issue so far.

Its difficult to tell the history of the newer charms because it appears that the bzr repositories were reinitialized at one point so most of the log is missing. I was able to confirm though that the newer versions contained the fix for lp: 1394008

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2015-05-06:

#18

So we have been seeing this kind of behaviour in a number of places of the last few months, specifically when running corosync clusters in lxc containers. Bumping lxc veth mtu to a higher value than that inside the container seems to have proven to be a decent workaround up until now and support for this exists in Juju (although broken and in process of being fixed [1]). A number of people have pinpointed some issues in corosync itself but the biggest culprit looks to be [2] for which we now have a an SRU in trusty-proposed. Please upgrade to this package (and manually resolve any corosync cluster issues (restarting seems to fix)) and report back if you still see this issue.

[1] https://bugs.launchpad.net/juju-core/+bug/1442257
[2] https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496

Changed in hacluster (Juju Charms Collection):
importance:	Critical → Medium

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2015-11-04:

#19

Since we have not had reports of this problem for a while, particularly since the aforementioned fixes were released, I am going to mark this as no longer an issue. If anybody hits this issue again, please reopen the ticket.

Changed in hacluster (Juju Charms Collection):
status:	Triaged → Invalid

Juju Charms Collection
hacluster package

Corosync - 100% CPU consumption - HA not working

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Juju Charms Collectionhacluster package

Corosync - 100% CPU consumption - HA not working

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Juju Charms Collection
hacluster package