corosync hangs inside libqb
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
libqb (Ubuntu) |
Fix Released
|
Undecided
|
Kick In | ||
Trusty |
Fix Released
|
Medium
|
Unassigned | ||
Utopic |
Won't Fix
|
Medium
|
Unassigned |
Bug Description
$ lsb_release -rd
Description: Ubuntu 14.04 LTS
Release: 14.04
$ apt-cache policy libqb0
libqb0:
Installed: 0.16.0.
Candidate: 0.16.0.
Version table:
*** 0.16.0.
500 http://
100 /var/lib/
Corosync sometimes hangs inside libqb. I've looked at a hanged process with gdb, and I think I've found the problem.
The problem is the loop here: https:/
This was fixed in 0.17.0, see: https:/
I think bumping to 0.17.0 should fix this (at least in backports? Please?)
-------
[Impact]
* libqb does not currently handle ring buffer alloc errors properly. The
result of this is corosync frequently ending up in an infinite loop
(consuming 100% cpu) as it continuously tries and fails to allocate
space from the ringbuffer due to erroneous logic when an attempt to
reclaim space fails. This patch ensures that when the reclaim fails the
libqb library gracefully errors out and allows corosync to proceed with
execution.
* This is fixed by cherry-picking the following 2 commits:
- https:/
- https:/
[Test Case]
There is a test case in comment #2.
A test case that was simple for me to recreate the problem (I used juju to replicate):
1. Deploy a 2 node percona-cluster w/ corosync and pacemaker.
2. Scale the number of units from 2 to 5 nodes.
3. Observe one of the instances of corosync will encounter 100% cpu usage and will not be stuck.
e.g.
juju bootstrap
# install percona-cluster
juju deploy -n 2 cs:trusty/
juju deploy cs:trusty/hacluster
# configure corosync to use unicast for communication
juju set hacluster corosync_
# configure the virtual ip for corosync
juju set percona-cluster vip=<your-vip>
# cause juju to configure the corosync/pacemaker configuration with percona-cluster.
juju add-relation percona-cluster hacluster
# wait for juju debug-log to go quiet.
# then expand the cluster by 3 nodes.
juju add-unit -n 3 percona-cluster
[Regression Potential]
* As a result of the changes, this may cause a blackbox log entry to be
dropped or it may cause a ring to be discarded and a new ring to be
created.
- If a log entry is dropped, some information may be missing from the
blackbox used later for analysis. However, upstream has determined
that missing a log entry is more ideal than hanging the corosync
process.
- Rings are discarded as part of the normal corosync communication
process, and corosync already knows how ot properly handle this
situation so the risk is small in this area.
Related branches
Changed in libqb (Ubuntu): | |
assignee: | nobody → Kick In (kick-d) |
Changed in libqb (Ubuntu Trusty): | |
importance: | Undecided → Medium |
Changed in libqb (Ubuntu Utopic): | |
importance: | Undecided → Medium |
Changed in libqb (Ubuntu Trusty): | |
status: | New → In Progress |
Changed in libqb (Ubuntu Utopic): | |
status: | New → In Progress |
I had the same issue and eventually I ended up with: https:/ /launchpad. net/~claudiu- popescu/ +archive/ ubuntu/ ppa/+packages
I installed it on a testing cluster and it is working ok for now.
I strongly advise against using this directly in production since it is my first library built for Ubuntu.
Maybe some one will make an official release soon since v0.16.0 is not usable in production environments.