Precise corosync dies if failed_to_recv is set

Bug #1318441 reported by Rafael David Tinoco on 2014-05-12
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
corosync (Ubuntu)
Undecided
Unassigned
Precise
Medium
Unassigned

Bug Description

[Impact]

 * On certain conditions *precise* corosync daemon may quit if it detects itself
   as not being able to receive messages. The logic asserts the existence of
   at least one functional node but the node is marking itself as a failed node
   (not following the specification). It is safe not to assert this if
   failed_to_recv is set.

[Test Case]

 * Using "corosync test suite" on precise-test machine:

   - Make sure to set ssh keys so precise-test can access precise-cluster-{01,02}.
   - Make sure only failed-to-receive-crash.sh is executable on "tests" dir.
   - Make sure precise-cluster-{01,02} nodes have build-dep for corosync installed.
   - sudo ./run-tests.sh -c flatiron -n "precise-cluster-01 precise-cluster-02"
   - Check corosync log messages to see precise-cluster-01 corosync dieing.

[Regression Potential]

 * We are not asserting the existence of at least 1 node in corosync cluster
   anymore. Since there is always 1 node in the cluster (the node itself) it
   is very unlikely this change alters corosync logic for membership. If it
   does it is likely corosync will recover from the error and reestablish new
   membership (with 1 or more nodes).

[Other Info]

 * n/a

tags: added: corosync
tags: added: precise
Changed in corosync (Ubuntu):
status: New → In Progress
assignee: nobody → Rafael David Tinoco (inaddy)
description: updated
Rafael David Tinoco (inaddy) wrote :
Download full text (5.6 KiB)

######## Tests before the patch:

#
# NODE 1
#

--- MARKER --- ./failed-to-receive-crash.sh at 2014-05-09-17:33:04 --- MARKER ---
May 09 17:33:04 corosync [MAIN]: ] Corosync Cluster Engine ('1.4.2'): started and ready to provide service.
May 09 17:33:04 corosync [MAIN]: ] Corosync built-in features: nss
May 09 17:33:04 corosync [MAIN]: ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
May 09 17:33:04 corosync [TOTEM]: ] Initializing transport (UDP/IP Multicast).
May 09 17:33:04 corosync [TOTEM]: ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
May 09 17:33:04 corosync [TOTEM]: ] The network interface [192.168.168.1] is now up.
May 09 17:33:04 corosync [SERV]: ] Service engine loaded: openais checkpoint service B.01.01
May 09 17:33:04 corosync [SERV]: ] Service engine loaded: corosync extended virtual synchrony service
May 09 17:33:04 corosync [SERV]: ] Service engine loaded: corosync configuration service
May 09 17:33:04 corosync [SERV]: ] Service engine loaded: corosync cluster closed process group service v1.01
May 09 17:33:04 corosync [SERV]: ] Service engine loaded: corosync cluster config database access v1.01
May 09 17:33:04 corosync [SERV]: ] Service engine loaded: corosync profile loading service
May 09 17:33:04 corosync [SERV]: ] Service engine loaded: corosync cluster quorum service v0.1
May 09 17:33:04 corosync [MAIN]: ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
May 09 17:33:04 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed.
May 09 17:33:04 corosync [CPG]: ] chosen downlist: sender r(0) ip(192.168.168.1) ; members(old:0 left:0)
May 09 17:33:04 corosync [MAIN]: ] Completed service synchronization, ready to provide service.
May 09 17:33:05 corosync [TOTEM]: ] A processor joined or left the membership and a new membership was formed.
May 09 17:33:05 corosync [CPG]: ] chosen downlist: sender r(0) ip(192.168.168.1) ; members(old:1 left:0)
May 09 17:33:05 corosync [MAIN]: ] Completed service synchronization, ready to provide service.
May 09 17:33:10 corosync [TOTEM]: ] FAILED TO RECEIVE

# COROSYNC HAS DIED BEFORE TEST CASE TRIES TO STOP IT

root@precise-cluster-01:~# ps -ef | grep corosync
root 1414 1306 0 17:31 pts/0 00:00:00 tail -f /var/log/cluster/corosync.log
root 4712 1306 0 17:33 pts/0 00:00:00 grep --color=auto corosync

######## Tests after the patch:

May 11 22:27:48 corosync [MAIN]: ] Corosync Cluster Engine ('1.4.2'): started and ready to provide service.
May 11 22:27:48 corosync [MAIN]: ] Corosync built-in features: nss
May 11 22:27:48 corosync [MAIN]: ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
May 11 22:27:48 corosync [TOTEM]: ] Initializing transport (UDP/IP Multicast).
May 11 22:27:48 corosync [TOTEM]: ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
May 11 22:27:48 corosync [TOTEM]: ] The network interface [192.168.168.1] is now up.
May 11 22:27:48 corosync [SERV]: ] Service engine loaded: openais checkpoint service B.01.01
May 11 22:27:48 co...

Read more...

description: updated
Rafael David Tinoco (inaddy) wrote :

Attaching patch.

description: updated
Chris J Arges (arges) on 2014-05-12
Changed in corosync (Ubuntu Precise):
assignee: nobody → Rafael David Tinoco (inaddy)
Changed in corosync (Ubuntu):
status: In Progress → Fix Released
Changed in corosync (Ubuntu Precise):
status: New → In Progress
importance: Undecided → Medium
Changed in corosync (Ubuntu):
assignee: Rafael David Tinoco (inaddy) → nobody
Chris J Arges (arges) wrote :

Sponsored for Precise.

Hello Rafael, or anyone else affected,

Accepted corosync into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/corosync/1.4.2-2ubuntu0.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in corosync (Ubuntu Precise):
status: In Progress → Fix Committed
tags: added: verification-needed
Rafael David Tinoco (inaddy) wrote :

Brian,

I've made several tests on this and everything works like expected. Changing tag.

Thanks

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package corosync - 1.4.2-2ubuntu0.2

---------------
corosync (1.4.2-2ubuntu0.2) precise; urgency=medium

  * Fixed consensus being empty in case failed_to_recv is set (LP: #1318441)
 -- Rafael David Tinoco <email address hidden> Mon, 12 May 2014 09:37:06 -0500

Changed in corosync (Ubuntu Precise):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for corosync has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Changed in corosync (Ubuntu Precise):
assignee: Rafael David Tinoco (inaddy) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers