Corosync report "Started" itself too early

Bug #1586876 reported by guessi
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
corosync (Ubuntu)
Fix Released
Undecided
Unassigned
Trusty
Won't Fix
Undecided
Unassigned
Xenial
Won't Fix
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned
Disco
Won't Fix
Undecided
Unassigned
Eoan
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned

Bug Description

Problem description:
currently, we have no service state check after start-stop-daemon in do_start(),
it might lead to an error if corosync report itself started too early,
pacemaker might think it is a 'heartbeat' backended, which is not we desired,
we should check if corosync is "really" started, then report its state,

syslog with wrong state:
May 24 19:53:50 myhost corosync[1018]: [MAIN ] Corosync Cluster Engine ('1.4.2'): started and ready to provide service.
May 24 19:53:50 myhost corosync[1018]: [MAIN ] Corosync built-in features: nss
May 24 19:53:50 myhost corosync[1018]: [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
May 24 19:53:50 myhost corosync[1018]: [TOTEM ] Initializing transport (UDP/IP Unicast).
May 24 19:53:50 myhost corosync[1018]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
May 24 19:53:50 myhost pacemakerd: [1094]: info: Invoked: pacemakerd
May 24 19:53:50 myhost pacemakerd: [1094]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/root
May 24 19:53:50 myhost pacemakerd: [1094]: info: get_cluster_type: Assuming a 'heartbeat' based cluster
May 24 19:53:50 myhost pacemakerd: [1094]: info: read_config: Reading configure for stack: heartbeat

expected result:
May 24 21:45:02 myhost corosync[1021]: [MAIN ] Completed service synchronization, ready to provide service.
May 24 21:45:02 myhost pacemakerd: [1106]: info: Invoked: pacemakerd
May 24 21:45:02 myhost pacemakerd: [1106]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/root
May 24 21:45:02 myhost pacemakerd: [1106]: info: config_find_next: Processing additional service options...
May 24 21:45:02 myhost pacemakerd: [1106]: info: get_config_opt: Found 'pacemaker' for option: name
May 24 21:45:02 myhost pacemakerd: [1106]: info: get_config_opt: Found '1' for option: ver
May 24 21:45:02 myhost pacemakerd: [1106]: info: get_cluster_type: Detected an active 'classic openais (with plugin)' cluster

please note the order of following two lines:
* corosync: [MAIN ] Completed service synchronization, ready to provide service.
* pacemakerd: info: get_cluster_type: ...

affected versions:
ALL (precise, trusty, vivid, wily, xenial, yakkety)

upstream solution: wait_for_ipc()
https://github.com/corosync/corosync/blob/master/init/corosync.in#L84-L99

Revision history for this message
guessi (guessi) wrote :
Revision history for this message
guessi (guessi) wrote :
Revision history for this message
guessi (guessi) wrote :
Revision history for this message
guessi (guessi) wrote :
Revision history for this message
guessi (guessi) wrote :
Revision history for this message
guessi (guessi) wrote :
tags: added: corosync
tags: added: precise
removed: corosync
tags: added: corosync trusty vivid wily xenial yakkety
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "precise.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
guessi (guessi)
Changed in corosync (Ubuntu):
assignee: nobody → guessi (guessi)
status: New → In Progress
status: In Progress → Fix Committed
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

We have now migrated to systemd, and this bug is pretty old. Nevertheless, I would like to check again if this condition would ever happen in new corosync started by systemd and/or the sysv generators (or if any conflict related to this could exist). Will mark this as Done as soon as I'm sure (during this new Ubuntu HA work).

Changed in corosync (Ubuntu):
status: Fix Committed → In Progress
importance: Undecided → Medium
tags: added: ubuntu-ha
Changed in corosync (Ubuntu Focal):
assignee: guessi (guessi) → nobody
importance: Medium → Undecided
Changed in corosync (Ubuntu Trusty):
status: New → Won't Fix
Changed in corosync (Ubuntu Disco):
status: New → Won't Fix
Changed in corosync (Ubuntu Focal):
status: In Progress → Triaged
Changed in corosync (Ubuntu Eoan):
status: New → Triaged
Changed in corosync (Ubuntu Xenial):
status: New → Triaged
Changed in corosync (Ubuntu Bionic):
status: New → Triaged
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

From upstream documentation:

"""
Pacemaker used to obtain membership and quorum from a custom Corosync plugin. This plugin also had the capability to start Pacemaker automatically when Corosync was started. Neither behavior is possible with Corosync 2.0 and beyond as support for plugins was removed.

Instead, Pacemaker must be started as a separate job/initscript. Also, since Pacemaker made use of the plugin for message routing, a node using the plugin (Corosync prior to 2.0) cannot talk to one that isn’t (Corosync 2.0+).
Rolling upgrades between these versions are therefore not possible and an alternate strategy must be used.
"""

showing that since Ubuntu Trusty this detection behavior is not supported any longer. Nowadays, we start both services separately and using systemd.

Corosync starts with a simple one-node only (localhost) ring configured:

(c)rafaeldtinoco@clusterdev:~$ systemctl status corosync
● corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2020-03-19 20:16:49 UTC; 45min ago
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview
   Main PID: 851 (corosync)
      Tasks: 9 (limit: 23186)
     Memory: 125.9M
     CGroup: /system.slice/corosync.service
             └─851 /usr/sbin/corosync -f

(c)rafaeldtinoco@clusterdev:~$ sudo corosync-quorumtool
Quorum information
------------------
Date: Thu Mar 19 21:02:21 2020
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 1
Ring ID: 1.5
Quorate: Yes

Votequorum information
----------------------
Expected votes: 1
Highest expected: 1
Total votes: 1
Quorum: 1
Flags: Quorate

Membership information
----------------------
    Nodeid Votes Name
         1 1 node1 (local)

AND systemd is responsible to guarantee the synchronicity needed.

----

From pacemaker service unit:

...
After=corosync.service
Requires=corosync.service

...

# If you want Corosync to stop whenever Pacemaker is stopped,
# uncomment the next line too:
#
# ExecStopPost=/bin/sh -c 'pidof pacemaker-controld || killall -TERM corosync'

...

# Pacemaker will restart along with Corosync if Corosync is stopped while
# Pacemaker is running.
# In this case, if you want to be fenced always (if you do not want to restart)
# uncomment ExecStopPost below.
#
# ExecStopPost=/bin/sh -c 'pidof corosync || \
# /usr/bin/systemctl --no-block stop pacemaker'

you have different options to control behavior for start/stop and restart accordingly with corosync status.

Changed in corosync (Ubuntu Focal):
status: Triaged → Fix Released
Changed in corosync (Ubuntu Eoan):
status: Triaged → Fix Released
Changed in corosync (Ubuntu Bionic):
status: Triaged → Fix Released
Changed in corosync (Ubuntu Xenial):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.