Failed to sign on to LRMd with Heartbeat/Pacemaker

Bug #1251298 reported by Steffen Zieger
40
This bug affects 8 people
Affects Status Importance Assigned to Milestone
cluster-glue (Ubuntu)
Fix Released
Undecided
Unassigned
Trusty
Incomplete
Undecided
Unassigned

Bug Description

I'm running a 2 node heartbeat/pacemaker cluster, which was working fine with Ubuntu 13.04
After upgrading from Ubuntu 13.04 to Ubuntu 13.10, Heartbeat/Pacemaker keeps restarting the system due to sign on errors of lrmd and heartbeat tries to recover.

As one system is already on ubuntu 13.10 and one system still running 13.04, I've tried it without the second node, which leads to the same behavior, which occurs before any cluster communication happens.

Syslog:
Nov 14 15:53:06 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 1 (30 max) times
Nov 14 15:53:06 wolverine crmd[2464]: notice: crmd_client_status_callback: Status update: Client wolverine.domain.tld/crmd now has status [join] (DC=false)
Nov 14 15:53:06 wolverine crmd[2464]: notice: crmd_client_status_callback: Status update: Client wolverine.domain.tld/crmd now has status [online] (DC=false)
Nov 14 15:53:06 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 2 (30 max) times
Nov 14 15:53:06 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 3 (30 max) times
Nov 14 15:53:07 wolverine stonith-ng[2462]: notice: setup_cib: Watching for stonith topology changes
Nov 14 15:53:07 wolverine stonith-ng[2462]: notice: unpack_config: On loss of CCM Quorum: Ignore
Nov 14 15:53:08 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 4 (30 max) times
Nov 14 15:53:10 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 5 (30 max) times
Nov 14 15:53:12 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 6 (30 max) times
Nov 14 15:53:14 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 7 (30 max) times
Nov 14 15:53:16 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 8 (30 max) times
Nov 14 15:53:18 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 9 (30 max) times
Nov 14 15:53:20 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 10 (30 max) times
Nov 14 15:53:22 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 11 (30 max) times
Nov 14 15:53:24 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 12 (30 max) times
Nov 14 15:53:26 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 13 (30 max) times
Nov 14 15:53:28 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 14 (30 max) times
Nov 14 15:53:30 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 15 (30 max) times
Nov 14 15:53:32 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 16 (30 max) times
Nov 14 15:53:34 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 17 (30 max) times
Nov 14 15:53:36 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 18 (30 max) times
Nov 14 15:53:38 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 19 (30 max) times
Nov 14 15:53:40 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 20 (30 max) times
Nov 14 15:53:42 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 21 (30 max) times
Nov 14 15:53:44 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 22 (30 max) times
Nov 14 15:53:46 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 23 (30 max) times
Nov 14 15:53:48 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 24 (30 max) times
Nov 14 15:53:50 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 25 (30 max) times
Nov 14 15:53:52 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 26 (30 max) times
Nov 14 15:53:54 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 27 (30 max) times
Nov 14 15:53:56 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 28 (30 max) times
Nov 14 15:53:58 wolverine crmd[2464]: warning: do_lrm_control: Failed to sign on to the LRM 29 (30 max) times
Nov 14 15:54:00 wolverine crmd[2464]: error: do_lrm_control: Failed to sign on to the LRM 30 (max) times
Nov 14 15:54:00 wolverine crmd[2464]: error: do_log: FSA: Input I_ERROR from do_lrm_control() received in state S_STARTING
Nov 14 15:54:00 wolverine crmd[2464]: notice: do_state_transition: State transition S_STARTING -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=do_lrm_control ]
Nov 14 15:54:00 wolverine crmd[2464]: warning: do_recover: Fast-tracking shutdown in response to errors

Symlinking lrmd from pacemaker package solved this problem partly:
root@wolverine ~ # mv /usr/lib/heartbeat/lrmd{,.cluster-glue}
root@wolverine ~ # cd /usr/lib/heartbeat/
root@wolverine /usr/lib/heartbeat # ln -s ../pacemaker/lrmd
root@wolverine /usr/lib/heartbeat # ls -la lrmd
lrwxrwxrwx 1 root root 17 Nov 14 16:35 lrmd -> ../pacemaker/lrmd
root@wolverine /usr/lib/heartbeat # ls -la lrmd*
lrwxrwxrwx 1 root root 17 Nov 14 16:35 lrmd -> ../pacemaker/lrmd
-rwxr-xr-x 1 root root 92816 Jul 18 17:55 lrmd.cluster-glue
root@wolverine /usr/lib/heartbeat #

Stopping heartbeat will still result in an unexpected reboot:
Nov 14 16:37:27 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-backup:1_notify_0 (call=45, rc=0, cib-update=0, confirmed=true) ok
Nov 14 16:37:28 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-rsyslog:1_notify_0 (call=48, rc=0, cib-update=0, confirmed=true) ok
Nov 14 16:37:28 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2673 failed authorization [no default client auth]
Nov 14 16:37:28 wolverine heartbeat: [2238]: ERROR: api_process_registration_msg: cannot add client(crm_node)
Nov 14 16:37:28 wolverine attrd[2258]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd-backup:1 (10000)
Nov 14 16:37:28 wolverine attrd[2258]: notice: attrd_perform_update: Sent update 10: master-drbd-backup:1=10000
Nov 14 16:37:28 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-backup:1_monitor_31000 (call=50, rc=0, cib-update=17, confirmed=false) ok
Nov 14 16:37:29 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2700 failed authorization [no default client auth]
Nov 14 16:37:29 wolverine heartbeat: [2238]: ERROR: api_process_registration_msg: cannot add client(crm_node)
Nov 14 16:37:29 wolverine attrd[2258]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd-rsyslog:1 (10000)
Nov 14 16:37:29 wolverine attrd[2258]: notice: attrd_perform_update: Sent update 13: master-drbd-rsyslog:1=10000
Nov 14 16:37:29 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-rsyslog:1_monitor_31000 (call=54, rc=0, cib-update=18, confirmed=false) ok
Nov 14 16:37:59 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2812 failed authorization [no default client auth]
Nov 14 16:37:59 wolverine heartbeat: [2238]: ERROR: api_process_registration_msg: cannot add client(crm_node)
Nov 14 16:38:00 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2839 failed authorization [no default client auth]
Nov 14 16:38:00 wolverine heartbeat: [2238]: ERROR: api_process_registration_msg: cannot add client(crm_node)
Nov 14 16:38:05 wolverine heartbeat: [2238]: info: killing /usr/lib/heartbeat/crmd process group 2259 with signal 15
Nov 14 16:38:05 wolverine crmd[2259]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms
Nov 14 16:38:05 wolverine attrd[2258]: notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (1384443485)
Nov 14 16:38:05 wolverine attrd[2258]: notice: attrd_perform_update: Sent update 16: shutdown=1384443485
Nov 14 16:38:06 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-backup:1_notify_0 (call=57, rc=0, cib-update=0, confirmed=true) ok
Nov 14 16:38:06 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-rsyslog:1_notify_0 (call=59, rc=0, cib-update=0, confirmed=true) ok
Nov 14 16:38:07 wolverine kernel: [ 255.385984] d-con backup: Requested state change failed by peer: Refusing to be Primary while peer is not outdated (-7)
Nov 14 16:38:07 wolverine kernel: [ 255.386415] d-con backup: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) disk( UpToDate -> Outdated ) pdsk( UpToDate -> DUnknown )
Nov 14 16:38:07 wolverine kernel: [ 255.386428] d-con backup: asender terminated
Nov 14 16:38:07 wolverine kernel: [ 255.386438] d-con backup: Terminating drbd_a_backup
Nov 14 16:38:07 wolverine kernel: [ 255.386693] d-con backup: Connection closed
Nov 14 16:38:07 wolverine kernel: [ 255.386716] d-con backup: conn( Disconnecting -> StandAlone )
Nov 14 16:38:07 wolverine kernel: [ 255.386718] d-con backup: receiver terminated
Nov 14 16:38:07 wolverine kernel: [ 255.386722] d-con backup: Terminating drbd_r_backup
Nov 14 16:38:07 wolverine kernel: [ 255.386750] block drbd0: disk( Outdated -> Failed )
Nov 14 16:38:07 wolverine kernel: [ 255.409861] block drbd0: bitmap WRITE of 0 pages took 0 jiffies
Nov 14 16:38:07 wolverine kernel: [ 255.409930] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Nov 14 16:38:07 wolverine kernel: [ 255.409943] block drbd0: disk( Failed -> Diskless )
Nov 14 16:38:07 wolverine kernel: [ 255.410041] block drbd0: drbd_bm_resize called with capacity == 0
Nov 14 16:38:07 wolverine kernel: [ 255.411773] d-con backup: Terminating drbd_w_backup
Nov 14 16:38:07 wolverine kernel: [ 255.466428] d-con rsyslog: Requested state change failed by peer: Refusing to be Primary while peer is not outdated (-7)
Nov 14 16:38:07 wolverine kernel: [ 255.466796] d-con rsyslog: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) disk( UpToDate -> Outdated ) pdsk( UpToDate -> DUnknown )
Nov 14 16:38:07 wolverine kernel: [ 255.466814] d-con rsyslog: asender terminated
Nov 14 16:38:07 wolverine kernel: [ 255.466832] d-con rsyslog: Terminating drbd_a_rsyslog
Nov 14 16:38:07 wolverine kernel: [ 255.467098] d-con rsyslog: Connection closed
Nov 14 16:38:07 wolverine kernel: [ 255.467121] d-con rsyslog: conn( Disconnecting -> StandAlone )
Nov 14 16:38:07 wolverine kernel: [ 255.467123] d-con rsyslog: receiver terminated
Nov 14 16:38:07 wolverine kernel: [ 255.467128] d-con rsyslog: Terminating drbd_r_rsyslog
Nov 14 16:38:07 wolverine kernel: [ 255.467169] block drbd1: disk( Outdated -> Failed )
Nov 14 16:38:07 wolverine kernel: [ 255.481716] block drbd1: bitmap WRITE of 0 pages took 0 jiffies
Nov 14 16:38:07 wolverine kernel: [ 255.481778] block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Nov 14 16:38:07 wolverine kernel: [ 255.481791] block drbd1: disk( Failed -> Diskless )
Nov 14 16:38:07 wolverine kernel: [ 255.481881] block drbd1: drbd_bm_resize called with capacity == 0
Nov 14 16:38:07 wolverine kernel: [ 255.482011] d-con rsyslog: Terminating drbd_w_rsyslog
Nov 14 16:38:07 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2986 failed authorization [no default client auth]
Nov 14 16:38:07 wolverine heartbeat: [2238]: ERROR: api_process_registration_msg: cannot add client(crm_node)
Nov 14 16:38:07 wolverine heartbeat: [2238]: WARN: Client [crm_node] pid 2989 failed authorization [no default client auth]
Nov 14 16:38:07 wolverine heartbeat: [2238]: ERROR: api_process_registration_msg: cannot add client(crm_node)
Nov 14 16:38:07 wolverine attrd[2258]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd-backup:1 (<null>)
Nov 14 16:38:07 wolverine attrd[2258]: notice: attrd_perform_update: Sent delete 18: node=19f64c15-2545-4b18-8d1a-39d9c3a88a56, attr=master-drbd-backup:1, id=<n/a>, set=(null), section=status
Nov 14 16:38:07 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-backup:1_stop_0 (call=64, rc=0, cib-update=19, confirmed=true) ok
Nov 14 16:38:07 wolverine attrd[2258]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd-rsyslog:1 (<null>)
Nov 14 16:38:07 wolverine attrd[2258]: notice: attrd_perform_update: Sent delete 20: node=19f64c15-2545-4b18-8d1a-39d9c3a88a56, attr=master-drbd-rsyslog:1, id=<n/a>, set=(null), section=status
Nov 14 16:38:07 wolverine crmd[2259]: notice: process_lrm_event: LRM operation drbd-rsyslog:1_stop_0 (call=67, rc=0, cib-update=20, confirmed=true) ok
Nov 14 16:38:07 wolverine attrd[2258]: notice: attrd_perform_update: Sent delete 22: node=19f64c15-2545-4b18-8d1a-39d9c3a88a56, attr=master-drbd-backup:1, id=<n/a>, set=(null), section=status
Nov 14 16:38:07 wolverine attrd[2258]: notice: attrd_perform_update: Sent delete 24: node=19f64c15-2545-4b18-8d1a-39d9c3a88a56, attr=master-drbd-rsyslog:1, id=<n/a>, set=(null), section=status
Nov 14 16:38:08 wolverine crmd[2259]: notice: do_state_transition: State transition S_NOT_DC -> S_STOPPING [ input=I_STOP cause=C_HA_MESSAGE origin=route_message ]
Nov 14 16:38:08 wolverine crmd[2259]: notice: lrm_state_verify_stopped: Stopped 0 recurring operations at (null) (1274719202 ops remaining)
Nov 14 16:38:08 wolverine crmd[2259]: notice: do_lrm_control: Disconnected from the LRM
Nov 14 16:38:08 wolverine ccm: [2254]: info: client (pid=2259) removed from ccm
Nov 14 16:38:08 wolverine heartbeat: [2238]: EMERG: Rebooting system. Reason: /usr/lib/heartbeat/crmd

root@wolverine ~ # lsb_release -rd
Description: Ubuntu 13.10
Release: 13.10

root@wolverine ~ # apt-cache policy cluster-glue
cluster-glue:
  Installed: 1.0.11+hg2754-1.1
  Candidate: 1.0.11+hg2754-1.1
  Version table:
 *** 1.0.11+hg2754-1.1 0
        500 http://de.archive.ubuntu.com/ubuntu/ saucy/main amd64 Packages
        100 /var/lib/dpkg/status
root@wolverine ~ #

root@wolverine ~ # apt-cache policy heartbeat
heartbeat:
  Installed: 1:3.0.5-3.1ubuntu1
  Candidate: 1:3.0.5-3.1ubuntu1
  Version table:
 *** 1:3.0.5-3.1ubuntu1 0
        500 http://de.archive.ubuntu.com/ubuntu/ saucy/main amd64 Packages
        100 /var/lib/dpkg/status
root@wolverine ~ #

root@wolverine ~ # apt-cache policy pacemaker
pacemaker:
  Installed: 1.1.10+git20130802-1ubuntu1
  Candidate: 1.1.10+git20130802-1ubuntu1
  Version table:
 *** 1.1.10+git20130802-1ubuntu1 0
        500 http://de.archive.ubuntu.com/ubuntu/ saucy/main amd64 Packages
        100 /var/lib/dpkg/status
root@wolverine ~ #

Expected:
- Working heartbeat/pacemaker setup after ubuntu upgrade

What happend:
- System reboots after about one minute due to heartbeat recovery tries

Tags: ubuntu-ha
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in cluster-glue (Ubuntu):
status: New → Confirmed
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Might be similar to [1]. Ill investigate.

http://www.gossamer-threads.com/lists/linuxha/pacemaker/85946

Changed in cluster-glue (Ubuntu):
assignee: nobody → Andres Rodriguez (andreserl)
Revision history for this message
Steffen Zieger (saz) wrote :

[1] describes my solution with the symlink.

--

[1]http://www.gossamer-threads.com/lists/linuxha/pacemaker/86023#86023

Revision history for this message
Jiri Hoogeveen (wica128) wrote :

The symlink works.
mv /usr/lib/heartbeat/lrmd /usr/lib/heartbeat/lrmd-dist
ln -sf ln -sf /usr/lib/pacemaker/lrmd /usr/lib/heartbeat/lrmd

Revision history for this message
Jiri Hoogeveen (wica128) wrote :

When will this be fixed? Or do I really have to use Centos or something like that?

Revision history for this message
Devin Vance (devin-p) wrote :

Bug is still present in 14.04 and the cluster-glue 1.0.11+hg2754-1.1build1 package.

I am fairly certain this is the same issue mentioned earlier on the mailing list:
http://www.gossamer-threads.com/lists/linuxha/pacemaker/85946

replacing /usr/lib/heartbeat/lrmd with a symlink to /usr/lib/pacemaker/lrmd seems to resolve the issue as mentioned earlier in thread.

Changed in cluster-glue (Ubuntu):
assignee: Andres Rodriguez (andreserl) → nobody
Revision history for this message
Erik Sornes (ejs-m) wrote :

This is still an issue in ubuntu 14.0.2 with full updates from 12. aug 2015.
The symlink resolves the problem, but why not fix the package?

Revision history for this message
Erik Sornes (ejs-m) wrote :

sorry, ubuntu 14.04

Revision history for this message
John Leach (johnleach) wrote :

an easier workaround than setting up your own symlinks is to just tell heartbeat to use the lrmd binary provided by the pacemaker package directly.

In /etc/heartbeat/ha.cf just add the line:

respawn root /usr/lib/pacemaker/lrmd

package still needs fixing though!

Changed in cluster-glue (Ubuntu):
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
tags: added: ubuntu-ha
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Thank you for taking the time to report this bug. In an effort to keep an
up-to-date and valid list of bugs to work on, I have reviewed this report
to verify it still requires effort and occurs on an Ubuntu release in
standard support, and it does not.

Judging by the existing comments, and the pointed upstream discussion thread, it
appears that the wrong lrmd was being set after upgrade and that led existing
cluster to confusion when connecting to the local resource manager daemon.

It is unfortunate that we were unable to resolve this defect, however
there appears to be no further action possible at this time. I am
therefore moving the bug to 'Incomplete'. If you disagree or have
new information, we would be grateful if you could please add a comment
stating why and then change the status of the bug to 'New'.

Changed in cluster-glue (Ubuntu):
status: Confirmed → Incomplete
Changed in cluster-glue (Ubuntu Trusty):
status: New → Incomplete
Changed in cluster-glue (Ubuntu):
status: Incomplete → Fix Released
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.