ClusterMon resource creation core-dumps while created with extra_option -E

Bug #1848834 reported by Manisha
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pacemaker (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Triaged
Low
Unassigned

Bug Description

Hi
I have a 2 nodes cluster with a number of resources working fine.
I am using Ubuntu 16.04 with Pacemaker: 1.1.14
The moment i create a ClusterMon resource with an extra_option "-E" to run a script, it crashes and i can see the following in dmesg

[73880.444953] crm_mon[20739]: segfault at 0 ip 00007f948cc5c746 sp 00007ffed0cb0fb8 error 4 in libc-2.23.so[7f948cbd1000+1c0000]

I am using the following command to create the resource:
pcs resource create newRes ClusterMon user="root" extra_options="-E /usr/local/bin/new.sh "
OR
pcs resource create newRes ocf:pacemaker:ClusterMon user="root" extra_options="-E /usr/local/bin/new.sh "

and immediately i see following in /var/log/messages

2019-10-19T01:53:11.783763-04:00 master daemon notice crmd 17042 notice: Operation newRes_monitor_0: not running (node=master.dhcp, call=85, rc=7, cib-update=58, confirmed=true)
2019-10-19T01:53:12.097529-04:00 master daemon info systemd - Started Session c75 of user root.
2019-10-19T01:53:12.105468-04:00 master auth info systemd-logind - New session c75 of user root.
2019-10-19T01:53:12.150340-04:00 master daemon notice lrmd 17039 notice: newRes_start_0:30376:stderr [ mesg: ttyname failed: Inappropriate ioctl for device ]
2019-10-19T01:53:12.186340-04:00 master daemon notice crmd 17042 notice: Operation newRes_start_0: ok (node=master.dhcp, call=86, rc=0, cib-update=59, confirmed=true)
2019-10-19T01:53:12.195312-04:00 master kern info kernel - crm_mon[30398]: segfault at 0 ip 00007f9cfbe41746 sp 00007ffd971060e8 error 4 in libc-2.23.so[7f9cfbdb6000+1c0000]
2019-10-19T01:53:12.216644-04:00 master auth info systemd-logind - Removed session c75.
2019-10-19T01:53:12.241439-04:00 master daemon notice lrmd 17039 notice: newRes_monitor_10000:30406:stderr [ /usr/lib/ocf/resource.d/heartbeat/ClusterMon: 155: kill: No such process ]
2019-10-19T01:53:12.241980-04:00 master daemon notice lrmd 17039 notice: newRes_monitor_10000:30406:stderr [ ]
2019-10-19T01:53:12.245273-04:00 master daemon notice crmd 17042 notice: master.dhcp-newRes_monitor_10000:87 [ /usr/lib/ocf/resource.d/heartbeat/ClusterMon: 155: kill: No such process\n\n ]

Note:
- All other types of resources i.e. IPAddr, Drbd, systemd are working fine.
- Also, if the newRes is created wihtout -E, it works fine.
- Script has no complicated code. Event without the "echo" command i am seeing same issue.
cat /usr/local/bin/new.sh
#!/bin/sh
echo "HELLO from Crm_mon script" >> /var/log/messages
exit

Manisha (mk-ubuntu2019)
summary: - ClusterMon resource core-dumps while created with extr_option -E
+ ClusterMon resource creation core-dumps while created with extra_option
+ -E
Manisha (mk-ubuntu2019)
affects: corosync (Ubuntu) → pacemaker (Ubuntu)
Revision history for this message
Paride Legovini (paride) wrote :

Thank you for taking the time to report this issue. Unfortunately I couldn't reproduce the coredump following the steps you provided. What I did is the following:

1. Started an amd64 Xenial virtual machine using multipass
2. Installed pacemaker and pcs
3. Created /usr/local/bin/new.sh identical to yours, and
   marked it as executable
4. Ran: pcs resource create newRes ClusterMon user="root" \
        extra_options="-E /usr/local/bin/new.sh "
5. Ran: pcs resource create newRes ocf:pacemaker:ClusterMon \
        user="root" extra_options="-E /usr/local/bin/new.sh"
6. Checked the kernel messages.

In both cases no coredumps happened. Without a reproducer where isn't enough information for a developer to confirm this issue is a bug, or to begin working on it. Is it possible for you to provide up complete steps to reproduce from scratch from a clean system? You could use multipass to conveniently start a VM, or LXD to start a container (the *host* kernel is used in this case.)

I'm marking this report as Incomplete for the moment; please change its status back to New after providing more information. Thanks!

Changed in pacemaker (Ubuntu):
status: New → Incomplete
Revision history for this message
Manisha (mk-ubuntu2019) wrote :

Hi Paride,
Thanks for the reply.

The issue got solved after creating the resource with -e <mail_id> option i.e.
pcs resource create newRes ClusterMon user="root" extra_options="-E /usr/local/bin/new.sh -e <mail_id> "

The catch was, pacemaker version 1.1.14 does not throw any error while running "pcs resource create" without -e option. But in /var/log/messages it shows that there is segfault.

However, in pacemaker version 1.1.18 the create command without -e <mail_id> works fine.

Did you try the steps in Ubuntu 16.04 and pacemaker 1.1.14?

Regards,
Manisha.

Changed in pacemaker (Ubuntu):
status: Incomplete → New
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Yeah he was running on 16.04 which means he was using 1.1.14-2ubuntu1.6

Neither could I reproduce the crashes :-/

But I'm glad to hear that for you it was resolved after using -e
And even more glad that you found newer versions working better for you ´.

I hope this bug helps further users that might run into this to find your workaround, but I don't see a crash to debug and fix in the package yet (just as Paride)

Changed in pacemaker (Ubuntu Xenial):
status: New → Incomplete
Changed in pacemaker (Ubuntu):
status: New → Fix Released
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Good to know. I also couldn't reproduce it just by using those commands, some more setting up is probably needed. If we get reproducible steps, we might be able to identify the commit (or series of commits) that fixed it after 1.1.14. I took a quick look at the ChangeLog file but failed to spot anything obvious.

Revision history for this message
Manisha (mk-ubuntu2019) wrote :

Hello everyone,
Thanks for the response.
As for my setup, i could reproduce it every time.
Its 2 node cluster with Ubuntu 16.04 version.
I downloaded pcs, corosync and pacemaker from xenial repo, the following revisions:

pacemaker_1.1.14-2ubuntu1.6_amd64.deb ;
pacemaker-common_1.1.14-2ubuntu1.6_all.deb;
pacemaker-cli-utils_1.1.14-2ubuntu1.6_amd64.deb;
pacemaker-resource-agents_1.1.14-2ubuntu1.6_all.deb;
corosync_2.3.5-3ubuntu2.3_amd64.deb
pcs_0.9.149-1ubuntu1.1_amd64.deb
resource-agents_1%3a3.9.7-1ubuntu1.1_amd64.deb

Ran the following steps:

pcs resource create newRes ocf:pacemaker:ClusterMon user="root" update="30" extra_options="-E /opt/nec/vcpe/scripts/db_clustermon_helper.sh op monitor on-fail="restart" interval="60"

And I see following in /var/log/message

2019-11-07T00:03:36.384243-05:00 master daemon notice lrmd 1562 notice: newRes_start_0:29654:stderr [ mesg: ttyname failed: Inappropriate ioctl for device ]
2019-11-07T00:03:36.390797-05:00 master daemon notice crmd 1565 notice: Operation newRes_start_0: ok (node=master.clusterdb.cwp, call=33, rc=0, cib-update=22, confirmed=true)
2019-11-07T00:03:36.397064-05:00 master kern warning kernel - show_signal_msg: 4 callbacks suppressed
2019-11-07T00:03:36.397241-05:00 master kern info kernel - crm_mon[29667]: segfault at 0 ip 00007f68b8eb5746 sp 00007ffea7ea4ce8 error 4 in libc-2.23.so[7f68b8e2a000+1c0000]

And for pcs status:
root@master:~/autoci# pcs status | grep newRes
 newRes (ocf::pacemaker:ClusterMon): Stopped
* newRes_monitor_60000 on slave.clusterdb.cwp 'not running' (7): call=90, status=complete, exitreason='none',
* newRes_monitor_60000 on master.clusterdb.cwp 'not running' (7): call=34, status=complete, exitreason='none',

Thanks,
Manisha.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for the further feedback, still can't reproduce it on my side thou :-/

Since you found your workaround (-e) and it is fixed in later versions I'd call this low prio as it could end up in a rather long debug session with unclear outcome.

Changed in pacemaker (Ubuntu Xenial):
importance: Undecided → Low
Changed in pacemaker (Ubuntu Xenial):
status: Incomplete → Triaged
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Changed in pacemaker (Ubuntu Xenial):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.