call ca_clear_event gives a seg fault

Bug #1877113 reported by Helge Brands on 2020-05-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Undecided
Unassigned

Bug Description

the segmentation fault happends when a client writes to the SYSRESET channel of an IOCSTATS - IOC. This kills the IOC and the call of ca_clear_event() of SYSRESET on the client gives a this error.

What can be seen too: is that all connected channels gets a clean disconnect except SYSRESET.

This happends with our 3.14.12 and with 7.0.3

gdb backtrace of caqtdm(epics 3.14.12):
#0 0x0000000000000000 in ?? ()
#1 0x00007fffee1d2f74 in ca_client_context::eliminateExcessiveSendBacklog (this=0x8f8f20, guard=..., chan=...) at ../ca_client_context.cpp:791
#2 0x00007fffee1d30d1 in eliminateExcessiveSendBacklog (pMon=<value optimized out>) at ../oldAccess.h:482
#3 ca_clear_subscription (pMon=<value optimized out>) at ../ca_client_context.cpp:761
#4 0x00007fffe79e80d6 in ClearMonitor (kData=<value optimized out>) at epicsSubs.c:937
#5 0x00007fffe79e39e5 in Epics3Plugin::pvClearMonitor (this=0x8f74e0, kData=0x7fffffffa600) at epics3_plugin.cpp:86
#6 0x00007ffff7d79f39 in CaQtDM_Lib::closeEvent (this=0x12016bb0, ce=<value optimized out>) at src/caqtdm_lib.cpp:6445
#7 0x00000030c67f5cbc in QWidget::event(QEvent*) () from /usr/lib64/libQtGui.so.4
#8 0x00000030c6b898eb in QMainWindow::event(QEvent*) () from /usr/lib64/libQtGui.so.4
#9 0x00000030c67aaa0c in QApplicationPrivate::notify_helper(QObject*, QEvent*) () from /usr/lib64/libQtGui.so.4
#10 0x00000030c67b094b in QApplication::notify(QObject*, QEvent*) () from /usr/lib64/libQtGui.so.4
#11 0x00000030bfb57d0c in QCoreApplication::notifyInternal(QObject*, QEvent*) () from /usr/lib64/libQtCore.so.4
#12 0x00000030c67fa711 in QWidgetPrivate::close_helper(QWidgetPrivate::CloseMode) () from /usr/lib64/libQtGui.so.4
#13 0x00000030c68134b6 in QApplication::x11ClientMessage(QWidget*, _XEvent*, bool) () from /usr/lib64/libQtGui.so.4
#14 0x00000030c68213c6 in QApplication::x11ProcessEvent(_XEvent*) () from /usr/lib64/libQtGui.so.4
#15 0x00000030c6849e92 in ?? () from /usr/lib64/libQtGui.so.4
#16 0x00000030b8a40555 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#17 0x00000030b8a44d08 in ?? () from /lib64/libglib-2.0.so.0
#18 0x00000030b8a44ec2 in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#19 0x00000030bfb7d613 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib64/libQtCore.so.4
#20 0x00000030c6849b5e in ?? () from /usr/lib64/libQtGui.so.4
#21 0x00000030bfb56752 in QEventLoop::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib64/libQtCore.so.4
#22 0x00000030bfb56a1c in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib64/libQtCore.so.4
#23 0x00000030bfb58d09 in QCoreApplication::exec() () from /usr/lib64/libQtCore.so.4
#24 0x000000000040e132 in main ()

gdb backtrace of caqtdm(epics 7.0.3):
#0 0x0000000000000000 in ?? ()
#1 0x00007fffd75df200 in ca_client_context::eliminateExcessiveSendBacklog (this=0x9259c0, guard=..., chan=...) at ../ca_client_context.cpp:785
#2 0x00007fffd75c2949 in eliminateExcessiveSendBacklog (this=0xc55a98, this=0xc55a98, guard=...) at ../oldAccess.h:481
#3 ca_clear_channel (pChan=0xc55a98) at ../access.cpp:371
#4 0x00007fffdc0c590d in ClearMonitor (kData=<optimized out>) at epicsSubs.c:942
#5 0x00007fffdc0c0a3f in Epics3Plugin::pvClearMonitor (this=0x89ba50, kData=0x7fffffffa700) at epics3_plugin.cpp:87
#6 0x00007ffff7b4e531 in CaQtDM_Lib::closeEvent (this=0x1049ff0, ce=<optimized out>) at src/caqtdm_lib.cpp:6445
#7 0x00007ffff5eb5db4 in QWidget::event(QEvent*) () from /lib64/libQtGui.so.4
#8 0x00007ffff62870d3 in QMainWindow::event(QEvent*) () from /lib64/libQtGui.so.4
#9 0x00007ffff5e600ec in QApplicationPrivate::notify_helper(QObject*, QEvent*) () from /lib64/libQtGui.so.4
#10 0x00007ffff5e669c5 in QApplication::notify(QObject*, QEvent*) () from /lib64/libQtGui.so.4
#11 0x00007ffff592e87d in QCoreApplication::notifyInternal(QObject*, QEvent*) () from /lib64/libQtCore.so.4
#12 0x00007ffff5eb1fd9 in QWidgetPrivate::close_helper(QWidgetPrivate::CloseMode) () from /lib64/libQtGui.so.4
#13 0x00007ffff5ed0e32 in QApplication::x11ClientMessage(QWidget*, _XEvent*, bool) () from /lib64/libQtGui.so.4
#14 0x00007ffff5edd98f in QApplication::x11ProcessEvent(_XEvent*) () from /lib64/libQtGui.so.4
#15 0x00007ffff5f04f34 in x11EventSourceDispatch(_GSource*, int (*)(void*), void*) () from /lib64/libQtGui.so.4
#16 0x00007ffff1f3f049 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#17 0x00007ffff1f3f3a8 in g_main_context_iterate.isra.19 () from /lib64/libglib-2.0.so.0
#18 0x00007ffff1f3f45c in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#19 0x00007ffff595d3ae in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /lib64/libQtCore.so.4
#20 0x00007ffff5f050b6 in QGuiEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /lib64/libQtGui.so.4
#21 0x00007ffff592d30f in QEventLoop::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /lib64/libQtCore.so.4
#22 0x00007ffff592d65d in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /lib64/libQtCore.so.4
#23 0x00007ffff5932d29 in QCoreApplication::exec() () from /lib64/libQtCore.so.4
#24 0x000000000040defd in main (argc=4, argv=<optimized out>) at src/caQtDM.cpp:390

mdavidsaver (mdavidsaver) wrote :

> #0 0x0000000000000000 in ?? ()

This looks like a jump to NULL, maybe a bad virtual call.

Can this fault be replicated?

If so, can you run with valgrind? This seems like the kind of fault which valgrind could identify.

Helge Brands (helge-brands) wrote :

Hi Michael,

valgrind only tells me that this is uninitialized pointer as you already said. But I guess somebody already thought about this problem: There is a note in cacIO.h about bad practice

The problem can replicated easily. The question is only what is the error in terms of:

1) why is the data structure corrupted (zeroed,deleted...)
2) why there is no callback for the disconnect
3) how this can be detected.

I tried an old EPICS (3.14.12) and a new one (7.0.3.2) in debug, but both are showing the same behavior.
Can you give me some hints where to set some breakpoints to find some problems. It look that this a sync problem between the put and the release of callbacks.

Best regards
     Helge

Ralph Lange (ralph-lange) wrote :

The difference between the SYSRESET and the other channels is that SYSRESET causes the server (IOC) to do an ungraceful exit between the put and the callback.

I.e., for the client library, the order of things happening is different. Only the SYSRESET case has the connection loss happening between the put operation and the reply coming back from the server.

mdavidsaver (mdavidsaver) wrote :

Helge, Can you include (or attach if too long) the entire valgrind message? It might help to narrow down which variable is undefined (eg. a struct member or a local).

Andrew Johnson (anj) wrote :

The SYSRESET PV from the iocStats module is a subroutine record. The rebootProc() code which the subroutine executes calls reboot(), which has a different implementation on each OS.

On VxWorks it actually calls the OS routine that tells the OS to reboot, probably rather violently with respect to the client. On RTEMS it becomes either bsp_reset() or rtemsReboot() depending on the specific target being used, which will probably be equally nasty.

Interestingly on Darwin (macOS) there is actually a reboot() system call which it uses, although unless you're running the IOC as root that won't actually succeed (and if you are running as root it may do some Very Bad Things to your OS – the reboot(2) manpage says that only the reboot(8) program should be using this system-call).

On the other workstation OSs running Base-3.15.1 or later it calls epicsExitLater(0) which should allow the CA server to respond to the client's put and properly shut down. On earlier Base releases that routine doesn't exist, so there it calls epicsExit(0) which probably triggers the same violent shut-down behavior as the embedded OSs.

The fact that you are seeing what appears to be the same behavior from both Base-3.14.12 and the as-yet-unreleased 7.0.3.2 suggests to me that your crash may be related to how your code is reacting to an exception from the CA client library. Do you have an exception callback registered with it? Might that code be trying to do something with a chid that has just been invalidated by the server disconnecting?

When I use caput to trigger the equivalent exit() subroutine in a softIoc running Base-3.14.12.3 I get this exception message:

tux$ caput anj:exit 1
Old : anj:exit 0
New : anj:exit 0
Read operation timed out: PV data was not read.
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "op=0, channel=anj:exit, type=DBR_TIME_DOUBLE, count=1, ctx="tux.aps.anl.gov:5064""
    Source File: ../getCopy.cpp line 92
    Current Time: Fri May 08 2020 15:22:50.617469900
..................................................................
Unexpected problem with CA circuit to server "tux.aps.anl.gov:5064" was "Connection reset by peer" - disconnecting

Can you also try doing the put from a *different* CA client (while still monitoring the IOC with this one as you were before), it might have a different effect.

Andrew Johnson (anj) wrote :

To demonstrate the difference between 3.14.12 and 7.0 this is what caput gives me if the IOC is running a 7.0.x version:

tux$ caput anj:exit 1
Old : anj:exit 0
New : anj:exit 1
tux$

Helge Brands (helge-brands) wrote :
Download full text (3.9 KiB)

put double/float for <vmpc1572-l7:SYSRESET> with data=1,000000 chid=858674064
Unexpected problem with CA circuit to server "vmpc1572-l7.psi.ch:5064" was "Die Verbindung wurde vom Kommunikationspartner zurückgesetzt" - disconnecting
ClearMonitor -- clear channel vmpc1572-l7:SYSRESET index=26
ca_clear_event: vmpc1572-l7:SYSRESET index=26 chid=858674064 evid=844784272
==3204== Invalid read of size 8
==3204== at 0x7C330C7: __dynamic_cast (in /usr/lib64/libstdc++.so.6.0.19)
==3204== by 0x223627C5: ca_client_context::eliminateExcessiveSendBacklog(epicsGuard<epicsMutex>&, cacChannel&) (ca_client_context.cpp:785)
==3204== by 0x22346D9C: oldChannelNotify::eliminateExcessiveSendBacklog(epicsGuard<epicsMutex>&) (oldAccess.h:481)
==3204== by 0x223625FE: ca_clear_subscription (ca_client_context.cpp:755)
==3204== by 0x223465CA: ca_clear_event (access.cpp:435)
==3204== by 0x220F9B72: ClearMonitor (epicsSubs.c:940)
==3204== by 0x220F4B7E: Epics3Plugin::pvClearMonitor(_knobData*) (epics3_plugin.cpp:87)
==3204== by 0x4E6B660: CaQtDM_Lib::closeEvent(QCloseEvent*) (caqtdm_lib.cpp:6463)
==3204== by 0x64B6DB3: QWidget::event(QEvent*) (in /usr/lib64/libQtGui.so.4.8.7)
==3204== by 0x68880D2: QMainWindow::event(QEvent*) (in /usr/lib64/libQtGui.so.4.8.7)
==3204== by 0x64610EB: QApplicationPrivate::notify_helper(QObject*, QEvent*) (in /usr/lib64/libQtGui.so.4.8.7)
==3204== by 0x64679C4: QApplication::notify(QObject*, QEvent*) (in /usr/lib64/libQtGui.so.4.8.7)
==3204== Address 0x6001433304299 is not stack'd, malloc'd or (recently) free'd
==3204==
==3204==
==3204== Process terminating with default action of signal 11 (SIGSEGV)
==3204== General Protection Fault
==3204== at 0x7C330C7: __dynamic_cast (in /usr/lib64/libstdc++.so.6.0.19)
==3204== by 0x223627C5: ca_client_context::eliminateExcessiveSendBacklog(epicsGuard<epicsMutex>&, cacChannel&) (ca_client_context.cpp:785)
==3204== by 0x22346D9C: oldChannelNotify::eliminateExcessiveSendBacklog(epicsGuard<epicsMutex>&) (oldAccess.h:481)
==3204== by 0x223625FE: ca_clear_subscription (ca_client_context.cpp:755)
==3204== by 0x223465CA: ca_clear_event (access.cpp:435)
==3204== by 0x220F9B72: ClearMonitor (epicsSubs.c:940)
==3204== by 0x220F4B7E: Epics3Plugin::pvClearMonitor(_knobData*) (epics3_plugin.cpp:87)
==3204== by 0x4E6B660: CaQtDM_Lib::closeEvent(QCloseEvent*) (caqtdm_lib.cpp:6463)
==3204== by 0x64B6DB3: QWidget::event(QEvent*) (in /usr/lib64/libQtGui.so.4.8.7)
==3204== by 0x68880D2: QMainWindow::event(QEvent*) (in /usr/lib64/libQtGui.so.4.8.7)
==3204== by 0x64610EB: QApplicationPrivate::notify_helper(QObject*, QEvent*) (in /usr/lib64/libQtGui.so.4.8.7)
==3204== by 0x64679C4: QApplication::notify(QObject*, QEvent*) (in /usr/lib64/libQtGui.so.4.8.7)
==3204==
==3204== HEAP SUMMARY:
==3204== in use at exit: 10,702,974 bytes in 93,463 blocks
==3204== total heap usage: 586,990 allocs, 493,488 frees, 72,137,816 bytes allocated
==3204==
==3204== LEAK SUMMARY:
==3204== definitely lost: 67,919 bytes in 114 blocks
==3204== indirectly lost: 415,735 bytes in 17,223 blocks
==3204== possibly lost: 519,210 bytes in 4,022 blocks
==3204== stil...

Read more...

mdavidsaver (mdavidsaver) wrote :

I tried unsuccessfully to replicate on Linux with the cothread python client and a soft ioc.

Based on the valgrind message, this isn't as straightforward as *NULL. Nor does it seem as simple as calling ca_clear_subscription() twice for the same 'evid'. However, it might be the case that ca_clear_channel() has already been called, which invalidates any subscriptions.

mdavidsaver (mdavidsaver) wrote :

My test script. To show that a subscription is created.

import epicscorelibs.path.cothread
from cothread import Event
from cothread.catools import caput, camonitor

PV='LCam1::SYSRESET'

evt = Event()
def show(S, evt=evt):
    print("Update", S)
    evt.Signal(S)

S = camonitor(PV, show)

evt.Wait() # wait for initial update
print('kill')

caput(PV, 1)
S.close()
evt.Wait(1.0) # will timeout

Helge Brands (helge-brands) wrote :

here is a modified caput.c together with a IOCStats IOC you can replicate it, hopefully. Sorry for some quick and dirty things.

mdavidsaver (mdavidsaver) wrote :

I gave this a quick try and find that it crashes at caput.c:627 with a straightforward "ca_clear_subscription(NULL)". This is due to line 446 being commented out.

> //result = ca_add_array_event(dbrType,0,pvs[0].chid,data_event_handler,&pvs[0],0.0,0.0,0.0,&pEventID);

If I uncomment this line, the process completes with code 1. This is in part because the duplicate ca_clear_event() in connect_event_handler() isn't reached. If it were reached, I think the result would be a crash similar to what you report.

Andrew Johnson (anj) wrote :

I also uncommented this line
    result = ca_add_array_event(...,&pEventID);
and (on MacOS) get no issues from running caput against a 7.0 IOC:

woz$ caput anj:exit 1
MODIFIED MODIFIED MODIFIED MODIFIED MODIFIED MODIFIED MODIFIED MODIFIED MODIFIED
Channel: anj:exit
ca_create_channel: Normal successful completion
connect_event_handler
anj:exit has just connected with channel id=0x104259000 count=1 native type=DBF_DOUBLE
channel connected
Old : data_event_handler
anj:exit 0
channel connected
New : data_event_handler
anj:exit 1

I tried removing the call to ca_clear_event() from the connect_event_handler() (although it's never actually called there since the associated message doesn't appear in the above output) and adding a second call to ca_clear_event() in main() and I still get no crash.

Helge Brands (helge-brands) wrote :

yes your both right. I will continue to try

Helge Brands (helge-brands) wrote :

@Michael: I found the pitfall in my code (macro construct for a cleanup) you are right with : I called ca_clear_channel before ca_clear_event. Sorry for the late understanding

Andrew Johnson (anj) on 2020-05-11
Changed in epics-base:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments