Gateway sigsegv's when cleaning up channels using ca_clear_channel

Bug #1279147 reported by Murali Shankar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
PV Gateway
Invalid
Undecided
Unassigned

Bug Description

At LCLS, the archiver appliances connect to the IOC's thru a CA gateway. The gateway crashes once in a while. This does not seem to be related to an “out-of-memory” issue or a “Gateway has been running for a long time” issue. Instead, it seems to be related to the gateway cleaning up PVs (Feb 07 04:42) from an IOC that is CPU overloaded and keeps disconnecting ( Feb 07 02:41).

From the gateway logs...
>> Unexpected problem with CA circuit to server "eioc-und1-mp01.slac.stanford.edu:5068" was "Connection reset by peer" - disconnecting
>> Feb 07 02:21:23 Warning: Virtual circuit disconnect eioc-und1-mp01.slac.stanford.edu:5068

>> Feb 07 02:21:23 !!! Errlog message received (message is above)
>> Unexpected problem with CA circuit to server "eioc-und1-mp01.slac.stanford.edu:5068" was "Connection reset by peer" - disconnecting

>> Feb 07 02:41:49 !!! Errlog message received (message is above)
>> Feb 07 02:41:49 Warning: Virtual circuit disconnect eioc-und1-mp01.slac.stanford.edu:5068
>> Feb 07 04:42:32 PV Gateway Aborting (SIGSEGV)

 I have core dumps and I am able to examine the variables etc and indeed the gateway is trying to clean up the PVs from this IOC using ca_clear_channel. However, the place where this crashes is in a fundamental place (tsDLList.h:238) in EPICS base. I can provide more details/core if needed.

Regards,
Murali

(gdb) bt
#0 0x0016c410 in __kernel_vsyscall ()
#1 0x0086de30 in raise () from /lib/libc.so.6
#2 0x0086f741 in abort () from /lib/libc.so.6
#3 0x080513a4 in sig_end (sig=11) at ../gateway.cc:300
#4 <signal handler called>
#5 0x0075a8c9 in remove (this=0xaf728260, guard=..., chan=...) at ../../../include/tsDLList.h:238
#6 tcpiiu::uninstallChan (this=0xaf728260, guard=..., chan=...) at ../tcpiiu.cpp:1981
#7 0x007512b7 in nciu::destroy (this=0x17e24b88, guard=...) at ../nciu.cpp:93
#8 0x00768347 in oldChannelNotify::destructor (this=0x17e179f0, guard=...) at ../oldChannelNotify.cpp:71
#9 0x00749039 in ca_clear_channel (pChan=0x17e179f0) at ../access.cpp:386
#10 0x080582e0 in gatePvData::~gatePvData (this=0x157f79b0, __in_chrg=<value optimized out>) at ../gatePv.cc:240
#11 0x08062064 in gatePvNode::destroy (this=0x1ca02110) at ../gateServer.h:69
#12 0x0805d6e7 in gateServer::inactiveDeadCleanup (this=0x925af40) at ../gateServer.cc:1490
#13 0x08060fc8 in gateServer::mainLoop (this=0x925af40) at ../gateServer.cc:285
#14 0x0804ef18 in startEverything (prefix=0xbfd7bbe2 "GWLCLSARCH") at ../gateway.cc:656
#15 0x080511a8 in main (argc=16, argv=0xbfd7b494) at ../gateway.cc:1299
……
(gdb) up
#4 <signal handler called>
(gdb) up
#5 0x0075a8c9 in remove (this=0xaf728260, guard=..., chan=...) at ../../../include/tsDLList.h:238
238 prevNode.pNext = theNode.pNext;
(gdb) print theNode
$1 = (tsDLNode<nciu> &) @0x17e24b98: {pNext = 0x17d44d68, pPrev = 0x0}
(gdb) up
#6 tcpiiu::uninstallChan (this=0xaf728260, guard=..., chan=...) at ../tcpiiu.cpp:1981
1981 this->createReqPend.remove ( chan );
(gdb) print chan
$2 = (nciu &) @0x17e24b88: {<cacChannel> = {_vptr.cacChannel = 0x781168, static priorityMax = 99, static priorityMin = 0, static priorityDefault = 0, static priorityLinksDB = 99,
    static priorityArchive = 49, static priorityOPI = 0, callback = @0x17e179f0}, <chronIntIdRes<nciu>> = {<chronIntId> = {<intId<unsigned int, 8u, 32u>> = {
        id = 833073}, <No data fields>}, <tsSLNode<nciu>> = {pNext = 0x0}, <No data fields>}, <channelNode> = {<tsDLNode<nciu>> = {pNext = 0x17d44d68, pPrev = 0x0},
    listMember = cs_createReqPend}, <privateInterfaceForIO> = {_vptr.privateInterfaceForIO = 0x7811d8}, eventq = {pFirst = 0x0, pLast = 0x0, itemCount = 0}, accessRightState = {
    f_readPermit = false, f_writePermit = false, f_operatorConfirmationRequest = false}, cacCtx = @0x925e2d8, pNameStr = 0x1c5838a8 "BLM:UND1:MP01:XILINX_CELS.LOW", piiu = 0xaf728260,
  sid = 4294967295, count = 0, retry = 1, nameLength = 30, typeCode = 65535, priority = 0 '\000'}
(gdb) quit

Revision history for this message
Murali Shankar (mshankar) wrote :

More information
This is PV Gateway Version 2.0.3.0 [Mar 2 2012 09:46:57]
Gateway is built against base-R3-14-12 with a few patches applied (I can provide a full list if needed).
IOC eioc-und1-mp01 runs on RTEMS-4.9.4-slac_p0 on top of EPICS R3.14.12-SLAC_1 $Date 2010/11/27\

Revision history for this message
Murali Shankar (mshankar) wrote :

Results of thread apply all bt in a core.

Changed in epics-base:
assignee: nobody → Ralph Lange (ralph-lange)
no longer affects: epics-base
Revision history for this message
Ralph Lange (ralph-lange) wrote :
Changed in epics-gateway:
status: New → Won't Fix
status: Won't Fix → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.