EPICS Base

CA Client library crash when nproc ulimit reached

Series 3.15
Bug #1664302

Bug #1664302 reported by Andrew Johnson on 2017-02-13

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
EPICS Base	Fix Released	Medium	Unassigned
3.14	Fix Released	Undecided	Unassigned
3.15	Fix Released	Undecided	Unassigned	EPICS Base 3.15.6
3.16	Fix Released	Undecided	Unassigned

Bug Description

Michael Ritzert filed a bug https://github.com/epics-extensions/ca-gateway/issues/14 against the CA Gateway that looks like a CA client library bug in Base-3.14.12.6. His core dump shows this:

(gdb) bt
#0 0x00007f67e7a3ed26 in assertIdenticalMutex (this=0x0, guard=..., chan=..., sidIn=4294967295, typeIn=65535, countIn=0) at ../../../include/epicsGuard.h:81
#1 tcpiiu::installChannel (this=0x0, guard=..., chan=..., sidIn=4294967295, typeIn=65535, countIn=0) at ../tcpiiu.cpp:1911
#2 0x00007f67e7a2c2bb in cac::transferChanToVirtCircuit (this=<value optimized out>, cid=<value optimized out>, sid=4294967295, typeCode=65535, count=0, minorVersionNumber=13,
addr=..., currentTime=...) at ../cac.cpp:639
#3 0x00007f67e7a3a4a0 in udpiiu::searchRespAction (this=<value optimized out>, msg=<value optimized out>, addr=<value optimized out>, currentTime=<value optimized out>)
at ../udpiiu.cpp:690
#4 0x00007f67e7a3a5c2 in udpiiu::postMsg (this=0x242d760, net_addr=..., pInBuf=<value optimized out>, blockSize=48, currentTime=...) at ../udpiiu.cpp:857
#5 0x00007f67e7a3c681 in udpRecvThread::run (this=0x243db88) at ../udpiiu.cpp:394
#6 0x00007f67e77dd249 in epicsThreadCallEntryPoint (pPvt=0x243dba8) at ../../../src/libCom/osi/epicsThread.cpp:83

He also noticed that at the time "the user the gateway is running under has reached is nproc ulimit."

Tags:

Related branches

lp:~epics-core/epics-base/fix-async-dns

Merged into lp:~epics-core/epics-base/3.14

Andrew Johnson: Approve on 2017-04-17

Ralph Lange: Approve on 2017-03-31

Andrew Johnson (anj) on 2017-02-13

Changed in epics-base:
importance:	High → Medium

Revision history for this message

Ralph Lange (ralph-lange) wrote on 2017-02-14:

He added another core dump:

OK, I have another crash, this time no other circumstances involved (it's on the second PC, no problems with ulimit for sure), just regular operation of the system.
It's in another place, but I'm adding it here, because it also has udpiiu in it:

Program terminated with signal 11, Segmentation fault.
#0 add (this=0xe0fc60, cid=<value optimized out>, sid=168846180, typeCode=65535, count=<value optimized out>, minorVersionNumber=<value optimized out>, addr=...,
    currentTime=...) at ../../../include/tsDLList.h:322
322 lastNode.pNext = &item;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64 libgcc-4.4.7-17.el6.x86_64 libstdc++-4.4.7-17.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64
(gdb) bt
#0 add (this=0xe0fc60, cid=<value optimized out>, sid=168846180, typeCode=65535, count=<value optimized out>, minorVersionNumber=<value optimized out>, addr=...,
    currentTime=...) at ../../../include/tsDLList.h:322
#1 cac::transferChanToVirtCircuit (this=0xe0fc60, cid=<value optimized out>, sid=168846180, typeCode=65535, count=<value optimized out>,
    minorVersionNumber=<value optimized out>, addr=..., currentTime=...) at ../cac.cpp:616
#2 0x00007f434cc014a0 in udpiiu::searchRespAction (this=<value optimized out>, msg=<value optimized out>, addr=<value optimized out>, currentTime=<value optimized out>)
    at ../udpiiu.cpp:690
#3 0x00007f434cc015c2 in udpiiu::postMsg (this=0xe7cc80, net_addr=..., pInBuf=<value optimized out>, blockSize=24, currentTime=...) at ../udpiiu.cpp:857
#4 0x00007f434cc03681 in udpRecvThread::run (this=0xe8d0a8) at ../udpiiu.cpp:394
#5 0x00007f434c9a4249 in epicsThreadCallEntryPoint (pPvt=0xe8d0c8) at ../../../src/libCom/osi/epicsThread.cpp:83
#6 0x00007f434c9aaed3 in start_routine (arg=0xe8d340) at ../../../src/libCom/osi/os/posix/osdThread.c:389
#7 0x00000030ada07aa1 in start_thread () from /lib64/libpthread.so.0
#8 0x00000030ad2e8aad in clone () from /lib64/libc.so.6

I also have another core still to be examined.

He added another core dump:

Program terminated with signal 11, Segmentation fault.
#0  add (this=0xe0fc60, cid=<value optimized out>, sid=168846180, typeCode=65535, count=<value optimized out>, minorVersionNumber=<value optimized out>, addr=..., 
    currentTime=...) at ../../../include/tsDLList.h:322
322             lastNode.pNext = &item;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64 libgcc-4.4.7-17.el6.x86_64 libstdc++-4.4.7-17.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64
(gdb) bt
#0  add (this=0xe0fc60, cid=<value optimized out>, sid=168846180, typeCode=65535, count=<value optimized out>, minorVersionNumber=<value optimized out>, addr=..., 
    currentTime=...) at ../../../include/tsDLList.h:322
#1  cac::transferChanToVirtCircuit (this=0xe0fc60, cid=<value optimized out>, sid=168846180, typeCode=65535, count=<value optimized out>, 
    minorVersionNumber=<value optimized out>, addr=..., currentTime=...) at ../cac.cpp:616
#2  0x00007f434cc014a0 in udpiiu::searchRespAction (this=<value optimized out>, msg=<value optimized out>, addr=<value optimized out>, currentTime=<value optimized out>)
    at ../udpiiu.cpp:690
#3  0x00007f434cc015c2 in udpiiu::postMsg (this=0xe7cc80, net_addr=..., pInBuf=<value optimized out>, blockSize=24, currentTime=...) at ../udpiiu.cpp:857
#4  0x00007f434cc03681 in udpRecvThread::run (this=0xe8d0a8) at ../udpiiu.cpp:394
#5  0x00007f434c9a4249 in epicsThreadCallEntryPoint (pPvt=0xe8d0c8) at ../../../src/libCom/osi/epicsThread.cpp:83
#6  0x00007f434c9aaed3 in start_routine (arg=0xe8d340) at ../../../src/libCom/osi/os/posix/osdThread.c:389
#7  0x00000030ada07aa1 in start_thread () from /lib64/libpthread.so.0
#8  0x00000030ad2e8aad in clone () from /lib64/libc.so.6

I also have another core still to be examined.

Revision history for this message

Ralph Lange (ralph-lange) wrote on 2017-02-14:

Looks like the same issue, although the arguments are less obviously wrong this time.

The crash appears in the action to a search response. I.e. after asking for a PV, the Gateway got an "I have it" UDP response and is transferring the channel to a virtual circuit (TCP connection to an IOC).

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-14:

(gdb) info args
item = @0x7f433c054e20
this = 0xe0fd60
(gdb) p lastNode
$1 = (tsDLNode<msgForMultiplyDefinedPV> &) @0x0: <error reading variable>
(gdb) p &item
$2 = (msgForMultiplyDefinedPV *) 0x7f433c054e20
(gdb) info locals
lastNode = @0x0
theNode = @0x7f433c054e28

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-14:

The original stack trace points to a (I think) clear this==NULL bug in cac::transferChanToVirtCircuit. Specifically the piiu->installChannel which looks like it should be conditional on newIIU.

The handling of piiu isn't so straightforward. As I read it, this->serverTable.lookup() returns NULL of failure. piiu is then passed to findOrCreateVirtCircuit() by *reference*. That is, a reference to a pointer. findOrCreateVirtCircuit() returns true if piiu is now non-NULL. So it seems clear that piiu should not be de-referenced unless this boolean newIIU is true.

https://github.com/epics-base/epics-base/blob/3.14/src/ca/cac.cpp#L638

I won't pretend to understand the logic here. The most recent footprints in this area are in 2010 with changes dating from 2008. This commit carries the inspiring message "COMPLETELY UNTESTED" :)

https://github.com/epics-base/epics-base/commit/23612a7afe1c6e0a208bf4a0acecd2a5e2468380

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-15:

Ralph's crash is probably due to corruption of the msgMultiPVList list. cac::mutex is clearly held in transferChanToVirtCircuit when adding to this list, but seems to not be held when entries are removed in cac::pvMultiplyDefinedNotify or cac::~cac.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-15:

Also, when providing stacktraces, please dump all thread threads. I suspect that both of these crashes are races in some way, so it would be useful to look for the other racer(s). In GDB run "thread apply all backtrace". This will be large, so please attach the output as a file.

Revision history for this message

Bruce Hill (bhill) wrote on 2017-02-15:

I agree w/ Michael's take on the 2nd crash, looks like the DLL freelist is probably corrupted. Adding a guard in pvMultiplyDefinedNotify() is essential. Not clear the best fix for ~cac() due to it's warnings about deadlocks w/ UDP thread.

On the first crash, findOrCreateVirtualCircuit() can return FALSE for a valid piiu ptr, whether the circuit is connected or not. Not clear why it even bothers calling tcpiiu::alive() as it returns FALSE either way.

Also not clear if piiu->installChannel() should be called for a valid piiu ptr when newIIU is FALSE. I'm guessing it should as that looks like it probably happens a lot.

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-15:

Full backtrace for the second crash. Edit (43.2 KiB, text/plain)

Full backtrace for the *second* crash.

I have the full core file available, and the executable is from an RPM. If you want to debug interactively, I can make both available.

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-15:

Full backtrace for the first crash. Edit (14.6 KiB, text/plain)

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-16:

#10

> On the first crash, ...

Ha, so newIIU==false can mean either an existing circuit was reused, or that findOrCreateVirtCircuit() error'd. So newIIU==false and piiu==NULL implies that findOrCreateVirtCircuit() failed to create a new circuit.

I see no mention, was there any "CAC: exception during virtual circuit creation" error printed? Though this may get lost in the errlog thread. The only exception I can see which might pop up here is bad_alloc from bheFreeList.allocate(), which seems unlikely.

Otherwise, the only code path I can see which would lead to this is if beaconTable.add fails, which looks like it can't happen.

To further troubleshoot this I'd make the piiu->installChannel call conditional on piiu!=NULL.

From what I see, this crash would only happen when the first client connects through a gateway to a particular server (no existing circuit). In transferChanToVirtCircuit(), look at pChan->pNameStr to see the name of this channel. This might be helpful in attempting to reproduce this crash.

If this is happening intermittently, it occurs to me that a PV name conflict may be involved (thus msgForMultiplyDefinedPV and the second crash).

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-16:

#11

> Full backtrace for the *second* crash.

2ndcrash.log shows signs of stack corruption on many threads. Which makes me wonder if the linked list corruption might not be a side effect of this instead of a race.

1stcrash.log shows no sign of racers. The stack trace of thread 10 is garbage, the others look ok.

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-16:

#12

> I see no mention, was there any "CAC: exception during virtual circuit creation" error printed?

What all the crashes have in common is that no message is printed at all. We are running everything with output to log files, and this is the full content of the file:

/opt/epics/ioc/iocBoot/gateway/start.sh: line 3: 3949 Segmentation fault (core dumped) /opt/epics/extensions/bin/linux-x86_64/gateway -sip 192.168.99.44 -cip 10.16.99.255 -log gateway.log -putlog put.log -pvlist PXD.pvlist -archive -no_cache -signore 192.168.99.49

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-16:

#13

> In transferChanToVirtCircuit(), look at pChan->pNameStr to see the name of this channel.

(gdb) up
#1 cac::transferChanToVirtCircuit (this=0xe0fc60, cid=<value optimized out>, sid=168846180, typeCode=65535, count=<value optimized out>,
    minorVersionNumber=<value optimized out>, addr=..., currentTime=...) at ../cac.cpp:616
616 this->msgMultiPVList.add ( *pMsg );
(gdb) p pChan->pNameStr
$1 = 0x403e000000000000 <Address 0x403e000000000000 out of bounds>
(gdb) p pChan
$2 = (nciu *) 0xe0fdf0
(gdb) p *pChan
$3 = {<cacChannel> = {_vptr.cacChannel = 0xe0ff70, static priorityMax = 99, static priorityMin = 0, static priorityDefault = 0, static priorityLinksDB = 99,
    static priorityArchive = 49, static priorityOPI = 0, callback =
    @0x7f433c0552b8}, <chronIntIdRes<nciu>> = {<chronIntId> = {<intId<unsigned int, 8u, 32u>> = {id = 1007097120}, <No data fields>}, <tsSLNode<nciu>> = {
      pNext = 0x7f434ce28570}, <No data fields>}, <channelNode> = {<tsDLNode<nciu>> = {pNext = 0xe0ffe0, pPrev = 0x7f4332615090},
    listMember = 845172752}, <privateInterfaceForIO> = {_vptr.privateInterfaceForIO = 0x7f434ce2a1b0}, eventq = {pFirst = 0xe10050, pLast = 0x7f433c011308,
    itemCount = 1006692352}, accessRightState = {f_readPermit = false, f_writePermit = false, f_operatorConfirmationRequest = false}, cacCtx = @0x2a5e9580,
  pNameStr = 0x403e000000000000 <Address 0x403e000000000000 out of bounds>, piiu = 0xdff7f8, sid = 14678016, count = 0, retry = 14745728, nameLength = 0,
  typeCode = 0, priority = 240 '\360'}

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-16:

#14

The above was for the second crash, here is the same for the first:

(gdb) up
#2 0x00007f67e7a2c2bb in cac::transferChanToVirtCircuit (this=<value optimized out>, cid=<value optimized out>, sid=4294967295, typeCode=65535, count=0,
    minorVersionNumber=13, addr=..., currentTime=...) at ../cac.cpp:639
639 guard, *pChan, sid, typeCode, count );
(gdb) p pChan->pNameStr
$1 = 0x2556f90 "PXD:O:RC:Masked:S"
(gdb) p *pChan
$2 = {<cacChannel> = {_vptr.cacChannel = 0x7f67e7c620f0, static priorityMax = 99, static priorityMin = 0, static priorityDefault = 0,
    static priorityLinksDB = 99, static priorityArchive = 49, static priorityOPI = 0, callback =
    @0x242cde8}, <chronIntIdRes<nciu>> = {<chronIntId> = {<intId<unsigned int, 8u, 32u>> = {id = 982}, <No data fields>}, <tsSLNode<nciu>> = {
      pNext = 0x0}, <No data fields>}, <channelNode> = {<tsDLNode<nciu>> = {pNext = 0x7f67e56669c0, pPrev = 0x0},
    listMember = channelNode::cs_none}, <privateInterfaceForIO> = {_vptr.privateInterfaceForIO = 0x7f67e7c621d0}, eventq = {pFirst = 0x0, pLast = 0x0,
    itemCount = 0}, accessRightState = {f_readPermit = false, f_writePermit = false, f_operatorConfirmationRequest = false}, cacCtx = @0x2415480,
  pNameStr = 0x2556f90 "PXD:O:RC:Masked:S", piiu = 0x242d760, sid = 4294967295, count = 0, retry = 9, nameLength = 18, typeCode = 65535, priority = 0 '\000'}

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-16:

#15

This just occurred to me:
> (gdb) p pChan->pNameStr
> $1 = 0x2556f90 "PXD:O:RC:Masked:S"

The gateway is not supposed to handle that PV. I'm starting with

$ /opt/epics/extensions/bin/linux-x86_64/gateway -sip 10.16.99.100 -cip 10.16.3.255 -log gateway.log -putlog put.log -archive -no_cache -signore 10.16.99.100 -pvlist ONSEN.pvlist

gateway.log confirms
EPICS_CA_ADDR_LIST=10.16.3.255
EPICS_CA_AUTO_ADDR_LIST=NO

but:
$ cainfo PXD:O:RC:Masked:S
PXD:O:RC:Masked:S
State: connected
Host: 192.168.99.49:38811

so it's not from the 10.16.3.255 net.

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-16:

#16

> The gateway is not supposed to handle that PV. I'm starting with

Ignore this. The gateway machine can also see the PV in the network it is supposed to handle.

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-16:

#17

Concerning stack corruption: Everything is built with
-Wp,-D_FORTIFY_SOURCE=2 -fstack-protector --param=ssp-buffer-size=4
so the most trivial cases would lead to immediate termination of the program.

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-16:

#18

> To further troubleshoot this I'd make the piiu->installChannel call conditional on piiu!=NULL.

Do you want me to deploy a patched version?

I'm also considering running a version instrumented with the address sanitizer.

> From what I see, this crash would only happen when the first client connects through a gateway to a particular server (no existing circuit).

I think we can rule out that scenario. The crashes also happen after the gateway has been running for a while, and it certainly has been used in the meantime.

Revision history for this message

Ralph Lange (ralph-lange) wrote on 2017-02-16:

#19

Wait.
The Gateway sees the PV in its client net in 10.16.3.0/24, and offers it on 10.16.99.100, but another server also offers it on 10.16.99.49?

That does not seem right. (At least very unusual.)

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-16:

#20

> but another server also offers it on 10.16.99.49?

The other IP is 192.168.99.49. Different prefix.

My confusion came, when cainfo didn't report multiple PVs. It should have seen 192.168.99.49 and 10.16.3.254.

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-17:

#21

3rdcrash.log Edit (52.9 KiB, text/plain)

Another one. Just for completeness; it looks extremely similar.

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-21:

#22

Download full text (4.4 KiB)

Now it gets interesting: On the same machine, I just got this crash from caget:

Program terminated with signal 11, Segmentation fault.
#0 remove (this=0x1dc3040, mfmdpv=..., pChannelName=<value optimized out>, pAcc=<value optimized out>, pRej=<value optimized out>)
    at ../../../include/tsDLList.h:236
236 prevNode.pNext = theNode.pNext;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64 libgcc-4.4.7-17.el6.x86_64 libstdc++-4.4.7-17.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64 readline-6.0-4.el6.x86_64
(gdb) bt
#0 remove (this=0x1dc3040, mfmdpv=..., pChannelName=<value optimized out>, pAcc=<value optimized out>, pRej=<value optimized out>)
    at ../../../include/tsDLList.h:236
#1 cac::pvMultiplyDefinedNotify (this=0x1dc3040, mfmdpv=..., pChannelName=<value optimized out>, pAcc=<value optimized out>, pRej=<value optimized out>)
    at ../cac.cpp:1309
#2 0x00007fe4b476b01e in ipAddrToAsciiEnginePrivate::run (this=0x1dc35d0) at ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:276
#3 0x00007fe4b476d249 in epicsThreadCallEntryPoint (pPvt=0x1dc3a28) at ../../../src/libCom/osi/epicsThread.cpp:83
#4 0x00007fe4b4773ed3 in start_routine (arg=0x1dc3d90) at ../../../src/libCom/osi/os/posix/osdThread.c:389
#5 0x0000003111c07aa1 in start_thread () from /lib64/libpthread.so.0
#6 0x00000031114e8aad in clone () from /lib64/libc.so.6
(gdb) p prevNode
$1 = (tsDLNode<msgForMultiplyDefinedPV> &) @0x0: <error reading variable>
(gdb) p theNode
$2 = (tsDLNode<msgForMultiplyDefinedPV> &) @0x7fe4ac017038: {pNext = 0x7fe4ac0170d8, pPrev = 0x0}

I'm adding this here because of the similarity of the crash location in tsDLList.h, related to the node handling.

There are only two more threads, which makes this a lot easier to debug:
Thread 3 (Thread 0x7fe4b4734720 (LWP 14595)):
#0 0x0000003111c0b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fe4b47747a9 in condWait (pevent=0x1dc3bd0) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#2 epicsEventWait (pevent=0x1dc3bd0) at ../../../src/libCom/osi/os/posix/osdEvent.c:137
#3 0x00007fe4b476dbac in epicsEvent::wait (this=<value optimized out>) at ../../../src/libCom/osi/epicsEvent.cpp:63
#4 0x00007fe4b476ad27 in ipAddrToAsciiTransactionPrivate::~ipAddrToAsciiTransactionPrivate (this=0x7fe4ac007d80, __in_chrg=<value optimized out>)
at ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:319
#5 0x00007fe4b476a96a in ipAddrToAsciiTransactionPrivate::release (this=0x7fe4ac007d80) at ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:302
#6 0x00007fe4b49e19e2 in msgForMultiplyDefinedPV::~msgForMultiplyDefinedPV (this=0x7fe4ac017030, __in_chrg=<value optimized out>)
at ../msgForMultiplyDefinedPV.cpp:53
#7 0x00007fe4b49be706 in cac::~cac (this=0x1dc3040, __in_chrg=<value optimized out>) at ../cac.cpp:338
#8 0x00007fe4b49bebb9 in cac::~cac (this=0x1dc3040, __in_chrg=<value optimized out>) at ../cac.cpp:349
#9 0x00007fe4b49d8aee in destroyTarget (this=0x1dc2c50, __in_chrg=<value optimized out>) at ../../../include/epicsMemory.h:52
#10 reset (this=0x1dc2c50, __in_chrg=<value optimized out>) at ../../../include/epicsMemo...

Now it gets interesting: On the same machine, I just got this crash from caget:

Program terminated with signal 11, Segmentation fault.
#0  remove (this=0x1dc3040, mfmdpv=..., pChannelName=<value optimized out>, pAcc=<value optimized out>, pRej=<value optimized out>)
    at ../../../include/tsDLList.h:236
236             prevNode.pNext = theNode.pNext;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64 libgcc-4.4.7-17.el6.x86_64 libstdc++-4.4.7-17.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64 readline-6.0-4.el6.x86_64
(gdb) bt
#0  remove (this=0x1dc3040, mfmdpv=..., pChannelName=<value optimized out>, pAcc=<value optimized out>, pRej=<value optimized out>)
    at ../../../include/tsDLList.h:236
#1  cac::pvMultiplyDefinedNotify (this=0x1dc3040, mfmdpv=..., pChannelName=<value optimized out>, pAcc=<value optimized out>, pRej=<value optimized out>)
    at ../cac.cpp:1309
#2  0x00007fe4b476b01e in ipAddrToAsciiEnginePrivate::run (this=0x1dc35d0) at ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:276
#3  0x00007fe4b476d249 in epicsThreadCallEntryPoint (pPvt=0x1dc3a28) at ../../../src/libCom/osi/epicsThread.cpp:83
#4  0x00007fe4b4773ed3 in start_routine (arg=0x1dc3d90) at ../../../src/libCom/osi/os/posix/osdThread.c:389
#5  0x0000003111c07aa1 in start_thread () from /lib64/libpthread.so.0
#6  0x00000031114e8aad in clone () from /lib64/libc.so.6
(gdb) p prevNode
$1 = (tsDLNode<msgForMultiplyDefinedPV> &) @0x0: <error reading variable>
(gdb) p theNode
$2 = (tsDLNode<msgForMultiplyDefinedPV> &) @0x7fe4ac017038: {pNext = 0x7fe4ac0170d8, pPrev = 0x0}

I'm adding this here because of the similarity of the crash location in tsDLList.h, related to the node handling.

There are only two more threads, which makes this a lot easier to debug:
Thread 3 (Thread 0x7fe4b4734720 (LWP 14595)):
#0  0x0000003111c0b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe4b47747a9 in condWait (pevent=0x1dc3bd0) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#2  epicsEventWait (pevent=0x1dc3bd0) at ../../../src/libCom/osi/os/posix/osdEvent.c:137
#3  0x00007fe4b476dbac in epicsEvent::wait (this=<value optimized out>) at ../../../src/libCom/osi/epicsEvent.cpp:63
#4  0x00007fe4b476ad27 in ipAddrToAsciiTransactionPrivate::~ipAddrToAsciiTransactionPrivate (this=0x7fe4ac007d80, __in_chrg=<value optimized out>)
    at ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:319
#5  0x00007fe4b476a96a in ipAddrToAsciiTransactionPrivate::release (this=0x7fe4ac007d80) at ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:302
#6  0x00007fe4b49e19e2 in msgForMultiplyDefinedPV::~msgForMultiplyDefinedPV (this=0x7fe4ac017030, __in_chrg=<value optimized out>)
    at ../msgForMultiplyDefinedPV.cpp:53
#7  0x00007fe4b49be706 in cac::~cac (this=0x1dc3040, __in_chrg=<value optimized out>) at ../cac.cpp:338
#8  0x00007fe4b49bebb9 in cac::~cac (this=0x1dc3040, __in_chrg=<value optimized out>) at ../cac.cpp:349
#9  0x00007fe4b49d8aee in destroyTarget (this=0x1dc2c50, __in_chrg=<value optimized out>) at ../../../include/epicsMemory.h:52
#10 reset (this=0x1dc2c50, __in_chrg=<value optimized out>) at ../../../include/epicsMemory.h:111
#11 ca_client_context::~ca_client_context (this=0x1dc2c50, __in_chrg=<value optimized out>) at ../ca_client_context.cpp:188
#12 0x00007fe4b49d8eb9 in ca_client_context::~ca_client_context (this=0x1dc2c50, __in_chrg=<value optimized out>) at ../ca_client_context.cpp:193
#13 0x00007fe4b49c1d43 in ca_context_destroy () at ../access.cpp:252
#14 0x0000000000401c40 in main (argc=<value optimized out>, argv=<value optimized out>) at ../caget.c:551

Thread 2 (Thread 0x7fe4b3b30700 (LWP 14604)):
#0  0x0000003111c0ba5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fe4b4774704 in condTimedwait (pevent=0x7fe4a4002900, timeout=0) at ../../../src/libCom/osi/os/posix/osdEvent.c:65
#2  epicsEventWaitWithTimeout (pevent=0x7fe4a4002900, timeout=0) at ../../../src/libCom/osi/os/posix/osdEvent.c:156
#3  0x00007fe4b475ec76 in errlogThread () at ../../../src/libCom/error/errlog.c:507
#4  0x00007fe4b4773ed3 in start_routine (arg=0x7fe4a4005cc0) at ../../../src/libCom/osi/os/posix/osdThread.c:389
#5  0x0000003111c07aa1 in start_thread () from /lib64/libpthread.so.0
#6  0x00000031114e8aad in clone () from /lib64/libc.so.6

I can actually reproduce this by just repeating the caget enough times. The PV is nothing special, just plain ao, r/w access.

Revision history for this message

Ralph Lange (ralph-lange) wrote on 2017-02-21:

#23

That smells like being related to https://bugs.launchpad.net/epics-base/+bug/1580623
You backported the fix to 3.14, Michael, so I assume your base version has that fix?

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-21:

#24

I'm using the plain base-3.14.12.6 version that seems to have this patch applied.

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-21:

#25

About the latest crash in caget: It happens way more frequently, whenn the PV is on localhost than when run from remote. And only when multiple replies (from multiple network interfaces on the host) are received. I haven't yet seen a crash when it terminated before a second reply was reported.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-21:

#26

may be related to https://bugs.launchpad.net/epics-base/+bug/1645301

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-21:

#27

> Do you want me to deploy a patched version?

It would be helpful if you could test possible patches. How this is done is of course up to you. I might suggest to clone and build of Base and gateway from source.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-22:

#28

... specifically I'd like you to test with https://github.com/mdavidsaver/epics-base/tree/fix1664302

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-22:

#29

> ... specifically I'd like you to test with https://github.com/mdavidsaver/epics-base/tree/fix1664302

sorry, that one doesn't help, at least for the caget crash.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7114700 (LWP 14790)]
remove (this=0x60b040, mfmdpv=..., pChannelName=<value optimized out>, pAcc=<value optimized out>, pRej=<value optimized out>)
at ../../../include/tsDLList.h:251
251 prevNode.pNext = theNode.pNext;
(gdb) info locals
prevNode = @0x0
theNode = @0x7ffff0017038
(gdb) quit
A debugging session is active.

Inferior 1 [process 14786] will be killed.

Quit anyway? (y or n) y
[belle-iocpxd] /tmp/epics-base $ git status
# On branch fix1664302

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-22:

#30

For the gateway the situation is now a lot worse. It reliably crashes within seconds:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff2a0c700 (LWP 1087)]
0x00007ffff426f250 in remove (item=..., this=0x60380000fd80) at ../../../include/tsDLList.h:251
251 prevNode.pNext = theNode.pNext;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64 libasan-4.8.2-15.1.el6.x86_64 libgcc-4.4.7-17.el6.x86_64 libstdc++-4.4.7-17.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64
(gdb) info locals
prevNode = @0x8: <error reading variable>
theNode = @0x60580000f088: {pNext = 0x60580000f128, pPrev = 0x0}
(gdb) bt
#0 0x00007ffff426f250 in remove (item=..., this=0x60380000fd80) at ../../../include/tsDLList.h:251
#1 cac::pvMultiplyDefinedNotify (this=0x60380000fc80, mfmdpv=..., pChannelName=<optimized out>, pAcc=<optimized out>, pRej=<optimized out>) at ../cac.cpp:1320
#2 0x00007ffff3fd3e15 in ipAddrToAsciiEnginePrivate::run (this=0x60460001ef80) at ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:276
#3 0x00007ffff3fd6b65 in epicsThreadCallEntryPoint (pPvt=0x60460001f3d8) at ../../../src/libCom/osi/epicsThread.cpp:83

nextNode seems always to be 0x8.

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-22:

#31

I should add: I'm testing with 3.14.2.6 base with the three latest commits from https://github.com/mdavidsaver/epics-base/commits/fix1664302 applied.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-22:

#32

> For the gateway the situation is now a lot worse. It reliably crashes within seconds:

That's because I made a mistake (remove() twice). I've pushed updated patches.

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-22:

#33

I updated the second of the three patches.

This brings no change for caget, but no more immediate crashes for the gateway.

I cannot install the patched version of the gateway right now, because we are taking data. I will update it at the next possibility.

However, the current instance has been up for six days now without another crash, so I don't know when to call the test a success. If of course it still crashes…

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-22:

#34

Ok, I may see what is happening. Even with my changes, in ~cac the msgForMultiplyDefinedPV are removed from the msgMultiPVList list before their destructor is called (where dnsTransaction.release() happens). So pvMultiplyDefinedNotify() may already be queued and run after removal from the list, but before release(). This will blindly try to remove() again, which causes the crash.

I think the fix will be to track whether an msgForMultiplyDefinedPV is in the list.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-24:

#35

This is another regression caused by the fix for #1527636

Previously the line 'this->ipToAEngine.release ();' in ~cac ensured that there was no race before msgMultiPVList was cleaned.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-24:

#36

Well, at least for the last cac instance. As with the other regressions/bugs exposed by lp:1527636 this could happen if a processes has multiple ca contexts.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-26:

#37

I've made an attempt to fix the root problem, which is the change in behavior to ipAddrToAsciiEngine::release(). Instead of a singleton, the engine is now a lightweight struct which exists only to track transaction ownership. release() should now have the side-effect of canceling all pending or in-progress transactions created through that engine instance.

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2017-02-27:

#38

Your latest commit b1fb0c4a15b3de6c53413539bc6c4bce800b1bb7 seems to help at least for the caget crash.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-27:

#39

Good. Unless I hear otherwise, I'll assume you have no further crashes.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-27:

#40

I've committed the (small) change to findOrCreateVirtCircuit() on the 3.14 branch. http://bazaar.launchpad.net/~epics-core/epics-base/3.14/revision/12701

I'll start a merge proposal for the ipAddrToAsciiEngine change as this is larger and trickier.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2017-02-27:

#41

I picked off part of the msgMultiPVList locking change as http://bazaar.launchpad.net/~epics-core/epics-base/3.14/revision/12702 which is not by itself sufficient to avoid this crash, but should be when combined with epics-base/fix-async-dns/+merge.

Andrew Johnson (anj) on 2017-12-15

Changed in epics-base:
milestone:	3.14.branch → none
status:	New → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.