EPICS Base

IOC segmentation fault related to CA security

Series 3.16
Bug #1571224

Bug #1571224 reported by mdavidsaver on 2016-04-16

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
EPICS Base	Fix Released	High	mdavidsaver	EPICS Base 3.16.1
3.14	Fix Released	High	mdavidsaver	EPICS Base 3.14.12.6
3.15	Fix Released	High	mdavidsaver	EPICS Base 3.15.4
3.16	Fix Released	High	mdavidsaver	EPICS Base 3.16.1

Bug Description

On 04/15/2016 05:27 PM, Pearson, Matthew R. wrote:
>
> Hi,
>
> I’ve had a few instances of one of my soft IOCs crashing with a segmentation fault when I shutdown another IOC that hosts a PV used as part of the CA security logic on the other crashed IOC.
>
> There is sometimes (but not always) this message printed out before the crash:
>
> dbCa:exceptionCallback stat "Virtual circuit disconnect" channel "unknown" context "cg1d-dassrv1.ornl.gov:5064"
> nativeType DBR_invalid requestType DBR_invalid nativeCount 0 requestCount 0 noReadAccess noWriteAccess
>
> Printing the stack trace:
>
> Core was generated by `../../bin/linux-x86_64/cg1d-parker1 ./st.cmd'.
> Program terminated with signal 11, Segmentation fault.
> #0 0x00007f5321218599 in ellDelete (pList=0x7f52bc000920, pNode=0x7f52ac008ec0) at ../../../src/libCom/ellLib/ellLib.c:87
> 87 pNode->previous->next = pNode->next;
> Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6_6.9.x86_64 libgcc-4.4.7-11.el6.x86_64 libstdc++-4.4.7-11.el6.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64 readline-6.0-4.el6.x86_64
> (gdb) bt
> #0 0x00007f5321218599 in ellDelete (pList=0x7f52bc000920, pNode=0x7f52ac008ec0) at ../../../src/libCom/ellLib/ellLib.c:87
> #1 0x00007f532217c289 in casAccessRightsCB (ascpvt=0x7f52b8000db8, type=asClientCOAR) at ../camessage.c:1111
> #2 0x00007f5321d64122 in asComputePvt (asClientPvt=0x7f52b8000db8) at ../asLibRoutines.c:1014
> #3 0x00007f5321d63ea0 in asComputeAsgPvt (pasg=0x1f91ee0) at ../asLibRoutines.c:940
> #4 0x00007f5321d62419 in asComputeAsg (pasg=0x1f91ee0) at ../asLibRoutines.c:455
> #5 0x00007f5321d60482 in connectCallback (arg=...) at ../asCa.c:99
> #6 0x00007f53214c1301 in oldChannelNotify::disconnectNotify (this=0x7f52d0000d10, guard=...) at ../oldChannelNotify.cpp:112
> #7 0x00007f53214abe30 in nciu::unresponsiveCircuitNotify (this=0x7f5323714010, cbGuard=..., guard=...) at ../nciu.cpp:171
> #8 0x00007f53214b7c38 in tcpiiu::disconnectAllChannels (this=0x7f52d40008c0, cbGuard=..., guard=..., discIIU=...) at ../tcpiiu.cpp:1834
> #9 0x00007f532149a042 in cac::destroyIIU (this=0x7f52d000ed20, iiu=...) at ../cac.cpp:1227
> #10 0x00007f53214b27e3 in tcpSendThread::run (this=0x7f52d4000a00) at ../tcpiiu.cpp:229
> #11 0x00007f532122bc51 in epicsThreadCallEntryPoint (pPvt=0x7f52d4000a08) at ../../../src/libCom/osi/epicsThread.cpp:85
> #12 0x00007f53212333ce in start_routine (arg=0x7f52d400a250) at ../../../src/libCom/osi/os/posix/osdThread.c:385
> #13 0x00007f53204a29d1 in start_thread () from /lib64/libpthread.so.0
> #14 0x00007f53207a08fd in clone () from /lib64/libc.so.6
>
>
> The crashed IOC has a CA access security rule that looks like:
>
> ASG(DEFAULT)
> {
> INPA("$(P):Scan:Active")
> RULE(1, READ)
> RULE(1, WRITE)
> {
> CALC("A=0")
> }
> RULE(1, WRITE)
> {
> UAG(epics, beamline, detector)
> HAG(beamline)
> CALC("A=1")
> }
> }
>
> where $(P):Scan:Active is hosted by the IOC that I’m shutting down.
>
> In addition there is a channel access link between the two IOCs involving a CP link.
>
> I can’t reliably reproduce it, but it’s happened a few times today as I was testing it (stopping and starting the IOC hosting $(P):Scan:Active perhaps 20 times).
>
> Anybody have ideas about this?
>
> Our base version is 3.14.12.4 running on RHEL6.

Related branches

lp:~epics-core/epics-base/3.14

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2016-04-16:

http://www.aps.anl.gov/epics/tech-talk/2016/msg00692.php

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2016-04-16:

I can recall seeing "dbCa:exceptionCallback" associated with ACF
disconnects before, but don't recall any crashes. This might just be
chance. The 'channel "unknown"' certainly stands out.

Would it be easy for you to take a packet capture of the traffic between
these two IOCs when this crash occurs? This might give some clues about
the specific sequence of events which triggers the crash.

Also, is the communication between these two IOCs direct? You don't
mention any ca gateway.

Revision history for this message

Steven Hartman (hartmansm) wrote on 2016-04-16:

No ca gateway. The two IOCs are on the same subnet.

The ScanSupport IOC serves a PV used in an access security calculation by the Parker1 IOC. The Parker1 IOC is the one with the segfault sometimes when the ScanSupport IOC is restarted. Interestingly, the ScanSupport IOC is also a client of a PV served by the Parker1 IOC.

My supposition is a race condition between notifying clients of the new access security status, and the client disconnect. The reboot of the ScanSupport IOC causes the disappearance of the PV used by the access security calculation. This causes the Parker1 IOC to recompute its access security rules in asComputeAsg(). It then attempts to updates its clients of the new security rules in casAccessRightsCB(). However, when it gets to the client on ScanSupport IOC, that channel has already disappeared because of the reboot. That leads to the segfault in ellDelete().

For now, we have rearranged the database files so the IOC which serves the access security rule PV is not a client of other IOCs.

Revision history for this message

Andrew Johnson (anj) wrote on 2016-04-20:

I don't think the 'channel "unknown"' is an issue, looking at the asCa.c::exceptionCallback() routine that's just what it prints when args.chid is NULL, which I believe is normal for a virtual circuit disconnect. You only get one exception callback per circuit, but CA does call the connection callback for every channel.

I'm trying to replicate this crash, with no joy so far:

ioc1.db loaded into a regular softIoc:
    # This PV is used in an access security calculation by the other IOC:
    record(ao, "anj1:ao") {
        field(VAL, 1)
    }
    # This PV connects to a PV served by the other IOC:
    record(ai, "anj1:ai") {
        field(INP, "anj2:ai CP")
    }

ioc2.db loaded in a softIoc with the rules.asc file below, running under gdb:
    # The default AS rule uses the PV anj1:ao served by the other IOC:
    record(ao, "anj2:ao") {}
    # This PV is the target of a CA link in the other IOC:
    record(ai, "anj2:ai") {}

rules.asc
    # AS Rules file for ioc2
    HAG(tux) {tux, tux.aps.anl.gov}
    ASG(DEFAULT) {
        INPA(anj1:ao)
        RULE(1, READ)
        RULE(1, WRITE) {
            CALC("A=0")
        }
        RULE(1, WRITE) {
            CALC("A=1")
            HAG(tux)
        }
    }

Restarting ioc1 generates dbCa:exceptionCallback messages and various threads stop and start, but so far no crashes. I'll try again tomorrow running the IOCs on different workstations.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2016-04-21:

When reading this report I was immediately suspicious of some connection to lp:1091401, which fixes a crash related to ACF and CA disconnects, though in a different situation. The fix for this bug first appeared in 3.14.12.4, which happens to be the version that Matt is using.

So far I've spent some time trying to understand the RSRV code in question, but haven't found any obvious candidates to cause linked list corruption.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2016-04-26:

Matt, the next time you can cause this crash in a debugger place print the stack traces of all threads and attach as a file to this issue. With gdb run "thread apply all backtrace" or "thr app all bt". I'd like to know what the CAC thread is racing.

Also, when you stop the IOC serving "$(P):Scan:Active" is this a graceful shutdown, or do you terminate the process?

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2016-04-26:

The fix for was lp:1091401 to change the order so that asRemoveClient() is called before the channel is removed from a client's chanList. This was done in clear_channel_reply(), but not in destroyAllChannels(), where I think it is also needed.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2016-04-26:

avoid double ellDelete Edit (3.4 KiB, text/plain)

If my thinking is correct then this patch should avoid a crash.

Revision history for this message

Matt Pearson (mrpeawork) wrote on 2016-04-27:

thread_backtrace.txt Edit (49.4 KiB, text/plain)

I still can't reliably reproduce it, but I've attached the output from "thread apply all backtrace".

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2016-04-27:

#10

Thanks Matt. This matches my thinking. So I think you will have even more trouble reproducing this crash with the patch applied :)

> Thread 1 (Thread 0x7f2a54eb7700 (LWP 4077)):
> #0 0x00007f2a6dfd9599 in ellDelete (pList=0x7f2a08000d60, pNode=0x7f2a0000d740) at ../../../src/libCom/ellLib/ellLib.c:87
> #1 0x00007f2a6ef3d289 in casAccessRightsCB (ascpvt=0x7f29ec001ab0, type=asClientCOAR) at ../camessage.c:1111
> #2 0x00007f2a6eb25122 in asComputePvt (asClientPvt=0x7f29ec001ab0) at ../asLibRoutines.c:1014
> #3 0x00007f2a6eb24ea0 in asComputeAsgPvt (pasg=0x10d9bb0) at ../asLibRoutines.c:940
> #4 0x00007f2a6eb23419 in asComputeAsg (pasg=0x10d9bb0) at ../asLibRoutines.c:455
> ...
> #9 0x00007f2a6e25b042 in cac::destroyIIU (this=0x7f2a2400ed20, iiu=...) at ../cac.cpp:1227
> #10 0x00007f2a6e2737e3 in tcpSendThread::run (this=0x7f2a18000a00) at ../tcpiiu.cpp:229
> #11 0x00007f2a6dfecc51 in epicsThreadCallEntryPoint (pPvt=0x7f2a18000a08) at ../../../src/libCom/osi/epicsThread.cpp:85
> ...

> Thread 30 (Thread 0x7f2a54db6700 (LWP 4012)):
> ...
> #5 0x00007f2a6dfede4f in epicsMutexLock (pmutexNode=0x10d9a60) at ../../../src/libCom/osi/epicsMutex.cpp:143
> #6 0x00007f2a6eb23118 in asRemoveClient (asClientPvt=0x7f2a0000d7b0) at ../asLibRoutines.c:391
> #7 0x00007f2a6ef39e8d in destroyAllChannels (client=0x7f2a08000d00, pList=0x7f2a08000d60) at ../caservertask.c:657
> #8 0x00007f2a6ef39fb4 in destroy_tcp_client (client=0x7f2a08000d00) at ../caservertask.c:688
> #9 0x00007f2a6ef3ae51 in camsgtask (pParm=0x7f2a08000d00) at ../camsgtask.c:164
> ...

By the time asRemoveClient() is called the channel has already been removed from chanList, but it's state has not been changed. So casAccessRightsCB() tries to ellDelete it again. My patch adds a new channel state rsrvCS_shutdown for when the channel is not in either of the two possible lists.

Revision history for this message

Matt Pearson (mrpeawork) wrote on 2016-04-27:

#11

I've set up two simple soft IOCs, and running them under procServ I can remotely reboot one of them via procServControl. Now I can reproduce it fairly easily, usually within a few minutes. I'll try the patch now.

Here are the two databases I'm using and the access control file:

IOC 1:

record(ai, "CG1D:Matt:Test1")
{
field(VAL, "1.0")
field(PINI, "YES")
}

IOC 2:

record(ai, "CG1D:Matt:Test2")
{
field(INP, "CG1D:Matt:Test1.VAL CP")
}

record(bo, "CG1D:Matt:Active")
{
field(VAL, "0")
field(PINI, "YES")
}

ACF:

HAG(beamline) { cg1d-dassrv1.ornl.gov, cg1d-dassrv1 }

UAG(epics) { controls, mkp }

ASG(DEFAULT)
{
INPA("$(P):Matt:Active")

RULE(1, READ)

    RULE(1, WRITE)
    {
        CALC("A=0")
    }

    RULE(1, WRITE)
    {
        UAG(epics)
        HAG(beamline)
        CALC("A=1")
    }
}

ASG(BEAMLINE)
{
RULE(1, READ)

    RULE(1, WRITE)
    {
        UAG(epics)
        HAG(beamline)
    }
}

ASG(ALWAYS)
{
RULE(1, READ)
RULE(1, WRITE)
}

Revision history for this message

Matt Pearson (mrpeawork) wrote on 2016-04-27:

#12

FYI, the shutdown was usually a SIGKILL via procServ.

I've applied the patch and currently testing. So far so good!

Before, I was able to generate the crash after a few reboots (perhaps up to 50).

Now I've done >600 reboots with no problem.

Revision history for this message

Andrew Johnson (anj) wrote on 2016-05-02:

#13

Michael, are you still working on this or can I apply your patch to the 3.14 branch?

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2016-05-02: Re: [Bug 1571224] Re: IOC segmentation fault related to CA security

#14

Download full text (4.2 KiB)

I'd like to get final confirmation from Matt before declaring victory.
On May 2, 2016 1:15 PM, "Andrew Johnson" <email address hidden> wrote:

> Michael, are you still working on this or can I apply your patch to the
> 3.14 branch?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1571224
>
> Title:
> IOC segmentation fault related to CA security
>
> Status in EPICS Base:
> New
>
> Bug description:
> On 04/15/2016 05:27 PM, Pearson, Matthew R. wrote:
> >
> > Hi,
> >
> > I’ve had a few instances of one of my soft IOCs crashing with a
> segmentation fault when I shutdown another IOC that hosts a PV used as part
> of the CA security logic on the other crashed IOC.
> >
> > There is sometimes (but not always) this message printed out before
> the crash:
> >
> > dbCa:exceptionCallback stat "Virtual circuit disconnect" channel
> "unknown" context "cg1d-dassrv1.ornl.gov:5064"
> > nativeType DBR_invalid requestType DBR_invalid nativeCount 0
> requestCount 0 noReadAccess noWriteAccess
> >
> > Printing the stack trace:
> >
> > Core was generated by `../../bin/linux-x86_64/cg1d-parker1 ./st.cmd'.
> > Program terminated with signal 11, Segmentation fault.
> > #0 0x00007f5321218599 in ellDelete (pList=0x7f52bc000920,
> pNode=0x7f52ac008ec0) at ../../../src/libCom/ellLib/ellLib.c:87
> > 87 pNode->previous->next = pNode->next;
> > Missing separate debuginfos, use: debuginfo-install
> glibc-2.12-1.149.el6_6.9.x86_64 libgcc-4.4.7-11.el6.x86_64
> libstdc++-4.4.7-11.el6.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64
> readline-6.0-4.el6.x86_64
> > (gdb) bt
> > #0 0x00007f5321218599 in ellDelete (pList=0x7f52bc000920,
> pNode=0x7f52ac008ec0) at ../../../src/libCom/ellLib/ellLib.c:87
> > #1 0x00007f532217c289 in casAccessRightsCB (ascpvt=0x7f52b8000db8,
> type=asClientCOAR) at ../camessage.c:1111
> > #2 0x00007f5321d64122 in asComputePvt (asClientPvt=0x7f52b8000db8) at
> ../asLibRoutines.c:1014
> > #3 0x00007f5321d63ea0 in asComputeAsgPvt (pasg=0x1f91ee0) at
> ../asLibRoutines.c:940
> > #4 0x00007f5321d62419 in asComputeAsg (pasg=0x1f91ee0) at
> ../asLibRoutines.c:455
> > #5 0x00007f5321d60482 in connectCallback (arg=...) at ../asCa.c:99
> > #6 0x00007f53214c1301 in oldChannelNotify::disconnectNotify
> (this=0x7f52d0000d10, guard=...) at ../oldChannelNotify.cpp:112
> > #7 0x00007f53214abe30 in nciu::unresponsiveCircuitNotify
> (this=0x7f5323714010, cbGuard=..., guard=...) at ../nciu.cpp:171
> > #8 0x00007f53214b7c38 in tcpiiu::disconnectAllChannels
> (this=0x7f52d40008c0, cbGuard=..., guard=..., discIIU=...) at
> ../tcpiiu.cpp:1834
> > #9 0x00007f532149a042 in cac::destroyIIU (this=0x7f52d000ed20,
> iiu=...) at ../cac.cpp:1227
> > #10 0x00007f53214b27e3 in tcpSendThread::run (this=0x7f52d4000a00) at
> ../tcpiiu.cpp:229
> > #11 0x00007f532122bc51 in epicsThreadCallEntryPoint
> (pPvt=0x7f52d4000a08) at ../../../src/libCom/osi/epicsThread.cpp:85
> > #12 0x00007f53212333ce in start_routine (arg=0x7f52d400a250) at
> ../../../src/libCom/osi/os/posix/osdThread.c:385
> > #13 0x00007f53204a29d1 in start_thread (...

I'd like to get final confirmation from Matt before declaring victory.
On May 2, 2016 1:15 PM, "Andrew Johnson" <anj@aps.anl.gov> wrote:

> Michael, are you still working on this or can I apply your patch to the
> 3.14 branch?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1571224
>
> Title:
>   IOC segmentation fault related to CA security
>
> Status in EPICS Base:
>   New
>
> Bug description:
>   On 04/15/2016 05:27 PM, Pearson, Matthew R. wrote:
>   >
>   > Hi,
>   >
>   > I’ve had a few instances of one of my soft IOCs crashing with a
> segmentation fault when I shutdown another IOC that hosts a PV used as part
> of the CA security logic on the other crashed IOC.
>   >
>   > There is sometimes (but not always) this message printed out before
> the crash:
>   >
>   > dbCa:exceptionCallback stat "Virtual circuit disconnect" channel
> "unknown" context "cg1d-dassrv1.ornl.gov:5064"
>   > nativeType DBR_invalid requestType DBR_invalid nativeCount 0
> requestCount 0 noReadAccess noWriteAccess
>   >
>   > Printing the stack trace:
>   >
>   > Core was generated by `../../bin/linux-x86_64/cg1d-parker1 ./st.cmd'.
>   > Program terminated with signal 11, Segmentation fault.
>   > #0  0x00007f5321218599 in ellDelete (pList=0x7f52bc000920,
> pNode=0x7f52ac008ec0) at ../../../src/libCom/ellLib/ellLib.c:87
>   > 87              pNode->previous->next = pNode->next;
>   > Missing separate debuginfos, use: debuginfo-install
> glibc-2.12-1.149.el6_6.9.x86_64 libgcc-4.4.7-11.el6.x86_64
> libstdc++-4.4.7-11.el6.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64
> readline-6.0-4.el6.x86_64
>   > (gdb) bt
>   > #0  0x00007f5321218599 in ellDelete (pList=0x7f52bc000920,
> pNode=0x7f52ac008ec0) at ../../../src/libCom/ellLib/ellLib.c:87
>   > #1  0x00007f532217c289 in casAccessRightsCB (ascpvt=0x7f52b8000db8,
> type=asClientCOAR) at ../camessage.c:1111
>   > #2  0x00007f5321d64122 in asComputePvt (asClientPvt=0x7f52b8000db8) at
> ../asLibRoutines.c:1014
>   > #3  0x00007f5321d63ea0 in asComputeAsgPvt (pasg=0x1f91ee0) at
> ../asLibRoutines.c:940
>   > #4  0x00007f5321d62419 in asComputeAsg (pasg=0x1f91ee0) at
> ../asLibRoutines.c:455
>   > #5  0x00007f5321d60482 in connectCallback (arg=...) at ../asCa.c:99
>   > #6  0x00007f53214c1301 in oldChannelNotify::disconnectNotify
> (this=0x7f52d0000d10, guard=...) at ../oldChannelNotify.cpp:112
>   > #7  0x00007f53214abe30 in nciu::unresponsiveCircuitNotify
> (this=0x7f5323714010, cbGuard=..., guard=...) at ../nciu.cpp:171
>   > #8  0x00007f53214b7c38 in tcpiiu::disconnectAllChannels
> (this=0x7f52d40008c0, cbGuard=..., guard=..., discIIU=...) at
> ../tcpiiu.cpp:1834
>   > #9  0x00007f532149a042 in cac::destroyIIU (this=0x7f52d000ed20,
> iiu=...) at ../cac.cpp:1227
>   > #10 0x00007f53214b27e3 in tcpSendThread::run (this=0x7f52d4000a00) at
> ../tcpiiu.cpp:229
>   > #11 0x00007f532122bc51 in epicsThreadCallEntryPoint
> (pPvt=0x7f52d4000a08) at ../../../src/libCom/osi/epicsThread.cpp:85
>   > #12 0x00007f53212333ce in start_routine (arg=0x7f52d400a250) at
> ../../../src/libCom/osi/os/posix/osdThread.c:385
>   > #13 0x00007f53204a29d1 in start_thread () from /lib64/libpthread.so.0
>   > #14 0x00007f53207a08fd in clone () from /lib64/libc.so.6
>   >
>   >
>   > The crashed IOC has a CA access security rule that looks like:
>   >
>   > ASG(DEFAULT)
>   > {
>   >     INPA("$(P):Scan:Active")
>   >     RULE(1, READ)
>   >     RULE(1, WRITE)
>   >     {
>   >         CALC("A=0")
>   >     }
>   >     RULE(1, WRITE)
>   >     {
>   >         UAG(epics, beamline, detector)
>   >         HAG(beamline)
>   >         CALC("A=1")
>   >     }
>   > }
>   >
>   > where $(P):Scan:Active is hosted by the IOC that I’m shutting down.
>   >
>   > In addition there is a channel access link between the two IOCs
> involving a CP link.
>   >
>   > I can’t reliably reproduce it, but it’s happened a few times today as
> I was testing it (stopping and starting the IOC hosting $(P):Scan:Active
> perhaps 20 times).
>   >
>   > Anybody have ideas about this?
>   >
>   > Our base version is 3.14.12.4 running on RHEL6.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/epics-base/+bug/1571224/+subscriptions
>

Revision history for this message

Matt Pearson (mrpeawork) wrote on 2016-05-02:

#15

It certainly fixed the IOC crash on reboot issue, and I didn't see any other problems.

I left my test script running overnight, and it got through about 7000 reboots without any crash.

Thanks for thinking about this and developing a patch Michael.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2016-05-02:

#16

Committed as http://bazaar.launchpad.net/~epics-core/epics-base/3.14/revision/12632

Going forward it would be good to consolidate the channel cleanup sequences in clear_channel_reply() and destroyAllChannels() into one to avoid this sort of situation in future.