'assert (pca->pgetNative)' failed in ../dbCa.c

Bug #541221 reported by Jeff Hill on 2004-12-01
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Status tracked in 7.0
3.14
Undecided
Unassigned
3.15
Undecided
Unassigned
3.16
Undecided
Unassigned
7.0
High
Andrew Johnson

Bug Description

From Michael Abbott [michael atsign araneidae.co.uk]

A call to "assert (pca->pgetNative)" failed in ../dbCa.c line 625.
EPICS Release EPICS R3.14.6 $R3-14-6$ $2004/05/28 19:27:47$.
Current time Fri Nov 19 2004 08:59:06.536066000.
Please E-mail this message to the author or to <email address hidden>
Calling epicsThreadSuspendSelf()

Steps to reproduce:
Actually, it was pretty simple: the IOC that crashed was running one
record, a sub array (with data type FLOAT) pointing to an array on another
IOC (with data type USHORT). I restarted the second IOC, changing the
data type of the array to FLOAT: *!bang!*

Version: R3.14.9

Original Mantis Bug: mantis-164
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=164

Jeff Hill (johill-lanl) wrote :

From Emma Shepherd:

I have come across a problem on an R3.14.8.2 IOC that is affecting channel access links - some records are in LINK ERROR and others have CP links that fail to update. When we started investigating we found that the CAC-TCP-recv task was in SUSPEND+I state, and the following messages had been printed to the console:

BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 CAC-TCP-recv: A call to "assert (pca->pgetNative)" failed in ../dbCa.c at 629
BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 Current time WED AUG 15 2007 15:37:23.708349950.
BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 EPICS Release EPICS R3.14.8.2 $R3-14-8-2$ $2006/01/06 15:55:13$.
BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 Please E-mail this message and the output from "tt (0x1e0ff9e0)"
BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 to the author or to <email address hidden>

Here is the task trace:

BL18I-MO-IOC-01 -> tt 0x1e0ff9e0
231ff8 vxTaskEntry +68 : 1e8cb6e4 ()
1e8cb754 epicsThreadPrivateGet+f8 : epicsThreadCallEntryPoint ()
1e8bd048 epicsThreadCallEntryPoint+15c: 1e88b718 (1)
1e88b718 tcpRecvThread::run(void)+990: 1e88e78c () 1e88e78c tcpiiu::processIncoming(epicsTime const &, callbackManager
&)+408: cac::executeResponse(callbackManager &, tcpiiu &, epicsTime const &, caHdrLargeArray &, char *) ()
1e87a588 cac::executeResponse(callbackManager &, tcpiiu &, epicsTime const &, caHdrLargeArray &, char *)+bc : cac ::eventRespAction(callbackManager &, tcpiiu &, epicsTime const &, caHdrLargeArray const &, void *) ()
1e875fc8 cac::eventRespAction(callbackManager &, tcpiiu &, epicsTime const &, caHdrLargeArray const &, void *)+19 4:
netSubscription::completion(epicsGuard<epicsMutex> &, cacRecycle &, unsigned int, unsigned long, void const *) ()
1e89a364 netSubscription::completion(epicsGuard<epicsMutex> &, cacRecycle &, unsigned int, unsigned long, void co nst *)+84 :
oldSubscription::current(epicsGuard<epicsMutex> &, unsigned int, unsigned long, void const *) ()
1e855ff4 oldSubscription::current(epicsGuard<epicsMutex> &, unsigned int, unsigned long, void const *)+104: 1e815 434 ()
1e8156d0 dbCaGetUnits +790: epicsAssert ()
1e8c9a5c epicsAssert +154: epicsThreadSuspendSelf ()
1e8cb010 epicsThreadSuspendSelf+2c : taskSuspend () value = 0 = 0x0

Any ideas what could have caused this?

Jeff Hill (johill-lanl) wrote :

From Emma Shepherd:

I've seen it occur at least twice (on separate but similarly configured IOCs), but haven't yet figured out how to reproduce it. If I do I will let you know and try to get some more debugging information.

Jeff Hill (johill-lanl) wrote :

the steps to reproduce this can be seen only under the "advanced view"

Andrew Johnson (anj) wrote :

Confirmed bug still present in R3.14.9:

#5 0x00c04850 in epicsAssert (pFile=0x944f3f "../dbCa.c", line=656,
    pExp=0x944f71 "pca->pgetNative", pAuthorName=0xc15973 "the author")
    at ../../../src/libCom/osi/os/default/osdAssert.c:71
        current = {secPastEpoch = 585681673, nsec = 843834202}
        date = "Wed Jul 23 2008 12:21:13.843834202"
        status = <value optimized out>
#6 0x0093ad01 in eventCallback (
        arg = {
            usr = 0x8528738,
            chid = 0x857c430,
            type = 19,
            count = 10,
            dbr = 0x85c7da0,
            status = 1
        }) at ../dbCa.c:656
        pca = (caLink *) 0x8528738
        plink = (DBLINK *) 0x8528400
        size = 40
        precord = (dbCommon *) 0x85282c0
        monitor = (dbCaCallback) 0
        userPvt = (void *) 0x0

(gdb) print *pca
$4 = {
    node = {next = 0x0, previous = 0x0},
    lock = 0x8528820,
    plink = 0x8528400,
    pvname = 0x8528838 "anjHost:array",
    chid = 0x857c430,
    link_action = 0,
    sevr = 3,
    timeStamp = {secPastEpoch = 585681666, nsec = 713450592},
    dbrType = 2,
    nelements = 10,
    hasReadAccess = 1 '\001',
    hasWriteAccess = 1 '\001',
    isConnected = 1 '\001',
    gotFirstConnection = 1 '\001',
    connect = 0,
    monitor = 0,
    userPvt = 0x0,
    putType = 0,
    putCallback = 0,
    putUserPvt = 0x0,
    plinkPutCallback = 0x0,
    gotAttributes = 1 '\001',
    getAttributes = 0,
    getAttributesPvt = 0x0,
    controlLimits = {0, 0},
    displayLimits = {0, 0},
    alarmLimits = {0, 0, 0, 0},
    precision = 0,
    units = "",
    pgetNative = 0x0,
    pgetString = 0x0,
    pputNative = 0x0,
    pputString = 0x0,
    gotInNative = 0 '\0',
    gotInString = 0 '\0',
    gotOutNative = 0 '\0',
    gotOutString = 0 '\0',
    newOutNative = 0 '\0',
    newOutString = 0 '\0',
    nDisconnect = 2,
    nNoWrite = 0
}
(gdb) print *plink
$5 = {
    value = {
        ...,
        pv_link = {
            pvname = 0x8528490 "anjHost:array",
            precord = 0x85282c0,
            pvt = 0x8528738,
            getCvt = 0,
            pvlMask = 8,
            lastGetdbrType = 0},
        ...,
    },
    type = 11
}

Here's what I think is happening: When a dbCa link reconnects with a different type or element count, the connectionCallback() routine frees the original buffer that was allocated to store the incoming data. Usually the buffer will be pca->pgetNative but it could be pca->pgetString instead. There is a comment in the above code that says "Let next dbCaGetLink and/or dbCaPutLink determine options" which seems reasonable, but it doesn't take account of the possibility that a monitor event could come in before the record next gets processed. The above assertion in the eventCallback() routine demonstrates that this is what has happened.

Andrew Johnson (anj) wrote :

Here's the bug: If the native field type changes, we never clear the old native subscription, so after fixing the buffer allocation properly the eventCallback() routine is still getting called with the data from the old subscription type.

Either we need to be less clever (i.e. forget the whole 'native subscription' thing and just request the type that we're looking for), or we could just forget the fact that the data type changed underneath us and keep using the old subscription and buffer [not good if the array size gets bigger because we'll never see the larger array], or else we have to clear the old subscription and create a new one whenever the data type or array size changes. Getting the latter synchronized properly through the dbCa task is not going to be easy, so I'm more inclined to simplify the whole thing.

edited on: 2008-07-24 17:31

Andrew Johnson (anj) wrote :

The fix for this bug is too extensive to be fixed in the imminent R3.14.10 release, I'm going to stop working on it for now and come back afterwards.

Andrew Johnson (anj) on 2010-03-24
Changed in epics-base:
assignee: nobody → Andrew Johnson (anj)
Andrew Johnson (anj) wrote :

This patch seems to fix the problem for me against the 3.14 branch, and it's simple enough that it should merge up to the 7.0 branch without too much trouble.

Changed in epics-base:
status: Confirmed → In Progress
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers