'assert (pca->pgetNative)' failed in ../dbCa.c at 629

Bug #541329 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Invalid
Wishlist
Jeff Hill

Bug Description

From Emma Shepherd:

I have come across a problem on an R3.14.8.2 IOC that is affecting channel access links - some records are in LINK ERROR and others have CP links that fail to update. When we started investigating we found that the CAC-TCP-recv task was in SUSPEND+I state, and the following messages had been printed to the console:

BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 CAC-TCP-recv: A call to "assert (pca->pgetNative)" failed in ../dbCa.c at 629
BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 Current time WED AUG 15 2007 15:37:23.708349950.
BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 EPICS Release EPICS R3.14.8.2 $R3-14-8-2$ $2006/01/06 15:55:13$.
BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 Please E-mail this message and the output from "tt (0x1e0ff9e0)"
BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 to the author or to <email address hidden>

Here is the task trace:

BL18I-MO-IOC-01 -> tt 0x1e0ff9e0
231ff8 vxTaskEntry +68 : 1e8cb6e4 ()
1e8cb754 epicsThreadPrivateGet+f8 : epicsThreadCallEntryPoint ()
1e8bd048 epicsThreadCallEntryPoint+15c: 1e88b718 (1)
1e88b718 tcpRecvThread::run(void)+990: 1e88e78c () 1e88e78c tcpiiu::processIncoming(epicsTime const &, callbackManager
&)+408: cac::executeResponse(callbackManager &, tcpiiu &, epicsTime const &, caHdrLargeArray &, char *) ()
1e87a588 cac::executeResponse(callbackManager &, tcpiiu &, epicsTime const &, caHdrLargeArray &, char *)+bc : cac ::eventRespAction(callbackManager &, tcpiiu &, epicsTime const &, caHdrLargeArray const &, void *) ()
1e875fc8 cac::eventRespAction(callbackManager &, tcpiiu &, epicsTime const &, caHdrLargeArray const &, void *)+19 4:
netSubscription::completion(epicsGuard<epicsMutex> &, cacRecycle &, unsigned int, unsigned long, void const *) ()
1e89a364 netSubscription::completion(epicsGuard<epicsMutex> &, cacRecycle &, unsigned int, unsigned long, void co nst *)+84 :
oldSubscription::current(epicsGuard<epicsMutex> &, unsigned int, unsigned long, void const *) ()
1e855ff4 oldSubscription::current(epicsGuard<epicsMutex> &, unsigned int, unsigned long, void const *)+104: 1e815 434 ()
1e8156d0 dbCaGetUnits +790: epicsAssert ()
1e8c9a5c epicsAssert +154: epicsThreadSuspendSelf ()
1e8cb010 epicsThreadSuspendSelf+2c : taskSuspend () value = 0 = 0x0

Any ideas what could have caused this?

Original Mantis Bug: mantis-299
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=299

Tags: db 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

It's difficult at this point to isolate to a subsystem. The assert fail in dbCa.c initially points to a logic error in the db ca link code, or alternatively a race condition - possibly a data structure that is being used after it was deleted. Alternatively, this might be generalized corruption, or a failure in another subsystem (possibly the CA client library). I am not intimately familiar with the dbCa.c code so this may require some time spent looking at the sources.

Have you seen this occur more than once?

If the problem is repeatable, is it possible to reproduce it with a small database along with a well defined recipe of external circumstances? If the problem is repeatable, but not with a small database, you might also obtain further details (a stack trace with arguments and possibly the contents of related data structures) by building base for debugging and then attaching to the crashed thread using the Tornado debugger.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Emma Shepherd:

I've seen it occur at least twice (on separate but similarly configured IOCs), but haven't yet figured out how to reproduce it. If I do I will let you know and try to get some more debugging information.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Andrew:

Michael Abbott reported a failure of the same assertion back in 2004; mantis-164 describes what he was doing. Is there any chance that someone changed the data type of a waveform or genSub array field that a record in this IOC was linking to? That seemed to be the trigger in his case.

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.10 released.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.