From Emma Shepherd:
I have come across a problem on an R3.14.8.2 IOC that is affecting channel access links - some records are in LINK ERROR and others have CP links that fail to update. When we started investigating we found that the CAC-TCP-recv task was in SUSPEND+I state, and the following messages had been printed to the console:
BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 CAC-TCP-recv: A call to "assert (pca->pgetNative)" failed in ../dbCa.c at 629
BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 Current time WED AUG 15 2007 15:37:23.708349950.
BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 EPICS Release EPICS R3.14.8.2 $R3-14-8-2$ $2006/01/06 15:55:13$.
BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 Please E-mail this message and the output from "tt (0x1e0ff9e0)"
BL18I-MO-IOC-01.diamond.ac.uk:1 Wed Aug 15 16:37:26 2007 to the author or to <email address hidden>
Here is the task trace:
BL18I-MO-IOC-01 -> tt 0x1e0ff9e0
231ff8 vxTaskEntry +68 : 1e8cb6e4 ()
1e8cb754 epicsThreadPrivateGet+f8 : epicsThreadCallEntryPoint ()
1e8bd048 epicsThreadCallEntryPoint+15c: 1e88b718 (1)
1e88b718 tcpRecvThread::run(void)+990: 1e88e78c () 1e88e78c tcpiiu::processIncoming(epicsTime const &, callbackManager
&)+408: cac::executeResponse(callbackManager &, tcpiiu &, epicsTime const &, caHdrLargeArray &, char *) ()
1e87a588 cac::executeResponse(callbackManager &, tcpiiu &, epicsTime const &, caHdrLargeArray &, char *)+bc : cac ::eventRespAction(callbackManager &, tcpiiu &, epicsTime const &, caHdrLargeArray const &, void *) ()
1e875fc8 cac::eventRespAction(callbackManager &, tcpiiu &, epicsTime const &, caHdrLargeArray const &, void *)+19 4:
netSubscription::completion(epicsGuard<epicsMutex> &, cacRecycle &, unsigned int, unsigned long, void const *) ()
1e89a364 netSubscription::completion(epicsGuard<epicsMutex> &, cacRecycle &, unsigned int, unsigned long, void co nst *)+84 :
oldSubscription::current(epicsGuard<epicsMutex> &, unsigned int, unsigned long, void const *) ()
1e855ff4 oldSubscription::current(epicsGuard<epicsMutex> &, unsigned int, unsigned long, void const *)+104: 1e815 434 ()
1e8156d0 dbCaGetUnits +790: epicsAssert ()
1e8c9a5c epicsAssert +154: epicsThreadSuspendSelf ()
1e8cb010 epicsThreadSuspendSelf+2c : taskSuspend () value = 0 = 0x0
Any ideas what could have caused this?
Original Mantis Bug: mantis-299
http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=299
It's difficult at this point to isolate to a subsystem. The assert fail in dbCa.c initially points to a logic error in the db ca link code, or alternatively a race condition - possibly a data structure that is being used after it was deleted. Alternatively, this might be generalized corruption, or a failure in another subsystem (possibly the CA client library). I am not intimately familiar with the dbCa.c code so this may require some time spent looking at the sources.
Have you seen this occur more than once?
If the problem is repeatable, is it possible to reproduce it with a small database along with a well defined recipe of external circumstances? If the problem is repeatable, but not with a small database, you might also obtain further details (a stack trace with arguments and possibly the contents of related data structures) by building base for debugging and then attaching to the crashed thread using the Tornado debugger.