ipAddrToAscii immediate callback deadlock

Bug #541293 reported by Jeff Hill
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
High
Jeff Hill

Bug Description

From Kay:

I'm looking at the hangup with multiply defined PVs on a 32bit machine now, with the basic 'camonitor'. So this is nothing specific to 64bit or the archiver code.

I simply started the same softIoc twice. It has a database created like this:
for ($i=0; $i<500; ++$i)
{
 print "record(calc, "count$i")n";
 print "{n";
 print " field(SCAN, "1 second")n";
 print " field(INPA, "count$i")n";
 print " field(CALC, "A+1")n";
 print "}n";
}

When I run
    camonitor count0 count1 count2 count3 I get warnings about the multiple PVs, but then the updates arrive OK.

But when I monitor more, as the archive engine would do:
  perl >test
  print "camonitor ";
  for ($i=0; $i<500; ++$i) { print "count$i "; }
  print "n";
  ^D
  sh ./test

... the camonitor hangs. Sometimes immediately, sometimes after a bunch of warnings about identical PVs. But always after a few seconds.

I didn't see the problem with R3.14.7.
I do see it with the basic R3.14.8.2.
When attaching the debugger, there's not much info:

(gdb) thread 1
[Switching to thread 1 (Thread -1222069600 (LWP 21363))]#0
0xb75326e1 in __lll_mutex_lock_wait () from /lib/tls/libpthrea d.so.0
(gdb) bt
#0 0xb75326e1 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1 0xb752f7a0 in _L_mutex_lock_78 () from /lib/tls/libpthread.so.0
#2 0xb758b850 in __JCR_LIST__ () from /ade/epics/supTop/base/ R3.14.8.2/lib/linux-x86/libCom.so
#3 0x0804da14 in ?? ()
#4 0xbfffa598 in ?? ()
#5 0xb7579375 in epicsMutexOsdLock (pmutex=0xb758b850) at ../../../
src/libCom/osi/os/posix/osdMutex.c:44
Previous frame identical to this frame (corrupt stack?)
(gdb) thread 2
[Switching to thread 2 (Thread -1225532496 (LWP 21371))]#0
0xb75301fb in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/ libpthread.so.0
(gdb) bt
#0 0xb75301fb in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/ libpthread.so.0
#1 0xb7579786 in epicsEventWait (pevent=0xb700a8d0) at ../../../src/
libCom/osi/os/posix/osdEvent.c:75
#2 0xb7566fac in errlogThread () at ../../../src/libCom/error/
errlog.c:468
#3 0xb7578d55 in start_routine (arg=0xb700afc0) at ../../../src/ libCom/osi/os/posix/osdThread.c:320
#4 0xb752ddec in start_thread () from /lib/tls/libpthread.so.0
#5 0xb736ce8a in clone () from /lib/tls/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread -1224737872 (LWP 21368))]#0
0xb75301fb in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/ libpthread.so.0
(gdb) bt
#0 0xb75301fb in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/ libpthread.so.0
#1 0xb7579786 in epicsEventWait (pevent=0xb70045f8) at ../../../src/
libCom/osi/os/posix/osdEvent.c:75
#2 0xb757320a in epicsEvent::wait (this=0xfffffffc) at ../../../src/
libCom/osi/epicsEvent.cpp:63
#3 0xb75c096a in tcpSendThread::run (this=0x8085528) at ../
tcpiiu.cpp:85
#4 0xb75715e4 in epicsThreadCallEntryPoint (pPvt=0x808552c) at ../../../src/libCom/osi/epicsThread.cpp:59
#5 0xb7578d55 in start_routine (arg=0xb70004d0) at ../../../src/ libCom/osi/os/posix/osdThread.c:320
#6 0xb752ddec in start_thread () from /lib/tls/libpthread.so.0
#7 0xb736ce8a in clone () from /lib/tls/libc.so.6
(gdb) thread 4
[Switching to thread 4 (Thread -1222997072 (LWP 21367))]#0
0xb75326e1 in __lll_mutex_lock_wait () from /lib/tls/libpthrea d.so.0
(gdb) bt
#0 0xb75326e1 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1 0xb752f7a0 in _L_mutex_lock_78 () from /lib/tls/libpthread.so.0
#2 0xb758b850 in __JCR_LIST__ () from /ade/epics/supTop/base/ R3.14.8.2/lib/linux-x86/libCom.so
#3 0x0804d9dc in ?? ()
#4 0xb71a8548 in ?? ()
#5 0xb7579375 in epicsMutexOsdLock (pmutex=0xb758b850) at ../../../
src/libCom/osi/os/posix/osdMutex.c:44
Previous frame identical to this frame (corrupt stack?)
(gdb) thread 5
[Switching to thread 5 (Thread -1222730832 (LWP 21366))]#0
0xb75326e1 in __lll_mutex_lock_wait () from /lib/tls/libpthrea d.so.0
(gdb) bt
#0 0xb75326e1 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1 0xb752f7a0 in _L_mutex_lock_78 () from /lib/tls/libpthread.so.0
#2 0xb758b850 in __JCR_LIST__ () from /ade/epics/supTop/base/ R3.14.8.2/lib/linux-x86/libCom.so
#3 0x0804da14 in ?? ()
#4 0xb71e9198 in ?? ()
#5 0xb7579375 in epicsMutexOsdLock (pmutex=0xb758b850) at ../../../
src/libCom/osi/os/posix/osdMutex.c:44
Previous frame identical to this frame (corrupt stack?)
(gdb) thread 6
[Switching to thread 6 (Thread -1222206544 (LWP 21365))]#0
0xb75326e1 in __lll_mutex_lock_wait () from /lib/tls/libpthrea d.so.0
(gdb) bt
#0 0xb75326e1 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1 0xb752f7a0 in _L_mutex_lock_78 () from /lib/tls/libpthread.so.0
#2 0xb758b850 in __JCR_LIST__ () from /ade/epics/supTop/base/ R3.14.8.2/lib/linux-x86/libCom.so
#3 0x0804d9dc in ?? ()
#4 0xb72696f8 in ?? ()
#5 0xb7579375 in epicsMutexOsdLock (pmutex=0xb758b850) at ../../../
src/libCom/osi/os/posix/osdMutex.c:44
Previous frame identical to this frame (corrupt stack?)
(gdb) thread 7
[Switching to thread 7 (Thread -1222071376 (LWP 21364))]#0
0xb75326e1 in __lll_mutex_lock_wait () from /lib/tls/libpthrea d.so.0
(gdb) bt
#0 0xb75326e1 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1 0xb752f7a0 in _L_mutex_lock_78 () from /lib/tls/libpthread.so.0
#2 0xb758b850 in __JCR_LIST__ () from /ade/epics/supTop/base/ R3.14.8.2/lib/linux-x86/libCom.so
#3 0x0804d9dc in ?? ()
#4 0xb728a5f8 in ?? ()
#5 0xb7579375 in epicsMutexOsdLock (pmutex=0xb758b850) at ../../../
src/libCom/osi/os/posix/osdMutex.c:44
Previous frame identical to this frame (corrupt stack?)

So they're almost all waiting and have a corrupted stack.
I only get errors when trying to print the "pmutex".

When switching to the 64bit computer, with tsFreeList in EPICS_FREELIST_DEBUG mode and the recent Mantis patches for the ipAddeToAsciiAync.cpp mutex and the ca/searchTimer.cpp, tcpRecvWatchdog.cpp, there is a bit more detail.
I'll pack that in a follow-up email.

Original Mantis Bug: mantis-260
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=260

Tags: ca 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

I see that the primary mutex is held when iitiating the IO for the IP address to ascii translation. Unfortunately, the ioInitiate function will call the callback directly if the queue quota is exceeded. Since this callback takes the callback mutex we can have a lock inversion, and therefore a deadlock.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Here is the fix that I am testing:

cvs diff -u -wb -i -- cac.cpp (in directory D:users/hill/R3.14.fix)
Index: cac.cpp
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/ca/cac.cpp,v
retrieving revision 1.180.2.33
diff -c -u -w -b -i -r1.180.2.33 cac.cpp
cvs diff: conflicting specifications of output style
--- cac.cpp 13 Feb 2006 16:33:46 -0000 1.180.2.33
+++ cac.cpp 30 May 2006 19:32:10 -0000
@@ -497,6 +497,12 @@
         if ( ! sockAddrAreIdentical ( &addr, &chanAddr ) ) {
             char acc[64];
             pChan->getPIIU(guard)->getHostName ( guard, acc, sizeof ( acc ) );
+ // It is possible for the ioInitiate call below to
+ // call the callback directly if queue quota is exceeded.
+ // This callback takes the callback lock and therefore we
+ // must release the primary mutex here to avoid a lock
+ // hierarchy inversion.
+ epicsGuardRelease < epicsMutex > unguard ( guard );
             msgForMultiplyDefinedPV * pMsg = new ( this->mdpvFreeList )
                 msgForMultiplyDefinedPV ( this->ipToAEngine,
                     *this, pChan->pName ( guard ), acc );

***** CVS exited normally with code 1 *****

edited on: 2006-05-30 14:03

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I believe that we have tested the multiply defined PVs message before so I suspect that this is influenced by the last round of patches we installed against R3.14.8.

edited on: 2006-05-30 14:11

Revision history for this message
Jeff Hill (johill-lanl) wrote :

fixed in R3.14.9

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.9 Released.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.