IOC shutdown issues in dbCa from ca_context_destroy()

Bug #541345 reported by Andrew Johnson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
High
Jeff Hill

Bug Description

When closing an IOC using exit/Ctrl-D from the iocsh, I'm getting a failure when dbCaTask is shutting down. The shutdown code checks that all dbCa links have been cleared and then calls ca_context_destroy(). The problem isn't in dbCa itself, it's inside the ca_context_destroy() call.

The IOC console message I get is:

epicsMutex pthread_mutex_lock failed: error Invalid argument
epicsMutexOsdLockThread dbCaLink (0x...) can't proceed, suspending.

Steps to reproduce:
Run a soft IOC on Linux that contains connected CA links to a different IOC. Shut it down with exit or Control-D.

Not seen on Solaris, but now occurs every time on Linux.

Additional information:
I've seen two different stack traces for the dbCaTask, this being the most recent:

(gdb) where
#0 0x00d19402 in __kernel_vsyscall ()
#1 0x00ca81a6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2 0x004d28cd in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3 0x007c65c7 in epicsEventWait (pevent=0x9d6eb70) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#4 0x007c5242 in epicsThreadSuspendSelf () at ../../../src/libCom/osi/os/posix/osdThread.c:485
#5 0x007b966c in cantProceed (msg=0x7d4619 "epicsMutexOsdLock") at ../../../src/libCom/misc/cantProceed.c:69
#6 0x007c5e77 in epicsMutexOsdLock (pmutex=0x9d70240) at ../../../src/libCom/osi/os/posix/osdMutex.c:113
#7 0x007bdf60 in epicsMutexLock (pmutexNode=0x9d6fa50) at ../../../src/libCom/osi/epicsMutex.cpp:145
#8 0x007be67f in epicsMutex::lock (this=0x9d6ed10) at ../../../src/libCom/osi/epicsMutex.cpp:238
#9 0x007bb15a in ipAddrToAsciiEnginePrivate::release (this=0x9dcb438) at ../../../src/libCom/cxxTemplates/epicsGuard.h:71
#10 0x00d5fa21 in ~cac (this=0x9db2c10) at ../cac.cpp:289
#11 0x0014d6cf in ~dbContext (this=0x9d6f670) at ../../../include/epicsMemory.h:52
#12 0x00d7f9ca in ~ca_client_context (this=0x9d6f4a8) at ../../../include/epicsMemory.h:52
#13 0x00d64e19 in ca_context_destroy () at ../access.cpp:251
#14 0x0014a91b in dbCaTask (arg=0x0) at ../dbCa.c:986
#15 0x007c512b in start_routine (arg=0x9d6eb10) at ../../../src/libCom/osi/os/posix/osdThread.c:318
#16 0x00ca43db in start_thread () from /lib/libpthread.so.0
#17 0x004c626e in clone () from /lib/libc.so.6
(gdb)

I have also seen this in the ~cac() routine's call to timerQueue.release() which causes a failure in epicsThreadOnceOsd instead. I'll attach additional stack traces if I trap that one again.

The problem has to be due to the order of shutdown, but this isn't easy to track down.

OS: Linux

Platform: x86

Version: R3.14.9 R3-14-10-CVS

Original Mantis Bug: mantis-316
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=316

Revision history for this message
Andrew Johnson (anj) wrote :

This bug still present in R3-14-10-pre1. This is the backtrace from a 64-bit linux build:

#0 0x0000003a15e0a8f9 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1 0x000000000106cb11 in epicsEventWait (pevent=0x160f040)
    at ../../../src/libCom/osi/os/posix/osdEvent.c:77
#2 0x000000000106ba6b in epicsThreadSuspendSelf ()
    at ../../../src/libCom/osi/os/posix/osdThread.c:486
#3 0x000000000106119b in cantProceed (msg=<value optimized out>)
    at ../../../src/libCom/misc/cantProceed.c:69
#4 0x000000000106c523 in epicsMutexOsdLock (pmutex=<value optimized out>)
    at ../../../src/libCom/osi/os/posix/osdMutex.c:114
#5 0x0000000001065b1c in epicsMutex::lock (this=<value optimized out>)
    at ../../../src/libCom/osi/epicsMutex.cpp:238
#6 0x00000000010629dd in ipAddrToAsciiEnginePrivate::release (
    this=<value optimized out>)
    at ../../../src/libCom/cxxTemplates/epicsGuard.h:71
#7 0x00000000066ea541 in ~cac (this=0x1633fb0) at ../cac.cpp:289
#8 0x000000000378cef9 in ~dbContext (this=0x160f5a0)
    at ../../../include/epicsMemory.h:52
#9 0x0000000006705e88 in ~ca_client_context (this=0x160f290)
    at ../../../include/epicsMemory.h:52
#10 0x00000000066eea76 in ca_context_destroy () at ../access.cpp:251
#11 0x000000000378a406 in dbCaTask (arg=<value optimized out>)
    at ../dbCa.c:986
#12 0x000000000106b973 in start_routine (arg=<value optimized out>)
    at ../../../src/libCom/osi/os/posix/osdThread.c:319
#13 0x0000003a15e06407 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003a152d4b0d in clone () from /lib64/libc.so.6

Revision history for this message
Andrew Johnson (anj) wrote :

Since this bug is now occurring every time a Linux IOC that has external links is shut down, it must be fixed before the R3.14.10 release.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I am guessing that its at line 196 in ipAddrToAsciiAsynchronous.cp, and the problem is that pGlobalMutex is no longer valid - as follows.

    epicsGuard < epicsMutex > guard ( * ipAddrToAsciiEnginePrivate::pGlobalMutex );

Furthermore, pGlobalMutex is no longer valid because epicsAtExit has run its exit handler - as follows.

    epicsAtExit ( ipAddrToAsciiEngineGlobalMutexDestruct, 0 );

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I had a look at the db ca code, and it does install its exit handler in the correct order - only after initializing ca.

Closer investigation reveals that the db snap-in for the ca client lib only initializes a network snap-in for the ca client lib when the first channel lacking direct linkage to the database is created.

This can lead to reversed exit handler initialization order.

Revision history for this message
Jeff Hill (johill-lanl) wrote :
Download full text (4.5 KiB)

I committed this patch

cvs diff -wb -- ipAddrToAsciiAsynchronous.cpp (in directory C:\hill\R3.14.dll_hell_fix\epics\base\src\libCom\misc\)
Index: ipAddrToAsciiAsynchronous.cpp
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/libCom/misc/ipAddrToAsciiAsynchronous.cpp,v
retrieving revision 1.21.2.17
diff -u -b -w -b -r1.21.2.17 ipAddrToAsciiAsynchronous.cpp
--- ipAddrToAsciiAsynchronous.cpp 28 Jul 2008 18:32:32 -0000 1.21.2.17
+++ ipAddrToAsciiAsynchronous.cpp 26 Sep 2008 21:56:25 -0000
@@ -82,7 +82,7 @@

 extern "C" {
 static void ipAddrToAsciiEngineGlobalMutexConstruct ( void * );
-static void ipAddrToAsciiEngineGlobalMutexDestruct ( void * );
+static void ipAddrToAsciiEngineShutdownRequest ( void * );
 }

 // - this class executes the synchronous DNS query
@@ -111,6 +111,7 @@
     static epicsMutex * pGlobalMutex;
     static ipAddrToAsciiEnginePrivate * pEngine;
     static unsigned numberOfReferences;
+ static bool shutdownRequest;
     ipAddrToAsciiTransaction & createTransaction ();
     void release ();
     void run ();
@@ -118,13 +119,14 @@
  ipAddrToAsciiEnginePrivate & operator = ( const ipAddrToAsciiEngine & );
     friend class ipAddrToAsciiEngine;
     friend class ipAddrToAsciiTransactionPrivate;
- friend void ipAddrToAsciiEngineGlobalMutexDestruct ( void * );
+ friend void ipAddrToAsciiEngineShutdownRequest ( void * );
     friend void ipAddrToAsciiEngineGlobalMutexConstruct ( void * );
 };

 epicsMutex * ipAddrToAsciiEnginePrivate :: pGlobalMutex = 0;
 ipAddrToAsciiEnginePrivate * ipAddrToAsciiEnginePrivate :: pEngine = 0;
 unsigned ipAddrToAsciiEnginePrivate :: numberOfReferences = 0u;
+bool ipAddrToAsciiEnginePrivate :: shutdownRequest = false;
 static epicsThreadOnceId ipAddrToAsciiEngineGlobalMutexOnceFlag = 0;

 // the users are not required to supply a show routine
@@ -136,15 +138,19 @@
 ipAddrToAsciiTransaction::~ipAddrToAsciiTransaction () {}
 ipAddrToAsciiEngine::~ipAddrToAsciiEngine () {}

-static void ipAddrToAsciiEngineGlobalMutexDestruct ( void * )
+static void ipAddrToAsciiEngineShutdownRequest ( void * )
 {
+ epicsGuard < epicsMutex > guard ( * ipAddrToAsciiEnginePrivate::pGlobalMutex );
+ ipAddrToAsciiEnginePrivate :: shutdownRequest = true;
+ if ( ipAddrToAsciiEnginePrivate::numberOfReferences == 0u ) {
     delete ipAddrToAsciiEnginePrivate :: pGlobalMutex;
 }
+}

 static void ipAddrToAsciiEngineGlobalMutexConstruct ( void * )
 {
     ipAddrToAsciiEnginePrivate :: pGlobalMutex = new epicsMutex ();
- epicsAtExit ( ipAddrToAsciiEngineGlobalMutexDestruct, 0 );
+ epicsAtExit ( ipAddrToAsciiEngineShutdownRequest, 0 );
 }

 // for now its probably sufficent to allocate one
@@ -156,6 +162,16 @@
     epicsThreadOnce (
         & ipAddrToAsciiEngineGlobalMutexOnceFlag,
         ipAddrToAsciiEngineGlobalMutexConstruct, 0 );
+ // since we must not own lock when checking this flag
+ // this diagnostic has imperfect detection, but never
+ // incorrect detection
+ if ( ipAddrToAsciiEnginePrivate :: shutdownRequest ) {
+ throw std :: runtime_error (
+ "ipAddrToAsciiEngine::allocate (): "...

Read more...

Revision history for this message
Jeff Hill (johill-lanl) wrote :

After committing the above fix I see that the hang during IOC exit on Linux has been eliminated

Revision history for this message
Jeff Hill (johill-lanl) wrote :

fixed in (the bug was occurring as a side effect of changes prepared for) R3.14.10

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.10 released.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.