Channel access client deadlock in tcpRecvThread

Bug #672665 reported by florent.paitrault
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Won't Fix
Low
Jeff Hill

Bug Description

In a multi-threaded application I get a deadlock situation with epics CA library

I could extract the following stack trace showing where the lock occur:
============================================================================
Thread 37 (Thread 0xaa4dbb90 (LWP 4604)):
#0 0x00bc6422 in __kernel_vsyscall ()
#1 0x002280b9 in __lll_lock_wait () from /lib/libpthread.so.0
#2 0x00223864 in _L_lock_824 () from /lib/libpthread.so.0
#3 0x0022371d in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x0071eca6 in pthread_mutex_lock () from /lib/libc.so.6
#5 0x048f4746 in epicsMutexOsdLock (pmutex=0x88b0668) at ../../../src/libCom/osi/os/posix/osdMutex.c:44
#6 0x048ec700 in epicsMutexLock (pmutexNode=0x88f0bf8) at ../../../src/libCom/osi/epicsMutex.cpp:145
#7 0x048ece1f in epicsMutex::lock (this=0x88f0b9c) at ../../../src/libCom/osi/epicsMutex.cpp:238
#8 0x0560df9c in tcpRecvThread::run (this=0x89388dc) at ../../../include/epicsGuard.h:71
#9 0x048ec50a in epicsThreadCallEntryPoint (pPvt=0x89388e0) at ../../../src/libCom/osi/epicsThread.cpp:85
#10 0x048f360b in start_routine (arg=0x88f6f20) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#11 0x0022149b in start_thread () from /lib/libpthread.so.0
#12 0x0071242e in clone () from /lib/libc.so.6
============================================================================

Just messages in the stderr with logs in DEBUG:
============================================================================
CAS Response: cmd=0 id=0 typ=0 cnt=11 psz=0 avail=0 outBuf ptr=0x9b822b8
Socket outgoing: 16 byte socket: 0000001C status: 16
Socket incoming: 16 byte socket: 0000002A
CAS outgoing: 16 byte reply to 152.81.45.163:56305
CAS Response: cmd=22 id=1 typ=0 cnt=0 psz=0 avail=3 outBuf ptr=0x9b822b8
CAS Response: cmd=18 id=1 typ=4 cnt=2008 psz=0 avail=1 outBuf ptr=0x9b822c8
Socket outgoing: 32 byte socket: 0000001C status: 32 <========= This is never received by CA
CAS outgoing: 32 byte reply to 152.81.45.163:56305
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=1e outBuf ptr=0x9b922b8
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=26 outBuf ptr=0x9b922d8
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=1e outBuf ptr=0x9b922f8
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=20 outBuf ptr=0x9b92318
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=26 outBuf ptr=0x9b8a2b8
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=102 outBuf ptr=0x9b8a2d8
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=29 outBuf ptr=0x9b8a2f8
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=105 outBuf ptr=0x9b8a318
CAS Response: cmd=1 id=1 typ=20 cnt=1 psz=24 avail=2f outBuf ptr=0x9b8a338
CAS Response: cmd=1 id=1 typ=20 cnt=1 psz=24 avail=10b outBuf ptr=0x9b8a360
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=26 outBuf ptr=0x9b8a388
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=102 outBuf ptr=0x9b8a3a8
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=23 outBuf ptr=0x9b8a3c8
CAS Response: cmd=1 id=1 typ=0 cnt=1 psz=16 avail=e7 outBuf ptr=0x9b8a3e8
Socket outgoing: 336 byte socket: 0000001B status: 336
Socket incoming: 336 byte socket: 000000A2
============================================================================

Tags: ca 3.14.11
Revision history for this message
Andrew Johnson (anj) wrote :

Please indicate which version of EPICS Base you were using, which target architecture for both the CA client and server, which server (was it an IOC or a CAS tool such as the PV Gateway) and tell us more about the client program.

tags: added: ca
Changed in epics-base:
status: New → Incomplete
Revision history for this message
florent.paitrault (florent-paitrault) wrote :

- Version of Epics Base is 3.14.11.
- Target architecture: linux-x86 for both client & server.
- Server application is a multi-threaded application made internally using CAS library.
- Client program is an Epics variable recorder made internally.

Both server & client are running on the same computer.
We are using CentOS 5.2 as Linux distribution a kernel 2.6.29.6.

Revision history for this message
Andrew Johnson (anj) wrote :

Is the deadlock occurring in the CAS server application or in the CA client program?

I've just assigned and subscribed Jeff Hill to this bug as both libraries are his code; hopefully he'll be able to help when he gets a chance.

Changed in epics-base:
assignee: nobody → Jeff Hill (johill-lanl)
tags: added: 3.14.11
Revision history for this message
florent.paitrault (florent-paitrault) wrote :

Deadlock occur in the CA client program. (tcpRecvThread::run() from ca/tcpiiu.cpp)

Revision history for this message
Jeff Hill (johill-lanl) wrote :

To establish a deadlock condition we need to show that this is occurring.

Thread 1 has lock A and is taking lock B
Thread 2 has lock B is request Lock A

Since we have only one thread's stack trace then its quite difficult to guess at a cause. Please use the gdb command "thread apply all backtrace" against the wedged client side application and place the result in this bug report (or send an email to <email address hidden>).

At this point in time its difficult to rule out other causes. For example this stack trace would occur if the send thread were blocked because outgoing data isn't being accepted by the socket. It would also be quite useful at this point to provide output from netstat which will show if bytes are pending in the tcp circuits, and if so on which side of the circuit.

> #8 0x0560df9c in tcpRecvThread::run (this=0x89388dc) at ../../../include/epicsGuard.h:71

Unfortunately, gdb doesn't provide a line number in tcpRecvThread::run, but I can narrow it down to line numbers 443, 482, 504, 600, 593, 547, 585 all of which take the per circuit lock.

Revision history for this message
florent.paitrault (florent-paitrault) wrote :
Download full text (15.5 KiB)

Here is stack trace of every threads in application:

Thread 16 (Thread 0xb6793b90 (LWP 2730)):
#0 0x0084e422 in __kernel_vsyscall ()
#1 0x0098d973 in poll () from /lib/libc.so.6
#2 0x081d3f9b in omni::SocketCollection::Select (this=0x91e4cd0) at SocketCollection.cc:500
#3 0x081fbb52 in omni::tcpEndpoint::AcceptAndMonitor (this=<value optimized out>, func=<value optimized out>, cookie=<value optimized out>) at ./tcp/tcpEndpoint.cc:659
#4 0x081b8e1d in omni::giopRendezvouser::execute (this=<value optimized out>) at giopRendezvouser.cc:92
#5 0x08156e8e in omniAsyncWorker::real_run (this=<value optimized out>) at invoker.cc:232
#6 0x0815601b in omniAsyncWorkerInfo::run (this=<value optimized out>) at invoker.cc:280
#7 0x0815712a in omniAsyncWorker::run (this=Could not find the frame base for "omniAsyncWorker::run(void*)".
) at invoker.cc:159
#8 0x002be8ed in omni_thread_wrapper (ptr=<value optimized out>) at posix.cc:456
#9 0x00a4049b in start_thread () from /lib/libpthread.so.0
#10 0x0099742e in clone () from /lib/libc.so.6

Thread 15 (Thread 0xb5f92b90 (LWP 2731)):
#0 0x0084e422 in __kernel_vsyscall ()
#1 0x00a448c2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2 0x009a3b84 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3 0x002bd478 in omni_condition::timedwait (this=<value optimized out>, secs=<value optimized out>, nanosecs=<value optimized out>) at posix.cc:172
#4 0x081aeb73 in omni::Scavenger::execute (this=<value optimized out>) at giopStrand.cc:719
#5 0x08156e8e in omniAsyncWorker::real_run (this=<value optimized out>) at invoker.cc:232
#6 0x0815601b in omniAsyncWorkerInfo::run (this=<value optimized out>) at invoker.cc:280
#7 0x0815712a in omniAsyncWorker::run (this=Could not find the frame base for "omniAsyncWorker::run(void*)".
) at invoker.cc:159
#8 0x002be8ed in omni_thread_wrapper (ptr=<value optimized out>) at posix.cc:456
#9 0x00a4049b in start_thread () from /lib/libpthread.so.0
#10 0x0099742e in clone () from /lib/libc.so.6

Thread 14 (Thread 0xafdffb90 (LWP 2782)):
#0 0x0084e422 in __kernel_vsyscall ()
#1 0x00990211 in select () from /lib/libc.so.6
#2 0x0610432e in fdManager::process (this=0x6133e60, delay=0.00050000000000000001) at ../../../src/libCom/fdmgr/fdManager.cpp:129
#3 0x054d77ac in epicsThread () at /home/to52638/sources/gc/trunk/vs/impl/epics/src/epicsVs.cc:47
#4 0x0611860b in start_routine (arg=0x9254ac0) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#5 0x00a4049b in start_thread () from /lib/libpthread.so.0
#6 0x0099742e in clone () from /lib/libc.so.6

Thread 13 (Thread 0xafdbeb90 (LWP 2783)):
#0 0x0084e422 in __kernel_vsyscall ()
#1 0x00a44595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2 0x009a3b3d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3 0x06119ee7 in epicsEventWait (pevent=0x925f2f8) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#4 0x061123bf in epicsEvent::wait (this=0x929114c) at ../../../src/libCom/osi/epicsEvent.cpp:63
#5 0x0610ed74 in ipAddrToAsciiEnginePrivate::run (this=0x9290d28) at ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:305
#6 0x0611150a in epicsThreadCallEntryPoi...

visibility: public → private
Revision history for this message
Jeff Hill (johill-lanl) wrote :

> bash-3.2# netstat -tn
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address Foreign Address State
> tcp 0 0 XX.XX.XX.XX:45972 XX.XX.XX.XX:44569 ESTABLISHED
> tcp 0 0 XX.XX.XX.XX:52151 XX.XX.XX.XX:33041 ESTABLISHED
> tcp 0 0 XX.XX.XX.XX:49477 XX.XX.XX.XX:5064 ESTABLISHED
> tcp 0 0 XX.XX.XX.XX:50852 XX.XX.XX.XX:54359 ESTABLISHED
> tcp 0 0 XX.XX.XX.XX:60181 XX.XX.XX.XX:59399 ESTABLISHED
> .
> .

So we can see tcp/ip no backlog

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Only this thread is blocked on a lock which is a contra-indicator for a deadlock; instead this could be a compromised lock or alternatively see below. A lock gets compromised if it isn't unlocked when traversing an unusual code path such as a c++ exception. Its less likely for that scenario to occur in this code because it uses an automatic instance of a Guard class to manage locking an unlocking, but see below for another possibility.

Thread 3 (Thread 0xa9e72b90 (LWP 2860)):
#0 0x0084e422 in __kernel_vsyscall ()
#1 0x00a470b9 in __lll_lock_wait () from /lib/libpthread.so.0
#2 0x00a42864 in _L_lock_824 () from /lib/libpthread.so.0
#3 0x00a4271d in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x009a3ca6 in pthread_mutex_lock () from /lib/libc.so.6
#5 0x06119746 in epicsMutexOsdLock (pmutex=0x9179320) at ../../../src/libCom/osi/os/posix/osdMutex.c:44
#6 0x06111700 in epicsMutexLock (pmutexNode=0x9179340) at ../../../src/libCom/osi/epicsMutex.cpp:145
#7 0x06111e1f in epicsMutex::lock (this=0x91792a4) at ../../../src/libCom/osi/epicsMutex.cpp:238
#8 0x051a2f9c in tcpRecvThread::run (this=0x9333524) at ../../../include/epicsGuard.h:71
#9 0x0611150a in epicsThreadCallEntryPoint (pPvt=0x9333528) at ../../../src/libCom/osi/epicsThread.cpp:85
#10 0x0611860b in start_routine (arg=0x92f3678) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#11 0x00a4049b in start_thread () from /lib/libpthread.so.0
#12 0x0099742e in clone () from /lib/libc.so.6

A lock might be taken at any of lines 443, 482, 504, 600, 593, 547, and 585 in tcpiiu.cpp; all of which take the per circuit lock.

Looking closer I see also at line 502 we have"callbackManager mgr ( this->ctxNotify, this->cbMutex );". This is taking the callback manager lock which is the traffic cop which, in non-preemptive callback mode, prevents a callback from proceeding unless the application is executing in the library.

Is this application running in non-preemptive or preemptive callback mode? If so, does it call ca_poll periodically to allow ca client library background activity to proceed?

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Should have said "If its operating in non-preemptive callback mode (the default), does it call ca_poll periodically to allow ca client library background activity to proceed?"

Revision history for this message
florent.paitrault (florent-paitrault) wrote :

After adding some traces around line 502, I confirm that lock occur at this line.
We are operating in preemptive callback mode, so normally we should not have to call ca_poll periodically.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Sorry, some more questions

1) Do you have any thread delete calls in your application (this could cause the callback control lock to get compromised)
2) When does this failure occur? Is it after the application is running for a long time, during start up, shutdown, etc

Revision history for this message
florent.paitrault (florent-paitrault) wrote :

Some threads are deleted during application execution but they are deleted naturally at the end of their loop and using pthread_join for synchronization. The only part I don't have visibility is the CORBA layer with its thread pool.
This failure occur systematically while reconfiguring our application.

Should I do a "ca_detach_context" before exiting any thread in my application ?

Revision history for this message
Jeff Hill (johill-lanl) wrote :

In the debugger, or from the source code, we might get some more information if we call these functions. In particular, this might tell us which thread might be the owner of the callback control mutex.

epicsThreadShowAll ( 1 /* interest level */ );
epicsMutexShowAll ( 1 /* interest level */ )

int ca_client_status ( unsigned level ); // uses current threads ca client context

int ca_context_status ( struct ca_client_context *,
       unsigned interestLevel ); // any specified ca client context

Revision history for this message
Jeff Hill (johill-lanl) wrote :

If the application is in preemptive callback mode then the callback control lock is used only to enforce that only one thread at a time may call a ca callback.

The threads that call callbacks are
o the tcp receive threads
o the udp receive threads (during client ctx shutdown and only in one other rare situation - ECA_NOSEARCHADDR message)
o asynchronous dns thread (when we see the multiply defined PV message)
o the timer queue threads (send watchdog expire, receive watchdog expire)
o in a thread that destroys a ca context shutdown

We see that one of the ca client library's tcp receive threads is stuck waiting for the callback control lock, but all of the others are waiting in the normal place for new response messages (so they shouldn't own the callback control lock).
The UDP receive thread is parked in the normal place waiting for new messages (so it shouldn't own the callback control lock).
The timer queue threads are parked in the normal place waiting for a timer to expire (so they shouldn't own the callback control lock).
The asynchronous dns thread is parked in the normal place waiting for a new request (so it shouldn't own the callback control lock).

Revision history for this message
Jeff Hill (johill-lanl) wrote :

The most likely scenario would be that a thread was somehow destroyed while it owned the callback control lock, or alternatively that somehow the g++ object code neglected to run the destructor for an automatic instance of the guard class when unwinding the stack because of a c++ exception. Are there any message just prior to the failure?

Does the issue occur when building the application with g++ optimization turned off? If so, which version of g++ is in use (use "g++ -v")?

Maybe this is what is different with this application; the client library is communicating with an in-process ca server. In an IOC the ca client library uses direct in-process function calls to communicate with the database so communication with an in-process ca server is maybe less commonly used. Is your ca client side application connecting directly to the in-process ca server?

Revision history for this message
Jeff Hill (johill-lanl) wrote :

> Some threads are deleted during application execution but they are deleted naturally at the end of their loop and
> using pthread_join for synchronization.

We are only concerned about poorly synchronized uses of, for example, pthread_cancel or pthread_kill so perhaps this is not the issue.

> The only part I don't have visibility is the CORBA layer with its thread pool.

Do any of the corba threads interact directly with the ca client context? In practice, maybe it really doesn't matter because in preemptive callback mode they (the corba threads) wouldn't be manipulating the callback control lock when they make ca calls. There are maybe some unique failure scenarios with thread private variables, which the client library does use, and thread pools.

> This failure occur systematically while reconfiguring our application.

This could be a scenario where some object is being manipulated after it was destroyed. It might help to run valgrind, purify, etc. Also, running one of {ca_client_status, ca_context_status} might identify corruption issues (admittedly maybe we would see an exception violation instead of pending for a lock in that scenario), and or possibly identify which thread might still own a compromised lock (that might be a substantial clue).

> Should I do a "ca_detach_context" before exiting any thread in my application ?

I don't see any reason why this would be required, because the client library doesn't have any auto context destroy when the last thread using a context exits, but it also shouldn't be harmful.

visibility: private → public
Changed in epics-base:
importance: Undecided → Low
Revision history for this message
Andrew Johnson (anj) wrote :

No updates in 7 years => Won't Fix.

Changed in epics-base:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.