EPICS Base

Channel access client deadlock in tcpRecvThread

Bug #672665 reported by florent.paitrault on 2010-11-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	EPICS Base	Won't Fix	Low	Jeff Hill

Bug Description

In a multi-threaded application I get a deadlock situation with epics CA library

I could extract the following stack trace showing where the lock occur:
============================================================================
Thread 37 (Thread 0xaa4dbb90 (LWP 4604)):
#0 0x00bc6422 in __kernel_vsyscall ()
#1 0x002280b9 in __lll_lock_wait () from /lib/libpthread.so.0
#2 0x00223864 in _L_lock_824 () from /lib/libpthread.so.0
#3 0x0022371d in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x0071eca6 in pthread_mutex_lock () from /lib/libc.so.6
#5 0x048f4746 in epicsMutexOsdLock (pmutex=0x88b0668) at ../../../src/libCom/osi/os/posix/osdMutex.c:44
#6 0x048ec700 in epicsMutexLock (pmutexNode=0x88f0bf8) at ../../../src/libCom/osi/epicsMutex.cpp:145
#7 0x048ece1f in epicsMutex::lock (this=0x88f0b9c) at ../../../src/libCom/osi/epicsMutex.cpp:238
#8 0x0560df9c in tcpRecvThread::run (this=0x89388dc) at ../../../include/epicsGuard.h:71
#9 0x048ec50a in epicsThreadCallEntryPoint (pPvt=0x89388e0) at ../../../src/libCom/osi/epicsThread.cpp:85
#10 0x048f360b in start_routine (arg=0x88f6f20) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#11 0x0022149b in start_thread () from /lib/libpthread.so.0
#12 0x0071242e in clone () from /lib/libc.so.6
============================================================================

Just messages in the stderr with logs in DEBUG:
============================================================================
CAS Response: cmd=0 id=0 typ=0 cnt=11 psz=0 avail=0 outBuf ptr=0x9b822b8
Socket outgoing: 16 byte socket: 0000001C status: 16
Socket incoming: 16 byte socket: 0000002A
CAS outgoing: 16 byte reply to 152.81.45.163:56305
CAS Response: cmd=22 id=1 typ=0 cnt=0 psz=0 avail=3 outBuf ptr=0x9b822b8
CAS Response: cmd=18 id=1 typ=4 cnt=2008 psz=0 avail=1 outBuf ptr=0x9b822c8
Socket outgoing: 32 byte socket: 0000001C status: 32 <========= This is never received by CA
CAS outgoing: 32 byte reply to 152.81.45.163:56305
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=1e outBuf ptr=0x9b922b8
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=26 outBuf ptr=0x9b922d8
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=1e outBuf ptr=0x9b922f8
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=20 outBuf ptr=0x9b92318
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=26 outBuf ptr=0x9b8a2b8
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=102 outBuf ptr=0x9b8a2d8
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=29 outBuf ptr=0x9b8a2f8
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=105 outBuf ptr=0x9b8a318
CAS Response: cmd=1 id=1 typ=20 cnt=1 psz=24 avail=2f outBuf ptr=0x9b8a338
CAS Response: cmd=1 id=1 typ=20 cnt=1 psz=24 avail=10b outBuf ptr=0x9b8a360
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=26 outBuf ptr=0x9b8a388
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=102 outBuf ptr=0x9b8a3a8
CAS Response: cmd=1 id=1 typ=19 cnt=1 psz=16 avail=23 outBuf ptr=0x9b8a3c8
CAS Response: cmd=1 id=1 typ=0 cnt=1 psz=16 avail=e7 outBuf ptr=0x9b8a3e8
Socket outgoing: 336 byte socket: 0000001B status: 336
Socket incoming: 336 byte socket: 000000A2
============================================================================

Tags:

Revision history for this message

Andrew Johnson (anj) wrote on 2010-11-08:

Please indicate which version of EPICS Base you were using, which target architecture for both the CA client and server, which server (was it an IOC or a CAS tool such as the PV Gateway) and tell us more about the client program.

tags:	added: ca
Changed in epics-base:
status:	New → Incomplete

Revision history for this message

florent.paitrault (florent-paitrault) wrote on 2010-11-09:

- Version of Epics Base is 3.14.11.
- Target architecture: linux-x86 for both client & server.
- Server application is a multi-threaded application made internally using CAS library.
- Client program is an Epics variable recorder made internally.

Both server & client are running on the same computer.
We are using CentOS 5.2 as Linux distribution a kernel 2.6.29.6.

Revision history for this message

Andrew Johnson (anj) wrote on 2010-11-09:

Is the deadlock occurring in the CAS server application or in the CA client program?

I've just assigned and subscribed Jeff Hill to this bug as both libraries are his code; hopefully he'll be able to help when he gets a chance.

Changed in epics-base:
assignee:	nobody → Jeff Hill (johill-lanl)
tags:	added: 3.14.11

Revision history for this message

florent.paitrault (florent-paitrault) wrote on 2010-11-09:

Deadlock occur in the CA client program. (tcpRecvThread::run() from ca/tcpiiu.cpp)

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2010-11-10:

To establish a deadlock condition we need to show that this is occurring.

Thread 1 has lock A and is taking lock B
Thread 2 has lock B is request Lock A

Since we have only one thread's stack trace then its quite difficult to guess at a cause. Please use the gdb command "thread apply all backtrace" against the wedged client side application and place the result in this bug report (or send an email to <email address hidden>).

At this point in time its difficult to rule out other causes. For example this stack trace would occur if the send thread were blocked because outgoing data isn't being accepted by the socket. It would also be quite useful at this point to provide output from netstat which will show if bytes are pending in the tcp circuits, and if so on which side of the circuit.

> #8 0x0560df9c in tcpRecvThread::run (this=0x89388dc) at ../../../include/epicsGuard.h:71

Unfortunately, gdb doesn't provide a line number in tcpRecvThread::run, but I can narrow it down to line numbers 443, 482, 504, 600, 593, 547, 585 all of which take the per circuit lock.

Revision history for this message

florent.paitrault (florent-paitrault) wrote on 2010-11-15:

Download full text (15.5 KiB)

Here is stack trace of every threads in application:

Thread 16 (Thread 0xb6793b90 (LWP 2730)):
#0 0x0084e422 in __kernel_vsyscall ()
#1 0x0098d973 in poll () from /lib/libc.so.6
#2 0x081d3f9b in omni::SocketCollection::Select (this=0x91e4cd0) at SocketCollection.cc:500
#3 0x081fbb52 in omni::tcpEndpoint::AcceptAndMonitor (this=<value optimized out>, func=<value optimized out>, cookie=<value optimized out>) at ./tcp/tcpEndpoint.cc:659
#4 0x081b8e1d in omni::giopRendezvouser::execute (this=<value optimized out>) at giopRendezvouser.cc:92
#5 0x08156e8e in omniAsyncWorker::real_run (this=<value optimized out>) at invoker.cc:232
#6 0x0815601b in omniAsyncWorkerInfo::run (this=<value optimized out>) at invoker.cc:280
#7 0x0815712a in omniAsyncWorker::run (this=Could not find the frame base for "omniAsyncWorker::run(void*)".
) at invoker.cc:159
#8 0x002be8ed in omni_thread_wrapper (ptr=<value optimized out>) at posix.cc:456
#9 0x00a4049b in start_thread () from /lib/libpthread.so.0
#10 0x0099742e in clone () from /lib/libc.so.6

Thread 15 (Thread 0xb5f92b90 (LWP 2731)):
#0 0x0084e422 in __kernel_vsyscall ()
#1 0x00a448c2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2 0x009a3b84 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3 0x002bd478 in omni_condition::timedwait (this=<value optimized out>, secs=<value optimized out>, nanosecs=<value optimized out>) at posix.cc:172
#4 0x081aeb73 in omni::Scavenger::execute (this=<value optimized out>) at giopStrand.cc:719
#5 0x08156e8e in omniAsyncWorker::real_run (this=<value optimized out>) at invoker.cc:232
#6 0x0815601b in omniAsyncWorkerInfo::run (this=<value optimized out>) at invoker.cc:280
#7 0x0815712a in omniAsyncWorker::run (this=Could not find the frame base for "omniAsyncWorker::run(void*)".
) at invoker.cc:159
#8 0x002be8ed in omni_thread_wrapper (ptr=<value optimized out>) at posix.cc:456
#9 0x00a4049b in start_thread () from /lib/libpthread.so.0
#10 0x0099742e in clone () from /lib/libc.so.6

Thread 14 (Thread 0xafdffb90 (LWP 2782)):
#0 0x0084e422 in __kernel_vsyscall ()
#1 0x00990211 in select () from /lib/libc.so.6
#2 0x0610432e in fdManager::process (this=0x6133e60, delay=0.00050000000000000001) at ../../../src/libCom/fdmgr/fdManager.cpp:129
#3 0x054d77ac in epicsThread () at /home/to52638/sources/gc/trunk/vs/impl/epics/src/epicsVs.cc:47
#4 0x0611860b in start_routine (arg=0x9254ac0) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#5 0x00a4049b in start_thread () from /lib/libpthread.so.0
#6 0x0099742e in clone () from /lib/libc.so.6

Here is stack trace of every threads in application:

Thread 16 (Thread 0xb6793b90 (LWP 2730)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x0098d973 in poll () from /lib/libc.so.6
#2  0x081d3f9b in omni::SocketCollection::Select (this=0x91e4cd0) at SocketCollection.cc:500
#3  0x081fbb52 in omni::tcpEndpoint::AcceptAndMonitor (this=<value optimized out>, func=<value optimized out>, cookie=<value optimized out>) at ./tcp/tcpEndpoint.cc:659
#4  0x081b8e1d in omni::giopRendezvouser::execute (this=<value optimized out>) at giopRendezvouser.cc:92
#5  0x08156e8e in omniAsyncWorker::real_run (this=<value optimized out>) at invoker.cc:232
#6  0x0815601b in omniAsyncWorkerInfo::run (this=<value optimized out>) at invoker.cc:280
#7  0x0815712a in omniAsyncWorker::run (this=Could not find the frame base for "omniAsyncWorker::run(void*)".
) at invoker.cc:159
#8  0x002be8ed in omni_thread_wrapper (ptr=<value optimized out>) at posix.cc:456
#9  0x00a4049b in start_thread () from /lib/libpthread.so.0
#10 0x0099742e in clone () from /lib/libc.so.6

Thread 15 (Thread 0xb5f92b90 (LWP 2731)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00a448c2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x009a3b84 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3  0x002bd478 in omni_condition::timedwait (this=<value optimized out>, secs=<value optimized out>, nanosecs=<value optimized out>) at posix.cc:172
#4  0x081aeb73 in omni::Scavenger::execute (this=<value optimized out>) at giopStrand.cc:719
#5  0x08156e8e in omniAsyncWorker::real_run (this=<value optimized out>) at invoker.cc:232
#6  0x0815601b in omniAsyncWorkerInfo::run (this=<value optimized out>) at invoker.cc:280
#7  0x0815712a in omniAsyncWorker::run (this=Could not find the frame base for "omniAsyncWorker::run(void*)".
) at invoker.cc:159
#8  0x002be8ed in omni_thread_wrapper (ptr=<value optimized out>) at posix.cc:456
#9  0x00a4049b in start_thread () from /lib/libpthread.so.0
#10 0x0099742e in clone () from /lib/libc.so.6

Thread 14 (Thread 0xafdffb90 (LWP 2782)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00990211 in select () from /lib/libc.so.6
#2  0x0610432e in fdManager::process (this=0x6133e60, delay=0.00050000000000000001) at ../../../src/libCom/fdmgr/fdManager.cpp:129
#3  0x054d77ac in epicsThread () at /home/to52638/sources/gc/trunk/vs/impl/epics/src/epicsVs.cc:47
#4  0x0611860b in start_routine (arg=0x9254ac0) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#5  0x00a4049b in start_thread () from /lib/libpthread.so.0
#6  0x0099742e in clone () from /lib/libc.so.6

Thread 13 (Thread 0xafdbeb90 (LWP 2783)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00a44595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x009a3b3d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3  0x06119ee7 in epicsEventWait (pevent=0x925f2f8) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#4  0x061123bf in epicsEvent::wait (this=0x929114c) at ../../../src/libCom/osi/epicsEvent.cpp:63
#5  0x0610ed74 in ipAddrToAsciiEnginePrivate::run (this=0x9290d28) at ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:305
#6  0x0611150a in epicsThreadCallEntryPoint (pPvt=0x9291154) at ../../../src/libCom/osi/epicsThread.cpp:85
#7  0x0611860b in start_routine (arg=0x925f470) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#8  0x00a4049b in start_thread () from /lib/libpthread.so.0
#9  0x0099742e in clone () from /lib/libc.so.6

Thread 12 (Thread 0xafd3db90 (LWP 2784)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00a448c2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x009a3b84 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3  0x06119de4 in epicsEventWaitWithTimeout (pevent=0x9291408, timeout=0.0087420540000000008) at ../../../src/libCom/osi/os/posix/osdEvent.c:65
#4  0x06112256 in epicsEvent::wait (this=0x92912d0, timeOut=0.0087420540000000008) at ../../../src/libCom/osi/epicsEvent.cpp:72
#5  0x06120e0c in timerQueueActive::run (this=0x9291270) at ../../../src/libCom/timer/timerQueueActive.cpp:93
#6  0x0611150a in epicsThreadCallEntryPoint (pPvt=0x92912d8) at ../../../src/libCom/osi/epicsThread.cpp:85
#7  0x0611860b in start_routine (arg=0x9291580) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#8  0x00a4049b in start_thread () from /lib/libpthread.so.0
#9  0x0099742e in clone () from /lib/libc.so.6

Thread 11 (Thread 0xafc2eb90 (LWP 2798)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00998388 in recvfrom () from /lib/libc.so.6
#2  0x0519c33d in udpRecvThread::run (this=0xafe1ed40) at ../udpiiu.cpp:366
#3  0x0611150a in epicsThreadCallEntryPoint (pPvt=0xafe1ed50) at ../../../src/libCom/osi/epicsThread.cpp:85
#4  0x0611860b in start_routine (arg=0xafe03118) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#5  0x00a4049b in start_thread () from /lib/libpthread.so.0
#6  0x0099742e in clone () from /lib/libc.so.6

Thread 10 (Thread 0xab8fcb90 (LWP 2812)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00998308 in recv () from /lib/libc.so.6
#2  0x0519fb79 in tcpiiu::recvBytes (this=0x92b7970, pBuf=0xab7be030, nBytesInBuf=16384, stat=@0xab8fc2a0) at ../tcpiiu.cpp:319
---Type <return> to continue, or q <return> to quit---
#3  0x051a2e6f in tcpRecvThread::run (this=0x92b7a2c) at ../comBuf.h:183
#4  0x0611150a in epicsThreadCallEntryPoint (pPvt=0x92b7a30) at ../../../src/libCom/osi/epicsThread.cpp:85
#5  0x0611860b in start_routine (arg=0x92a10e8) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#6  0x00a4049b in start_thread () from /lib/libpthread.so.0
#7  0x0099742e in clone () from /lib/libc.so.6

Thread 9 (Thread 0xab87bb90 (LWP 2813)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00a44595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x009a3b3d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3  0x06119ee7 in epicsEventWait (pevent=0x92a1490) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#4  0x061123bf in epicsEvent::wait (this=0x92b7b7c) at ../../../src/libCom/osi/epicsEvent.cpp:63
#5  0x051a57a9 in tcpSendThread::run (this=0x92b7a58) at ../tcpiiu.cpp:89
#6  0x0611150a in epicsThreadCallEntryPoint (pPvt=0x92b7a5c) at ../../../src/libCom/osi/epicsThread.cpp:85
#7  0x0611860b in start_routine (arg=0x92a1328) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#8  0x00a4049b in start_thread () from /lib/libpthread.so.0
#9  0x0099742e in clone () from /lib/libc.so.6

Thread 8 (Thread 0xab7b9b90 (LWP 2818)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00998308 in recv () from /lib/libc.so.6
#2  0x081faba6 in omni::tcpConnection::Recv (this=0x92b33b8, buf=0x92c40a8, sz=<value optimized out>, deadline_secs=0, deadline_nanosecs=0) at ./tcp/tcpConnection.cc:422
#3  0x081b2b69 in omni::giopStream::inputMessage (this=<value optimized out>) at giopStream.cc:869
#4  0x081cb3ec in omni::giopImpl12::inputNewServerMessage (g=<value optimized out>) at giopImpl12.cc:555
#5  0x081cb52a in omni::giopImpl12::inputMessageBegin (g=<value optimized out>, unmarshalHeader=<value optimized out>) at giopImpl12.cc:768
#6  0x081bbd05 in omni::GIOP_S::dispatcher (this=<value optimized out>) at GIOP_S.cc:248
#7  0x081b8528 in omni::giopWorker::real_execute (this=<value optimized out>) at giopWorker.cc:215
#8  0x081b8beb in omni::giopWorkerInfo::run (this=<value optimized out>) at giopWorker.cc:103
#9  0x081b8c3a in omni::giopWorker::execute (this=Could not find the frame base for "omni::giopWorker::execute()".
) at giopWorker.cc:117
#10 0x08156e8e in omniAsyncWorker::real_run (this=<value optimized out>) at invoker.cc:232
#11 0x0815601b in omniAsyncWorkerInfo::run (this=<value optimized out>) at invoker.cc:280
#12 0x0815712a in omniAsyncWorker::run (this=Could not find the frame base for "omniAsyncWorker::run(void*)".
) at invoker.cc:159
#13 0x002be8ed in omni_thread_wrapper (ptr=<value optimized out>) at posix.cc:456
#14 0x00a4049b in start_thread () from /lib/libpthread.so.0
#15 0x0099742e in clone () from /lib/libc.so.6

Thread 7 (Thread 0xaa7b7b90 (LWP 2829)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00998308 in recv () from /lib/libc.so.6
#2  0x0519fb79 in tcpiiu::recvBytes (this=0x92b7ba8, pBuf=0xab7c2044, nBytesInBuf=16384, stat=@0xaa7b72a0) at ../tcpiiu.cpp:319
#3  0x051a2e6f in tcpRecvThread::run (this=0x92b7c64) at ../comBuf.h:183
#4  0x0611150a in epicsThreadCallEntryPoint (pPvt=0x92b7c68) at ../../../src/libCom/osi/epicsThread.cpp:85
#5  0x0611860b in start_routine (arg=0x92a1ae8) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#6  0x00a4049b in start_thread () from /lib/libpthread.so.0
#7  0x0099742e in clone () from /lib/libc.so.6

Thread 6 (Thread 0xaa736b90 (LWP 2830)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00a44595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#2  0x009a3b3d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3  0x06119ee7 in epicsEventWait (pevent=0x92a18a8) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#4  0x061123bf in epicsEvent::wait (this=0x92b7db4) at ../../../src/libCom/osi/epicsEvent.cpp:63
#5  0x051a57a9 in tcpSendThread::run (this=0x92b7c90) at ../tcpiiu.cpp:89
#6  0x0611150a in epicsThreadCallEntryPoint (pPvt=0x92b7c94) at ../../../src/libCom/osi/epicsThread.cpp:85
#7  0x0611860b in start_routine (arg=0x92cc288) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#8  0x00a4049b in start_thread () from /lib/libpthread.so.0
#9  0x0099742e in clone () from /lib/libc.so.6

Thread 5 (Thread 0xa9ef4b90 (LWP 2858)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00a448c2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x009a3b84 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3  0x06119de4 in epicsEventWaitWithTimeout (pevent=0x9179550, timeout=0.024678533999999998) at ../../../src/libCom/osi/os/posix/osdEvent.c:65
#4  0x06112256 in epicsEvent::wait (this=0x9178fe0, timeOut=0.024678533999999998) at ../../../src/libCom/osi/epicsEvent.cpp:72
#5  0x06120e0c in timerQueueActive::run (this=0x9178f80) at ../../../src/libCom/timer/timerQueueActive.cpp:93
#6  0x0611150a in epicsThreadCallEntryPoint (pPvt=0x9178fe8) at ../../../src/libCom/osi/epicsThread.cpp:85
#7  0x0611860b in start_routine (arg=0x91798e0) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#8  0x00a4049b in start_thread () from /lib/libpthread.so.0
#9  0x0099742e in clone () from /lib/libc.so.6

Thread 4 (Thread 0xa9eb3b90 (LWP 2859)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00998388 in recvfrom () from /lib/libc.so.6
#2  0x0519c33d in udpRecvThread::run (this=0x930c248) at ../udpiiu.cpp:366
#3  0x0611150a in epicsThreadCallEntryPoint (pPvt=0x930c258) at ../../../src/libCom/osi/epicsThread.cpp:85
#4  0x0611860b in start_routine (arg=0x92f2eb8) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#5  0x00a4049b in start_thread () from /lib/libpthread.so.0
#6  0x0099742e in clone () from /lib/libc.so.6

Thread 3 (Thread 0xa9e72b90 (LWP 2860)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00a470b9 in __lll_lock_wait () from /lib/libpthread.so.0
#2  0x00a42864 in _L_lock_824 () from /lib/libpthread.so.0
#3  0x00a4271d in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x009a3ca6 in pthread_mutex_lock () from /lib/libc.so.6
#5  0x06119746 in epicsMutexOsdLock (pmutex=0x9179320) at ../../../src/libCom/osi/os/posix/osdMutex.c:44
#6  0x06111700 in epicsMutexLock (pmutexNode=0x9179340) at ../../../src/libCom/osi/epicsMutex.cpp:145
#7  0x06111e1f in epicsMutex::lock (this=0x91792a4) at ../../../src/libCom/osi/epicsMutex.cpp:238
#8  0x051a2f9c in tcpRecvThread::run (this=0x9333524) at ../../../include/epicsGuard.h:71
#9  0x0611150a in epicsThreadCallEntryPoint (pPvt=0x9333528) at ../../../src/libCom/osi/epicsThread.cpp:85
#10 0x0611860b in start_routine (arg=0x92f3678) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#11 0x00a4049b in start_thread () from /lib/libpthread.so.0
#12 0x0099742e in clone () from /lib/libc.so.6

Thread 2 (Thread 0xa9df1b90 (LWP 2861)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00a44595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x009a3b3d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libc.so.6
---Type <return> to continue, or q <return> to quit---
#3  0x06119ee7 in epicsEventWait (pevent=0x933bd68) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#4  0x061123bf in epicsEvent::wait (this=0x9333674) at ../../../src/libCom/osi/epicsEvent.cpp:63
#5  0x051a57a9 in tcpSendThread::run (this=0x9333550) at ../tcpiiu.cpp:89
#6  0x0611150a in epicsThreadCallEntryPoint (pPvt=0x9333554) at ../../../src/libCom/osi/epicsThread.cpp:85
#7  0x0611860b in start_routine (arg=0x9337c10) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#8  0x00a4049b in start_thread () from /lib/libpthread.so.0
#9  0x0099742e in clone () from /lib/libc.so.6

Thread 1 (Thread 0xb78946d0 (LWP 2726)):
#0  0x0084e422 in __kernel_vsyscall ()
#1  0x00a44595 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x009a3b3d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3  0x002bd4fb in omni_condition::wait (this=<value optimized out>) at posix.cc:163
#4  0x0813b06f in omni::ORBAsyncInvoker::perform (this=Could not find the frame base for "omni::ORBAsyncInvoker::perform(unsigned long, unsigned long)".
) at corbaOrb.cc:1129
#5  0x0813ae7d in omniOrbORB::run (this=<value optimized out>) at corbaOrb.cc:813
#6  0x08060837 in startServer () at /xxxxxxxx/Server/Server.cpp:299
#7  0x08061648 in main (argc=2, argv=0xbff49914) at /xxxxxxxx/Server/Server.cpp:691

Result of netstat (no bytes are pending for udp):
bash-3.2# netstat -tn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State      
tcp        0      0 XX.XX.XX.XX:45972         XX.XX.XX.XX:44569         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:52151         XX.XX.XX.XX:33041         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:49477         XX.XX.XX.XX:5064          ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:50852         XX.XX.XX.XX:54359         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:60181         XX.XX.XX.XX:59399         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:35435         XX.XX.XX.XX:42058         ESTABLISHED 
tcp       32      0 XX.XX.XX.XX:51959         XX.XX.XX.XX:38658         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:38658         XX.XX.XX.XX:51954         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:58159         XX.XX.XX.XX:45907         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:38201         XX.XX.XX.XX:5064          ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:52143         XX.XX.XX.XX:33041         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:51954         XX.XX.XX.XX:38658         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:38658         XX.XX.XX.XX:51953         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:38658         XX.XX.XX.XX:51959         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:38658         XX.XX.XX.XX:51958         ESTABLISHED 
tcp       33      0 XX.XX.XX.XX:59460         XX.XX.XX.XX:43128         CLOSE_WAIT  
tcp        0      0 XX.XX.XX.XX:58159         XX.XX.XX.XX:45900         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:45907         XX.XX.XX.XX:58159         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:59399         XX.XX.XX.XX:60181         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:45909         XX.XX.XX.XX:58159         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:5064          XX.XX.XX.XX:49477         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:51958         XX.XX.XX.XX:38658         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:60181         XX.XX.XX.XX:59393         ESTABLISHED 
tcp        0      0 XX.XX.XX.XX:58159         XX.XX.XX.XX:45909         ESTABLISHED

florent.paitrault (florent-paitrault) on 2010-11-15

visibility:

public → private

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2010-11-15:

> bash-3.2# netstat -tn
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address Foreign Address State
> tcp 0 0 XX.XX.XX.XX:45972 XX.XX.XX.XX:44569 ESTABLISHED
> tcp 0 0 XX.XX.XX.XX:52151 XX.XX.XX.XX:33041 ESTABLISHED
> tcp 0 0 XX.XX.XX.XX:49477 XX.XX.XX.XX:5064 ESTABLISHED
> tcp 0 0 XX.XX.XX.XX:50852 XX.XX.XX.XX:54359 ESTABLISHED
> tcp 0 0 XX.XX.XX.XX:60181 XX.XX.XX.XX:59399 ESTABLISHED
> .
> .

So we can see tcp/ip no backlog

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2010-11-15:

Only this thread is blocked on a lock which is a contra-indicator for a deadlock; instead this could be a compromised lock or alternatively see below. A lock gets compromised if it isn't unlocked when traversing an unusual code path such as a c++ exception. Its less likely for that scenario to occur in this code because it uses an automatic instance of a Guard class to manage locking an unlocking, but see below for another possibility.

Thread 3 (Thread 0xa9e72b90 (LWP 2860)):
#0 0x0084e422 in __kernel_vsyscall ()
#1 0x00a470b9 in __lll_lock_wait () from /lib/libpthread.so.0
#2 0x00a42864 in _L_lock_824 () from /lib/libpthread.so.0
#3 0x00a4271d in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x009a3ca6 in pthread_mutex_lock () from /lib/libc.so.6
#5 0x06119746 in epicsMutexOsdLock (pmutex=0x9179320) at ../../../src/libCom/osi/os/posix/osdMutex.c:44
#6 0x06111700 in epicsMutexLock (pmutexNode=0x9179340) at ../../../src/libCom/osi/epicsMutex.cpp:145
#7 0x06111e1f in epicsMutex::lock (this=0x91792a4) at ../../../src/libCom/osi/epicsMutex.cpp:238
#8 0x051a2f9c in tcpRecvThread::run (this=0x9333524) at ../../../include/epicsGuard.h:71
#9 0x0611150a in epicsThreadCallEntryPoint (pPvt=0x9333528) at ../../../src/libCom/osi/epicsThread.cpp:85
#10 0x0611860b in start_routine (arg=0x92f3678) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#11 0x00a4049b in start_thread () from /lib/libpthread.so.0
#12 0x0099742e in clone () from /lib/libc.so.6

A lock might be taken at any of lines 443, 482, 504, 600, 593, 547, and 585 in tcpiiu.cpp; all of which take the per circuit lock.

Looking closer I see also at line 502 we have"callbackManager mgr ( this->ctxNotify, this->cbMutex );". This is taking the callback manager lock which is the traffic cop which, in non-preemptive callback mode, prevents a callback from proceeding unless the application is executing in the library.

Is this application running in non-preemptive or preemptive callback mode? If so, does it call ca_poll periodically to allow ca client library background activity to proceed?

Only this thread is blocked on a lock which is a contra-indicator for a deadlock; instead this could be a compromised lock  or alternatively see below. A lock gets compromised if it isn't unlocked when traversing an unusual code path such as a c++ exception. Its less likely for that scenario to occur in this code because it uses an automatic instance of a Guard class to manage locking an unlocking, but see below for another possibility.

Thread 3 (Thread 0xa9e72b90 (LWP 2860)):
#0 0x0084e422 in __kernel_vsyscall ()
#1 0x00a470b9 in __lll_lock_wait () from /lib/libpthread.so.0
#2 0x00a42864 in _L_lock_824 () from /lib/libpthread.so.0
#3 0x00a4271d in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x009a3ca6 in pthread_mutex_lock () from /lib/libc.so.6
#5 0x06119746 in epicsMutexOsdLock (pmutex=0x9179320) at ../../../src/libCom/osi/os/posix/osdMutex.c:44
#6 0x06111700 in epicsMutexLock (pmutexNode=0x9179340) at ../../../src/libCom/osi/epicsMutex.cpp:145
#7 0x06111e1f in epicsMutex::lock (this=0x91792a4) at ../../../src/libCom/osi/epicsMutex.cpp:238
#8 0x051a2f9c in tcpRecvThread::run (this=0x9333524) at ../../../include/epicsGuard.h:71
#9 0x0611150a in epicsThreadCallEntryPoint (pPvt=0x9333528) at ../../../src/libCom/osi/epicsThread.cpp:85
#10 0x0611860b in start_routine (arg=0x92f3678) at ../../../src/libCom/osi/os/posix/osdThread.c:282
#11 0x00a4049b in start_thread () from /lib/libpthread.so.0
#12 0x0099742e in clone () from /lib/libc.so.6

A lock might be taken at any of lines 443, 482, 504, 600, 593, 547, and 585 in tcpiiu.cpp; all of which take the per circuit lock.

Is this application running in non-preemptive or preemptive callback mode? If so, does it call ca_poll periodically to allow ca client library background activity to proceed?

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2010-11-15:

Should have said "If its operating in non-preemptive callback mode (the default), does it call ca_poll periodically to allow ca client library background activity to proceed?"

Revision history for this message

florent.paitrault (florent-paitrault) wrote on 2010-11-16:

#10

After adding some traces around line 502, I confirm that lock occur at this line.
We are operating in preemptive callback mode, so normally we should not have to call ca_poll periodically.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2010-11-16:

#11

Sorry, some more questions

1) Do you have any thread delete calls in your application (this could cause the callback control lock to get compromised)
2) When does this failure occur? Is it after the application is running for a long time, during start up, shutdown, etc

Revision history for this message

florent.paitrault (florent-paitrault) wrote on 2010-11-16:

#12

Some threads are deleted during application execution but they are deleted naturally at the end of their loop and using pthread_join for synchronization. The only part I don't have visibility is the CORBA layer with its thread pool.
This failure occur systematically while reconfiguring our application.

Should I do a "ca_detach_context" before exiting any thread in my application ?

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2010-11-16:

#13

In the debugger, or from the source code, we might get some more information if we call these functions. In particular, this might tell us which thread might be the owner of the callback control mutex.

epicsThreadShowAll ( 1 /* interest level */ );
epicsMutexShowAll ( 1 /* interest level */ )

int ca_client_status ( unsigned level ); // uses current threads ca client context

int ca_context_status ( struct ca_client_context *,
unsigned interestLevel ); // any specified ca client context

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2010-11-16:

#14

If the application is in preemptive callback mode then the callback control lock is used only to enforce that only one thread at a time may call a ca callback.

The threads that call callbacks are
o the tcp receive threads
o the udp receive threads (during client ctx shutdown and only in one other rare situation - ECA_NOSEARCHADDR message)
o asynchronous dns thread (when we see the multiply defined PV message)
o the timer queue threads (send watchdog expire, receive watchdog expire)
o in a thread that destroys a ca context shutdown

We see that one of the ca client library's tcp receive threads is stuck waiting for the callback control lock, but all of the others are waiting in the normal place for new response messages (so they shouldn't own the callback control lock).
The UDP receive thread is parked in the normal place waiting for new messages (so it shouldn't own the callback control lock).
The timer queue threads are parked in the normal place waiting for a timer to expire (so they shouldn't own the callback control lock).
The asynchronous dns thread is parked in the normal place waiting for a new request (so it shouldn't own the callback control lock).

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2010-11-16:

#15

The most likely scenario would be that a thread was somehow destroyed while it owned the callback control lock, or alternatively that somehow the g++ object code neglected to run the destructor for an automatic instance of the guard class when unwinding the stack because of a c++ exception. Are there any message just prior to the failure?

Does the issue occur when building the application with g++ optimization turned off? If so, which version of g++ is in use (use "g++ -v")?

Maybe this is what is different with this application; the client library is communicating with an in-process ca server. In an IOC the ca client library uses direct in-process function calls to communicate with the database so communication with an in-process ca server is maybe less commonly used. Is your ca client side application connecting directly to the in-process ca server?

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2010-11-16:

#16

> Some threads are deleted during application execution but they are deleted naturally at the end of their loop and
> using pthread_join for synchronization.

We are only concerned about poorly synchronized uses of, for example, pthread_cancel or pthread_kill so perhaps this is not the issue.

> The only part I don't have visibility is the CORBA layer with its thread pool.

Do any of the corba threads interact directly with the ca client context? In practice, maybe it really doesn't matter because in preemptive callback mode they (the corba threads) wouldn't be manipulating the callback control lock when they make ca calls. There are maybe some unique failure scenarios with thread private variables, which the client library does use, and thread pools.

> This failure occur systematically while reconfiguring our application.

This could be a scenario where some object is being manipulated after it was destroyed. It might help to run valgrind, purify, etc. Also, running one of {ca_client_status, ca_context_status} might identify corruption issues (admittedly maybe we would see an exception violation instead of pending for a lock in that scenario), and or possibly identify which thread might still own a compromised lock (that might be a substantial clue).

> Should I do a "ca_detach_context" before exiting any thread in my application ?

I don't see any reason why this would be required, because the client library doesn't have any auto context destroy when the last thread using a context exits, but it also shouldn't be harmful.

florent.paitrault (florent-paitrault) on 2011-01-04

visibility:

private → public

mdavidsaver (mdavidsaver) on 2015-12-03

Changed in epics-base:
importance:	Undecided → Low

Revision history for this message

Andrew Johnson (anj) wrote on 2018-06-26:

#17

No updates in 7 years => Won't Fix.

Changed in epics-base:
status:	Incomplete → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.