This is the stack of the failing CAS-client thread:
#0 0xb7be13bc in __pthread_cond_wait (cond=0xb5c588c0, mutex=0xb5c588a8) at pthread_cond_wait.c:153
#1 0xb7cf5978 in __pthread_cond_wait (cond=<optimized out>, mutex=<optimized out>) at forward.c:139
#2 0x0fd3e374 in condWait (mutexId=0xb5c588a8, condId=0xb5c588c0) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#3 epicsEventWait (pevent=0xb5c588a8) at ../../../src/libCom/osi/os/posix/osdEvent.c:137
#4 0x0fd3c68c in epicsThreadSuspendSelf () at ../../../src/libCom/osi/os/Linux/osdThread.c:580
#5 0x0fd3a4fc in epicsAssert (pFile=0xfe870e0 "../dbEvent.c", line=531, pExp=0xfe85d68 "status == epicsMutexLockOK", pAuthorName=0xfd4c894 "the author") at ../../../src/libCom/osi/os/default/osdAssert.c:50
#6 0x0fe75ba4 in db_cancel_event (es=0x10d0d990) at ../dbEvent.c:531
#7 0x0ff05610 in destroyAllChannels (client=0x10d09e28, pList=0x10d09e68) at ../caservertask.c:653
#8 0x0ff06508 in destroy_tcp_client (client=0x10d09e28) at ../caservertask.c:697
#9 0x0ff073f8 in camsgtask (pParm=0x10d09e28) at ../camsgtask.c:160
#10 0x0fd3bbcc in start_routine (arg=0xb5c08258) at ../../../src/libCom/osi/os/Linux/osdThread.c:408
#11 0xb7bdc82c in start_thread (arg=0xb22ba480) at pthread_create.c:306
#12 0xb7ce70e4 in clone () from /lib/libc.so.6
camsgtask (camsgtest.c:41) exits due to an error
151: epicsPrintf ("CAS: forcing disconnect from %s\n", buf);
152: break;
153: }
154: }
155:
156: LOCK_CLIENTQ;
157: ellDelete ( &clientQ, &client->node );
158: UNLOCK_CLIENTQ;
159:
160: destroy_tcp_client ( client );
destroyAllChannels (caservertask.c:619) calls db_cancel_event
640: while ( TRUE ) {
641: /*
642: * AS state change could be using this list
643: */
644: epicsMutexMustLock ( client->eventqLock );
645: pevext = (struct event_ext *) ellGet ( &pciu->eventq );
646: epicsMutexUnlock ( client->eventqLock );
647:
648: if ( ! pevext ) {
649: break;
650: }
651:
652: if ( pevext->pdbev ) {
653: db_cancel_event (pevext->pdbev);
db_cancel_event (dbEvent.c:518) locks the event queue which fails
531: LOCKEVQUE ( pevent->ev_que )
This macro expands to: epicsMutexUnlock((pevent->ev_que)->writelock);
So at this time the mutex seems to be invalid. Why?
Content of the mutex:
(gdb) print *(( struct evSubscrip * ) es) -> ev_que->writelock
$6 = {node = {next = 0x10459d08, previous = 0xb5c58410}, id = 0x11013848, pFileName = 0xfe870e0 "../dbEvent.c", lineno = 298}
The only pssibility that makes sense it that it had been destroyed.
But it does so only *after* destroyAllChannels in line 697, where it already crashed!
So where else is db_close_events called? In dbContext.cpp.
a) In dbContext::subscribe when creating the "CAC-event" thread failed:
214: int status = db_start_events ( tmpctx, "CAC-event",
215: cacAttachClientCtx, ca_current_context (), above );
216: if ( status ) {
217: db_close_events ( tmpctx );
218: throw std::bad_alloc ();
This would probably not have the observed effect, because the "CAC-event" thread which destroys the mutex is not running.
b) In dbContext::subscribe when another thread called dbContext::subscribe at the same time.
221: if ( this->ctx ) {
222: // another thread tried to simultaneously setup
223: // the event system
224: db_close_events ( tmpctx );
225: }
226: else {
227: this->ctx = tmpctx;
228: }
I think this can never happen because the whole dbContext::subscribe function is guarded with a mutex.
c) In the ~dbContext destructor:
68:dbContext::~dbContext ()
69:{
70: delete [] this->pStateNotifyCache;
71: if ( this->ctx ) {
72: db_close_events ( this->ctx );
Only possibility left. So where is the dbContext destroyed?
Only when epics_auto_ptr < cacContext > cacContextNotify::pServiceContext gets destroyed, assigned, or reset.
Assignement operator is not used. Destructor is not called explicitly, thus only runs at end of destructor of its owner.
reset is called in the ca_client_context constructor (ca_client_context.cpp:69), but there is is still empty.
The only other place is in the ~ca_client_context destructor (ca_client_context.cpp:171).
So this means when the ca_client_context gets destroyed and calls the ~dbContex destructor, the "CAS-client" thread may still be active.
BTW: Are the threads using dbContex resources? If that is the case, they must finish before the ~dbContex destructor completes.
More important here:
We must make sure that "CAS-event" (event_task) does not destroy the mutex before "CAS-client" (camsgtask) finishes.
My analysis so far:
This is the stack of the failing CAS-client thread:
#0 0xb7be13bc in __pthread_cond_wait (cond=0xb5c588c0, mutex=0xb5c588a8) at pthread_ cond_wait. c:153 0xb5c588a8, condId=0xb5c588c0) at ../../. ./src/libCom/ osi/os/ posix/osdEvent. c:75 ./src/libCom/ osi/os/ posix/osdEvent. c:137 endSelf () at ../../. ./src/libCom/ osi/os/ Linux/osdThread .c:580 0xfd4c894 "the author") at ../../. ./src/libCom/ osi/os/ default/ osdAssert. c:50 .c:653 .c:697 ./src/libCom/ osi/os/ Linux/osdThread .c:408 create. c:306
#1 0xb7cf5978 in __pthread_cond_wait (cond=<optimized out>, mutex=<optimized out>) at forward.c:139
#2 0x0fd3e374 in condWait (mutexId=
#3 epicsEventWait (pevent=0xb5c588a8) at ../../.
#4 0x0fd3c68c in epicsThreadSusp
#5 0x0fd3a4fc in epicsAssert (pFile=0xfe870e0 "../dbEvent.c", line=531, pExp=0xfe85d68 "status == epicsMutexLockOK", pAuthorName=
#6 0x0fe75ba4 in db_cancel_event (es=0x10d0d990) at ../dbEvent.c:531
#7 0x0ff05610 in destroyAllChannels (client=0x10d09e28, pList=0x10d09e68) at ../caservertask
#8 0x0ff06508 in destroy_tcp_client (client=0x10d09e28) at ../caservertask
#9 0x0ff073f8 in camsgtask (pParm=0x10d09e28) at ../camsgtask.c:160
#10 0x0fd3bbcc in start_routine (arg=0xb5c08258) at ../../.
#11 0xb7bdc82c in start_thread (arg=0xb22ba480) at pthread_
#12 0xb7ce70e4 in clone () from /lib/libc.so.6
camsgtask (camsgtest.c:41) exits due to an error
151: epicsPrintf ("CAS: forcing disconnect from %s\n", buf);
152: break;
153: }
154: }
155:
156: LOCK_CLIENTQ;
157: ellDelete ( &clientQ, &client->node );
158: UNLOCK_CLIENTQ;
159:
160: destroy_tcp_client ( client );
destroy_tcp_client (caservertask. c:676) calls destroyAllChannels
679: destroyAllChannels ( client, & client->chanList );
destroyAllChannels (caservertask. c:619) calls db_cancel_event
640: while ( TRUE ) {
641: /*
642: * AS state change could be using this list
643: */
644: epicsMutexMustLock ( client->eventqLock );
645: pevext = (struct event_ext *) ellGet ( &pciu->eventq );
646: epicsMutexUnlock ( client->eventqLock );
647:
648: if ( ! pevext ) {
649: break;
650: }
651:
652: if ( pevext->pdbev ) {
653: db_cancel_event (pevext->pdbev);
db_cancel_event (dbEvent.c:518) locks the event queue which fails k((pevent- >ev_que) ->writelock) ;
531: LOCKEVQUE ( pevent->ev_que )
This macro expands to: epicsMutexUnloc
So at this time the mutex seems to be invalid. Why?
Content of the mutex:
(gdb) print *(( struct evSubscrip * ) es) -> ev_que->writelock
$6 = {node = {next = 0x10459d08, previous = 0xb5c58410}, id = 0x11013848, pFileName = 0xfe870e0 "../dbEvent.c", lineno = 298}
The only pssibility that makes sense it that it had been destroyed.
event_task (dbEvent.c:915) ("CAS-event" thread) destroys writelock mutexes when it exits: oy(evUser- >firstque. writelock) ; >firstque. nextque; oy(ev_que- >writelock) ; dbevEventQueueF reeList, ev_que);
928: do{
...
961: pendexit = evUser->pendexit;
962: epicsMutexUnlock ( evUser->lock );
963:
964: } while( ! pendexit );
965:
966: epicsMutexDestr
967:
968: {
969: struct event_que *nextque;
970:
971: ev_que = evUser-
972: while(ev_que){
973: nextque = ev_que->nextque;
974: epicsMutexDestr
975: freeListFree(
976: ev_que = nextque;
977: }
978: }
The loop exits and the mutexes get destroyed after evUser->pendexit becomes TRUE. l(evUser- >ppendsem) ;
This happens in db_close_events (dbEvent.c:336):
348: epicsMutexMustLock ( evUser->lock );
349: evUser->pendexit = TRUE;
350: epicsMutexUnlock ( evUser->lock );
351: /* notify the waiting task */
352: epicsEventSigna
353:}
So who calls db_close_events? c:676) does it: >chanPendingUpd ateARList );
destroy_tcp_client (caservertask.
697: destroyAllChannels ( client, & client->chanList );
698: destroyAllChannels ( client, & client-
699:
700: if ( client->evuser ) {
701: db_close_events (client->evuser);
702: }
703:
704: destroy_client ( client );
705:}
But it does so only *after* destroyAllChannels in line 697, where it already crashed!
So where else is db_close_events called? In dbContext.cpp. :subscribe when creating the "CAC-event" thread failed:
a) In dbContext:
214: int status = db_start_events ( tmpctx, "CAC-event",
215: cacAttachClientCtx, ca_current_context (), above );
216: if ( status ) {
217: db_close_events ( tmpctx );
218: throw std::bad_alloc ();
This would probably not have the observed effect, because the "CAC-event" thread which destroys the mutex is not running.
b) In dbContext: :subscribe when another thread called dbContext: :subscribe at the same time.
221: if ( this->ctx ) {
222: // another thread tried to simultaneously setup
223: // the event system
224: db_close_events ( tmpctx );
225: }
226: else {
227: this->ctx = tmpctx;
228: }
I think this can never happen because the whole dbContext: :subscribe function is guarded with a mutex.
c) In the ~dbContext destructor: :~dbContext () ifyCache;
68:dbContext:
69:{
70: delete [] this->pStateNot
71: if ( this->ctx ) {
72: db_close_events ( this->ctx );
Only possibility left. So where is the dbContext destroyed? y::pServiceCont ext gets destroyed, assigned, or reset. context. cpp:69) , but there is is still empty. context. cpp:171) .
Only when epics_auto_ptr < cacContext > cacContextNotif
Assignement operator is not used. Destructor is not called explicitly, thus only runs at end of destructor of its owner.
reset is called in the ca_client_context constructor (ca_client_
The only other place is in the ~ca_client_context destructor (ca_client_
So this means when the ca_client_context gets destroyed and calls the ~dbContex destructor, the "CAS-client" thread may still be active.
BTW: Are the threads using dbContex resources? If that is the case, they must finish before the ~dbContex destructor completes.
More important here:
We must make sure that "CAS-event" (event_task) does not destroy the mutex before "CAS-client" (camsgtask) finishes.