EPICS Base

Comment 1 for bug 1773373

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2018-05-25:

My analysis so far:

This is the stack of the failing CAS-client thread:

#0 0xb7be13bc in __pthread_cond_wait (cond=0xb5c588c0, mutex=0xb5c588a8) at pthread_cond_wait.c:153
#1 0xb7cf5978 in __pthread_cond_wait (cond=<optimized out>, mutex=<optimized out>) at forward.c:139
#2 0x0fd3e374 in condWait (mutexId=0xb5c588a8, condId=0xb5c588c0) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#3 epicsEventWait (pevent=0xb5c588a8) at ../../../src/libCom/osi/os/posix/osdEvent.c:137
#4 0x0fd3c68c in epicsThreadSuspendSelf () at ../../../src/libCom/osi/os/Linux/osdThread.c:580
#5 0x0fd3a4fc in epicsAssert (pFile=0xfe870e0 "../dbEvent.c", line=531, pExp=0xfe85d68 "status == epicsMutexLockOK", pAuthorName=0xfd4c894 "the author") at ../../../src/libCom/osi/os/default/osdAssert.c:50
#6 0x0fe75ba4 in db_cancel_event (es=0x10d0d990) at ../dbEvent.c:531
#7 0x0ff05610 in destroyAllChannels (client=0x10d09e28, pList=0x10d09e68) at ../caservertask.c:653
#8 0x0ff06508 in destroy_tcp_client (client=0x10d09e28) at ../caservertask.c:697
#9 0x0ff073f8 in camsgtask (pParm=0x10d09e28) at ../camsgtask.c:160
#10 0x0fd3bbcc in start_routine (arg=0xb5c08258) at ../../../src/libCom/osi/os/Linux/osdThread.c:408
#11 0xb7bdc82c in start_thread (arg=0xb22ba480) at pthread_create.c:306
#12 0xb7ce70e4 in clone () from /lib/libc.so.6

camsgtask (camsgtest.c:41) exits due to an error
151: epicsPrintf ("CAS: forcing disconnect from %s\n", buf);
152: break;
153: }
154: }
155:
156: LOCK_CLIENTQ;
157: ellDelete ( &clientQ, &client->node );
158: UNLOCK_CLIENTQ;
159:
160: destroy_tcp_client ( client );

destroy_tcp_client (caservertask.c:676) calls destroyAllChannels
679: destroyAllChannels ( client, & client->chanList );

destroyAllChannels (caservertask.c:619) calls db_cancel_event
640: while ( TRUE ) {
641: /*
642: * AS state change could be using this list
643: */
644: epicsMutexMustLock ( client->eventqLock );
645: pevext = (struct event_ext *) ellGet ( &pciu->eventq );
646: epicsMutexUnlock ( client->eventqLock );
647:
648: if ( ! pevext ) {
649: break;
650: }
651:
652: if ( pevext->pdbev ) {
653: db_cancel_event (pevext->pdbev);

db_cancel_event (dbEvent.c:518) locks the event queue which fails
531: LOCKEVQUE ( pevent->ev_que )
This macro expands to: epicsMutexUnlock((pevent->ev_que)->writelock);

So at this time the mutex seems to be invalid. Why?
Content of the mutex:
(gdb) print *(( struct evSubscrip * ) es) -> ev_que->writelock
$6 = {node = {next = 0x10459d08, previous = 0xb5c58410}, id = 0x11013848, pFileName = 0xfe870e0 "../dbEvent.c", lineno = 298}

The only pssibility that makes sense it that it had been destroyed.

event_task (dbEvent.c:915) ("CAS-event" thread) destroys writelock mutexes when it exits:
928: do{
...
961: pendexit = evUser->pendexit;
962: epicsMutexUnlock ( evUser->lock );
963:
964: } while( ! pendexit );
965:
966: epicsMutexDestroy(evUser->firstque.writelock);
967:
968: {
969: struct event_que *nextque;
970:
971: ev_que = evUser->firstque.nextque;
972: while(ev_que){
973: nextque = ev_que->nextque;
974: epicsMutexDestroy(ev_que->writelock);
975: freeListFree(dbevEventQueueFreeList, ev_que);
976: ev_que = nextque;
977: }
978: }

The loop exits and the mutexes get destroyed after evUser->pendexit becomes TRUE.
This happens in db_close_events (dbEvent.c:336):
348: epicsMutexMustLock ( evUser->lock );
349: evUser->pendexit = TRUE;
350: epicsMutexUnlock ( evUser->lock );
351: /* notify the waiting task */
352: epicsEventSignal(evUser->ppendsem);
353:}

So who calls db_close_events?
destroy_tcp_client (caservertask.c:676) does it:
697: destroyAllChannels ( client, & client->chanList );
698: destroyAllChannels ( client, & client->chanPendingUpdateARList );
699:
700: if ( client->evuser ) {
701: db_close_events (client->evuser);
702: }
703:
704: destroy_client ( client );
705:}

But it does so only *after* destroyAllChannels in line 697, where it already crashed!

So where else is db_close_events called? In dbContext.cpp.
a) In dbContext::subscribe when creating the "CAC-event" thread failed:
214: int status = db_start_events ( tmpctx, "CAC-event",
215: cacAttachClientCtx, ca_current_context (), above );
216: if ( status ) {
217: db_close_events ( tmpctx );
218: throw std::bad_alloc ();

This would probably not have the observed effect, because the "CAC-event" thread which destroys the mutex is not running.

b) In dbContext::subscribe when another thread called dbContext::subscribe at the same time.
221: if ( this->ctx ) {
222: // another thread tried to simultaneously setup
223: // the event system
224: db_close_events ( tmpctx );
225: }
226: else {
227: this->ctx = tmpctx;
228: }

I think this can never happen because the whole dbContext::subscribe function is guarded with a mutex.

c) In the ~dbContext destructor:
68:dbContext::~dbContext ()
69:{
70: delete [] this->pStateNotifyCache;
71: if ( this->ctx ) {
72: db_close_events ( this->ctx );

Only possibility left. So where is the dbContext destroyed?
Only when epics_auto_ptr < cacContext > cacContextNotify::pServiceContext gets destroyed, assigned, or reset.
Assignement operator is not used. Destructor is not called explicitly, thus only runs at end of destructor of its owner.
reset is called in the ca_client_context constructor (ca_client_context.cpp:69), but there is is still empty.
The only other place is in the ~ca_client_context destructor (ca_client_context.cpp:171).

So this means when the ca_client_context gets destroyed and calls the ~dbContex destructor, the "CAS-client" thread may still be active.

BTW: Are the threads using dbContex resources? If that is the case, they must finish before the ~dbContex destructor completes.

More important here:
We must make sure that "CAS-event" (event_task) does not destroy the mutex before "CAS-client" (camsgtask) finishes.

My analysis so far:

This is the stack of the failing CAS-client thread:

#0  0xb7be13bc in __pthread_cond_wait (cond=0xb5c588c0, mutex=0xb5c588a8) at pthread_cond_wait.c:153
#1  0xb7cf5978 in __pthread_cond_wait (cond=<optimized out>, mutex=<optimized out>) at forward.c:139
#2  0x0fd3e374 in condWait (mutexId=0xb5c588a8, condId=0xb5c588c0) at ../../../src/libCom/osi/os/posix/osdEvent.c:75
#3  epicsEventWait (pevent=0xb5c588a8) at ../../../src/libCom/osi/os/posix/osdEvent.c:137
#4  0x0fd3c68c in epicsThreadSuspendSelf () at ../../../src/libCom/osi/os/Linux/osdThread.c:580
#5  0x0fd3a4fc in epicsAssert (pFile=0xfe870e0 "../dbEvent.c", line=531, pExp=0xfe85d68 "status == epicsMutexLockOK", pAuthorName=0xfd4c894 "the author") at ../../../src/libCom/osi/os/default/osdAssert.c:50
#6  0x0fe75ba4 in db_cancel_event (es=0x10d0d990) at ../dbEvent.c:531
#7  0x0ff05610 in destroyAllChannels (client=0x10d09e28, pList=0x10d09e68) at ../caservertask.c:653
#8  0x0ff06508 in destroy_tcp_client (client=0x10d09e28) at ../caservertask.c:697
#9  0x0ff073f8 in camsgtask (pParm=0x10d09e28) at ../camsgtask.c:160
#10 0x0fd3bbcc in start_routine (arg=0xb5c08258) at ../../../src/libCom/osi/os/Linux/osdThread.c:408
#11 0xb7bdc82c in start_thread (arg=0xb22ba480) at pthread_create.c:306
#12 0xb7ce70e4 in clone () from /lib/libc.so.6

camsgtask (camsgtest.c:41) exits due to an error
151:            epicsPrintf ("CAS: forcing disconnect from %s\n", buf);
152:                break;
153:        }
154:    }
155:
156:    LOCK_CLIENTQ;
157:    ellDelete ( &clientQ, &client->node );
158:    UNLOCK_CLIENTQ;
159:
160:    destroy_tcp_client ( client );

destroy_tcp_client (caservertask.c:676) calls destroyAllChannels
679: destroyAllChannels ( client, & client->chanList );

destroyAllChannels (caservertask.c:619) calls db_cancel_event
640:        while ( TRUE ) {
641:            /*
642:            * AS state change could be using this list
643:            */
644:            epicsMutexMustLock ( client->eventqLock );
645:            pevext = (struct event_ext *) ellGet ( &pciu->eventq );
646:            epicsMutexUnlock ( client->eventqLock );
647:
648:            if ( ! pevext ) {
649:                break;
650:            }
651:
652:            if ( pevext->pdbev ) {
653:                db_cancel_event (pevext->pdbev);

db_cancel_event (dbEvent.c:518) locks the event queue which fails
531:    LOCKEVQUE ( pevent->ev_que )
This macro expands to: epicsMutexUnlock((pevent->ev_que)->writelock);

The only pssibility that makes sense it that it had been destroyed.

event_task (dbEvent.c:915) ("CAS-event" thread) destroys writelock mutexes when it exits:
928:    do{
...
961:        pendexit = evUser->pendexit;
962:        epicsMutexUnlock ( evUser->lock );
963:
964:    } while( ! pendexit );
965:
966:    epicsMutexDestroy(evUser->firstque.writelock);
967:
968:    {
969:        struct event_que    *nextque;
970:
971:        ev_que = evUser->firstque.nextque;
972:        while(ev_que){ 
973:            nextque = ev_que->nextque;
974:            epicsMutexDestroy(ev_que->writelock);
975:            freeListFree(dbevEventQueueFreeList, ev_que);
976:            ev_que = nextque;
977:        }
978:    }

The loop exits and the mutexes get destroyed after evUser->pendexit becomes TRUE.
This happens in db_close_events (dbEvent.c:336):
348:    epicsMutexMustLock ( evUser->lock );
349:    evUser->pendexit = TRUE;
350:    epicsMutexUnlock ( evUser->lock );
351:    /* notify the waiting task */
352:    epicsEventSignal(evUser->ppendsem);
353:}

So who calls db_close_events?
destroy_tcp_client (caservertask.c:676) does it:
697:    destroyAllChannels ( client, & client->chanList );
698:    destroyAllChannels ( client, & client->chanPendingUpdateARList );
699:
700:    if ( client->evuser ) {
701:        db_close_events (client->evuser);
702:    }
703:
704:    destroy_client ( client );
705:}

But it does so only *after* destroyAllChannels in line 697, where it already crashed!

So where else is db_close_events called? In dbContext.cpp.
a) In dbContext::subscribe when creating the "CAC-event" thread failed:
214:            int status = db_start_events ( tmpctx, "CAC-event", 
215:                cacAttachClientCtx, ca_current_context (), above );
216:            if ( status ) {
217:                db_close_events ( tmpctx );
218:                throw std::bad_alloc ();

This would probably not have the observed effect, because the "CAC-event" thread which destroys the mutex is not running.

b) In dbContext::subscribe when another thread called dbContext::subscribe at the same time.
221:        if ( this->ctx ) {
222:            // another thread tried to simultaneously setup 
223:            // the event system
224:            db_close_events ( tmpctx );
225:        }
226:        else {
227:            this->ctx = tmpctx;
228:        }

I think this can never happen because the whole dbContext::subscribe function is guarded with a mutex.

c) In the ~dbContext destructor:
68:dbContext::~dbContext ()
69:{
70:    delete [] this->pStateNotifyCache;
71:    if ( this->ctx ) {
72:        db_close_events ( this->ctx );

Only possibility left. So where is the dbContext destroyed?
Only when  epics_auto_ptr < cacContext > cacContextNotify::pServiceContext gets destroyed, assigned, or reset.
Assignement operator is not used. Destructor is not called explicitly, thus only runs at end of destructor of its owner.
reset is called in the ca_client_context constructor (ca_client_context.cpp:69), but there is is still empty.
The only other place is in the ~ca_client_context destructor (ca_client_context.cpp:171).

So this means when the ca_client_context gets destroyed and calls the ~dbContex destructor, the "CAS-client" thread may still be active.

BTW: Are the threads using dbContex resources? If that is the case, they must finish before the ~dbContex destructor completes.

More important here:
We must make sure that "CAS-event" (event_task) does not destroy the mutex before "CAS-client" (camsgtask) finishes.