race condition when destroying subscripion in preemptive callback mode application

Bug #1179642 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
Undecided
Unassigned
3.14
Fix Released
Undecided
Unassigned

Bug Description

From Michael Abbot

see http://www.aps.anl.gov/epics/tech-talk/2012/msg00584.php

> > In the attached test IOC I repeatedly create 500 subscriptions to 500
> > locally published PVs, pause a few hundred microseconds, and then
> > proceed to tear them all down again. The context pointer I pass
> > (args.usr) just contains a validity flag which I reset after
> > ca_clear_subscription returns -- and which I test in the callback.
> >
> > Below is a typical run:
> >
> > $ ./test 10 500
> > dbLoadDatabase("dbd/TEST.dbd", NULL, NULL)
> > TEST_registerRecordDeviceDriver(pdbbase)
> > dbLoadRecords("db/TEST.db", NULL)
> > iocInit()
> > Starting iocInit
> >
> ###############################################################
> ########
> > #####
> > ## EPICS R3.14.11 $R3-14-11$ $2009/08/28 18:47:36$
> > ## EPICS Base built Nov 4 2011
> >
> ###############################################################
> ########
> > #####
> > iocRun: All initialization complete
> > All channels connected
> > Testing 10 cycles, interval 500 us
> > [......................................................................
> > .......................................................................
> > .......................................................................
> > .......................................................................
> > .......................................................................
> > .......................................................................
> > ...............................................................whoops!
> > ][
> >
> >
> > The two arguments to `test` are number of times to try and how long to
> > pause between create and clear (in microseconds, passed to usleep(3)).
> > [ and ] are printed at the start and end of a cycle (so [ is
> > immediately followed by a burst of ca_create_subscription() calls) and
> > each . represents a successful callback. An unsuccessful (invalid)
> > callback is shown by 'whoops!' which is followed by an exit() call.
> >
> > This test can be very delicate and difficult to reproduce, and may need
> > to be run many times with slightly different pause intervals before
> > being even partially repeatable -- the fault only appears to show when
> > there isn't time for all 500 PVs to complete their initial updates, but
> > there has to be enough time for them all to make the effort.
> >
> > Another interesting detail follows from some locking I'm doing. Here
> > is an extract of the relevant code (LOCK() is just
> > pthread_mutex_lock(3p) on a global mutex):
> >
> > 1 static void on_update(struct event_handler_args)
> > 2 {
> > 3 struct event *event = args.usr;
> > 4 LOCK();
> > 5 bool valid = event->valid;
> > 6 UNLOCK();
> > 7 if (valid) ...
> > 8 }
> >
> > ...
> >
> > 9 LOCK(); // This should trigger deadlock
> > 10 ca_clear_subscription(event->event_id);
> > 11 event->valid = false;
> > 12 UNLOCK();
> >
> > It seems to me that if ca_clear_subscription() is correctly doing what
> > we discussed a year ago, which is to say, if it is waiting for all
> > outstanding callbacks to complete before returning, then the LOCK() on
> > line 9 should trigger a deadlock when ca_clear_subscription() is called
> > with its associated callback still only on line 3 (or earlier). But I
> > never see my test deadlock.
> >
> > I'm seeing this problem occur on test code which is repeatedly creating
> > and destroying subscriptions, but I've previously reported this on CA
> > client shutdown, so it does look to me like there is a general
> > synchronisation problem here. I believe I have a workaround, which is
> > to delay releasing the callback context to give time for outstanding
> > callbacks to complete, but this is a bit worrysome...
>
>

Tags: ca client library
Jeff Hill (johill-lanl)
tags: added: ca client library
Jeff Hill (johill-lanl)
Changed in epics-base:
status: New → Confirmed
Revision history for this message
Michael Abbott (michael-araneidae) wrote :

Here's the attachment from the original e-mail.

Revision history for this message
Jeff Hill (johill-lanl) wrote :
Download full text (5.6 KiB)

I added this new regression test based on Michaels test

#define multiSubscrDestroyNoLateCallbackEventCount 500

struct MultiSubscrDestroyNoLateCallbackEventData {
    evid m_id;
    size_t m_nCallback;
    int m_callbackIsOk;
    struct MultiSubscrDestroyNoLateCallbackTestData * m_pTestData;
};

struct MultiSubscrDestroyNoLateCallbackTestData {
    const char * m_pChanName;
    chid m_chan;
    epicsMutexId m_mutex;
    epicsEventId m_testDoneEvent;
    unsigned m_interestLevel;
    struct MultiSubscrDestroyNoLateCallbackEventData
        m_eventData [multiSubscrDestroyNoLateCallbackEventCount];
};

static void noLateCallbackDetect ( struct event_handler_args args )
{
    int callbackIsOk;
    struct MultiSubscrDestroyNoLateCallbackEventData * const pEventData = args.usr;
    epicsMutexLockStatus lockStatus = epicsMutexLock ( pEventData->m_pTestData->m_mutex );
    callbackIsOk = pEventData->m_callbackIsOk;
    pEventData->m_nCallback++;
    epicsMutexUnlock ( pEventData->m_pTestData->m_mutex );
    verify ( lockStatus == epicsMutexLockOK );
    verify ( callbackIsOk );
}

static void multiSubscrDestroyNoLateCallbackThread ( void * pParm )
{
    struct MultiSubscrDestroyNoLateCallbackTestData * const pTestData =
              ( struct MultiSubscrDestroyNoLateCallbackTestData * ) pParm;
    unsigned i, j;
    int status;

    status = ca_context_create ( ca_enable_preemptive_callback );
    verify ( status == ECA_NORMAL );

    status = ca_create_channel ( pTestData->m_pChanName, 0, 0,
                           CA_PRIORITY_DEFAULT, &pTestData->m_chan );
    status = ca_pend_io ( timeoutToPendIO );
    SEVCHK ( status, "multiSubscrDestroyLateNoCallbackTest: channel connect failed" );
    verify ( status == ECA_NORMAL );

    /*
     * create a set of subscriptions
     */
    for ( i=0; i < 10000; i++ ) {
        unsigned int priorityOfTestThread;
        for ( j=0; j < multiSubscrDestroyNoLateCallbackEventCount; j++ ) {
            epicsMutexLockStatus lockStatus = epicsMutexLock ( pTestData->m_mutex );
            verify ( lockStatus == epicsMutexLockOK );
            pTestData->m_eventData[j].m_nCallback = 0;
            pTestData->m_eventData[j].m_callbackIsOk = TRUE;
            pTestData->m_eventData[j].m_pTestData = pTestData;
            epicsMutexUnlock ( pTestData->m_mutex );
            SEVCHK ( ca_add_event ( DBR_GR_FLOAT, pTestData->m_chan, noLateCallbackDetect,
                        &pTestData->m_eventData[j], &pTestData->m_eventData[j].m_id ) , NULL );
        }
        SEVCHK ( ca_flush_io(), NULL );

        /*
         * raise the priority of the current thread hoping to improve our
         * likelyhood of detecting a bug
         */
        priorityOfTestThread = epicsThreadGetPrioritySelf ();
        epicsThreadSetPriority ( epicsThreadGetIdSelf(), epicsThreadPriorityHigh );

        /*
         * wait for the first subscription update to arrive
         */
        {
            epicsMutexLockStatus lockStatus = epicsMutexLock ( pTestData->m_mutex );
            verify ( lockStatus == epicsMutexLockOK );
            while ( pTestData->m_eventData[0].m_nCallback == 0 ) {
                epicsMutexUnlock ( pTestData-...

Read more...

Changed in epics-base:
milestone: none → 3.15.branch
Jeff Hill (johill-lanl)
Changed in epics-base:
status: Confirmed → Fix Committed
Revision history for this message
Jeff Hill (johill-lanl) wrote :

In summary, the primary client library mutex must be released when calling the user's callback, and therefore a finite interval in time exists when we are moving forward with the intent to call the users callback but the users IO object could also be deleted by the user during this interval. To prevent, in all situations, the user's callback from being called after he destroys his IO object, we must install a fix that acquires the callback control mutex in the client library when destroying any IO object.

Andrew Johnson (anj)
Changed in epics-base:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.