race condition when destroying subscripion in preemptive callback mode application
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
EPICS Base |
Fix Released
|
Undecided
|
Unassigned | ||
3.14 |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
From Michael Abbot
see http://
> > In the attached test IOC I repeatedly create 500 subscriptions to 500
> > locally published PVs, pause a few hundred microseconds, and then
> > proceed to tear them all down again. The context pointer I pass
> > (args.usr) just contains a validity flag which I reset after
> > ca_clear_
> >
> > Below is a typical run:
> >
> > $ ./test 10 500
> > dbLoadDatabase(
> > TEST_registerRe
> > dbLoadRecords(
> > iocInit()
> > Starting iocInit
> >
> #######
> ########
> > #####
> > ## EPICS R3.14.11 $R3-14-11$ $2009/08/28 18:47:36$
> > ## EPICS Base built Nov 4 2011
> >
> #######
> ########
> > #####
> > iocRun: All initialization complete
> > All channels connected
> > Testing 10 cycles, interval 500 us
> > [......
> > .......
> > .......
> > .......
> > .......
> > .......
> > .......
> > ][
> >
> >
> > The two arguments to `test` are number of times to try and how long to
> > pause between create and clear (in microseconds, passed to usleep(3)).
> > [ and ] are printed at the start and end of a cycle (so [ is
> > immediately followed by a burst of ca_create_
> > each . represents a successful callback. An unsuccessful (invalid)
> > callback is shown by 'whoops!' which is followed by an exit() call.
> >
> > This test can be very delicate and difficult to reproduce, and may need
> > to be run many times with slightly different pause intervals before
> > being even partially repeatable -- the fault only appears to show when
> > there isn't time for all 500 PVs to complete their initial updates, but
> > there has to be enough time for them all to make the effort.
> >
> > Another interesting detail follows from some locking I'm doing. Here
> > is an extract of the relevant code (LOCK() is just
> > pthread_
> >
> > 1 static void on_update(struct event_handler_args)
> > 2 {
> > 3 struct event *event = args.usr;
> > 4 LOCK();
> > 5 bool valid = event->valid;
> > 6 UNLOCK();
> > 7 if (valid) ...
> > 8 }
> >
> > ...
> >
> > 9 LOCK(); // This should trigger deadlock
> > 10 ca_clear_
> > 11 event->valid = false;
> > 12 UNLOCK();
> >
> > It seems to me that if ca_clear_
> > we discussed a year ago, which is to say, if it is waiting for all
> > outstanding callbacks to complete before returning, then the LOCK() on
> > line 9 should trigger a deadlock when ca_clear_
> > with its associated callback still only on line 3 (or earlier). But I
> > never see my test deadlock.
> >
> > I'm seeing this problem occur on test code which is repeatedly creating
> > and destroying subscriptions, but I've previously reported this on CA
> > client shutdown, so it does look to me like there is a general
> > synchronisation problem here. I believe I have a workaround, which is
> > to delay releasing the callback context to give time for outstanding
> > callbacks to complete, but this is a bit worrysome...
>
>
tags: | added: ca client library |
Changed in epics-base: | |
status: | New → Confirmed |
Changed in epics-base: | |
status: | Confirmed → Fix Committed |
Changed in epics-base: | |
status: | Fix Committed → Fix Released |
Here's the attachment from the original e-mail.