EPICS Base

pcas deadlocks in casEventSys

Bug #1830957 reported by Till Straumann on 2019-05-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	EPICS Base	New	Undecided	Unassigned

Bug Description

We observe a deadlock situation in the pcas server:

The indented lines represent the call stack; 1) 2) are threads

1) Application calls casPV::postEvent();
     casPVI::postEvent() takes casPVI::.mutex
        ...
          casEventSys::postEvent() takes casEventSys::.mutex

2) server thread runs fileDescriptorManager.process(..)
     ...
       casEventSys::process() takes casEventSys::.mutex
          ...
             casAsyncWriteIOI::cbFuncAsyncIO()
                this->chan.uninstallIO()
                    ..
                        casPVI::uninstallIO() takes casPVI::.mutex

Thus, we have the classical case of two threads trying to acquire two locks in opposite order.

Note that this bug has already been experienced and discussed on tech-talk (no launchpad bug report I could find, though):

https://epics.anl.gov/tech-talk/2016/msg01930.php
https://github.com/paulscherrerinstitute/pcaspy/issues/29

and a "solution" to the particular race condition reported then has been put in place.
This "solution" is, IMHO, but a mere hack which works around one particular scenario.

(another potential race condition is casPVI::updateEnumStringTableAsyncCompletion()
when called from casAsyncReadIOI::cbFuncAsyncIO() and there may be more)

The deeper problem is -- again IMHO -- a design flaw in the event processing loop which
holds on to the casEventSys::.mutex while working on the callbacks.

It is not unreasonable (and quite common in other event processing systems I have seen)
for an application to post to an asynchronous facility from a guarded code section
and for callbacks to be synchronized using the same (application) lock:

{ guard( myLock );
POST_TO_ASYC_FACILTY( somewhere, myCallback );
other_guarded_business();
}

and

myCallback()
{ guard( myLock );
do_something();
}

Not possible with pcas.

-> I believe the casEventSys::process() loop should be reviewed
    - release casEventSys::.mutex while working on the callback
    - remove the epicsGuard< evSysMutex > & argument from casEvent::cbFunc()
      (this is super ugly anyways. Callback should not have to know about
      locking semantics of the event loop)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.