pcas deadlocks in casEventSys

Bug #1830957 reported by Till Straumann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
New
Undecided
Unassigned

Bug Description

We observe a deadlock situation in the pcas server:

The indented lines represent the call stack; 1) 2) are threads

1) Application calls casPV::postEvent();
     casPVI::postEvent() takes casPVI::.mutex
        ...
          casEventSys::postEvent() takes casEventSys::.mutex

2) server thread runs fileDescriptorManager.process(..)
     ...
       casEventSys::process() takes casEventSys::.mutex
          ...
             casAsyncWriteIOI::cbFuncAsyncIO()
                this->chan.uninstallIO()
                    ..
                        casPVI::uninstallIO() takes casPVI::.mutex

Thus, we have the classical case of two threads trying to acquire two locks in opposite order.

Note that this bug has already been experienced and discussed on tech-talk (no launchpad bug report I could find, though):

  https://epics.anl.gov/tech-talk/2016/msg01930.php
  https://github.com/paulscherrerinstitute/pcaspy/issues/29

and a "solution" to the particular race condition reported then has been put in place.
This "solution" is, IMHO, but a mere hack which works around one particular scenario.

(another potential race condition is casPVI::updateEnumStringTableAsyncCompletion()
when called from casAsyncReadIOI::cbFuncAsyncIO() and there may be more)

The deeper problem is -- again IMHO -- a design flaw in the event processing loop which
holds on to the casEventSys::.mutex while working on the callbacks.

It is not unreasonable (and quite common in other event processing systems I have seen)
for an application to post to an asynchronous facility from a guarded code section
and for callbacks to be synchronized using the same (application) lock:

{ guard( myLock );
  POST_TO_ASYC_FACILTY( somewhere, myCallback );
  other_guarded_business();
}

and

myCallback()
{ guard( myLock );
  do_something();
}

Not possible with pcas.

-> I believe the casEventSys::process() loop should be reviewed
    - release casEventSys::.mutex while working on the callback
    - remove the epicsGuard< evSysMutex > & argument from casEvent::cbFunc()
      (this is super ugly anyways. Callback should not have to know about
      locking semantics of the event loop)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.