Deadlock in ca_clear_subscription()

Bug #1751380 reported by Andrew Johnson on 2018-02-23
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Status tracked in 7.0
3.14
Undecided
Unassigned
3.15
Undecided
Unassigned
3.16
Undecided
Unassigned
7.0
Medium
Ben Franksen

Bug Description

A preemptive CA client program that acquires a mutex inside its subscription event callback and also holds that same mutex in its main thread when calling ca_clear_subscription() can deadlock with the cbMutex in cac. There is nothing in the reference documentation that makes this practice illegal.

The attached C program demonstrates the issue. To run it, first start an IOC with the following database to generate a continuous stream of monitor updates from X1:

record(calc, "X1") {
    field(INPA, "X1.VAL CP")
    field(CALC, "A+1")
}

Then in a second window run "bin/<host-arch>/deadlock X1" and wait for it to stop (it usually only takes a few seconds). Here's an example of the output I get (the ^C shows where it hung):

tux% bin/linux-x86_64/deadlock X1

Connecting #0

accessRightsCallback
  pv: X1 type(-1) nelements(0) host(tux.aps.anl.gov:5064)
      read(1) write(1) state(0)

connectionCallback
  pv: X1 type(6) nelements(1) host(tux.aps.anl.gov:5064)
      read(1) write(1) state(2)
Event Callback: X1 = 177085

Disconnecting:
Unsubscribing from X1 ... done
Disconnected from X1

Connecting #1

accessRightsCallback
  pv: X1 type(-1) nelements(0) host(tux.aps.anl.gov:5064)
      read(1) write(1) state(0)

connectionCallback
  pv: X1 type(6) nelements(1) host(tux.aps.anl.gov:5064)
      read(1) write(1) state(2)
Event Callback: X1 = 179442
Event Callback: X1 = 179443
Event Callback: X1 = 179444
Event Callback: X1 = 179445
Event Callback: X1 = 179446
Event Callback: X1 = 179447
Event Callback: X1 = 179448
Event Callback: X1 = 179449
Event Callback: X1 = 179450
Event Callback: X1 = 179451
Event Callback: X1 = 179452
Event Callback: X1 = 179453
Event Callback: X1 = 179454
Event Callback: X1 = 179455
Event Callback: X1 = 179456
Event Callback: X1 = 179457
Event Callback: X1 = 179458
Event Callback: X1 = 179459
Event Callback: X1 = 179460
Event Callback: X1 = 179461

Disconnecting:
Unsubscribing from X1 ... ^C

I don't know how this is related to the fix to lp: #1179642, but I get a segfault from the call to ca_name() inside the eventCallback() routine instead of the hang if I build deadlock against a version of Base that pre-dates that fix.

The deadlock is caused by different lock ordering. During monitor callbacks the client program's mutex is acquired after the context's cbMutex, but in the main thread the client program's mutex is acquired before the context's cbMutex.

Given the complexity of this area an acceptable fix might just be to document a requirement that ca_clear_subscription() must not be called while owning a mutex that is also taken inside the event callback subroutine.

Tags: ca Edit Tag help

Related branches

Andrew Johnson (anj) wrote :
Ben Franksen (bfrk) wrote :

As similar problem with ca_clear_channel has been known to me for a long time, I remember stumbling over it in the sequencer. I agree that it makes the most sense to clearly document these behaviors.

The current behavior has been introduced as a fix for a bug. IIRC, this fix was made somewhere between 3.14.10 and 3.14.11. The problem was that callbacks could not rely on the chid being valid, because the chid was invalidated while a callback was still running (in another thread). Jeff Hill changed it so that ca_clear_channel waits for any active callback to complete before invalidating the chid, which is the correct behavior. OTOH it means that you must not hold a mutex that is also taken by your callback when calling ca_clear_channel.

I believe things are similar for ca_clear_subscription.

Andrew Johnson (anj) wrote :

Thanks for the description of the earlier fix, that confirms my understanding.

I have just confirmed that removing the call to ca_clear_subscription() in the main() function of the program I attached above allows the deadlock to occur in the call to ca_clear_channel() instead.

The fix is thus to describe these limitations in CAref.html on the 3.14 branch. Patches or merge requests are welcome from anyone with the time to do this.

Changed in epics-base:
milestone: none → 3.14.branch
Ben Franksen (bfrk) wrote :

I can do it but I need advice how to contribute with git on launchpad.

Ralph Lange (ralph-lange) wrote :

That should have enough information. Does it?

https://epics.anl.gov/core-talk/2018/msg00243.php

Ralph Lange (ralph-lange) wrote :

If you want to be able to create repos in the common epics-core area, Andrew has to add you to the epics-core group so that you get write access.

Andrew Johnson (anj) wrote :

Just as with Bazaar, every Launchpad user can push to their own git repositories for any LP project without needing to belong to the project itself. Thus as LP user "bfrk" Ben can do

  git remote add lp-3.14 git+ssh://<email address hidden>/~bfrk/epics-base/+git/base-3.14

and push to that to create a git repo for base-3.14 changes. The remote name and final path component of the URL are of course up to Ben to choose.

Repo's with URLs in the form …/~epics-core/… that Ralph's email suggested belong to the Core Developers team and so require membership of that team to be able to push to them (which is better for joint development work). I would be happy to add Ben to the team if he wants, but that won't be true for everyone who creates merge requests for Base, so the above URL pattern is a better solution for the general case.

Thanks.

Ralph Lange (ralph-lange) wrote :

Yes, sure, obviously I wasn't considering Ben to be the general case.

Ben Franksen (bfrk) wrote :

Thanks Andrew, Ralph, I think this will get me going.

Ben Franksen (bfrk) wrote :

So this note will be attached to:

- ca_clear_channel
- ca_clear_subscription

and I guess also

- ca_context_destroy

as presumably it probably calls the above two functions for each channel (or at least does something with the same effect).

Andrew Johnson (anj) wrote :

I think I just confirmed that the deadlock happens with ca_context_destroy() too, thanks!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers