EPICS Base

Deadlock in ca_clear_subscription()

Bug #1751380 reported by Andrew Johnson on 2018-02-23

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
EPICS Base	Status tracked in 7.0
3.14	Fix Released	Undecided	Unassigned	EPICS Base 3.14.12.8
3.15	Fix Released	Undecided	Unassigned	EPICS Base 3.15.6
3.16	Fix Released	Undecided	Unassigned	EPICS Base 3.16.2
7.0	Fix Released	Medium	Ben Franksen	EPICS Base 7.0.2

Bug Description

A preemptive CA client program that acquires a mutex inside its subscription event callback and also holds that same mutex in its main thread when calling ca_clear_subscription() can deadlock with the cbMutex in cac. There is nothing in the reference documentation that makes this practice illegal.

The attached C program demonstrates the issue. To run it, first start an IOC with the following database to generate a continuous stream of monitor updates from X1:

record(calc, "X1") {
field(INPA, "X1.VAL CP")
field(CALC, "A+1")
}

Then in a second window run "bin/<host-arch>/deadlock X1" and wait for it to stop (it usually only takes a few seconds). Here's an example of the output I get (the ^C shows where it hung):

tux% bin/linux-x86_64/deadlock X1

Connecting #0

accessRightsCallback
pv: X1 type(-1) nelements(0) host(tux.aps.anl.gov:5064)
read(1) write(1) state(0)

connectionCallback
pv: X1 type(6) nelements(1) host(tux.aps.anl.gov:5064)
read(1) write(1) state(2)
Event Callback: X1 = 177085

Disconnecting:
Unsubscribing from X1 ... done
Disconnected from X1

Connecting #1

accessRightsCallback
pv: X1 type(-1) nelements(0) host(tux.aps.anl.gov:5064)
read(1) write(1) state(0)

connectionCallback
pv: X1 type(6) nelements(1) host(tux.aps.anl.gov:5064)
read(1) write(1) state(2)
Event Callback: X1 = 179442
Event Callback: X1 = 179443
Event Callback: X1 = 179444
Event Callback: X1 = 179445
Event Callback: X1 = 179446
Event Callback: X1 = 179447
Event Callback: X1 = 179448
Event Callback: X1 = 179449
Event Callback: X1 = 179450
Event Callback: X1 = 179451
Event Callback: X1 = 179452
Event Callback: X1 = 179453
Event Callback: X1 = 179454
Event Callback: X1 = 179455
Event Callback: X1 = 179456
Event Callback: X1 = 179457
Event Callback: X1 = 179458
Event Callback: X1 = 179459
Event Callback: X1 = 179460
Event Callback: X1 = 179461

Disconnecting:
Unsubscribing from X1 ... ^C

I don't know how this is related to the fix to lp: #1179642, but I get a segfault from the call to ca_name() inside the eventCallback() routine instead of the hang if I build deadlock against a version of Base that pre-dates that fix.

The deadlock is caused by different lock ordering. During monitor callbacks the client program's mutex is acquired after the context's cbMutex, but in the main thread the client program's mutex is acquired before the context's cbMutex.

Given the complexity of this area an acceptable fix might just be to document a requirement that ca_clear_subscription() must not be called while owning a mutex that is also taken inside the event callback subroutine.

Tags:

Revision history for this message

Andrew Johnson (anj) wrote on 2018-02-23:

Program demonstrating the deadlock Edit (4.6 KiB, text/x-csrc)

Revision history for this message

Ben Franksen (bfrk) wrote on 2018-04-09:

As similar problem with ca_clear_channel has been known to me for a long time, I remember stumbling over it in the sequencer. I agree that it makes the most sense to clearly document these behaviors.

The current behavior has been introduced as a fix for a bug. IIRC, this fix was made somewhere between 3.14.10 and 3.14.11. The problem was that callbacks could not rely on the chid being valid, because the chid was invalidated while a callback was still running (in another thread). Jeff Hill changed it so that ca_clear_channel waits for any active callback to complete before invalidating the chid, which is the correct behavior. OTOH it means that you must not hold a mutex that is also taken by your callback when calling ca_clear_channel.

I believe things are similar for ca_clear_subscription.

Revision history for this message

Andrew Johnson (anj) wrote on 2018-04-09:

Thanks for the description of the earlier fix, that confirms my understanding.

I have just confirmed that removing the call to ca_clear_subscription() in the main() function of the program I attached above allows the deadlock to occur in the call to ca_clear_channel() instead.

The fix is thus to describe these limitations in CAref.html on the 3.14 branch. Patches or merge requests are welcome from anyone with the time to do this.

Changed in epics-base:
milestone:	none → 3.14.branch

Revision history for this message

Ben Franksen (bfrk) wrote on 2018-04-10:

I can do it but I need advice how to contribute with git on launchpad.

Revision history for this message

Ralph Lange (ralph-lange) wrote on 2018-04-10:

That should have enough information. Does it?

https://epics.anl.gov/core-talk/2018/msg00243.php

Revision history for this message

Ralph Lange (ralph-lange) wrote on 2018-04-10:

If you want to be able to create repos in the common epics-core area, Andrew has to add you to the epics-core group so that you get write access.

Revision history for this message

Andrew Johnson (anj) wrote on 2018-04-10:

Just as with Bazaar, every Launchpad user can push to their own git repositories for any LP project without needing to belong to the project itself. Thus as LP user "bfrk" Ben can do

git remote add lp-3.14 git+ssh://<email address hidden>/~bfrk/epics-base/+git/base-3.14

and push to that to create a git repo for base-3.14 changes. The remote name and final path component of the URL are of course up to Ben to choose.

Repo's with URLs in the form …/~epics-core/… that Ralph's email suggested belong to the Core Developers team and so require membership of that team to be able to push to them (which is better for joint development work). I would be happy to add Ben to the team if he wants, but that won't be true for everyone who creates merge requests for Base, so the above URL pattern is a better solution for the general case.

Thanks.

Revision history for this message

Ralph Lange (ralph-lange) wrote on 2018-04-10:

Yes, sure, obviously I wasn't considering Ben to be the general case.

Revision history for this message

Ben Franksen (bfrk) wrote on 2018-04-11:

Thanks Andrew, Ralph, I think this will get me going.

Revision history for this message

Ben Franksen (bfrk) wrote on 2018-04-11:

#10

So this note will be attached to:

- ca_clear_channel
- ca_clear_subscription

and I guess also

- ca_context_destroy

as presumably it probably calls the above two functions for each channel (or at least does something with the same effect).

Revision history for this message

Andrew Johnson (anj) wrote on 2018-04-11:

#11

I think I just confirmed that the deadlock happens with ca_context_destroy() too, thanks!

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Program demonstrating the deadlock Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.