Deadlock in ca_clear_subscription()
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
EPICS Base | Status tracked in 7.0 | |||||
3.14 |
Fix Released
|
Undecided
|
Unassigned | |||
3.15 |
Fix Released
|
Undecided
|
Unassigned | |||
3.16 |
Fix Released
|
Undecided
|
Unassigned | |||
7.0 |
Fix Released
|
Medium
|
Ben Franksen |
Bug Description
A preemptive CA client program that acquires a mutex inside its subscription event callback and also holds that same mutex in its main thread when calling ca_clear_
The attached C program demonstrates the issue. To run it, first start an IOC with the following database to generate a continuous stream of monitor updates from X1:
record(calc, "X1") {
field(INPA, "X1.VAL CP")
field(CALC, "A+1")
}
Then in a second window run "bin/<host-
tux% bin/linux-
Connecting #0
accessRightsCal
pv: X1 type(-1) nelements(0) host(tux.
read(1) write(1) state(0)
connectionCallback
pv: X1 type(6) nelements(1) host(tux.
read(1) write(1) state(2)
Event Callback: X1 = 177085
Disconnecting:
Unsubscribing from X1 ... done
Disconnected from X1
Connecting #1
accessRightsCal
pv: X1 type(-1) nelements(0) host(tux.
read(1) write(1) state(0)
connectionCallback
pv: X1 type(6) nelements(1) host(tux.
read(1) write(1) state(2)
Event Callback: X1 = 179442
Event Callback: X1 = 179443
Event Callback: X1 = 179444
Event Callback: X1 = 179445
Event Callback: X1 = 179446
Event Callback: X1 = 179447
Event Callback: X1 = 179448
Event Callback: X1 = 179449
Event Callback: X1 = 179450
Event Callback: X1 = 179451
Event Callback: X1 = 179452
Event Callback: X1 = 179453
Event Callback: X1 = 179454
Event Callback: X1 = 179455
Event Callback: X1 = 179456
Event Callback: X1 = 179457
Event Callback: X1 = 179458
Event Callback: X1 = 179459
Event Callback: X1 = 179460
Event Callback: X1 = 179461
Disconnecting:
Unsubscribing from X1 ... ^C
I don't know how this is related to the fix to lp: #1179642, but I get a segfault from the call to ca_name() inside the eventCallback() routine instead of the hang if I build deadlock against a version of Base that pre-dates that fix.
The deadlock is caused by different lock ordering. During monitor callbacks the client program's mutex is acquired after the context's cbMutex, but in the main thread the client program's mutex is acquired before the context's cbMutex.
Given the complexity of this area an acceptable fix might just be to document a requirement that ca_clear_
As similar problem with ca_clear_channel has been known to me for a long time, I remember stumbling over it in the sequencer. I agree that it makes the most sense to clearly document these behaviors.
The current behavior has been introduced as a fix for a bug. IIRC, this fix was made somewhere between 3.14.10 and 3.14.11. The problem was that callbacks could not rely on the chid being valid, because the chid was invalidated while a callback was still running (in another thread). Jeff Hill changed it so that ca_clear_channel waits for any active callback to complete before invalidating the chid, which is the correct behavior. OTOH it means that you must not hold a mutex that is also taken by your callback when calling ca_clear_channel.
I believe things are similar for ca_clear_ subscription.