Dereference nullptr in notifyCallback in dbNotify.c
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
EPICS Base |
Incomplete
|
Undecided
|
Unassigned |
Bug Description
We had several crashes at random times on an softIOC. Each crash was reported on dmesg:
st.cmd[29147]: segfault at 18 ip 00007fa5f508b02c sp 00007fa5f30ffe10 error 4 in libdbIoc.
"error 4" indicates an access to a nullptr.
running addr2line:
addr2line -e libdbIoc.so.3.14 0x1e02c #0x1e02c = ip - offset of libdbIoc.so.3.14
/opt/epics/
Finally gdb shows that pputNotifyPvt was a nullptr, dereferenced in the assert (see attachment).
The IOC does not contain any user defined callbacks. But we have few clients (standalone) connected to the IOC. The clients in turn are using synchronous groups from libca to r/w data from/to the IOC. But I'm not sure it the faulty callbacks are generated using those functions (ca_sg_put/get, ca_sg_array_put/get and ca_sg_block). I could not really reproduce the error, since those crashes are infrequent (~2-3 days).
we are using the following:
EPICS 3.14.12.7
Scientific Linux 7
x86_64
it also happened on the same system using EPICS 3.14.12.6.
Best regards,
Hao
Hi Hao,
I'm going to guess that this problem is related to the fact that you're using the synchronous groups feature of libCa, which has not received much use or testing in recent years (I last remember writing code that used it about 20 years ago). CA Client applications can generally implement the behavior of synchronous groups for themselves using libCa's callback APIs, which are extremely well used and tested.
There must be some internal state generated by the interaction of CA's synchronous groups and the putNotify subsystem which the code doesn't expect or handle properly. If you still have a coredump from one of these crashes, could you load it into gdb and do a 'thread apply all bt' so we can see where the other threads are doing at the time of the crash?
To be honest I'm not very confident that we'll be able to find and fix this problem though, as the authors of both the CA and putNotify subsystems have now moved on to other things. If you can produce some code that replicates the problem (presumably that would be some combination of an IOC database and a simple CA client application) we would be able to look at it more carefully. Problems caused by interactions like this are not easy to track down though, and the Core Developers have many other higher priority issues that we're working on.
It is possible that a newer version of Base might have fixed the issue inside the IOC code, so you could also try updating to see if that helps, but I would wait for the up-coming Base-3.16.2 release before doing that. If you can't upgrade the IOC, I would recommend removing the use of synchronous groups from your CA client applications. This would probably be quicker than trying to replicate and fix the problem inside the IOC code.
I realize this is probably a disappointing response, but I see it as the reality of the current state of the Channel Access code maintenance in EPICS Base.
Regards,
- Andrew