Dereference nullptr in notifyCallback in dbNotify.c

Bug #1775444 reported by Hao Yin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Incomplete
Undecided
Unassigned

Bug Description

We had several crashes at random times on an softIOC. Each crash was reported on dmesg:
  st.cmd[29147]: segfault at 18 ip 00007fa5f508b02c sp 00007fa5f30ffe10 error 4 in libdbIoc.so.3.14[7fa5f506d000+3e000]

"error 4" indicates an access to a nullptr.

running addr2line:
addr2line -e libdbIoc.so.3.14 0x1e02c #0x1e02c = ip - offset of libdbIoc.so.3.14
/opt/epics/base-3.14.12.7/src/db/O.linux-x86_64/../dbNotify.c:257

Finally gdb shows that pputNotifyPvt was a nullptr, dereferenced in the assert (see attachment).

The IOC does not contain any user defined callbacks. But we have few clients (standalone) connected to the IOC. The clients in turn are using synchronous groups from libca to r/w data from/to the IOC. But I'm not sure it the faulty callbacks are generated using those functions (ca_sg_put/get, ca_sg_array_put/get and ca_sg_block). I could not really reproduce the error, since those crashes are infrequent (~2-3 days).

we are using the following:
  EPICS 3.14.12.7
  Scientific Linux 7
  x86_64

it also happened on the same system using EPICS 3.14.12.6.

Best regards,
 Hao

Revision history for this message
Hao Yin (hyin86) wrote :
Revision history for this message
Andrew Johnson (anj) wrote :

Hi Hao,

I'm going to guess that this problem is related to the fact that you're using the synchronous groups feature of libCa, which has not received much use or testing in recent years (I last remember writing code that used it about 20 years ago). CA Client applications can generally implement the behavior of synchronous groups for themselves using libCa's callback APIs, which are extremely well used and tested.

There must be some internal state generated by the interaction of CA's synchronous groups and the putNotify subsystem which the code doesn't expect or handle properly. If you still have a coredump from one of these crashes, could you load it into gdb and do a 'thread apply all bt' so we can see where the other threads are doing at the time of the crash?

To be honest I'm not very confident that we'll be able to find and fix this problem though, as the authors of both the CA and putNotify subsystems have now moved on to other things. If you can produce some code that replicates the problem (presumably that would be some combination of an IOC database and a simple CA client application) we would be able to look at it more carefully. Problems caused by interactions like this are not easy to track down though, and the Core Developers have many other higher priority issues that we're working on.

It is possible that a newer version of Base might have fixed the issue inside the IOC code, so you could also try updating to see if that helps, but I would wait for the up-coming Base-3.16.2 release before doing that. If you can't upgrade the IOC, I would recommend removing the use of synchronous groups from your CA client applications. This would probably be quicker than trying to replicate and fix the problem inside the IOC code.

I realize this is probably a disappointing response, but I see it as the reality of the current state of the Channel Access code maintenance in EPICS Base.

Regards,

- Andrew

Changed in epics-base:
status: New → Incomplete
Revision history for this message
Hao Yin (hyin86) wrote :

Hi Andrew,

I guess I'll switch to callbacks. Thank for the explanation.

Regards,
  Hao

Revision history for this message
Andrew Johnson (anj) wrote :

Core Group review at ESS: We would look at this if it can be reproduced against a 3.15 or higher release, where we made significant changes to dbNotify.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.