EPICS Base

Dereference nullptr in notifyCallback in dbNotify.c

Bug #1775444 reported by Hao Yin on 2018-06-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	EPICS Base	Incomplete	Undecided	Unassigned

Bug Description

We had several crashes at random times on an softIOC. Each crash was reported on dmesg:
st.cmd[29147]: segfault at 18 ip 00007fa5f508b02c sp 00007fa5f30ffe10 error 4 in libdbIoc.so.3.14[7fa5f506d000+3e000]

"error 4" indicates an access to a nullptr.

running addr2line:
addr2line -e libdbIoc.so.3.14 0x1e02c #0x1e02c = ip - offset of libdbIoc.so.3.14
/opt/epics/base-3.14.12.7/src/db/O.linux-x86_64/../dbNotify.c:257

Finally gdb shows that pputNotifyPvt was a nullptr, dereferenced in the assert (see attachment).

The IOC does not contain any user defined callbacks. But we have few clients (standalone) connected to the IOC. The clients in turn are using synchronous groups from libca to r/w data from/to the IOC. But I'm not sure it the faulty callbacks are generated using those functions (ca_sg_put/get, ca_sg_array_put/get and ca_sg_block). I could not really reproduce the error, since those crashes are infrequent (~2-3 days).

we are using the following:
  EPICS 3.14.12.7
  Scientific Linux 7
  x86_64

it also happened on the same system using EPICS 3.14.12.6.

Best regards,
Hao

Revision history for this message

Hao Yin (hyin86) wrote on 2018-06-06:

crash Edit (427.7 KiB, image/png)

Revision history for this message

Andrew Johnson (anj) wrote on 2018-06-22:

Hi Hao,

I'm going to guess that this problem is related to the fact that you're using the synchronous groups feature of libCa, which has not received much use or testing in recent years (I last remember writing code that used it about 20 years ago). CA Client applications can generally implement the behavior of synchronous groups for themselves using libCa's callback APIs, which are extremely well used and tested.

There must be some internal state generated by the interaction of CA's synchronous groups and the putNotify subsystem which the code doesn't expect or handle properly. If you still have a coredump from one of these crashes, could you load it into gdb and do a 'thread apply all bt' so we can see where the other threads are doing at the time of the crash?

To be honest I'm not very confident that we'll be able to find and fix this problem though, as the authors of both the CA and putNotify subsystems have now moved on to other things. If you can produce some code that replicates the problem (presumably that would be some combination of an IOC database and a simple CA client application) we would be able to look at it more carefully. Problems caused by interactions like this are not easy to track down though, and the Core Developers have many other higher priority issues that we're working on.

It is possible that a newer version of Base might have fixed the issue inside the IOC code, so you could also try updating to see if that helps, but I would wait for the up-coming Base-3.16.2 release before doing that. If you can't upgrade the IOC, I would recommend removing the use of synchronous groups from your CA client applications. This would probably be quicker than trying to replicate and fix the problem inside the IOC code.

I realize this is probably a disappointing response, but I see it as the reality of the current state of the Channel Access code maintenance in EPICS Base.

Regards,

- Andrew

Changed in epics-base:
status:	New → Incomplete

Revision history for this message

Hao Yin (hyin86) wrote on 2018-06-23:

Hi Andrew,

I guess I'll switch to callbacks. Thank for the explanation.

Regards,
Hao

Revision history for this message

Andrew Johnson (anj) wrote on 2018-09-05:

Core Group review at ESS: We would look at this if it can be reproduced against a 3.15 or higher release, where we made significant changes to dbNotify.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

crash Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.