Gateway segfaults on empty (zero-length) arrays

Bug #1415938 reported by Ralph Lange on 2015-01-29
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
PV Gateway
Fix Released
Critical
Ralph Lange

Bug Description

From Kay-Uwe Kasemir:

We have a gateway crashing with arrays that are INVALID/UDF, NORD=0, NELM>0.
As a CA client, the gateway receives an update with event.count = 0, but it tries to read ca_element_count(chID) >> 0.
Sometimes, that serendipitously works, but at other times it results in a crash because after all it is plain wrong.

Stack trace:

(gdb) bt
#0 0x00007f570a16a8fa in memcpy () from /lib64/libc.so.6
#1 0x0000000000411975 in gatePvData::eventLongCB (this=0x1024cc0, dbr=0x7f56f4025060) at ../gatePv.cc:2040
#2 0x0000000000413d6a in gatePvData::runEventCB (this=0x1024cc0, data=0x7f56f4025060) at ../gatePv.h:208
#3 0x000000000040f863 in gatePvData::eventCB (args=...) at ../gatePv.cc:1439
#4 0x000000000040b914 in eventCB (args=...) at ../gatePv.cc:82
#5 0x00007f570bdfaab3 in oldSubscription::current (this=0x7f56ec001fe8, guard=..., type=19, count=0, pData=0x7f56f4025060) at ../oldSubscription.cpp:68
#6 0x00007f570bdf1d91 in netSubscription::completion (this=0x7f56ec00d650, guard=..., typeIn=19, countIn=0, pDataIn=0x7f56f4025060) at ../netSubscription.cpp:142
#7 0x00007f570bdcff04 in cac::eventRespAction (this=0xf88930, iiu=..., hdr=..., pMsgBdy=0x7f56f4025060) at ../cac.cpp:946
#8 0x00007f570bdd0dd3 in cac::executeResponse (this=0xf88930, mgr=..., iiu=..., currentTime=..., hdr=..., pMshBody=0x7f56f4025060 "\021") at ../cac.cpp:1194
#9 0x00007f570bdecbcb in tcpiiu::processIncoming (this=0x7f56f4001740, currentTime=..., mgr=...) at ../tcpiiu.cpp:1252
#10 0x00007f570bdea3ce in tcpRecvThread::run (this=0x7f56f4001828) at ../tcpiiu.cpp:519

(gdb) fra 4
#4 0x000000000040b914 in eventCB (args=...) at ../gatePv.cc:82
82 gatePvData::eventCB(args);
(gdb) print args
$1 = {usr = 0x1024cc0, chid = 0xf9a750, type = 19, count = 0, dbr = 0x7f56f4025060, status = 1}

==> We received a monitor update with count = 0 elements.

(gdb) fra 1
#1 0x0000000000411975 in gatePvData::eventLongCB (this=0x1024cc0, dbr=0x7f56f4025060) at ../gatePv.cc:2040
2040 memcpy(nd,d,count*sizeof(aitInt32));
(gdb) list
2035 if(count>1)
2036 {
2037 aitInt32 *d,*nd;
2038 nd=new aitInt32[count];
2039 d=(aitInt32*)&ts->value;
2040 memcpy(nd,d,count*sizeof(aitInt32));
2041 value=new gddAtomic(GR->appValue,aitEnumInt32,1,&count);
2042 value->putRef(nd,new gateIntDestruct());
2043 }
2044 else

It crashes in memcpy, line 2040.
'nd' is good, it was just allocated.

(gdb) print count
$2 = 18088

'count' matches the correct PV size. That PV is indeed that long

(gdb) print *this
$3 = {eio = {pFirst = 0x0, pLast = 0x0, itemCount = 0}, callback_list = {pFirst = 0x0, pLast = 0x0, itemCount = 0}, mrg = 0xf73ec0, vc = 0x102b510, asentry = 0xf70760, max_elements = 18088, status = 0,
  pv_name = 0x1024de0 "BL16B:Det:N1:Det1:XY:Array:ArrayData", chID = 0xf9a750, evID = 0x7f56ec001fe8, logID = 0x0, alhID = 0x0, event_type = 19, data_type = 33, pv_state = gatePvActive, event_count = 0,
  static pv_state_names = 0x427de0, event_func = (gdd *(gatePvData::*)(gatePvData *, void *)) 0x4118b6 <gatePvData::eventLongCB(void*)>, data_func = (gdd *(gatePvData::*)(gatePvData *,
    void *)) 0x4112b2 <gatePvData::dataLongCB(void*)>, value_data_func = (gdd *(gatePvData::*)(gatePvData *, void *)) 0x41253c <gatePvData::valueDataLongCB(void*)>, mon_state = 1, log_mon_state = 0, ctrl_get_state = 0,
  time_get_state = 0, alh_mon_state = 0, alh_get_state = 0, abort_flag = 0, complete_flag = 1, no_connect_time = 1420748021, dead_alive_time = 1420748021, last_trans_time = 1420748021, bytes = 4, select_mask = {
    mask = 7}, alh_mask = {mask = 4}, value_mask = {mask = 1}, value_alarm_mask = {mask = 5}, value_log_mask = {mask = 3}}

But the 'ts' resp. 'd' only has 0 value elements in it according to the eventCB (args=…) , while the memcpy tries to read count=18088 elements.
We have 0 value elements because the channel is currently INVALID/UDF:

(gdb) print ts
$5 = (dbr_time_long *) 0x7f56f4025060
(gdb) print *ts
$6 = {status = 17, severity = 3, stamp = {secPastEpoch = 0, nsec = 0}, value = 0}

A caget of the NELM gives 18088, but NORD of that record is 0.

Overall, this section of code in gatePv.cc has 2 problems:

gdd* gatePvData::eventLongCB(void *dbr)
{
        gateDebug0(10,"gatePvData::eventLongCB\n");
        dbr_time_long* ts = (dbr_time_long*)dbr;
        aitIndex count = totalElements();
        gdd* value;

        // DBR_TIME_LONG response
        // set up the value
        if(count>1)
        {
                aitInt32 *d,*nd;
                nd=new aitInt32[count];
                d=(aitInt32*)&ts->value;
                memcpy(nd,d,count*sizeof(aitInt32));
                value=new gddAtomic(GR->appValue,aitEnumInt32,1,&count);
                value->putRef(nd,new gateIntDestruct());
        }
        else
        {
                value = new gddScalar(GR->appValue,aitEnumInt32);
                *value=ts->value;
        }
        value->setStatSevr(ts->status,ts->severity);
        value->setTimeStamp(&ts->stamp);
        return value;
}

1) The 'count' should be the count of the received eventCB (args), not the ca_element_count(chID) from totalElements().

2) For arrays, that count can be 0.
So it needs to use ca_element_count(chID) from totalElements() to decide if it's an array.
For a scalar, the received count should always be 1.
For arrays, the number of actual elements received could be 0, 1, 2, …

Same for eventShortCB, eventDoubleCB, …
We have the 2.0.4 version of the gateway. Not the very latest, but from the release notes the changes are in CAPutLogging and Base 3.15.1, nothing about arrays, and I looked at gatePv.h and gatePv.cc of 2.0.5.1, it should have the same problem.

Thanks,
Kay

Ralph Lange (ralph-lange) wrote :

I admire your command line gdb abilities and agree with your analysis and results.

Before variable length arrays were added to CA, all incoming arrays had the full size. (That code was once young and good, you know. It just wasn't adaptive enough when aging.)

Ralph Lange (ralph-lange) wrote :

From Kay Kasemir:

Since we're still developing a lot and thus reboot IOCs, which then serve zero-sized UDF arrays for a while, the crashing gateway has turned into a nuisance.

I've patched our copy to pass the received array element count down from the CA event handler to the CPP code.
Once there, I'm only handling it for 'short' and 'long' arrays by zeroing the memory, then copying in what's been received.
So this is not a perfect fix, but it keeps it from crashing for those 2 data types.

Maybe the patch safes you some typing to pass the received array element count.
Otherwise, by all means ignore it and do the right thing.

Changed in epics-gateway:
status: New → Confirmed
Ralph Lange (ralph-lange) wrote :

I fixed the segfaults with a mixture of your patch and changes from Bruce Hill's vararray branch.

While working on this, I also saw strange error messages in the log that I was able to trace back to a bug in gdd (lp:1415908). These should go away once that fix made its way through EPICS Base.

Changed in epics-gateway:
status: Confirmed → Fix Committed
Changed in epics-gateway:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Patches