casExpandRecvBuffer modifies the contents of the buffer

Bug #1815908 reported by Michael Ritzert
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
High
mdavidsaver

Bug Description

For the following data base run in a SoftIOC (3.16.2):

record(waveform, "PXD:H9999:D9:memoffsetdata_array:A:set"){
    field(FTVL, "UCHAR")
    field(NELM, "262144")
}

record(waveform, "PXD:H9999:D9:memswdata_array:A:set"){
    field(FTVL, "UCHAR")
    field(NELM, "262144")
}

the sequence of PV accesses created by this minimal sequence:

// vi: ft=cpp
program seqtest("")

option +c; /* dont wait for db connections */
option +r; /* make functions reentrant, multiple instances */
option +d; /* toggle runtime debugging messages */
option +W; /* extra warnings. */

%%int i;

unsigned char memswdata_array_set[16384];
assign memswdata_array_set to "PXD:H9999:D9:memswdata_array:A:set";

unsigned char memoffsetdata_array_set_small[32768];
assign memoffsetdata_array_set_small to "PXD:H9999:D9:memoffsetdata_array:A:set";

ss dhhseq_small {
  state init {
    entry {
      for( i = 0; i < sizeof memswdata_array_set; ++i ) memswdata_array_set[i] = i % 256;
      for( i = 0; i < sizeof memoffsetdata_array_set_small; ++i ) memoffsetdata_array_set_small[i] = 255 - i % 256;
      pvPut(memswdata_array_set, SYNC);
      pvPut(memoffsetdata_array_set_small, SYNC);
    }
    when() {} state done
  }

  state done {
        when () {} state done
  }
}

gives this error message on the softIOC side:

CAS: request from 192.168.0.1:44512 => CAS: Missaligned protocol rejected
CAS: Request from 192.168.0.1:44512 => cmmd=30720 cid=0x7800002c type=17023 count=0 postsize=44
CAS: Request from 192.168.0.1:44512 => available=0x427f0000 N=0 paddr=(nil)
CAS: forcing disconnect from 192.168.0.1:44512

and a segfault on the sequence side.

This report is (for now) about the softIOC.

I can reliably produce this behavior on two different systems (SL7 and Debian) when both the softIOC and the sequence run on the same host. Communication is via the lo interface, so the data packets are large.

During the second pvPut, the receive buffer is to be expanded a second time, and during this call, the contents change:

before casExpandRecvBuffer
  start of buffer=00 13 ffffff80
after casExpandRecvBuffer
  start of buffer=78 00 00

I will try to follow this even further, but for now I'm creating this report to have a place to track my findings.

Might there be a relation to https://bugs.launchpad.net/epics-base/+bug/1706703 ?

Revision history for this message
Michael Ritzert (michael-ritzert) wrote :
Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Could you try repeating your test with EPICS_CA_AUTO_ARRAY_BYTES=0 in the environment? This might help to narrow down which code path is involved.

Can you confirm that your "start of buffer=..." takes into account the offset pointer 'message_buffer::stk'?

Revision history for this message
Michael Ritzert (michael-ritzert) wrote :

EPICS_CA_AUTO_ARRAY_BYTES=0 gives me

CAS: request from 192.168.0.1:46046 => CAS: server unable to load large request message
CAS: Request from 192.168.0.1:46046 => cmmd=19 cid=0x1 type=4 count=32768 postsize=32768
CAS: Request from 192.168.0.1:46046 => available=0x2 N=0 paddr=0x7f8be0012120

and no crash on sequence side.

I did not actually take stk into account, but I had seen before that it is 0. To be certain, I just added the actual value to the debug message (@ …) and adjusted the buffer dump with this offset. It is still 0:

casExpandRecvBuffer
  start of buffer @ 0 = 00 13 ffffff80
after casExpandRecvBuffer
  start of buffer @ 0 = 78 00 00

Revision history for this message
Michael Ritzert (michael-ritzert) wrote :

BTW: I cannot confirm if this is a constant, but the data seem to be right from byte 32 in this case.

Revision history for this message
Michael Ritzert (michael-ritzert) wrote :

With right, I mean: As I would expect in this place. Not shifted by this amount.

I.e. now dumping bytes 0, 1, 2 and 31, 32, 33:
before casExpandRecvBuffer
  start of buffer @ 0 = 00 13 ffffff80 .. fffffff0 ffffffef ffffffee
after casExpandRecvBuffer
  start of buffer @ 0 = 78 00 00 .. 00 ffffffef ffffffee

Revision history for this message
Michael Ritzert (michael-ritzert) wrote :

I found the bug. It's a use-after-realloc problem. There is a code patch realloc, then copy from the old (new deallocated) buffer. I fixed it with the attached patch that takes into account that realloc already moved the data to the new buffer for us.

Revision history for this message
Michael Ritzert (michael-ritzert) wrote :

This is the patch for 7.0.3. Just the location of the file changes.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Sorry, this ticket fell through a crack. Looking at your proposed fix I think I see what the logic bug is. The only change I'll make when applying is to test if realloc() returns NULL.

Changed in epics-base:
assignee: nobody → mdavidsaver (mdavidsaver)
milestone: none → 7.0.3.1
status: New → In Progress
Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Committed as 69035629154c1ea20e8dacf5d122802d3d3277e5

Changed in epics-base:
status: In Progress → Fix Committed
importance: Undecided → High
Andrew Johnson (anj)
Changed in epics-base:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.