EPICS Base

casExpandRecvBuffer modifies the contents of the buffer

Bug #1815908 reported by Michael Ritzert on 2019-02-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	EPICS Base	Fix Released	High	mdavidsaver	EPICS Base 7.0.3.1

Bug Description

For the following data base run in a SoftIOC (3.16.2):

record(waveform, "PXD:H9999:D9:memoffsetdata_array:A:set"){
field(FTVL, "UCHAR")
field(NELM, "262144")
}

record(waveform, "PXD:H9999:D9:memswdata_array:A:set"){
field(FTVL, "UCHAR")
field(NELM, "262144")
}

the sequence of PV accesses created by this minimal sequence:

// vi: ft=cpp
program seqtest("")

option +c; /* dont wait for db connections */
option +r; /* make functions reentrant, multiple instances */
option +d; /* toggle runtime debugging messages */
option +W; /* extra warnings. */

%%int i;

unsigned char memswdata_array_set[16384];
assign memswdata_array_set to "PXD:H9999:D9:memswdata_array:A:set";

unsigned char memoffsetdata_array_set_small[32768];
assign memoffsetdata_array_set_small to "PXD:H9999:D9:memoffsetdata_array:A:set";

ss dhhseq_small {
  state init {
    entry {
      for( i = 0; i < sizeof memswdata_array_set; ++i ) memswdata_array_set[i] = i % 256;
      for( i = 0; i < sizeof memoffsetdata_array_set_small; ++i ) memoffsetdata_array_set_small[i] = 255 - i % 256;
      pvPut(memswdata_array_set, SYNC);
      pvPut(memoffsetdata_array_set_small, SYNC);
    }
    when() {} state done
  }

  state done {
        when () {} state done
  }
}

gives this error message on the softIOC side:

CAS: request from 192.168.0.1:44512 => CAS: Missaligned protocol rejected
CAS: Request from 192.168.0.1:44512 => cmmd=30720 cid=0x7800002c type=17023 count=0 postsize=44
CAS: Request from 192.168.0.1:44512 => available=0x427f0000 N=0 paddr=(nil)
CAS: forcing disconnect from 192.168.0.1:44512

and a segfault on the sequence side.

This report is (for now) about the softIOC.

I can reliably produce this behavior on two different systems (SL7 and Debian) when both the softIOC and the sequence run on the same host. Communication is via the lo interface, so the data packets are large.

During the second pvPut, the receive buffer is to be expanded a second time, and during this call, the contents change:

before casExpandRecvBuffer
start of buffer=00 13 ffffff80
after casExpandRecvBuffer
start of buffer=78 00 00

I will try to follow this even further, but for now I'm creating this report to have a place to track my findings.

Might there be a relation to https://bugs.launchpad.net/epics-base/+bug/1706703 ?

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2019-02-14:

minimal code to demonstrate the problem Edit (4.0 KiB, application/x-tar)

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2019-02-14:

Could you try repeating your test with EPICS_CA_AUTO_ARRAY_BYTES=0 in the environment? This might help to narrow down which code path is involved.

Can you confirm that your "start of buffer=..." takes into account the offset pointer 'message_buffer::stk'?

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2019-02-14:

EPICS_CA_AUTO_ARRAY_BYTES=0 gives me

CAS: request from 192.168.0.1:46046 => CAS: server unable to load large request message
CAS: Request from 192.168.0.1:46046 => cmmd=19 cid=0x1 type=4 count=32768 postsize=32768
CAS: Request from 192.168.0.1:46046 => available=0x2 N=0 paddr=0x7f8be0012120

and no crash on sequence side.

I did not actually take stk into account, but I had seen before that it is 0. To be certain, I just added the actual value to the debug message (@ …) and adjusted the buffer dump with this offset. It is still 0:

casExpandRecvBuffer
start of buffer @ 0 = 00 13 ffffff80
after casExpandRecvBuffer
start of buffer @ 0 = 78 00 00

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2019-02-14:

BTW: I cannot confirm if this is a constant, but the data seem to be right from byte 32 in this case.

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2019-02-14:

With right, I mean: As I would expect in this place. Not shifted by this amount.

I.e. now dumping bytes 0, 1, 2 and 31, 32, 33:
before casExpandRecvBuffer
start of buffer @ 0 = 00 13 ffffff80 .. fffffff0 ffffffef ffffffee
after casExpandRecvBuffer
start of buffer @ 0 = 78 00 00 .. 00 ffffffef ffffffee

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2019-03-01:

caservertask-realloc-use-after-free.patch Edit (727 bytes, text/plain)

I found the bug. It's a use-after-realloc problem. There is a code patch realloc, then copy from the old (new deallocated) buffer. I fixed it with the attached patch that takes into account that realloc already moved the data to the new buffer for us.

Revision history for this message

Michael Ritzert (michael-ritzert) wrote on 2019-10-14:

caservertask-realloc-use-after-free.patch Edit (908 bytes, text/plain)

This is the patch for 7.0.3. Just the location of the file changes.

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2019-10-16:

Sorry, this ticket fell through a crack. Looking at your proposed fix I think I see what the logic bug is. The only change I'll make when applying is to test if realloc() returns NULL.