Monitor callback fails for large arrays

Bug #541245 reported by evans
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
High
Jeff Hill

Bug Description

Get an a monitor callback with status != ECA_NORMAL when trying to
access a large array (count=262144, type=DBF_LONG) even though
EPICS_CA_MAX_ARRAY_BYTES is 1048600 (or even as large as 2048000).
This happens with MEDM and Camonitor, but not with Caget. It happens
with several PVs of the same count.

Additional information:
The MEDM message is:

Fri Jun 10 18:06:24 CDT 2005
medmUpdateChannelCb: Bad status [72] for cf2:turn:gtr:waveform0: The requested data transfer is greater than available memory or EPICS_CA_MAX_ARRAY_BYTES

The stack trace (note values for count and max_bytes) is:

  [1] medmUpdateChannelCb(args = RECORD), line 765 in "medmCA.c"
  [2] oldSubscription::exception(this = 0x404dc8, guard = CLASS, status = 72, _ARG4 = 0x36d128 "server unable to load read (or subscription update) response into protocol buffer PV="cf2:turn:gtr:waveform0" max bytes=1048600", type = 19U, count = 262144U), line 92 in "oldSubscription.cpp"
  [3] netSubscription::exception(this = 0x409dd8, guard = CLASS, recycle = CLASS, status = 72, pContext = 0x36d128 "server unable to load read (or subscription update) response into protocol buffer PV="cf2:turn:gtr:waveform0" max bytes=1048600", typeIn = 19U, countIn = 262144U), line 128 in "netSubscription.cpp"
  [4] cac::ioExceptionNotify(this = 0x2c3c50, idIn = 2U, status = 72, pContext = 0x36d128 "server unable to load read (or subscription update) response into protocol buffer PV="cf2:turn:gtr:waveform0" max bytes=1048600", type = 19U, count = 262144U), line 687 in "cac.cpp"
  [5] cac::eventAddExcep(this = 0x2c3c50, _ARG2 = CLASS, _ARG3 = CLASS, hdr = STRUCT, pCtx = 0x36d128 "server unable to load read (or subscription update) response into protocol buffer PV="cf2:turn:gtr:waveform0" max bytes=1048600", status = 72U), line 947 in "cac.cpp"
=>[6] cac::exceptionRespAction(this = 0x2c3c50, cbMutexIn = CLASS, iiu = CLASS, _ARG4 = CLASS, hdr = STRUCT, pMsgBdy = 0x36d110), line 1030 in "cac.cpp"
  [7] cac::executeResponse(this = 0x2c3c50, mgr = CLASS, iiu = CLASS, currentTime = CLASS, hdr = STRUCT, pMshBody = 0x36d110 ""), line 1124 in "cac.cpp"
  [8] tcpiiu::processIncoming(this = 0x366ba8, currentTime = CLASS, mgr = CLASS), line 1188 in "tcpiiu.cpp"
  [9] tcpRecvThread::run(this = 0x366c64), line 527 in "tcpiiu.cpp"
  [10] epicsThreadCallEntryPoint(pPvt = 0x366c68), line 59 in "epicsThread.cpp"
  [11] start_routine(arg = 0x365348), line 301 in "osdThread.c"

Original Mantis Bug: mantis-203
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=203

Tags: ca 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

The server is respnding with an ECA_TOLARGE exception

Revision history for this message
Jeff Hill (johill-lanl) wrote :

> I've got EPICS_CA_MAX_ARRAY_BYTES set to 1048576 on my IOC and
> on my workstation. The caget command-line program has no troubles
> reading the waveform record, but neither the command-line camonitor
> program nor medm can read it. The waveform record consists of
> 262144 LONG (4-byte) values.

Perhaps this problem is related to a treetop cruising altitude. You need 262144 * 4 bytes which is your EPICS_CA_MAX_ARRAY_BYTES of 1048576. However, programs like MEDM, and plausibly also camonitor, fetch a compound type such as DBR_GR_LONG adding around 36 bytes of additional overhead.

Ken did mention that a higher magnitude EPICS_CA_MAX_ARRAY_BYTES setting was tried, but was this set for both the client and the server?

edited on: 2005-06-13 11:12

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Eric,

You are correct. Adding a little more 'slop' on the server side gets
things working. I believe that the code in base already adds some
space to the EPICS_CA_MAX_ARRAY_BYTES, doesn't it? If so, perhaps this
slop factor needs to be a little larger.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

> I believe that the code in base already adds some space to
> the EPICS_CA_MAX_ARRAY_BYTES, doesn't it? If so, perhaps
> this slop factor needs to be a little larger.

I do internally pad the user's request by 16 bytes for the protocol header. That’s necessary because the user shouldn’t need to know about such details.

Otherwise, we could consider adding some slop to cover the largest scalar DBR type. The argument for this would be perhaps less confusion and less trouble calls. The argument against I suppose is that we seem to always regret trying to 2nd guess the user. At what size should we pad past the users request. What if there were new scalar DBR types that were very large?

Revision history for this message
Jeff Hill (johill-lanl) wrote :

If the code isn't changed we should at least include a warning about this particular situation in the documentation - possibly under configuring large arrays and also under troubleshooting.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Ken,

My opinion is that common sense would dictate trying a little (to a lot) more than the minimum required, esp. if the minimum required didn't work. We did this on the client side but forgot about the server side. I'm not sure there is a real problem here.

However, there *is* a problem to the extent that the error message did not inform us what was really wrong. At a minimum it could have said what size the request was and what size was allowed. In addition, presumably the client side knew it was a server problem and could have said so. A good error message is a lot better than putting it in the documentation and troubleshooting. (I didn't look at either of those for this problem.)

Revision history for this message
Jeff Hill (johill-lanl) wrote :

> However, there *is* a problem to the extent that the error
> message did not inform us what was really wrong.

Note that MEDM does receive an excellent exception message, but that it chooses to throw it away and present the user with a less informative message.

Here is MEDM's message:

medmUpdateChannelCb: Bad status [72] for cf2:turn:gtr:waveform0: The requested data transfer is greater than available memory or EPICS_CA_MAX_ARRAY_BYTES

Here is the exception being delivered to the client (I can see it on the stack):

[2] oldSubscription::exception(this = 0x404dc8, guard = CLASS, status = 72, _ARG4 = 0x36d128 "server unable to load read (or subscription update) response into protocol buffer PV="cf2:turn:gtr:waveform0" max bytes=1048600", type = 19U, count = 262144U), line 92 in "oldSubscription.cpp"

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Ken,

     Yeah, I saw that on the stack. MEDM uses:

    if(args.status != ECA_NORMAL) {
 medmPostMsg(0,"medmUpdateChannelCb: Bad status [%d] for %s: %s\n",
   args.status,
   ca_name(args.chid)?ca_name(args.chid):"Name Unknown",
   ca_message(args.status));
 return;
    }

What should I be using?

Revision history for this message
Jeff Hill (johill-lanl) wrote :

To avoid discarding useful information that the user may need, and to do at least the minimum that the built in exception handler does, its probably a good idea to print all of the fields in the exception_handler_args structure. In this situation the string stored in the "ctx" field contained particularly important details.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Ken:

     This happens in a *subscription* callback with *event_handler_args*. MEDM did *not* get an exception in its exception handler when this happened. That handler prints the ctx field you mention and would have printed the message if it had been called. It wasn't called.

edited on: 2005-06-15 14:51

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Yes, you are correct. Sorry about the confusion. I saw the context on the stack and just assumed that this was an exception callback. I had a look at the code involved, and the new interface allows the context to be passed along to the user, but the old interface (event_handler_args) doesn’t.

Perhaps we could use the dbr pointer for that purpose (its currently set to nill if the ststayus isn’t ECA_NORMAL), but that might be a bit too confusing.

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.8 Release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.