underflow in casr diagnostic

Bug #667386 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Invalid
Undecided
Unassigned

Bug Description

Today I saw a situation where casr was printing "Undelivered response bytes=4294967272". this value is 0xFFFFFFE8; suspiciously like unsigned integer underflow. Code could be added to prevent this situation, but I should also sweep through the source and try to determine how the structure members involved could actually get into this state. Another possibility is corruption, nbut this is contraindicated because other members of the surrounding data structure are ok.

> This is the output of "casr 100"
>
> epics> casr 100
> Channel Access Server V4.11
> Connected circuits:
> TCP 164.54.160.82:57075(corvette.cars.aps.anl.gov): User="epics", V4.11,
> 19 Channels, Priority=0
> Task Id=0x13736780, Socket FD=12
> Secs since last send 1937.69, Secs since last receive 2059.90
> Unprocessed request bytes=0, Undelivered response bytes=0
> State=up
> 4968 bytes allocated
> dxpXMAP:PixelsPerBuffer(1rw) dxpXMAP:PixelsPerBuffer_RBV(1rw)
> dxpXMAP:AutoPixelsPerBuffer(1rw)
> dxpXMAP:BufferSize_RBV(1rw) dxpXMAP:CollectMode(1rw)
> dxpXMAP:PixelAdvanceMode(1rw)
> dxpXMAP:SyncCount(1rw) dxpXMAP:IgnoreGate(1rw)
> dxpXMAP:InputLogicPolarity(1rw)
> dxpXMAP:dxp1:CurrentPixel(1rw) dxpXMAP:NextPixel(1rw)
> dxpXMAP:PixelsPerRun(1rw)
> dxpXMAP:AutoApply(1rw) dxpXMAP:Apply(1rw)
> dxpXMAP:MBytesRead_RBV(1rw)
> dxpXMAP:ReadRate_RBV(1rw) dxpXMAP:SyncMasterChannel(1rw)
> dxpXMAP:GateMasterChannel(1rw)
> dxpXMAP:LBUSMasterChannel(1rw)
> Send Lock
> epicsMutexId 0x13748318 source ../caservertask.c line 732
> ownerTid 0x0 count 0 owned 0
> Put Notify Lock
> epicsMutexId 0x137355b0 source ../caservertask.c line 733
> ownerTid 0x0 count 0 owned 0
> Address Queue Lock
> epicsMutexId 0x13735630 source ../caservertask.c line 734
> ownerTid 0x0 count 0 owned 0
> Event Queue Lock
> epicsMutexId 0x13736238 source ../caservertask.c line 735
> ownerTid 0x0 count 0 owned 0
> Block Semaphore
> TCP 164.54.160.82:57112(corvette.cars.aps.anl.gov): User="epics", V4.11,
> 25 Channels, Priority=0
> Task Id=0x100ebf38, Socket FD=3
> Secs since last send 581.47, Secs since last receive 868.66
> Unprocessed request bytes=0, Undelivered response bytes=0
> State=up
> 6480 bytes allocated
> dxpXMAP:EraseAll(1rw) dxpXMAP:EraseStart(1rw)
> dxpXMAP:StartAll(1rw)
> dxpXMAP:StopAll(1rw) dxpXMAP:Acquiring(1rw)
> dxpXMAP:Acquiring(1rw)
> dxpXMAP:EnableClientWait(1rw) dxpXMAP:ClientWait(1rw)
> dxpXMAP:StatusAllOnce(1rw)
> dxpXMAP:StatusAll(1rw) dxpXMAP:ReadAll(1rw)
> dxpXMAP:ReadAllOnce(1rw)
> dxpXMAP:ReadLLParams(1rw) dxpXMAP:ReadLLParams(1rw)
> dxpXMAP:PresetMode(1rw)
> dxpXMAP:ElapsedReal(1rw) dxpXMAP:PresetReal(1rw)
> dxpXMAP:ElapsedLive(1rw)
> dxpXMAP:PresetLive(1rw) dxpXMAP:PresetEvents(1rw)
> dxpXMAP:PresetTriggers(1rw)
> dxpXMAP:IDeadTime(1rw) dxpXMAP:DeadTime(1rw)
> dxpXMAP:PollTime_RBV(1rw)
> dxpXMAP:PollTime(1rw)
> Send Lock
> epicsMutexId 0x1010c8e8 source ../caservertask.c line 732
> ownerTid 0x0 count 0 owned 0
> Put Notify Lock
> epicsMutexId 0x1010c5a8 source ../caservertask.c line 733
> ownerTid 0x0 count 0 owned 0
> Address Queue Lock
> epicsMutexId 0x139d57d0 source ../caservertask.c line 734
> ownerTid 0x0 count 0 owned 0
> Event Queue Lock
> epicsMutexId 0x1010ce60 source ../caservertask.c line 735
> ownerTid 0x0 count 0 owned 0
> Block Semaphore
> TCP 164.54.160.87:2797(XIA-Detector): User="epics", V4.11, 2329 Channels,
> Priority=0
> Task Id=0x13a9f098, Socket FD=13
> Secs since last send 584.48, Secs since last receive 584.48
> Unprocessed request bytes=8256, Undelivered response
> bytes=4294967272
> State=up
> 478560 bytes allocated
> dxpXMAP:PresetMode(0rw) dxpXMAP:PresetReal(0rw) dxpXMAP:mca1(1rw)
> dxpXMAP:mca1(1r-) dxpXMAP:mca1(1rw) dxpXMAP:mca1(1rw)
...
> dxpXMAP:mca16(0rw) dxpXMAP:mca16(0rw) dxpXMAP:mca16(0rw)
> dxpXMAP:ElapsedLive(1rw) dxpXMAP:ElapsedReal(1rw)
> dxpXMAP:Acquiring(1rw)
> dxpXMAP:ClientWait(1rw) dxpXMAP:EraseAll(0rw)
> dxpXMAP:StartAll(0rw)
> dxpXMAP:ReadAll(0rw)
> Send Lock
> epicsMutexId 0x13a9e888 source ../caservertask.c line 732
> ownerTid 0x13a9f010 count 1 owned 1
> Put Notify Lock
> epicsMutexId 0x13a9e980 source ../caservertask.c line 733
> ownerTid 0x0 count 0 owned 0
> Address Queue Lock
> epicsMutexId 0x13a9ea78 source ../caservertask.c line 734
> ownerTid 0x0 count 0 owned 0
> Event Queue Lock
> epicsMutexId 0x13a9eb70 source ../caservertask.c line 735
> ownerTid 0x0 count 0 owned 0
> Block Semaphore
> UDP Server:
> UDP 164.54.160.100:1518(): User="", V4.0, 0 Channels, Priority=0
> Task Id=0x133721f0, Socket FD=9
> Secs since last send 936.94, Secs since last receive 0.58
> Unprocessed request bytes=16, Undelivered response bytes=0
> State=up
> 180 bytes allocated
> Send Lock
> epicsMutexId 0x133725a8 source ../caservertask.c line 732
> ownerTid 0x0 count 0 owned 0
> Put Notify Lock
> epicsMutexId 0x133726a0 source ../caservertask.c line 733
> ownerTid 0x0 count 0 owned 0
> Address Queue Lock
> epicsMutexId 0x13372798 source ../caservertask.c line 734
> ownerTid 0x0 count 0 owned 0
> Event Queue Lock
> epicsMutexId 0x13372890 source ../caservertask.c line 735
> ownerTid 0x0 count 0 owned 0
> Block Semaphore
> There are currently 286036 bytes on the server's free list
> 4 client(s), 699 channel(s), 1424 event(s) (monitors) 0 putNotify(s)
> 10 small buffers (16384 bytes ea), and 0 jumbo buffers (1000024 bytes ea)
> The server's resource id conversion table:
> Bucket entries in use = 2373 bytes in use = 54372
> Bucket entries/hash id - mean = 0.579346 std dev = 0.493664 max = 1
> The server's array size limit is 1000024 bytes max
> Channel Access Address List
> 164.54.160.255:5065
> epics>
>
> 164.54.160.87 is the IP address of the machine running both the IOC and
> the IDL client. For that client connection it is saying:
>
> Unprocessed request bytes=8256, Undelivered response
> bytes=4294967272
> State=up
>
> Can you provide any hints on what could be going wrong? Other diagnostics
> to print out?

Tags: cas cleanup rsrv
Revision history for this message
Jeff Hill (johill-lanl) wrote :

see also bug 667384

Revision history for this message
Jeff Hill (johill-lanl) wrote :

This code in camessage.c looks sketchy. It seems that the else part of the if clause does not consider where stk is, and could push stk to be greater than cnt.

    if ( client->recvBytesToDrain ) {
        if ( client->recvBytesToDrain >= client->recv.cnt ) {
            client->recvBytesToDrain -= client->recv.cnt;
            client->recv.stk = client->recv.cnt;
            return RSRV_OK;
        }
        else {
            client->recv.stk += client->recvBytesToDrain;
            client->recvBytesToDrain = 0u;
        }
    }

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Nevertheless, on line 80 of camsgtask.c I see that it always sets stk to zero just before calling the camessage function, and the camessage is called on behalf of tcp only from just this one location. So While the code is old, and it doesn't appear to be up to my current standards for robustness, this particular wart in the source doesn't appear appear to cause a bug.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I do see that the symptoms observed would occur if the thread was trapped by the IP kernel in this function at line 67 of camsgtask.c

status = socket_ioctl (client->sock, FIONREAD, &nchars);

This is because recv.stk isnt cleared until line 80 of camsgtask.c

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Since the send lock is also held I am starting to suspect that the event (send) thread is trapped in send by the ip kernel, and the receive thread is trapped in socket_ioctl by the ip kernel.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

To confirm this scenario, one would need to build HOST_OP=NO and attach to the process with cygwin gdb. One can type "info threads" to see what is running, and "thread nnn" to change the thread context. Once in the right thread one can type "bt" to see the stack.

We are looking to see if a thread "CAS-TCP" (whose task id is seen in casr) is lodged in the function ioctl, and also if the associated "CAS-event" thread (running at almost the same priority) is lodged in the function send.

tags: added: cleanup
Changed in epics-base:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.