Bug #541316 (mantis-284) “channel clear protocol warning message...” : Bugs : EPICS Base

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2007-02-28:

#1

> CAS: CAS Request: epics on cosylab.com: cmd=12 cid=4294967295 typ=0
> cnt=0 psz=0 avail=1

This indicates a client is trying to delete a channel that does not exist in the server. I can think of three scenarios.

In the first one we have a client application that is deleting a channel that has already been deleted. That might even be possible since a channel id is only a pointer and the data structure involved is returned to a free list where it might remain unmolested for some time. If the application deletes the same channel twice the library might be tricked into sending the delete request a second time. The solution will be of course be to fix the client application.

In the second one something is wrong with the client library. If that were the case then the first question to ask is what version of the client library is sending these requests (from epics on cosylab.com). Another possible explanation might be that some client with a protocol other than CA is using the TCP port of the gateway's server.

The third one has the gateway code deleting the PV, and the client library either doesn’t know how to deal with the server deletes the channel protocol, or the client application tries to delete the channel during a short time window existing only while the client library hasn’t yet received the server deletes the channel message.

Since we see "cid=4294967295" on multiple lines one could certainly guess that the client has "issues" involved with repeatedly deleting the same channel. Unfortunately, from this diagnostic we don’t know if this is multiple deletes of the same channel within the same client lifespan or else a client being repeatedly run that somehow always stumbles when attempting to delete the same channel id. That makes it somewhat harder to distinguish between the first two cases above and the third one. Still, I think we can conclude that the timing associated with the 3rd scenario would probably make it less likely to occur repeatedly with the same channel id. So we suspect the first two scenarios, and of those two I would currently place my money on the first one if I had to make a choice, but that guess is mostly based on conjecture.

> CAS: CAS Request: epics on cosylab.com: cmd=12 cid=4294967295 typ=0 
> cnt=0 psz=0 avail=1

This indicates a client is trying to delete a channel that does not exist in the server. I can think of three scenarios.

In the first one we have a client application that is deleting a channel that has already been deleted. That might even be possible since a channel id is only a pointer and the data structure involved is returned to a free list where it might remain unmolested for some time. If the application deletes the same channel twice the library might be tricked into sending the delete request a second time. The solution will be of course be to fix the client application.

In the second one something is wrong with the client library. If that were the case then the first question to ask is what version of the client library is sending these requests (from epics on cosylab.com). Another possible explanation might be that some client with a protocol other than CA is using the TCP port of the gateway's server.

The third one has the gateway code deleting the PV, and the client library either doesn’t know how to deal with the server deletes the channel protocol, or the client application tries to delete the channel during a short time window existing only while the client library hasn’t yet received the server deletes the channel message.

Since we see "cid=4294967295" on multiple lines one could certainly guess that the client has "issues" involved with repeatedly deleting the same channel. Unfortunately, from this diagnostic we don’t know if this is multiple deletes of the same channel within the same client lifespan or else a client being repeatedly run that somehow always stumbles when attempting to delete the same channel id. That makes it somewhat harder to distinguish between the first two cases above and the third one. Still, I think we can conclude that the timing associated with the 3rd scenario would probably make it less likely to occur repeatedly with the same channel id. So we suspect the first two scenarios, and of those two I would currently place my money on the first one if I had to make a choice, but that guess is mostly based on conjecture.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2007-02-28:

#2

From Gasper:

BTW the clients from my last email are caget and camonitor that comes with the epics base.

Revision history for this message

Andrew Johnson (anj) wrote on 2008-11-19:

#3

We are seeing this same issue at the APS with versions of the gateway built on both R3.14.8.2 and R3.14.10. In this case the client subscribes to alarm monitors for large number of channels and then basically just sits waiting in ca_pend_event(). We've had error messages logged by the gateway after midnight when there was definitely nobody starting or stopping the client (so it's not actively clearing channels), and on the client it looks like we're getting alarm value changes reported, although it's harder for me to be sure exactly what is happening there.

It's not obvious to me whether this is a server or a client library problem. If it helps we might be able to attach wireshark to the TCP circuit to track down what's actually happening.

Revision history for this message

Andrew Johnson (anj) wrote on 2008-11-19:

#4

PS to my above note: The client was built against a pre-R3.14.10 CVS version in both cases. We'll be upgrading it to the official R3.14.10 release soon.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2009-07-08:

#5

One can observe that a cid of 4294967295 is equivalent to 0xffffffff

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2009-07-08:

#6

One can also observe that the server id of a newly created channel in the client library is initialized to UINT_MAX

nciu::nciu ( cac & cacIn, netiiu & iiuIn, cacChannelNotify & chanIn,
            const char *pNameIn, cacChannel::priLev pri ) :
    cacChannel ( chanIn ),
    cacCtx ( cacIn ),
    piiu ( & iiuIn ),
    sid ( UINT_MAX ),
    count ( 0 ),
    retry ( 0u ),
    nameLength ( 0u ),
    typeCode ( USHRT_MAX ),
    priority ( static_cast <ca_uint8_t> ( pri ) )
{

edited on: 2009-07-08 10:01

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2009-07-08:

#7

the client lib has this logic to prevent sending the channel delete message to the server if the channel isn't yet installed there

void nciu::destroy (
    epicsGuard < epicsMutex > & guard )
{
    while ( baseNMIU * pNetIO = this->eventq.first () ) {
        assert ( this->cacCtx.destroyIO (
            guard, pNetIO->getId (), *this ) );
    }

    // if the claim reply has not returned yet then we will issue
    // the clear channel request to the server when the claim reply
    // arrives and there is no matching nciu in the client
    if ( this->channelNode::isInstalledInServer ( guard ) ) {
        this->getPIIU(guard)->clearChannelRequest (
            guard, this->sid, this->id );
    }
    this->piiu->uninstallChan ( guard, *this );
    this->cacCtx.destroyChannel ( guard, *this );
}

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2009-07-08:

#8

this function in nciu also sets the server id to all ones

void nciu::setServerAddressUnknown ( netiiu & newiiu,
                                epicsGuard < epicsMutex > & guard )
{
    guard.assertIdenticalMutex ( this->cacCtx.mutexRef () );
    this->piiu = & newiiu;
    this->retry = 0;
    this->typeCode = USHRT_MAX;
    this->count = 0u;
    this->sid = UINT_MAX;
    this->accessRightState.clrReadPermit();
    this->accessRightState.clrWritePermit();
}

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2009-07-08:

#9

Download full text (4.7 KiB)

committed a fix

cvs diff -wb -- tcpiiu.cpp (in directory C:\hill\R3.14.dll_hell_fix\epics\base\src\ca\)
Index: tcpiiu.cpp
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/ca/tcpiiu.cpp,v
retrieving revision 1.154.2.41
diff -u -b -w -b -r1.154.2.41 tcpiiu.cpp
--- tcpiiu.cpp 7 Jul 2009 23:47:38 -0000 1.154.2.41
+++ tcpiiu.cpp 8 Jul 2009 17:37:20 -0000
@@ -1775,8 +1775,10 @@
}

     while ( nciu * pChan = this->createRespPend.get () ) {
- this->clearChannelRequest ( guard,
- pChan->getSID(guard), pChan->getCID(guard) );
+ // we dont yet know the server's id so we cant
+ // send a channel delete request and will instead
+ // trust that the server can do the proper cleanup
+ // when the circuit disconnects
         discIIU.installDisconnectedChannel ( guard, *pChan );
     }

@@ -1788,19 +1790,35 @@

     while ( nciu * pChan = this->subscripReqPend.get () ) {
         pChan->disconnectAllIO ( cbGuard, guard );
+ this->clearChannelRequest ( guard,
+ pChan->getSID(guard), pChan->getCID(guard) );
         discIIU.installDisconnectedChannel ( guard, *pChan );
         pChan->unresponsiveCircuitNotify ( cbGuard, guard );
     }

     while ( nciu * pChan = this->connectedList.get () ) {
         pChan->disconnectAllIO ( cbGuard, guard );
+ this->clearChannelRequest ( guard,
+ pChan->getSID(guard), pChan->getCID(guard) );
         discIIU.installDisconnectedChannel ( guard, *pChan );
         pChan->unresponsiveCircuitNotify ( cbGuard, guard );
     }

     while ( nciu * pChan = this->unrespCircuit.get () ) {
+ // if we know that the circuit is unresponsive
+ // then we dont send a channel delete request and
+ // will instead trust that the server can do the
+ // proper cleanup when the circuit disconnects
+ pChan->disconnectAllIO ( cbGuard, guard );
+ discIIU.installDisconnectedChannel ( guard, *pChan );
+ }
+
+ while ( nciu * pChan = this->subscripUpdateReqPend.get () ) {
         pChan->disconnectAllIO ( cbGuard, guard );
+ this->clearChannelRequest ( guard,
+ pChan->getSID(guard), pChan->getCID(guard) );
         discIIU.installDisconnectedChannel ( guard, *pChan );
+ pChan->unresponsiveCircuitNotify ( cbGuard, guard );
     }

this->channelCountTot = 0u;
@@ -1816,20 +1834,18 @@
guard.assertIdenticalMutex ( this->mutex );

     while ( nciu * pChan = this->createReqPend.get () ) {
- // with server prior to V42 IO could exit here
- pChan->disconnectAllIO ( cbGuard, guard );
         pChan->serviceShutdownNotify ( cbGuard, guard );
     }

while ( nciu * pChan = this->createRespPend.get () ) {
- pChan->disconnectAllIO ( cbGuard, guard );
- this->clearChannelRequest ( guard,
- pChan->getSID(guard), pChan->getCID(guard) );
+ // we dont yet know the server's id so we cant
+ // send a channel delete request and will instead
+ // trust that the server can do the proper cleanup
+ // when the circuit disconnects
...

committed a fix

cvs diff -wb -- tcpiiu.cpp (in directory C:\hill\R3.14.dll_hell_fix\epics\base\src\ca\)
Index: tcpiiu.cpp
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/ca/tcpiiu.cpp,v
retrieving revision 1.154.2.41
diff -u -b -w -b -r1.154.2.41 tcpiiu.cpp
--- tcpiiu.cpp	7 Jul 2009 23:47:38 -0000	1.154.2.41
+++ tcpiiu.cpp	8 Jul 2009 17:37:20 -0000
@@ -1775,8 +1775,10 @@
     }
 
     while ( nciu * pChan = this->createRespPend.get () ) {
-        this->clearChannelRequest ( guard, 
-            pChan->getSID(guard), pChan->getCID(guard) );
+        // we dont yet know the server's id so we cant
+        // send a channel delete request and will instead 
+        // trust that the server can do the proper cleanup
+        // when the circuit disconnects
         discIIU.installDisconnectedChannel ( guard, *pChan );
     }
     
@@ -1788,19 +1790,35 @@
 
     while ( nciu * pChan = this->subscripReqPend.get () ) {
         pChan->disconnectAllIO ( cbGuard, guard );
+        this->clearChannelRequest ( guard, 
+            pChan->getSID(guard), pChan->getCID(guard) );
         discIIU.installDisconnectedChannel ( guard, *pChan );
         pChan->unresponsiveCircuitNotify ( cbGuard, guard );
     }
 
     while ( nciu * pChan = this->connectedList.get () ) {
         pChan->disconnectAllIO ( cbGuard, guard );
+        this->clearChannelRequest ( guard, 
+            pChan->getSID(guard), pChan->getCID(guard) );
         discIIU.installDisconnectedChannel ( guard, *pChan );
         pChan->unresponsiveCircuitNotify ( cbGuard, guard );
     }
 
     while ( nciu * pChan = this->unrespCircuit.get () ) {
+        // if we know that the circuit is unresponsive
+        // then we dont send a channel delete request and 
+        // will instead trust that the server can do the 
+        // proper cleanup when the circuit disconnects
+        pChan->disconnectAllIO ( cbGuard, guard );
+        discIIU.installDisconnectedChannel ( guard, *pChan );
+    }
+    
+    while ( nciu * pChan = this->subscripUpdateReqPend.get () ) {
         pChan->disconnectAllIO ( cbGuard, guard );
+        this->clearChannelRequest ( guard, 
+            pChan->getSID(guard), pChan->getCID(guard) );
         discIIU.installDisconnectedChannel ( guard, *pChan );
+        pChan->unresponsiveCircuitNotify ( cbGuard, guard );
     }
 
     this->channelCountTot = 0u;
@@ -1816,20 +1834,18 @@
     guard.assertIdenticalMutex ( this->mutex );
 
     while ( nciu * pChan = this->createReqPend.get () ) {
-        // with server prior to V42 IO could exit here
-        pChan->disconnectAllIO ( cbGuard, guard );
         pChan->serviceShutdownNotify ( cbGuard, guard );
     }
 
     while ( nciu * pChan = this->createRespPend.get () ) {
-        pChan->disconnectAllIO ( cbGuard, guard );
-        this->clearChannelRequest ( guard, 
-            pChan->getSID(guard), pChan->getCID(guard) );
+        // we dont yet know the server's id so we cant
+        // send a channel delete request and will instead 
+        // trust that the server can do the proper cleanup
+        // when the circuit disconnects
         pChan->serviceShutdownNotify ( cbGuard, guard );
     }
     
     while ( nciu * pChan = this->v42ConnCallbackPend.get () ) {
-        pChan->disconnectAllIO ( cbGuard, guard );
         this->clearChannelRequest ( guard, 
             pChan->getSID(guard), pChan->getCID(guard) );
         pChan->serviceShutdownNotify ( cbGuard, guard );
@@ -1837,16 +1853,31 @@
 
     while ( nciu * pChan = this->subscripReqPend.get () ) {
         pChan->disconnectAllIO ( cbGuard, guard );
+        this->clearChannelRequest ( guard, 
+            pChan->getSID(guard), pChan->getCID(guard) );
         pChan->serviceShutdownNotify ( cbGuard, guard );
     }
 
     while ( nciu * pChan = this->connectedList.get () ) {
         pChan->disconnectAllIO ( cbGuard, guard );
+        this->clearChannelRequest ( guard, 
+            pChan->getSID(guard), pChan->getCID(guard) );
         pChan->serviceShutdownNotify ( cbGuard, guard );
     }
 
     while ( nciu * pChan = this->unrespCircuit.get () ) {
         pChan->disconnectAllIO ( cbGuard, guard );
+        // if we know that the circuit is unresponsive
+        // then we dont send a channel delete request and 
+        // will instead trust that the server can do the 
+        // proper cleanup when the circuit disconnects
+        pChan->serviceShutdownNotify ( cbGuard, guard );
+    }
+    
+     while ( nciu * pChan = this->subscripUpdateReqPend.get () ) {
+        pChan->disconnectAllIO ( cbGuard, guard );
+        this->clearChannelRequest ( guard, 
+            pChan->getSID(guard), pChan->getCID(guard) );
         pChan->serviceShutdownNotify ( cbGuard, guard );
     }

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2009-07-08:

#10

fixed in R3.14.11

Revision history for this message

Andrew Johnson (anj) wrote on 2009-08-28:

#11

R3.14.11 released.

EPICS Base

channel clear protocol warning messages from PCAS

Bug Description

Other bug subscribers

Remote bug watches