Bug #541181 (mantis-111) “non-preemptive clients disconnect if c...” : Bugs : EPICS Base

Revision history for this message

evans (evans) wrote on 2005-04-08:

#1

Download full text (4.8 KiB)

I have been looking at the logic in Channel Access for determining if
a circuit is unresponsive (in non-premptive mode, the mode currently
used by most clients). I have found that it does not behave as
expected if the delay between calls to ca_poll (or the equivalent) is
long, and I believe I have fixed it.

Keep in mind that the delay between calls to ca_poll can be long for
several reasons, including (1) the programmer did not ca_poll
frequently because that was not required for 3.13, (2) it may be
impractical to call ca_poll on a regular basis in some situations, and
(3) owing to an unexpected event in an otherwise well-behaved client
ca_poll may not be called on time. Channel Access should be robust to
all these conditions, and, in particular, it should not say the
circuit is unresponsive when, in fact, it *is* responsive. Doing that
confuses people and causes undesirable behavior in the software as
well.

I have convinced myself that if the algorithm is working correctly,
the circuit will not be found to be unresponsive when ca_poll is not
called frequently unless it is truly unresponsive. To explain this,
let me first state what the algorithm is: Set a timer for 30 sec, then
reset it for 30 sec whenever an event comes in that indicates a
response. If the timer expires, then send a probe to the server and
reset the timer for 5 sec. If the probe response comes back or
another event happens, reset it for 30 sec, else if it expires, mark
it as unresponsive. This is fairly simple and straightforward.

The problems occur because the callbacks for the timer, the probe
response, and most other events do not get processed until ca_poll is
called. However, if done correctly, the algorithm should still work.
The basic reason is that for a responsive circuit, even though the
30-sec timer may expire, the probe response will always come back
before the 5-sec timer expires, independently of when the processing
of these events occurs.

As an example, assume ca_poll is called every 60 sec. First the timer
is set and expires at 30 sec. At 60 sec, the expire routine is
processed, a probe is sent, and the timer is reset to expire at 65
sec. The probe response then comes in, at say 61 sec, and the timer
expires at 65 sec (without knowing this). At 120 sec, the probe
response is processed, and the timer is reset to expire at 150 sec
(120 + 30). It continues in the same manner and is never marked as
unresponsive unless the probe response is not received, in which case
it truly *is* unresponsive.

I have had to fix three things to make it work this way.

1. The expire time was being set to the time the timer was received
plus the 30 or 5-sec delay, rather than being set to the time it was
processed plus the delay. When it was processed late, the expire time
was in the past, so that it expired immediately, giving the probe no
chance and generally screwing up the whole logic. This was fixed by
setting the expire time to the actual current time plus the delay.
This requires an extra call to epicsTime::getCurrent() and thus it is
more costly to do this. My tests with the Gateway have indicated this
delay is insignificant, and the price should be wo...

I have been looking at the logic in Channel Access for determining if
a circuit is unresponsive (in non-premptive mode, the mode currently
used by most clients).  I have found that it does not behave as
expected if the delay between calls to ca_poll (or the equivalent) is
long, and I believe I have fixed it.

Keep in mind that the delay between calls to ca_poll can be long for
several reasons, including (1) the programmer did not ca_poll
frequently because that was not required for 3.13, (2) it may be
impractical to call ca_poll on a regular basis in some situations, and
(3) owing to an unexpected event in an otherwise well-behaved client
ca_poll may not be called on time.  Channel Access should be robust to
all these conditions, and, in particular, it should not say the
circuit is unresponsive when, in fact, it *is* responsive.  Doing that
confuses people and causes undesirable behavior in the software as
well.

I have convinced myself that if the algorithm is working correctly,
the circuit will not be found to be unresponsive when ca_poll is not
called frequently unless it is truly unresponsive.  To explain this,
let me first state what the algorithm is: Set a timer for 30 sec, then
reset it for 30 sec whenever an event comes in that indicates a
response.  If the timer expires, then send a probe to the server and
reset the timer for 5 sec.  If the probe response comes back or
another event happens, reset it for 30 sec, else if it expires, mark
it as unresponsive.  This is fairly simple and straightforward.

The problems occur because the callbacks for the timer, the probe
response, and most other events do not get processed until ca_poll is
called.  However, if done correctly, the algorithm should still work.
The basic reason is that for a responsive circuit, even though the
30-sec timer may expire, the probe response will always come back
before the 5-sec timer expires, independently of when the processing
of these events occurs.

As an example, assume ca_poll is called every 60 sec.  First the timer
is set and expires at 30 sec.  At 60 sec, the expire routine is
processed, a probe is sent, and the timer is reset to expire at 65
sec.  The probe response then comes in, at say 61 sec, and the timer
expires at 65 sec (without knowing this).  At 120 sec, the probe
response is processed, and the timer is reset to expire at 150 sec
(120 + 30).  It continues in the same manner and is never marked as
unresponsive unless the probe response is not received, in which case
it truly *is* unresponsive.

I have had to fix three things to make it work this way.

1. The expire time was being set to the time the timer was received
plus the 30 or 5-sec delay, rather than being set to the time it was
processed plus the delay.  When it was processed late, the expire time
was in the past, so that it expired immediately, giving the probe no
chance and generally screwing up the whole logic.  This was fixed by
setting the expire time to the actual current time plus the delay.
This requires an extra call to epicsTime::getCurrent() and thus it is
more costly to do this.  My tests with the Gateway have indicated this
delay is insignificant, and the price should be worth having a more
robust algorithm in any event.  In the normal case it now takes
slightly longer (on the order of ms) to be declared unresponsive, but
declaring the circuit unresponsive should be done conservatively,
making this acceptable.

2. The probe response and the timer expire happen in different threads
and they were randomly not being processed in the order in which they
happened.  If the expire happened to be processed before the probe
response, it would think the probe response had not been received,
even though it had actually been received first.  This was fixed by
resetting the timer for an additional delay (I used 1 sec) when it was
expecting a probe response and the elapsed time between the expire
time and the current time was larger than a fixed amount (I used 100
ms).  This is enough to allow it to process any queued probe response
before declaring the circuit unresponsive.

3. Resetting the timer does not always prevent a previously expired,
but unprocessed timer from being processed.  It does set the expire
time in the timer, however.  This was fixed by not processing timers
with expire times in the future.

With these changes, my tests indicate there are no "Virtual circuit
unresponsive" exceptions even when ca_poll is called irregularly,
unless the circuit does *not* respond to the probe and is truly
unresponsive.  I have also tested the changes with a normal client
(PerfStat) updating 100 PVs at 100,000 Hz through the Gateway, both
built with the changes.  There is no discernable difference in
performance, indicating these changes do not affect normal operation
where ca_poll is called frequently as is preferable and as is done in
most of the important clients (when they are behaving normally).

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2005-04-14:

#2

There is one potential negative associated with fixing the CA client library so that it will not mark circuits unresponsive if a CA function isn't called at least once ever EPICS_CA_CONN_TMO (30) seconds.

That negative might be that developers will feel that it's perfectly ok for non-preemptive callback client applications to subscribe for monitor updates, and then not call a ca function for a long time allowing monitor updates to back up into the IOC and consume mbufs there. You can't stop users from doing that, but the behavior change you are advocating might also be encouraging (facilitating) naughty coding habits.

edited on: 2005-04-13 18:03

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2005-04-14:

#3

From Ken Evans:

Your point about notifying clients about not calling a CA function is
valid and I agree with it (strongly). However, that should be done
elsewhere (i.e. in the CA functions). Marking a responsive circuit as
unresponsive is not the way to accomplish that end. A notice such as
happens when the client cannot contact CaRepeater is what is appropriate.
Also, keep in mind that it may not always be absolutely necessary to do it
the suggested way. Clients like Probe worked fine before. (Which is not to
say I recommend the old way.)

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2005-04-14:

#4

Printing a warning message might be the best approach

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2005-04-14:

#5

From Andrew Johnson:

One solution to this issue is for the client context (and hence probably
each Virtual Circuit) to behave differently depending whether the client
application actually has any live monitor subscriptions or not. If not,
the client should not have to call ca_pend_event() or similar to avoid
getting unnecessary errors; if if does, it should get a Virtual Circuit
Neglected error - this is different to the current Virtual Circuit
Unresponsive error as it indicates that the error is caused by the
client end code behaviour, not a problem communicating with the server.

Of course I'm not sure how much work it would be to implement this, but
I suspect it would be a compromise behaviour that would satisfy both
sides if it can be done without too much work. If not, could we at
least change the error that gets generated to a Virtual Circuit
Neglected one which distinguishes the two problems.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2005-04-14:

#6

From Ken Evans:

I should also have said in the previous email that in some cases, such as
with the random "Virtual circuit disconnects" we are getting here, that a
"Virtual circuit neglected" message would help in debugging the situation.
Then we would know it is the client and not the server or network. Some
thought should probably go into how to implement this message in order to
keep it useful without its being annoying.

See mantis-194

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2005-04-14:

#7

More from Ken Evans:

Jeff's point about encouraging bad coding habits is perfectly valid.
Clients should usually be calling a CA function every 100 ms or more often.
However, as Andrew points out, there are cases where not doing that may be
perfectly reasonable and non-destructive. Probe, for example, worked well
for many years without calling a CA function unless it needed to.

What I would prefer is a message like the one that happens when a client
cannot contact CaRepeater. That happens once per client invocation and gets
the message across. What would be bad is a "Virtual circuit neglected"
message every 30 sec.

The not calling a CA function and the "Virtual circuit unresponsive"
watchdog actually occur in logically different parts of the code, and using
"Virtual circuit neglected" instead of "Virtual circuit unresponsive" is
probably not the best solution. (IMHO, of course ;)

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2005-05-09:

#8

Download full text (17.2 KiB)

From Ken Evans:

53 krypton:ca>pwd
/home/phoebus/EVANS/epics/base314Debug/src/ca
54 krypton:ca>cvs diff -c
cvs diff: Diffing .
Index: tcpRecvWatchdog.cpp
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/ca/tcpRecvWatchdog.cpp,v
retrieving revision 1.33.2.14
diff -c -r1.33.2.14 tcpRecvWatchdog.cpp
*** tcpRecvWatchdog.cpp 19 Oct 2004 20:54:19 -0000 1.33.2.14
--- tcpRecvWatchdog.cpp 7 Apr 2005 17:32:08 -0000
***************
*** 1,3 ****
--- 1,4 ----
+
  /*************************************************************************
  * Copyright (c) 2002 The University of Chicago, as Operator of Argonne
  * National Laboratory.
***************
*** 18,29 ****
--- 19,147 ----
   * Author: Jeff Hill
   */

+ #define DEBUG_TIMES 0
+ #define DEBUG_SHUTDOWN 0
+ #define DEBUG_THREAD 0
+
+ #if DEBUG_TIMES || DEBUG_THREAD || DEBUG_SHUTDOWN
+ # define DEBUG
+ #endif
+
+ #define USE_CURRENT_TIME 1
+ #define USE_CHECK_TIMER_EXPIRATION 1
+ #define USE_EXTRA_EXPIRE_TIMEOUT 1
+
+ #if USE_EXTRA_EXPIRE_TIMEOUT
+ # define ALLOWED_EXPIRE_DELAY 0.1
+ # define EXTRA_EXPIRE_TIMEOUT 1.0
+ #endif
+
#define epicsAssertAuthor "Jeff Hill <email address hidden>"

  #include "iocinf.h"
  #include "cac.h"
  #include "virtualCircuit.h"

+ #if DEBUG_TIMES || DEBUG_SHUTDOWN
+ #define PROBE 1
+ #define ANOMALY 2
+ #define BYTES 4
+ #define PROBETMO 8
+ #define ECHO 16
+ #define SHUTDOWN 32
+
+ #define FLAGS(x) (
+ (x->probeResponsePending?PROBE:0) |
+ (x->beaconAnomaly?ANOMALY:0) |
+ (x->iiu.bytesArePendingInOS()?BYTES:0) |
+ (x->probeTimeoutDetected?PROBETMO:0) |
+ (x->shuttingDown?SHUTDOWN:0))
+
+ static const char format[]="%H:%M:%S.%04f";
+
+ static void printTimeStamp(unsigned flags)
+ {
+ char tsString[50];
+
+ epicsTimeToStrftime(tsString,sizeof(tsString),
+ format,&(epicsTimeStamp)epicsTime::getCurrent());
+
+ printf("%s %c%c%c%c%c: ",tsString,
+ flags&PROBE?'p':'-',
+ flags&PROBETMO?'t':'-',
+ flags&ANOMALY?'a':'-',
+ flags&BYTES?'b':'-',
+ flags&SHUTDOWN?'s':'-');
+ }
+
+ static void printTimeStamp(unsigned flags, epicsTime time)
+ {
+ char timeString[50];
+ char tsString[50];
+ static const char zeroString[]="00:00:00.000";
+
+ epicsTimeToStrftime(tsString,sizeof(tsString),
+ format,&(epicsTimeStamp)epicsTime::getCurrent());
+
+ epicsTimeToStrftime(timeString,sizeof(timeString),
+ format,&(epicsTimeStamp)time);
+
+ printf("%s [%s] %c%c%c%c%c: ",tsString,timeString,
+ flags&PROBE?'p':'-',
+ flags&PROBETMO?'t':'-',
+ flags&ANOMALY?'a':'-',
+ flags&BYTES?'b':'-',
+ flags&SHUTDOWN?'s':'-');
+ }
+
+ static char *printTime(epicsTime time)
+ /* Gets current time and puts it in a static array
+ * The calling program should copy it to a safe place
+ * e.g. strcpy(savetime,timestamp()); */
+ {
+ static char timeStampStr[50];
+
+ epicsTimeToStrftime(timeStampStr,sizeof(timeStampStr),
+ format,&(epicsTimeStamp)time);
+
+ return timeStampStr;
+ }
+
+ static void printReset(epicsTime end, double interval)
+ {
+ #if USE_CURREN...

From Ken Evans:

53 krypton:ca>pwd
/home/phoebus/EVANS/epics/base314Debug/src/ca
54 krypton:ca>cvs diff -c
cvs diff: Diffing .
Index: tcpRecvWatchdog.cpp
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/ca/tcpRecvWatchdog.cpp,v
retrieving revision 1.33.2.14
diff -c -r1.33.2.14 tcpRecvWatchdog.cpp
*** tcpRecvWatchdog.cpp	19 Oct 2004 20:54:19 -0000	1.33.2.14
--- tcpRecvWatchdog.cpp	7 Apr 2005 17:32:08 -0000
***************
*** 1,3 ****
--- 1,4 ----
+ 
  /*************************************************************************
  * Copyright (c) 2002 The University of Chicago, as Operator of Argonne
  *     National Laboratory.
***************
*** 18,29 ****
--- 19,147 ----
   *  Author: Jeff Hill
   */
  
+ #define DEBUG_TIMES 0
+ #define DEBUG_SHUTDOWN 0
+ #define DEBUG_THREAD 0
+ 
+ #if DEBUG_TIMES || DEBUG_THREAD  || DEBUG_SHUTDOWN
+ # define DEBUG
+ #endif
+ 
+ #define USE_CURRENT_TIME 1
+ #define USE_CHECK_TIMER_EXPIRATION 1
+ #define USE_EXTRA_EXPIRE_TIMEOUT 1
+ 
+ #if USE_EXTRA_EXPIRE_TIMEOUT
+ # define ALLOWED_EXPIRE_DELAY 0.1
+ # define EXTRA_EXPIRE_TIMEOUT 1.0
+ #endif
+ 
  #define epicsAssertAuthor "Jeff Hill johill@lanl.gov"
  
  #include "iocinf.h"
  #include "cac.h"
  #include "virtualCircuit.h"
  
+ #if DEBUG_TIMES || DEBUG_SHUTDOWN
+ #define PROBE    1
+ #define ANOMALY  2
+ #define BYTES    4
+ #define PROBETMO 8
+ #define ECHO     16
+ #define SHUTDOWN 32
+ 
+ #define FLAGS(x) (
+ (x->probeResponsePending?PROBE:0) |
+ (x->beaconAnomaly?ANOMALY:0) |
+ (x->iiu.bytesArePendingInOS()?BYTES:0) |
+ (x->probeTimeoutDetected?PROBETMO:0) |
+ (x->shuttingDown?SHUTDOWN:0))
+ 
+ static const char format[]="%H:%M:%S.%04f";
+ 
+ static void printTimeStamp(unsigned flags)
+ {
+     char tsString[50];
+     
+     epicsTimeToStrftime(tsString,sizeof(tsString),
+       format,&(epicsTimeStamp)epicsTime::getCurrent());
+ 
+     printf("%s %c%c%c%c%c: ",tsString,
+       flags&PROBE?'p':'-',
+       flags&PROBETMO?'t':'-',
+       flags&ANOMALY?'a':'-',
+       flags&BYTES?'b':'-',
+       flags&SHUTDOWN?'s':'-');
+ }
+ 
+ static void printTimeStamp(unsigned flags, epicsTime time)
+ {
+     char timeString[50];
+     char tsString[50];
+     static const char zeroString[]="00:00:00.000";
+     
+     epicsTimeToStrftime(tsString,sizeof(tsString),
+       format,&(epicsTimeStamp)epicsTime::getCurrent());
+ 
+     epicsTimeToStrftime(timeString,sizeof(timeString),
+       format,&(epicsTimeStamp)time);
+ 
+     printf("%s [%s] %c%c%c%c%c: ",tsString,timeString,
+       flags&PROBE?'p':'-',
+       flags&PROBETMO?'t':'-',
+       flags&ANOMALY?'a':'-',
+       flags&BYTES?'b':'-',
+       flags&SHUTDOWN?'s':'-');
+ }
+ 
+ static char *printTime(epicsTime time)
+   /* Gets current time and puts it in a static array
+    * The calling program should copy it to a safe place
+    *   e.g. strcpy(savetime,timestamp()); */
+ {
+     static char timeStampStr[50];
+     
+     epicsTimeToStrftime(timeStampStr,sizeof(timeStampStr),
+       format,&(epicsTimeStamp)time);
+ 
+     return timeStampStr;
+ }
+ 
+ static void printReset(epicsTime end, double interval)
+ {
+ #if USE_CURRENT_TIME
+     static const char fix[]=" [FIX]";
+ #else
+     static const char fix[]="";
+ #endif
+     printf("  Timer reset to %s (%f sec from now) with interval %f%sn",
+       printTime(end),
+       end-epicsTime::getCurrent(),
+       interval,fix);
+ }
+ 
+ static void printResetViaReturn(double interval)
+ {
+ #if USE_CURRENT_TIME
+     static const char fix[]=" [FIX]";
+ #else
+     static const char fix[]="";
+ #endif
+     printf("  Timer reset (via return) to %s (%f sec from now)%sn",
+       printTime(epicsTime::getCurrent()+interval),
+       interval,fix);
+ }
+ 
+ static void printExpire(epicsTime time)
+ {
+     printf("expire:  Timer for %s (%f sec from now) expiredn",
+       printTime(time),
+       time-epicsTime::getCurrent());
+ }
+ 
+ #endif // DEBUG_TIMES
+ 
+ 
  //
  // the recv watchdog timer is active when this object is created
  //
***************
*** 45,58 ****
  }
  
  epicsTimerNotify::expireStatus
! tcpRecvWatchdog::expire ( const epicsTime & /* currentTime */ ) // X aCC 361
  {
      callbackManager mgr ( this->ctxNotify, this->cbMutex );
      epicsGuard < epicsMutex > guard ( this->mutex );
      if ( this->shuttingDown ) {
          return noRestart;
      }
      if ( this->probeResponsePending ) {
          if ( this->iiu.bytesArePendingInOS() ) {
              this->iiu.printf ( mgr.cbGuard,
      "The CA client library's server inactivity timer initiated server disconnectn" );
--- 163,210 ----
  }
  
  epicsTimerNotify::expireStatus
! tcpRecvWatchdog::expire ( const epicsTime & currentTime) // X aCC 361
  {
      callbackManager mgr ( this->ctxNotify, this->cbMutex );
      epicsGuard < epicsMutex > guard ( this->mutex );
      if ( this->shuttingDown ) {
          return noRestart;
      }
+ 
+   // KE: To make things faster this check could be put inside the if (
+   // this->probeResponsePending ) branch where it is most needed.
+ #if USE_CHECK_TIMER_EXPIRATION
+   // Check if this timer came too early
+     double elapsed = epicsTime::getCurrent() - timer.getExpireInfo().expireTime;
+     if(elapsed < 0) {
+ #if DEBUG_TIMES
+ 	printTimeStamp(FLAGS(this),currentTime);
+ 	printExpire(timer.getExpireInfo().expireTime);
+         debugPrintf ( ( "  Expire time is in the futuren") );
+ 	printResetViaReturn(-elapsed);
+ #endif
+         return expireStatus ( restart, -elapsed );
+     }
+ #endif
+ 
      if ( this->probeResponsePending ) {
+ #if USE_EXTRA_EXPIRE_TIMEOUT
+       // If it has been too long after the timer expired, reset it for
+       // another delay.  This will help with clients that are not
+       // calling ca_poll fast enough.  It is needed because the
+       // tcpRecvWatchdog sometimes gets the expire before the
+       // probeResponse, even though they happened in the opposite
+       // order.
+ 	if(elapsed > ALLOWED_EXPIRE_DELAY) {
+ #if DEBUG_TIMES
+ 	    printTimeStamp(FLAGS(this),currentTime);
+ 	    printExpire(timer.getExpireInfo().expireTime);
+ 	    debugPrintf ( ( "  Circuit timed out - elapsed time too long, restartingn") );
+ 	    printResetViaReturn(EXTRA_EXPIRE_TIMEOUT);
+ #endif
+ 	    return expireStatus ( restart, EXTRA_EXPIRE_TIMEOUT );
+ 	}
+ #endif // USE_EXTRA_EXPIRE_TIMEOUT
          if ( this->iiu.bytesArePendingInOS() ) {
              this->iiu.printf ( mgr.cbGuard,
      "The CA client library's server inactivity timer initiated server disconnectn" );
***************
*** 67,79 ****
      "o application is blocked in a callback from the client libraryn" );
          }
          {
! #           ifdef DEBUG
!                 char hostName[128];
!                 this->iiu.getHostName ( guard, hostName, sizeof (hostName) );
!                 debugPrintf ( ( "CA server "%s" unresponsive after %g inactive sec"
!                             "- disconnecting.n", 
!                     hostName, this->period ) );
! #           endif
              this->iiu.receiveTimeoutNotify ( mgr, guard );
              this->probeTimeoutDetected = true;
          }
--- 219,238 ----
      "o application is blocked in a callback from the client libraryn" );
          }
          {
! #if DEBUG_TIMES
! 	    char hostName[128];
! 	    this->iiu.getHostName ( guard, hostName, sizeof (hostName) );
! 	    printTimeStamp(FLAGS(this),currentTime);
! 	    printExpire(timer.getExpireInfo().expireTime);
! 	    debugPrintf ( ( "  Probe timed out - setting "%s" unresponsiven",
! 			    hostName ) );
! #if 0
! 	    debugPrintf ( ( "  CA server "%s" unresponsive after %g inactive sec "
! 			    "- disconnecting.n", 
! 			    hostName, this->period ) );
! #endif
! 	    debugPrintf ( ( "  Timer stoppedn") );
! # endif
              this->iiu.receiveTimeoutNotify ( mgr, guard );
              this->probeTimeoutDetected = true;
          }
***************
*** 82,101 ****
      else {
          this->probeTimeoutDetected = false;
          this->probeResponsePending = this->iiu.setEchoRequestPending ( guard );
!         debugPrintf ( ("circuit timed out - sending echo requestn") );
          return expireStatus ( restart, CA_ECHO_TIMEOUT );
      }
  }
  
  void tcpRecvWatchdog::beaconArrivalNotify ( 
!     epicsGuard < epicsMutex > & guard, const epicsTime & currentTime )
  {
      guard.assertIdenticalMutex ( this->mutex );
      if ( ! ( this->shuttingDown || this->beaconAnomaly || this->probeResponsePending ) ) {
          epicsGuardRelease < epicsMutex > unguard ( guard );
          this->timer.start ( *this, currentTime + this->period );
!         debugPrintf ( ("saw a normal beacon - reseting circuit receive watchdogn") );
      }
  }
  
  //
--- 241,282 ----
      else {
          this->probeTimeoutDetected = false;
          this->probeResponsePending = this->iiu.setEchoRequestPending ( guard );
! #if DEBUG_TIMES
! 	printTimeStamp(FLAGS(this),currentTime);
! 	printExpire(timer.getExpireInfo().expireTime);
!         debugPrintf ( ( "  Circuit timed out - sending proben") );
! 	printResetViaReturn(CA_ECHO_TIMEOUT);
! #endif
! #if DEBUG_THREAD
! 	epicsThreadShowAll(4);
! #endif
          return expireStatus ( restart, CA_ECHO_TIMEOUT );
      }
  }
  
  void tcpRecvWatchdog::beaconArrivalNotify ( 
!    epicsGuard < epicsMutex > & guard, const epicsTime & currentTime )
  {
      guard.assertIdenticalMutex ( this->mutex );
      if ( ! ( this->shuttingDown || this->beaconAnomaly || this->probeResponsePending ) ) {
          epicsGuardRelease < epicsMutex > unguard ( guard );
+ #if USE_CURRENT_TIME
+         this->timer.start ( *this, this->period );
+ #else
          this->timer.start ( *this, currentTime + this->period );
! #endif
! #if DEBUG_TIMES
! 	printTimeStamp(FLAGS(this),currentTime);
!         debugPrintf ( ("Saw a normal beacon - reseting circuit receive watchdogn") );
! 	printReset(timer.getExpireInfo().expireTime,this->period);
! #endif
      }
+ #if DEBUG_TIMES
+     else {
+ 	printTimeStamp(FLAGS(this),currentTime);
+         debugPrintf ( ("Saw a normal beacon - doing nothingn") );
+     }
+ #endif
  }
  
  //
***************
*** 110,116 ****
--- 291,300 ----
  {
      guard.assertIdenticalMutex ( this->mutex );
      this->beaconAnomaly = true;
+ #if DEBUG_TIMES
+     printTimeStamp(FLAGS(this));
      debugPrintf ( ("Saw an abnormal beaconn") );
+ #endif
  }
  
  void tcpRecvWatchdog::messageArrivalNotify ( 
***************
*** 126,133 ****
          // of expire() which takes the lock - it take also 
          // the callback lock
          epicsGuardRelease < epicsMutex > unguard ( guard );
          this->timer.start ( *this, currentTime + this->period );
!         debugPrintf ( ("received a message - reseting circuit recv watchdogn") );
      }
  }
  
--- 310,328 ----
          // of expire() which takes the lock - it take also 
          // the callback lock
          epicsGuardRelease < epicsMutex > unguard ( guard );
+ #if USE_CURRENT_TIME
+         this->timer.start ( *this, this->period );
+ #else
          this->timer.start ( *this, currentTime + this->period );
! #endif
! #if DEBUG_TIMES
! 	printTimeStamp(FLAGS(this),currentTime);
!         debugPrintf ( ("Received a message - reseting circuit recv watchdogn") );
! 	printReset(timer.getExpireInfo().expireTime,this->period);
! #endif
! #if DEBUG_THREAD
! 	epicsThreadShowAll(4);
! #endif
      }
  }
  
***************
*** 145,157 ****
                  this->probeTimeoutDetected = false;
                  this->probeResponsePending = this->iiu.setEchoRequestPending ( guard );
                  restartDelay = CA_ECHO_TIMEOUT;
!                 debugPrintf ( ("late probe response - sending another probe requestn") );
              }
              else {
                  this->probeResponsePending = false;
                  restartDelay = this->period;
                  this->iiu.responsiveCircuitNotify ( cbGuard, guard );
!                 debugPrintf ( ("probe response on time - circuit was tagged reponsive if unresponsiven") );
              }
          }
      }
--- 340,358 ----
                  this->probeTimeoutDetected = false;
                  this->probeResponsePending = this->iiu.setEchoRequestPending ( guard );
                  restartDelay = CA_ECHO_TIMEOUT;
! #if DEBUG_TIMES
! 		printTimeStamp(FLAGS(this),currentTime);
!                 debugPrintf ( ("Probe response late - sending another proben") );
! #endif
              }
              else {
                  this->probeResponsePending = false;
                  restartDelay = this->period;
                  this->iiu.responsiveCircuitNotify ( cbGuard, guard );
! #if DEBUG_TIMES
! 		printTimeStamp(FLAGS(this),currentTime);
!                 debugPrintf ( ("Probe response on time - circuit was tagged reponsive if unresponsiven") );
! #endif
              }
          }
      }
***************
*** 159,167 ****
          // timer callback takes the callback mutex and the cac mutex
          epicsGuardRelease < epicsMutex > cbGuardRelease ( cbGuard );
          epicsTime expireTime = currentTime + restartDelay;
          this->timer.start ( *this, expireTime );
!         debugPrintf ( ("recv wd restarted with delay %fn", restartDelay) );
!     }
  }
  
  //
--- 360,377 ----
          // timer callback takes the callback mutex and the cac mutex
          epicsGuardRelease < epicsMutex > cbGuardRelease ( cbGuard );
          epicsTime expireTime = currentTime + restartDelay;
+ #if USE_CURRENT_TIME
+         this->timer.start ( *this, restartDelay );
+ #else
          this->timer.start ( *this, expireTime );
! #endif
! #if DEBUG_TIMES
! 	printReset(timer.getExpireInfo().expireTime,restartDelay);
! #endif
!     }
! #if DEBUG_THREAD
!     epicsThreadShowAll(4);
! #endif
  }
  
  //
***************
*** 194,201 ****
          // until a recv wd timer expire callback completes, and 
          // this callback takes the lock
          epicsGuardRelease < epicsMutex > unguard ( guard );
          this->timer.start ( *this, currentTime + CA_ECHO_TIMEOUT );
!         debugPrintf ( ("saw heavy send backlog - reseting circuit recv watchdogn") );
      }
  }
  
--- 404,419 ----
          // until a recv wd timer expire callback completes, and 
          // this callback takes the lock
          epicsGuardRelease < epicsMutex > unguard ( guard );
+ #if USE_CURRENT_TIME
+         this->timer.start ( *this, CA_ECHO_TIMEOUT );
+ #else
          this->timer.start ( *this, currentTime + CA_ECHO_TIMEOUT );
! #endif
! #if DEBUG_TIMES
! 	printTimeStamp(FLAGS(this),currentTime);
!         debugPrintf ( ("Saw heavy send backlog - reseting circuit recv watchdogn") );
! 	printReset(timer.getExpireInfo().expireTime,CA_ECHO_TIMEOUT);
! #endif
      }
  }
  
***************
*** 210,216 ****
          epicsGuardRelease < epicsMutex > guardRelease ( guard );
          this->timer.start ( *this, this->period );
      }
!     debugPrintf ( ("connected to the server - initiating circuit recv watchdogn") );
  }
  
  void tcpRecvWatchdog::sendTimeoutNotify ( 
--- 428,438 ----
          epicsGuardRelease < epicsMutex > guardRelease ( guard );
          this->timer.start ( *this, this->period );
      }
! #if DEBUG_TIMES
!     printTimeStamp(FLAGS(this));
!     debugPrintf ( ("Connected to the server - initiating circuit recv watchdogn") );
!     printReset(timer.getExpireInfo().expireTime,this->period);
! #endif
  }
  
  void tcpRecvWatchdog::sendTimeoutNotify ( 
***************
*** 229,244 ****
          epicsGuardRelease < epicsMutex > unguard ( guard );
          {
              epicsGuardRelease < epicsMutex > cbUnguard ( cbGuard );
              this->timer.start ( *this, currentTime + CA_ECHO_TIMEOUT );
          }
      }
!     debugPrintf ( ("TCP send timed out - sending echo requestn") );
  }
  
  void tcpRecvWatchdog::cancel ()
  {
      this->timer.cancel ();
!     debugPrintf ( ("canceling TCP recv watchdogn") );
  }
  
  void tcpRecvWatchdog::shutdown ()
--- 451,483 ----
          epicsGuardRelease < epicsMutex > unguard ( guard );
          {
              epicsGuardRelease < epicsMutex > cbUnguard ( cbGuard );
+ #if USE_CURRENT_TIME
+             this->timer.start ( *this, CA_ECHO_TIMEOUT );
+ #else
              this->timer.start ( *this, currentTime + CA_ECHO_TIMEOUT );
+ #endif
          }
      }
! #if DEBUG_TIMES
!     printTimeStamp(FLAGS(this),currentTime);
!     if(restartNeeded) {
!         debugPrintf ( ( "sendTimeoutNotify: EchoRequestPending -> %sn",
! 			this->probeResponsePending?"true":"false") );
! 	debugPrintf ( ("TCP send timed out - sending probe, resetting timern") );
! 	printReset(timer.getExpireInfo().expireTime,CA_ECHO_TIMEOUT);
!     } else {
! 	debugPrintf ( ("TCP send timed out - no action (sendTimeoutNotify)n") );
!     }
! #endif
  }
  
  void tcpRecvWatchdog::cancel ()
  {
      this->timer.cancel ();
! #if DEBUG_TIMES
!     printTimeStamp(FLAGS(this));
!     debugPrintf ( ("Cancel: Canceling TCP recv watchdogn") );
! #endif
  }
  
  void tcpRecvWatchdog::shutdown ()
***************
*** 248,254 ****
          this->shuttingDown = true;
      }
      this->timer.cancel ();
!     debugPrintf ( ("canceling TCP recv watchdogn") );
  }
  
  double tcpRecvWatchdog::delay () const
--- 487,496 ----
          this->shuttingDown = true;
      }
      this->timer.cancel ();
! #if DEBUG_TIMES || DEBUG_SHUTDOWN
!     printTimeStamp(FLAGS(this));
!     debugPrintf ( ("Shutdown: Canceling TCP recv watchdogn") );
! #endif
  }
  
  double tcpRecvWatchdog::delay () const
55 krypton:ca>

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2005-10-26:

#9

1) I am confused by USE_CHECK_TIMER_EXPIRATION. I don’t think that the timer expire would ever be run by the timer library if epicsTime::getCurrent() - timer.getExpireInfo().expireTime was negative. I'm not fond of adding code that does not serve a purpose.

2) I also worry about USE_EXTRA_EXPIRE_TIMEOUT. This change seems to never allow the circuit to disconnect if the timer expire runs ALLOWED_EXPIRE_DELAY (0.1 seconds) after when the timer was specified to expire. That means that if the system gets busy, the circuits will not disconnect. I am inclined to give disconnect the benefit of the doubt. That produces a more robust behavior. If they are relying on the subscription update to track the state of a remote system, and take evasive action if the state isn’t appropriate, then they should probably receive a disconnect notification when a higher priority thread is using too much cpu. What you have done here is decided that disconnects are bad under these circumstance and turned them off. However other users might consider disconnects in this situation to be good. If this change was added we would probably need to enable it only if preemptive callback was disable and we knew the time since the last call to ca_poll was too long.

3) I think that I do understand USE_CURRENT_TIME. This will cause the scheduling of subsequent timers to be more accurate should the timer expire not get a chance to run immediately. This will add some extra overhead but arguably operate more robustly in an environment where they don’t call ca_poll regularly.

Revision history for this message

evans (evans) wrote on 2005-10-26:

#10

All of these changes are necessary and serve a purpose. See the note
dated 2005-04-08 for a further explanation of what they do. I tested
the changes pretty thoroughly. That is why there is so much debugging
code added. They do solve the problem and do not seem to cause
problems with the normal case where ca_poll is called regularly and
where there are no network problems. If you are unsure of the
consequences, it would probably be best to test if your concerns
occur.

There is some extra overhead, but I believe it is negligible and
should for the most part only occur in the unusual situations where
there are problems.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2005-10-31:

#11

Fixed in R3.14.8 (and also on the CVS main trunk)

The fix forces the receive thread to run before the receive watchdog
thread if they are both waiting for the preemptive callback control mutex.
I installed a circuit specific mutex to enforcement the new order.

This fix should avoid a disconnect if the user does not call ca_poll
regularly, but will continue to disconnect if a higher priority thread
uses too much CPU. This distinction is particularly applicable to a
fault where the thread that calls ca_context_create() uses all available
CPU. That thread typically runs at a priority above the receive thread
and below the receive watchdog thread. In that situation the receive
watchdog would disconnect, and not wait for the receive thread. Ken's
patch does not allow for this.

I did take Ken's advice and used a more current time stamp to schedule
subsequent timer expiration in the receive watchdog.

Revision history for this message

Andrew Johnson (anj) wrote on 2005-11-22:

#12

R3.14.8 Release.

EPICS Base

non-preemptive clients disconnect if ca_poll() isnt called regularly

Bug Description

Other bug subscribers

Remote bug watches