non-preemptive clients disconnect if ca_poll() isnt called regularly

Bug #541181 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
Wishlist
Jeff Hill

Bug Description

With the R3.14 CA client library non-preemptive mode
CA clients disconnect if ca_poll() isnt called
regularly. This is a behavior change compared to R3.13.

Additional information:
Technically, we have always advised non-preemptive CA clients to call ca_poll() regularly.

Original Mantis Bug: mantis-111
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=111

Tags: ca 3.14
Revision history for this message
evans (evans) wrote :
Download full text (4.8 KiB)

I have been looking at the logic in Channel Access for determining if
a circuit is unresponsive (in non-premptive mode, the mode currently
used by most clients). I have found that it does not behave as
expected if the delay between calls to ca_poll (or the equivalent) is
long, and I believe I have fixed it.

Keep in mind that the delay between calls to ca_poll can be long for
several reasons, including (1) the programmer did not ca_poll
frequently because that was not required for 3.13, (2) it may be
impractical to call ca_poll on a regular basis in some situations, and
(3) owing to an unexpected event in an otherwise well-behaved client
ca_poll may not be called on time. Channel Access should be robust to
all these conditions, and, in particular, it should not say the
circuit is unresponsive when, in fact, it *is* responsive. Doing that
confuses people and causes undesirable behavior in the software as
well.

I have convinced myself that if the algorithm is working correctly,
the circuit will not be found to be unresponsive when ca_poll is not
called frequently unless it is truly unresponsive. To explain this,
let me first state what the algorithm is: Set a timer for 30 sec, then
reset it for 30 sec whenever an event comes in that indicates a
response. If the timer expires, then send a probe to the server and
reset the timer for 5 sec. If the probe response comes back or
another event happens, reset it for 30 sec, else if it expires, mark
it as unresponsive. This is fairly simple and straightforward.

The problems occur because the callbacks for the timer, the probe
response, and most other events do not get processed until ca_poll is
called. However, if done correctly, the algorithm should still work.
The basic reason is that for a responsive circuit, even though the
30-sec timer may expire, the probe response will always come back
before the 5-sec timer expires, independently of when the processing
of these events occurs.

As an example, assume ca_poll is called every 60 sec. First the timer
is set and expires at 30 sec. At 60 sec, the expire routine is
processed, a probe is sent, and the timer is reset to expire at 65
sec. The probe response then comes in, at say 61 sec, and the timer
expires at 65 sec (without knowing this). At 120 sec, the probe
response is processed, and the timer is reset to expire at 150 sec
(120 + 30). It continues in the same manner and is never marked as
unresponsive unless the probe response is not received, in which case
it truly *is* unresponsive.

I have had to fix three things to make it work this way.

1. The expire time was being set to the time the timer was received
plus the 30 or 5-sec delay, rather than being set to the time it was
processed plus the delay. When it was processed late, the expire time
was in the past, so that it expired immediately, giving the probe no
chance and generally screwing up the whole logic. This was fixed by
setting the expire time to the actual current time plus the delay.
This requires an extra call to epicsTime::getCurrent() and thus it is
more costly to do this. My tests with the Gateway have indicated this
delay is insignificant, and the price should be wo...

Read more...

Revision history for this message
Jeff Hill (johill-lanl) wrote :

There is one potential negative associated with fixing the CA client library so that it will not mark circuits unresponsive if a CA function isn't called at least once ever EPICS_CA_CONN_TMO (30) seconds.

That negative might be that developers will feel that it's perfectly ok for non-preemptive callback client applications to subscribe for monitor updates, and then not call a ca function for a long time allowing monitor updates to back up into the IOC and consume mbufs there. You can't stop users from doing that, but the behavior change you are advocating might also be encouraging (facilitating) naughty coding habits.

edited on: 2005-04-13 18:03

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Ken Evans:

     Your point about notifying clients about not calling a CA function is
valid and I agree with it (strongly). However, that should be done
elsewhere (i.e. in the CA functions). Marking a responsive circuit as
unresponsive is not the way to accomplish that end. A notice such as
happens when the client cannot contact CaRepeater is what is appropriate.
Also, keep in mind that it may not always be absolutely necessary to do it
the suggested way. Clients like Probe worked fine before. (Which is not to
say I recommend the old way.)

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Printing a warning message might be the best approach

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Andrew Johnson:

One solution to this issue is for the client context (and hence probably
each Virtual Circuit) to behave differently depending whether the client
application actually has any live monitor subscriptions or not. If not,
the client should not have to call ca_pend_event() or similar to avoid
getting unnecessary errors; if if does, it should get a Virtual Circuit
Neglected error - this is different to the current Virtual Circuit
Unresponsive error as it indicates that the error is caused by the
client end code behaviour, not a problem communicating with the server.

Of course I'm not sure how much work it would be to implement this, but
I suspect it would be a compromise behaviour that would satisfy both
sides if it can be done without too much work. If not, could we at
least change the error that gets generated to a Virtual Circuit
Neglected one which distinguishes the two problems.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

From Ken Evans:

I should also have said in the previous email that in some cases, such as
with the random "Virtual circuit disconnects" we are getting here, that a
"Virtual circuit neglected" message would help in debugging the situation.
Then we would know it is the client and not the server or network. Some
thought should probably go into how to implement this message in order to
keep it useful without its being annoying.

See mantis-194

Revision history for this message
Jeff Hill (johill-lanl) wrote :

More from Ken Evans:

Jeff's point about encouraging bad coding habits is perfectly valid.
Clients should usually be calling a CA function every 100 ms or more often.
However, as Andrew points out, there are cases where not doing that may be
perfectly reasonable and non-destructive. Probe, for example, worked well
for many years without calling a CA function unless it needed to.

What I would prefer is a message like the one that happens when a client
cannot contact CaRepeater. That happens once per client invocation and gets
the message across. What would be bad is a "Virtual circuit neglected"
message every 30 sec.

The not calling a CA function and the "Virtual circuit unresponsive"
watchdog actually occur in logically different parts of the code, and using
"Virtual circuit neglected" instead of "Virtual circuit unresponsive" is
probably not the best solution. (IMHO, of course ;)

Revision history for this message
Jeff Hill (johill-lanl) wrote :
Download full text (17.2 KiB)

From Ken Evans:

53 krypton:ca>pwd
/home/phoebus/EVANS/epics/base314Debug/src/ca
54 krypton:ca>cvs diff -c
cvs diff: Diffing .
Index: tcpRecvWatchdog.cpp
===================================================================
RCS file: /net/phoebus/epicsmgr/cvsroot/epics/base/src/ca/tcpRecvWatchdog.cpp,v
retrieving revision 1.33.2.14
diff -c -r1.33.2.14 tcpRecvWatchdog.cpp
*** tcpRecvWatchdog.cpp 19 Oct 2004 20:54:19 -0000 1.33.2.14
--- tcpRecvWatchdog.cpp 7 Apr 2005 17:32:08 -0000
***************
*** 1,3 ****
--- 1,4 ----
+
  /*************************************************************************
  * Copyright (c) 2002 The University of Chicago, as Operator of Argonne
  * National Laboratory.
***************
*** 18,29 ****
--- 19,147 ----
   * Author: Jeff Hill
   */

+ #define DEBUG_TIMES 0
+ #define DEBUG_SHUTDOWN 0
+ #define DEBUG_THREAD 0
+
+ #if DEBUG_TIMES || DEBUG_THREAD || DEBUG_SHUTDOWN
+ # define DEBUG
+ #endif
+
+ #define USE_CURRENT_TIME 1
+ #define USE_CHECK_TIMER_EXPIRATION 1
+ #define USE_EXTRA_EXPIRE_TIMEOUT 1
+
+ #if USE_EXTRA_EXPIRE_TIMEOUT
+ # define ALLOWED_EXPIRE_DELAY 0.1
+ # define EXTRA_EXPIRE_TIMEOUT 1.0
+ #endif
+
  #define epicsAssertAuthor "Jeff Hill <email address hidden>"

  #include "iocinf.h"
  #include "cac.h"
  #include "virtualCircuit.h"

+ #if DEBUG_TIMES || DEBUG_SHUTDOWN
+ #define PROBE 1
+ #define ANOMALY 2
+ #define BYTES 4
+ #define PROBETMO 8
+ #define ECHO 16
+ #define SHUTDOWN 32
+
+ #define FLAGS(x) (
+ (x->probeResponsePending?PROBE:0) |
+ (x->beaconAnomaly?ANOMALY:0) |
+ (x->iiu.bytesArePendingInOS()?BYTES:0) |
+ (x->probeTimeoutDetected?PROBETMO:0) |
+ (x->shuttingDown?SHUTDOWN:0))
+
+ static const char format[]="%H:%M:%S.%04f";
+
+ static void printTimeStamp(unsigned flags)
+ {
+ char tsString[50];
+
+ epicsTimeToStrftime(tsString,sizeof(tsString),
+ format,&(epicsTimeStamp)epicsTime::getCurrent());
+
+ printf("%s %c%c%c%c%c: ",tsString,
+ flags&PROBE?'p':'-',
+ flags&PROBETMO?'t':'-',
+ flags&ANOMALY?'a':'-',
+ flags&BYTES?'b':'-',
+ flags&SHUTDOWN?'s':'-');
+ }
+
+ static void printTimeStamp(unsigned flags, epicsTime time)
+ {
+ char timeString[50];
+ char tsString[50];
+ static const char zeroString[]="00:00:00.000";
+
+ epicsTimeToStrftime(tsString,sizeof(tsString),
+ format,&(epicsTimeStamp)epicsTime::getCurrent());
+
+ epicsTimeToStrftime(timeString,sizeof(timeString),
+ format,&(epicsTimeStamp)time);
+
+ printf("%s [%s] %c%c%c%c%c: ",tsString,timeString,
+ flags&PROBE?'p':'-',
+ flags&PROBETMO?'t':'-',
+ flags&ANOMALY?'a':'-',
+ flags&BYTES?'b':'-',
+ flags&SHUTDOWN?'s':'-');
+ }
+
+ static char *printTime(epicsTime time)
+ /* Gets current time and puts it in a static array
+ * The calling program should copy it to a safe place
+ * e.g. strcpy(savetime,timestamp()); */
+ {
+ static char timeStampStr[50];
+
+ epicsTimeToStrftime(timeStampStr,sizeof(timeStampStr),
+ format,&(epicsTimeStamp)time);
+
+ return timeStampStr;
+ }
+
+ static void printReset(epicsTime end, double interval)
+ {
+ #if USE_CURREN...

Revision history for this message
Jeff Hill (johill-lanl) wrote :

1) I am confused by USE_CHECK_TIMER_EXPIRATION. I don’t think that the timer expire would ever be run by the timer library if epicsTime::getCurrent() - timer.getExpireInfo().expireTime was negative. I'm not fond of adding code that does not serve a purpose.

2) I also worry about USE_EXTRA_EXPIRE_TIMEOUT. This change seems to never allow the circuit to disconnect if the timer expire runs ALLOWED_EXPIRE_DELAY (0.1 seconds) after when the timer was specified to expire. That means that if the system gets busy, the circuits will not disconnect. I am inclined to give disconnect the benefit of the doubt. That produces a more robust behavior. If they are relying on the subscription update to track the state of a remote system, and take evasive action if the state isn’t appropriate, then they should probably receive a disconnect notification when a higher priority thread is using too much cpu. What you have done here is decided that disconnects are bad under these circumstance and turned them off. However other users might consider disconnects in this situation to be good. If this change was added we would probably need to enable it only if preemptive callback was disable and we knew the time since the last call to ca_poll was too long.

3) I think that I do understand USE_CURRENT_TIME. This will cause the scheduling of subsequent timers to be more accurate should the timer expire not get a chance to run immediately. This will add some extra overhead but arguably operate more robustly in an environment where they don’t call ca_poll regularly.

Revision history for this message
evans (evans) wrote :

All of these changes are necessary and serve a purpose. See the note
dated 2005-04-08 for a further explanation of what they do. I tested
the changes pretty thoroughly. That is why there is so much debugging
code added. They do solve the problem and do not seem to cause
problems with the normal case where ca_poll is called regularly and
where there are no network problems. If you are unsure of the
consequences, it would probably be best to test if your concerns
occur.

There is some extra overhead, but I believe it is negligible and
should for the most part only occur in the unusual situations where
there are problems.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Fixed in R3.14.8 (and also on the CVS main trunk)

The fix forces the receive thread to run before the receive watchdog
thread if they are both waiting for the preemptive callback control mutex.
I installed a circuit specific mutex to enforcement the new order.

This fix should avoid a disconnect if the user does not call ca_poll
regularly, but will continue to disconnect if a higher priority thread
uses too much CPU. This distinction is particularly applicable to a
fault where the thread that calls ca_context_create() uses all available
CPU. That thread typically runs at a priority above the receive thread
and below the receive watchdog thread. In that situation the receive
watchdog would disconnect, and not wait for the receive thread. Ken's
patch does not allow for this.

I did take Ken's advice and used a more current time stamp to schedule
subsequent timer expiration in the receive watchdog.

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.8 Release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.