non-preemptive clients disconnect if ca_poll() isnt called regularly
Bug #541181 reported by
Jeff Hill
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
EPICS Base |
Fix Released
|
Wishlist
|
Jeff Hill |
Bug Description
With the R3.14 CA client library non-preemptive mode
CA clients disconnect if ca_poll() isnt called
regularly. This is a behavior change compared to R3.13.
Additional information:
Technically, we have always advised non-preemptive CA clients to call ca_poll() regularly.
Original Mantis Bug: mantis-111
http://
To post a comment you must log in.
I have been looking at the logic in Channel Access for determining if
a circuit is unresponsive (in non-premptive mode, the mode currently
used by most clients). I have found that it does not behave as
expected if the delay between calls to ca_poll (or the equivalent) is
long, and I believe I have fixed it.
Keep in mind that the delay between calls to ca_poll can be long for
several reasons, including (1) the programmer did not ca_poll
frequently because that was not required for 3.13, (2) it may be
impractical to call ca_poll on a regular basis in some situations, and
(3) owing to an unexpected event in an otherwise well-behaved client
ca_poll may not be called on time. Channel Access should be robust to
all these conditions, and, in particular, it should not say the
circuit is unresponsive when, in fact, it *is* responsive. Doing that
confuses people and causes undesirable behavior in the software as
well.
I have convinced myself that if the algorithm is working correctly,
the circuit will not be found to be unresponsive when ca_poll is not
called frequently unless it is truly unresponsive. To explain this,
let me first state what the algorithm is: Set a timer for 30 sec, then
reset it for 30 sec whenever an event comes in that indicates a
response. If the timer expires, then send a probe to the server and
reset the timer for 5 sec. If the probe response comes back or
another event happens, reset it for 30 sec, else if it expires, mark
it as unresponsive. This is fairly simple and straightforward.
The problems occur because the callbacks for the timer, the probe
response, and most other events do not get processed until ca_poll is
called. However, if done correctly, the algorithm should still work.
The basic reason is that for a responsive circuit, even though the
30-sec timer may expire, the probe response will always come back
before the 5-sec timer expires, independently of when the processing
of these events occurs.
As an example, assume ca_poll is called every 60 sec. First the timer
is set and expires at 30 sec. At 60 sec, the expire routine is
processed, a probe is sent, and the timer is reset to expire at 65
sec. The probe response then comes in, at say 61 sec, and the timer
expires at 65 sec (without knowing this). At 120 sec, the probe
response is processed, and the timer is reset to expire at 150 sec
(120 + 30). It continues in the same manner and is never marked as
unresponsive unless the probe response is not received, in which case
it truly *is* unresponsive.
I have had to fix three things to make it work this way.
1. The expire time was being set to the time the timer was received :getCurrent( ) and thus it is
plus the 30 or 5-sec delay, rather than being set to the time it was
processed plus the delay. When it was processed late, the expire time
was in the past, so that it expired immediately, giving the probe no
chance and generally screwing up the whole logic. This was fixed by
setting the expire time to the actual current time plus the delay.
This requires an extra call to epicsTime:
more costly to do this. My tests with the Gateway have indicated this
delay is insignificant, and the price should be wo...