ca fails to connect

Bug #584951 reported by Jeff Hill
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Invalid
Low
Unassigned

Bug Description

Dirk mentioned at the codeathon that in some limited number of rare instances (maybe twice) an IOC never reconnected the client side when the IOC that it had a db ca link to dropped channel access because of too high CPU load at the server end. Removing the server end CPU load and rebooting alone did not help. Writing to the link name field on the client side in the database _did_ cause the link to reconnect.

We may need to get a stack trace for the timer queue thread running at similar priority to the db ca link ca client context in order to get this fixed.

Tags: cleanup
Revision history for this message
Jeff Hill (johill-lanl) wrote :

see also 649469

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Dirk mentioned that a saturated CPU (in I assume the client side IOC, but not confirmed) seems to be a precipitating circumstance. If so perhaps the cause is a malfunction in the timer queue code?

Changed in epics-base:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Jeff Hill (johill-lanl) wrote :

see also 584939

Revision history for this message
Jeff Hill (johill-lanl) wrote :

also bug 541358

A complete list of related entries is bug 649469, bug 649469, bug 584939

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I am writing a regression test that verifies a reconnect after N additional threads in the system uses all of the cpu

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Attempts to write a test that forces a disconnect by using cpu on the client side hasn't been successful on windows probably in part because the client library is well written to detect server side issues as disconnects but correctly decide that client side cpu consumption is not the cause of a disconnect.

Stopping and starting the server process in the debugger doesn't reproduce the issue

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I also wrote a simple code that uses 100% of the cpu at a high priority in a vxWorks 6 system here. We observed that camonitor does detect a disconnect when this thread is running (after EPICS_CA_CONN_TMO seconds) and that camonitor does immediately reconnect when the cpu load is removed.

I will also leave the cpu load in place overnight and see if the camonitor client successfully reconnects in the morning.

Revision history for this message
Jeff Hill (johill-lanl) wrote :
Revision history for this message
Jeff Hill (johill-lanl) wrote :

The common thread with all of this is that a database link, but not other types of ca clients, does not reconnect

Revision history for this message
Jeff Hill (johill-lanl) wrote :

There is a small possibility that this is another manifestation of Bug #878372

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Of course this couldn't be Bug #878372 because the client and server are on different hosts.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

I have completed some additional testing with no success reproducing this issue here.

I started two compact RIO R3.14.11 IOCs running vxWorks 6 on PPC processors. One of them contained a record with a db ca link to a record in the other one. On the server side of this db ca link I started a priority 100 thread that used all of the cpu. This caused the db ca link to immediately disconnect. I left this higher priority thread running for 24 hours. The next day I killed this thread, that was using all of the CPU, and observ3ed that the db ca link immediately reconnected.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

Considering this further. If rebooting the server didn't help, but rewriting the name of the ca link did help then it seems that the nature of the failure would be that either ca didn't find out that the socket disconnected along with taking appropriate action, or the client's ip kernel didn't detect the tcp circuit disconnect when the server rebooted.

The code paths in ca that detect a socket disconnect are frequently exercised so if there is failure there its a more rarely occurring type of bug like a race condition.

Changed in epics-base:
status: Triaged → New
Andrew Johnson (anj)
Changed in epics-base:
status: New → Incomplete
importance: High → Low
tags: added: cleanup
Changed in epics-base:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.