EPICS Base

ca fails to connect

Bug #584951 reported by Jeff Hill on 2010-05-24

8

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	EPICS Base	Invalid	Low	Unassigned

Bug Description

Dirk mentioned at the codeathon that in some limited number of rare instances (maybe twice) an IOC never reconnected the client side when the IOC that it had a db ca link to dropped channel access because of too high CPU load at the server end. Removing the server end CPU load and rebooting alone did not help. Writing to the link name field on the client side in the database _did_ cause the link to reconnect.

We may need to get a stack trace for the timer queue thread running at similar priority to the db ca link ca client context in order to get this fixed.

Tags:

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2010-09-28:

#1

see also 649469

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2011-10-03:

#2

Dirk mentioned that a saturated CPU (in I assume the client side IOC, but not confirmed) seems to be a precipitating circumstance. If so perhaps the cause is a malfunction in the timer queue code?

Changed in epics-base:
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2011-11-11:

#3

see also 584939

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2011-11-11:

#4

also bug 541358

A complete list of related entries is bug 649469, bug 649469, bug 584939

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2011-11-11:

#5

I am writing a regression test that verifies a reconnect after N additional threads in the system uses all of the cpu

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2011-11-14:

#6

Attempts to write a test that forces a disconnect by using cpu on the client side hasn't been successful on windows probably in part because the client library is well written to detect server side issues as disconnects but correctly decide that client side cpu consumption is not the cause of a disconnect.

Stopping and starting the server process in the debugger doesn't reproduce the issue

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2011-11-14:

#7

I also wrote a simple code that uses 100% of the cpu at a high priority in a vxWorks 6 system here. We observed that camonitor does detect a disconnect when this thread is running (after EPICS_CA_CONN_TMO seconds) and that camonitor does immediately reconnect when the cpu load is removed.

I will also leave the cpu load in place overnight and see if the camonitor client successfully reconnects in the morning.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2011-11-15:

#8

Another possibly related thread

http://www.aps.anl.gov/epics/tech-talk/2009/msg01167.php

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2011-11-15:

#9

The common thread with all of this is that a database link, but not other types of ca clients, does not reconnect

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2011-11-15:

#10

There is a small possibility that this is another manifestation of Bug #878372

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2011-11-15:

#11

Of course this couldn't be Bug #878372 because the client and server are on different hosts.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2011-11-16:

#12

I have completed some additional testing with no success reproducing this issue here.

I started two compact RIO R3.14.11 IOCs running vxWorks 6 on PPC processors. One of them contained a record with a db ca link to a record in the other one. On the server side of this db ca link I started a priority 100 thread that used all of the cpu. This caused the db ca link to immediately disconnect. I left this higher priority thread running for 24 hours. The next day I killed this thread, that was using all of the CPU, and observ3ed that the db ca link immediately reconnected.

Revision history for this message

Jeff Hill (johill-lanl) wrote on 2011-11-16:

#13

Considering this further. If rebooting the server didn't help, but rewriting the name of the ca link did help then it seems that the nature of the failure would be that either ca didn't find out that the socket disconnected along with taking appropriate action, or the client's ip kernel didn't detect the tcp circuit disconnect when the server rebooted.

The code paths in ca that detect a socket disconnect are frequently exercised so if there is failure there its a more rarely occurring type of bug like a race condition.

Changed in epics-base:
status:	Triaged → New

Andrew Johnson (anj) on 2013-11-20

Changed in epics-base:
status:	New → Incomplete
importance:	High → Low

mdavidsaver (mdavidsaver) on 2015-12-03

tags:	added: cleanup
Changed in epics-base:
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.