log client uses CPU attempting to connect if there isnt a default route

Bug #541302 reported by Jeff Hill
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
Medium
Jeff Hill

Bug Description

Reported by Gaspar Jansa:

If there is not a default route (presumably if there is no routing option available on the lovcal host) the log client can get into a tight loop attempting to connect and use too much CPU.

Additional information:

The CPU usage on the machine then pegs at 100%.

However, with a default route installed (a destination of 0.0.0.0 listed in the output from 'netstat -nr') I get this instead:

> epics> epicsEnvSet EPICS_IOC_LOG_INET 192.168.123.45 iocLogInit
> log client: unable to connect to "192.168.123.45:7004" for 2.0 seconds
> epics> iocLogShow
> log client: disconnected from log server at "192.168.123.45:7004"

After a while I also get this message:

> log client: unable to connect to "192.168.123.45:7004" because 110="Connection timed out"

The machine CPU usage remains normal in this case.

I would therefore recommend that you set a default route on your IOC machine as a workaround until we get a fix into Base.

The iocLogShow command I used above is not callable from the iocsh in any released version of Base; I just added the relevent iocsh table entries that make it callable and committed the changes to CVS.

Do any of the other EPICS core developers claim ownership of the src/libCom/logClient/logClient.c code, which is where the problem lies?
  The 101 value above would appear to be ENETUNREACH which is not mentioned in any osdSock.h file and not expected by the code in logClientConnect(). I might expect similar kinds of problems to appear if someone mentions an unreachable IP address in any of the network configuration settings such as EPICS_CA_ADDR_LIST, but I haven't tried them to confirm that.

Original Mantis Bug: mantis-269
    http://www.aps.anl.gov/epics/mantis/view_bug_page.php?f_id=269

Tags: libcom 3.14
Revision history for this message
Jeff Hill (johill-lanl) wrote :

after closer inspection of CVS there have been some large changes recently from Ben which appear to have caused this problem (I belive that the root cause is a race condition where if the circuit connect fails it sets a semaphore that immediately wakes up the log servers thread that is responsible for reconnecting and this can get into a CPU consuming loop.

Revision history for this message
Jeff Hill (johill-lanl) wrote :

fixed in R3.14.9

Revision history for this message
Andrew Johnson (anj) wrote :

R3.14.9 Released.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.