Race condition on IOC start leaves rsrv unresponsive

Bug #1862328 reported by Ralph Lange on 2020-02-07
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Status tracked in 7.0

Bug Description

We have been seeing an IOC lately that occasionally seems to boot fine, does not print the usual "cas warning: Configured TCP port was unavailable. [...]" messages for not being the first one on the host, but then emits a

CAS: Listen error: Address already in use
Thread CAS-TCP (0x3337a20) suspended

and becomes CA unresponsive.

This is a race condition: Using a script that is starting two IOCs in parallel, we can see the effect happening about 1 out of 50 times.

When rsrv starts, there is a window between checking and getting exclusive access so that further checks fail. It turns out only active listening sockets prevent another bind() with SO_REUSEADDR set on the sockets. From the socket(7) manpage:

Indicates that the rules used in validating addresses supplied in a bind(2) call should allow reuse of local addresses. For AF_INET sockets this means that a socket may bind, except when there is an active listening socket bound to the address.

If the second IOC calls bind() before the first IOC called listen(), the second bind() will succeed and the second IOC will fail later when it calls listen(). Currently it decides to go deaf (suspend the receiving thread) at that point, but it really should go back to the phase of testing bind() instead.

See also https://epics.anl.gov/core-talk/2020/msg00110.php

Ralph Lange (ralph-lange) wrote :

Fix prepared during the 2020 Codeathon by Bryan Tester (thanks!), committed as 4844fbb to the 3.15 branch.

Changed in epics-base:
milestone: none → 3.15.8
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers