Race condition on IOC start leaves rsrv unresponsive
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
EPICS Base | Status tracked in 7.0 | |||||
3.15 |
Fix Released
|
Medium
|
Unassigned | |||
7.0 |
Fix Released
|
Medium
|
Unassigned |
Bug Description
We have been seeing an IOC lately that occasionally seems to boot fine, does not print the usual "cas warning: Configured TCP port was unavailable. [...]" messages for not being the first one on the host, but then emits a
CAS: Listen error: Address already in use
Thread CAS-TCP (0x3337a20) suspended
and becomes CA unresponsive.
This is a race condition: Using a script that is starting two IOCs in parallel, we can see the effect happening about 1 out of 50 times.
When rsrv starts, there is a window between checking and getting exclusive access so that further checks fail. It turns out only active listening sockets prevent another bind() with SO_REUSEADDR set on the sockets. From the socket(7) manpage:
SO_REUSEADDR
Indicates that the rules used in validating addresses supplied in a bind(2) call should allow reuse of local addresses. For AF_INET sockets this means that a socket may bind, except when there is an active listening socket bound to the address.
If the second IOC calls bind() before the first IOC called listen(), the second bind() will succeed and the second IOC will fail later when it calls listen(). Currently it decides to go deaf (suspend the receiving thread) at that point, but it really should go back to the phase of testing bind() instead.
Fix prepared during the 2020 Codeathon by Bryan Tester (thanks!), committed as 4844fbb to the 3.15 branch.