IOC server binding to single IP not able to receive broadcasts

Bug #1466129 reported by Ralph Lange
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
Medium
mdavidsaver
3.15
Fix Released
Undecided
Unassigned
3.16
Fix Released
Medium
mdavidsaver

Bug Description

An IOC CA server (rsrv) that binds to a single IP address using the EPICS_CAS_INTF_ADDR_LIST environment variable (feature added in revision 12508) is not able to receive broadcast UDP name resolution messages sent to that IP address / interface.

Only unicast messages directed to the specific IP address will be received.

CA clients that want to connect to such an IOC have to set EPICS_CA_ADDR_LIST to the specific IP address of the IOC to connect.

Code in the CAS C++ server (casDGIntfIO.cc, lines 200 ff.) seems to suggest that the CA server has to separately bind to the broadcast address when it binds to a specific IP address.

This issue does not apply to Windows, where a socket binding to a specific IP address also receives broadcast traffic for that address.

Tags: rsrv

Related branches

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

This seems to be a design feature of BSD sockets which I hadn't come across before. Broadcasts are only received by sockets bound to INADDR_ANY and/or when SO_BINDTODEVICE is set (a privileged operation).

As I see it, instead of binding the socket to x.x.x.x, the same effect can be achieved by binding INADDR_ANY and checking the source address of received packets. This does require that the netmask of the interface be known.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Should read "... and checking the destination address of received packets."

Agreed. That is simpler than running another thread for receiving broadcasts or doing the FD handling (like CAS).

Revision history for this message
Ralph Lange (ralph-lange) wrote :

No way.

When receiving messages through the socket, you can only find out the source address. There is no way to get the destination address of incoming messages at that level.

You can bind a socket to a physical interface device, but that needs root rights and would collect much more traffic that would have to be filtered manually.

There is no way around having two sockets, one bound to the specific IP, one bound to the broadcast address of the network of that IP.
For rsrv, there are basically two options to handle this:
1. Do socket multiplexing using poll() or select() - as CAS does.
2. Start a separate UDP receiver thread for each socket.

Opinions?

Revision history for this message
Andrew Johnson (anj) wrote :

The issue is only with the UDP socket isn't it, as implemented in cast_server.c. Jeff moved away from doing the poll()/select() thing in the 3.14 version of libca, and I do remember having portability issues with select() when we used it. Whichever we pick this should let us support including multiple interfaces in EPICS_CAS_INTF_ADDR_LIST. I suspect it's going to be conceptually simpler to have a separate thread for each socket.

Hmm, I just tried setting EPICS_CAS_INTF_ADDR_LIST to the broadcast address of my subnet and found I couldn't connect, does that work for anyone else?

Revision history for this message
Andrew Johnson (anj) wrote :

When the IOC is listening with EPICS_CAS_INTF_ADDR_LIST set to my broadcast address the client immediately reports

tux% caget -w 10 anj:exit
CAC: Unable to connect because "Connection refused"
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "tux.aps.anl.gov:5064"
    Source File: ../cac.cpp line 1215
    Current Time: Thu Jun 18 2015 11:04:51.085505655
..................................................................

Thus it tries to connect but fails. Hmm, we have 2 sockets involved here, and we're currently binding both the UDP and TCP socket to the same address, but it's probably not kosher to bind a TCP socket to a broadcast address...

I just did an experiment: In cast_server.c I changed the call to ellFirst(&casIntfAddrList) to make it ellLast(&casIntfAddrList) and set EPICS_CAS_INTF_ADDR_LIST to my node address followed by my broadcast address. The IOC prints the warning at startup, but it now binds the TCP socket to the node address and the UDP socket to the broadcast address. With this configuration my clients can connect to the IOC.

On Linux at least we do not have to set the SO_BROADCAST option to receive broadcast packets on a socket bound to the broadcast address — I was talking about that issue with Ralph before Michael connected to the Hangout, but it seems we don't need to worry about that.

Conclusion: We still need to be listening to multiple UDP sockets, and we also have to make sure that we bind the TCP socket to the node address, not to the subnet broadcast address.

Do we currently have a way to connect or derive those addresses from each other?

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

So the trouble with this approach is that it doesn't allow unicast search messages from hosts not in a white-listed subnet.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

For that you could add the iptables trick that converts all incoming unicast packets to have a broadcast destination address.
But if we want to support multiple entries in EPICS_CAS_INTF_ADDR_LIST, we are back to the problem of having to bind to multiple sockets, so a general solution has advantages.

Revision history for this message
Andrew Johnson (anj) wrote :

I'm getting confused by conversations happening in 2 places at once, can we stick to using Launchpad for discussion of this bug please.

I do not like the idea Michael posed in email and code on github of using something like an ACL on the IOC. I think we need to do the multiple socket thing, for which the API is the existing EPICS_CAS_INTF_ADDR_LIST variable, as long as we can detect and/or convert between broadcast and node IP addresses, since the user might put either or even both in the list (we don't have to accept either/both [but it would be better if we can], but we would need to be able to detect the other and abort).

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

I like that comments pages doesn't refresh when I post a comment, very web 2.x. The fact that it doesn't pull down new comments w/o a refresh, not so much. Then again, I don't really need to read what you guys are saying. It's not like I'm going to spend a day going down the wrong path...

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

> I do not like the idea Michael posed in email and code on github of using something like an ACL on the IOC. I think we need to do the multiple socket thing

Agreed, please disregard my previous as I completely missed an important piece of this problem.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :
Revision history for this message
mdavidsaver (mdavidsaver) wrote :

I think I have something testable.

When binding to a specific interface two CAS-UDP threads are started. Each receives from a different socket, but both write to one (not bound to bcast address). I think this should be safe.

Also, the globals IOC_cast_sock and prsrv_cast_client are eliminated. Each client creates its own 'struct client*', and receiving socket. The socket created by the first cast_server to start is then used by the second to send replies.

As a TODO, the reporting on the UDP client(s) is commented out in casr() at the moment.

Changed in epics-base:
status: New → Confirmed
assignee: nobody → mdavidsaver (mdavidsaver)
importance: Undecided → Medium
Revision history for this message
Ralph Lange (ralph-lange) wrote :

This looks quite promising.
I have tested on a VM with Linux 64bit (Debian Testing), two (virtual) network interfaces plus localhost.

ok - Everything normal when not setting EPICS_CAS_INTF_ADDR_LIST.
ok - Setting EPICS_CAS_INTF_ADDR_LIST to one existing IP does work as expected.

fail - Setting EPICS_CAS_INTF_ADDR_LIST to 127.0.0.1 does not create the listener for the loopback broadcast address, IOC does not receive requests to the loopback broadcast address [1]
fail - Setting EPICS_CAS_INTF_ADDR_LIST to a broadcast address prints an irritating error message ("CAS UDP: failed to find interface broadcast address") and creates a TCP listener on the broadcast address, which results in "Virtual circuit disconnect" exceptions on the client, that is able to resolve the name but not to connect.

enh - Why not remove the limitation to the first IP address in EPICS_CAS_INTF_ADDR_LIST and allow binding to multiple ports, creating a thread pair for each?

[1] Even though the loopback interface does not have the BROADCAST flag set, using 127.255.255.255 works on recent Linux. It does not work for OS/X (according to stackoverflow)

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Looking at this again I have some questions about what should be allowed EPICS_CAS_INTF_ADDR_LIST, and show localhost should be handled.

> fail - Setting EPICS_CAS_INTF_ADDR_LIST to 127.0.0.1 does not create the listener for the loopback broadcast address

As you mention this is because loopback does not usually set IFF_BROADCAST although it can. Of course the fact that osiSockDiscoverBroadcastAddresses() explicitly ignores when IFF_LOOPBACK seems like the real problem.

I've made a change which does this, but needs testing on all non-linux except win32.

https://github.com/mdavidsaver/epics-base/commit/c5783e3af454bad32a98707f7431c27be13c3890

> fail - Setting EPICS_CAS_INTF_ADDR_LIST to a broadcast address

Should this be allowed? What should it mean? My inclination is to say no.

> enh - Why not remove the limitation to the first IP address

One thing at a time :)

Revision history for this message
Ralph Lange (ralph-lange) wrote :

>> fail - Setting EPICS_CAS_INTF_ADDR_LIST to a broadcast address

> Should this be allowed? What should it mean? My inclination is to say no.

Of course not.
But it should print a meaningful failure message, and ignore this entry (instead of creating a thread with a weird non-working configuration).

>> enh - Why not remove the limitation to the first IP address

> One thing at a time :)

Sure.
But adding this would make the rsrv server work the same as CAS, which was one of the original intentions.

It is also the last show stopper for a CA Gateway implementation based on rsrv, which would not only be possibly faster and superior to the existing Gateway, but would also remove the last reason for keeping CAS and gdd in base.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

I have re-tested your branch.

Good news: On Linux, everything seems to be working fine. Multiple selected NIFs are supported.

On my Laptop under cygwin, I cannot run the IOC without the EPICS_CAS_INTF_ADDR_LIST setting anymore.
While regular 3.16 tip yields some messages, but a running IOC:

$ ./bin/cygwin-x86_64/softIoc.exe -d ~/Work/test.db
Starting iocInit
############################################################################
## EPICS R3.16.0-DEV $$Date$$
## EPICS Base built Sep 25 2015
############################################################################
iocRun: All initialization complete
../../../src/ioc/rsrv/online_notify.c: CA beacon routing (connect to "169.254.255.255:5065") error was "Network is unreachable"
../../../src/ioc/rsrv/online_notify.c: CA beacon routing (connect to "169.254.255.255:5065") error was "Network is unreachable"
../../../src/ioc/rsrv/online_notify.c: CA beacon routing (connect to "169.254.255.255:5065") error was "Network is unreachable"
../../../src/ioc/rsrv/online_notify.c: CA beacon routing (connect to "169.254.255.255:5065") error was "Network is unreachable"

the rsrvbindiface branch gets:

$ ./bin/cygwin-x86_64/softIoc.exe -d ~/Work/test.db
Starting iocInit
############################################################################
## EPICS R3.16.0-DEV $$Date$$
## EPICS Base built Sep 25 2015
############################################################################
../../../src/ioc/rsrv/caservertask.c: CA beacon routing (connect to "169.254.255.255:0") error was "Network is unreachable"
Thread _main_ (0x600010590) can't proceed, suspending.

The rsrvbindiface branch does not compile using the Microsoft compiler:

[...]
"Installing created executable ../../../bin/windows-x64/podToHtml.pl"
"Installing created executable ../../../bin/windows-x64/podRemove.pl"
"Installing created executable ../../../bin/windows-x64/registerRecordDeviceDriver.pl"
"Installing html ../../../html/./style.css"
mkdir ../../../html
mkdir ../../../html/.
perl -CSD ../../../src/tools/podToHtml.pl -s -o Getopts.html ../EPICS/Getopts.pm
Can't open ../O.Common/EPICS/: No such file or directory.
"Installing generated html ../../../html/./EPICS/Getopts.html"
installEpics.pl: No such file '../O.Common/EPICS/Getopts.html' at ../../../src/tools/installEpics.pl line 53.
mkdir ../../../html/./EPICS
mingw32-make[3]: *** [../../../html/./EPICS/Getopts.html] Error 2

Might be solved with a rebase??!

I can't run cross machine tests on my Windows laptop - not authorized to change the firewall settings.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

There is one outstanding issue with this branch on Linux. Beacons aren't sent properly when binding to INADDR_ANY. This will need a special case.

wrt. cygwin, I know the error handling needs to be less fatal :) There are quite a few places where rsrv_init() can fail, and I wasn't sure how to handle them all and left cantProceed()s as placeholders. In this case connect()ing the UDP socket for beacon broadcasts has failed. Should this interface then be skipped all together, or should setup continue w/o the beacon socket? My inclination is to skip interfaces which can't be fully setup.

wrt. msvc. A rebase isn't difficult.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Also, the most recent commit on the rsrvbindiface branch is a fairly large rework of the RSRV initialization sequence. It consolidates socket and thread creation into rsrv_init(), which I think gives a clearer picture of what all is being done.

It also changes what sockets/threads are created. For each interface it creates 3-4 sockets: UDP beacon sender, UDP name receiver (maybe two), and TCP listener. I'm probably going to change this to decouple the beacon socket.

Each UDP receiver and TCP listener socket gets a thread. However, they all still share the global clientQ and clientQlock as I don't have a good reason to make them independent.

casr() is expanded to print some information about each listener.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

I noted and like the casr() expansion a *lot*.
I also found the thread/socket create situation now a lot clearer than it was before.

How much time will you need to complete the implementation?

This fix is the one reason why I will package 3.15.3 (to go into CCSv5.2), so I am depending on this.
My schedule was to package -pre1 next week, but I can slip this, as I still have some leeway.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Oh, yes. 3.15 is my target for the fix.

Maybe split the simple fix (3.15) from the latter rewrites (later 3.15 or 3.16)?

My use case is binding to a single NIF. This is what I need now. Everything else - nice, but not mandatory.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

> I know the error handling needs to be less fatal :)

errors should be less fatal now.

> I'm probably going to change this to decouple the beacon socket.

Beacon are now decoupled. sendto() is used with one socket instead of many connect()'d sockets.

I also think I have the handling of parameters as it was.

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Excellent. Will test tomorrow. (As 30 seconds are not enough...)

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Situation for 3.15.3:

The issue has been fixed as part of Michael's rsrvbindiface branch.
However, the changes on that branch are not tested thoroughly enough on all target platforms to go into 3.15.3.

I pulled the minimal changes to fix this issue into a patch file that I am attaching to this bug - this will also be available form the download sites. This is a patch tested on Linux only, and should used with care - especially on other target systems.

We expect the complete rsrvbindiface branch to be tested and merged before 3.15.4

Revision history for this message
Ralph Lange (ralph-lange) wrote :

Patch to fix bug #1466129 - only tested on Linux 64bit.

Changed in epics-base:
status: Confirmed → In Progress
Revision history for this message
Andrew Johnson (anj) wrote :

I edited this patch file slightly before publishing it from the 3.15.3 Known Problems page since our standard patch instructions tell the user to run "patch -p0 file.patch" which doesn't work with git patches that have the leading a/ and b/ on the pathnames.

Changed in epics-base:
milestone: none → 3.16.branch
Andrew Johnson (anj)
Changed in epics-base:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.