Broadcast address detection is unreliable on Linux

Bug #1910148 reported by Michael Ritzert
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Fix Released
Low
Michael Ritzert

Bug Description

I noticed that I have a Linux system where broadcasts are not sent to all configured broadcast addresses. The reason is easily identified: EPICS base still uses the SIOCGIFBRDADDR way to identify the addresses in osiSockDiscoverBroadcastAddresses. This fails in recent systems, when several IPs are added to one interface e.g. via "ip addr add" as NetworkManager likes to do:

2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 44:39:d4:9d:25:b6 brd ff:ff:ff:ff:ff:ff
    inet 10.49.126.61/24 brd 10.49.126.255 scope global dynamic noprefixroute eno1
       valid_lft 834648sec preferred_lft 834648sec
    inet 10.49.125.2/24 brd 10.49.125.255 scope global eno1
       valid_lft forever preferred_lft forever

In this case, only 10.49.126.255 is detected by EPICS (twice):

ifreqNext() pifreq 0x55ccbc0cc5f0, size 0x28, ifr 0x0x55ccbc0cc618
osiSockDiscoverBroadcastAddresses(): found IFACE: lo len: 0x10 current_ifreqsize: 0x28
osiSockDiscoverBroadcastAddresses(): net intf "lo" flags: 49
osiSockDiscoverBroadcastAddresses(): ignoring loopback interface: "lo"
ifreqNext() pifreq 0x55ccbc0cc618, size 0x28, ifr 0x0x55ccbc0cc640
osiSockDiscoverBroadcastAddresses(): found IFACE: eno1 len: 0x10 current_ifreqsize: 0x28
osiSockDiscoverBroadcastAddresses(): net intf "eno1" flags: 1043
found broadcast addr = a317eff
osiSockDiscoverBroadcastAddresses(): net intf "eno1" found
ifreqNext() pifreq 0x55ccbc0cc640, size 0x28, ifr 0x0x55ccbc0cc668
osiSockDiscoverBroadcastAddresses(): found IFACE: eno1 len: 0x10 current_ifreqsize: 0x28
osiSockDiscoverBroadcastAddresses(): net intf "eno1" flags: 1043
found broadcast addr = a317eff
osiSockDiscoverBroadcastAddresses(): net intf "eno1" found

The workaround here would be to put the second IP on another interface, i.e. eno1:1, but I believe the software shouldn't dictate that.

The more modern way to identify all broadcast addresses is via AF_NETLINK, and the code is not overly complex. I actually do have a first patch ready, see the attachment. Obviously, this code should not be in the generic directory, and it should probably only be selected when #ifdef AF_NETLINK to also consider older versions of Linux (I think kernel ≥ 2.2 (from 1999) should work, but the oldest kernel I actually tested with the netlink code was 3.2.102). So this is just an RFC for now. Please excuse the bad formatting. I wasn't sure if there is a style file available for clang-format or the likes and didn't bother to do it by hand for this first demo.

With the patch, the behavior is changed in one subtle way: Interfaces that are down are not ignored, because this information is not provided for RTM_GETADDR but for RTM_GETLINK. So I'd have to keep track of which interfaces are down in between these two calls. For other corner cases (loopback, PTP), the behavior should be identical as explained in the code.

Obviously, osiLocalAddrOnce also still uses the old interface, but since it should only find any address, this should be OK.

Revision history for this message
Michael Ritzert (michael-ritzert) wrote :
Revision history for this message
Andrew Johnson (anj) wrote :

@mdavidsaver please comment. Is there any overlap here with RTEMS-5?

I suspect we'd want anything like this to get fixed on the 3.15 branch as well, in which case we should target this bug to both the 3.15 and 7.0 series.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

fyi. the ioctl() interface is also used for portability with targets using the BSD network stack (including OSX and RTEMS) and vxWorks, so it can't be entirely removed in favor of the Linux specific AF_NETLINK. ioctl() is also the only option with RTEMS 4.x and vxWorks (as far as I know).

While I don't have an objection to adding usage of AF_NETLINK directly, my preference would be to use the getifaddrs() wrapper (cf. "man getifaddrs"). All of the libc's that I know of on Linux provide it, as well as OSX and RTEMS >=5 with the new network stack.

Changed in epics-base:
status: New → Triaged
importance: Undecided → Low
assignee: nobody → Michael Ritzert (michael-ritzert)
Revision history for this message
Martin Konrad (info-martin-konrad) wrote :

Unfortunately there doesn't seem to be a platform-independent way of obtaining all IP addresses/broadcast addresses for a network interface. C++'s Networking TS (a.k.a. Boost Asio) also seems to stay away from it. getifaddrs() sounds like the way to go on Linux and BSD. I feel like the additional abstraction is worth it.

Revision history for this message
Michael Ritzert (michael-ritzert) wrote :

OK, that makes the change even more straightforward.

I also tested this patch on FreeBSD.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

Do you mean your patch to be a proof of concept, or to be accepted as is? If the later, compatibility with the embedded RTOS targets (RTEMS 4.x and vxWorks) needs to be maintained. These targets do not provide getifaddrs(), and would fail to compile with getifaddrs.patch.

Revision history for this message
Michael Ritzert (michael-ritzert) wrote :

Defitively proof-of-concept at this stage. Now that it's cross-platform, I know even less how to correctly put in in the buildsystem. So I'll have to leave this to somebody who knows better. It's probably not going to be easy since there seems to be no #define that can be used to find out if getifaddrs is available,

Revision history for this message
Andrew Johnson (anj) wrote :

If all of our supported Linux targets provide this functionality then you could just move your modified version of this source file into the osi/os/Linux directory where all Linux builds will find it but our other targets will continue to use the original. You don't need to adjust the Makefile at all, the normal build will find it there. If it works on FreeBSD too then you should add a copy to the osi/os/freebsd directory as well, but note that we don't currently have anyone who tests EPICS on that target so that port isn't really supported.

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

If you do feel like trying, I'd suggest starting a (draft) github PR [1]. This will trigger CI build+test for Linux, Mac, RTEMS, and Windows in various configurations. There is some test coverage in modules/libcom/test/osiSockTest.c.

Right now, the pre-processor recipe I would start with in .../os/default/osdNetIntf.c is:

#if !defined(__rtems__) && !defined(vxWorks)

This already excludes windows, which is handled by .../os/WIN32/osdNetIntf.c

The other strategy would be to rename the existing default/osdNetIntf.c as RTEMS/osdNetIntf.c, add your new implimentation as default/osdNetIntf.c, and finally add a stub vxWorks/osdNetIntf.c with (I think) '#include "../RTEMS/osdNetIntf.c"'.

[1] https://github.com/epics-base/epics-base

Revision history for this message
Michael Ritzert (michael-ritzert) wrote :

So that's two options:
a) a positive list of targets with the new code as proposed in #8.
b) a negative list of targets as proposed in #9.

Any preferences?

As far as I can tell, glibc 2.3 (2002) is the first version to include getifaddrs. Are older Linux platforms still of interest?

Revision history for this message
mdavidsaver (mdavidsaver) wrote :

I'll be happy if the end result avoids multiple ~identical copies of any osdNetIntf.c implementation.

So far we haven't stated a minimum glibc version. I would say don't worry about this now.

Revision history for this message
Andrew Johnson (anj) wrote :

I'll be happy if the end result avoids source files in osi/os/default containing multiple implementations of the same routines wrapped in #ifdef #else #endif switches.

One way to handle that might be to put the new code in a file osi/os/osdNetIfAdrs.c say (which will not itself be listed in any Makefile). Rename the existing osi/os/default/osdNetIntf.c to osi/os/default/osdNetIfConf.c say, then add stub osdNetIntf.c files to the more specific osi/os directories which just include their required implementation. Putting one copy in osi/os/posix might work for the Unix-like OSs instead of all the more specific ones provided they all support the newer API.

Is the patch complete? There is very similar code in the osiLocalAddrOnce() routine below osiSockDiscoverBroadcastAddresses() which hasn't been changed, should it be? The answer to this determines how much the two implementations differ; that routine might have to be pulled out into a separate file to satisfy Michael D's happiness...

I did just try building and running the libCom tests on my Mac (Mojave) with the second patch version applied and it passed everything, so code-wise the darwin-x86 target should be fine with it.

Revision history for this message
Michael Ritzert (michael-ritzert) wrote :

https://github.com/epics-base/epics-base/pull/98

I included the old code in the default directory, and the new code for platforms for which I could find out that getifaddrs should be available. This information could be wrong or additional includes could be required, so some additional checking is certainly required.

After fixing the stupid bug of having the new code in both files, at least the autobuild results are now looking good.

Also I couldn't easily identify a code path that leads to osiLocalAddr, so this function is completely untested.

One could also consider moving the old code back to default instead of just including it from there. For now it's only used in this one place.

Revision history for this message
Andrew Johnson (anj) wrote :

Thanks! The Appveyor (Windows) builds take something like 11 hours to finish although any failures now would probably not have anything to do with your changes, sometimes our tests just fail randomly there so don't worry if you see an email about that, we'll take a look.

osiLocalAddr() is used by the CA client library code, in modules/ca/src/client/udpiiu.cpp (I found that using git grep from the top of the tree).

I hope Michael D will take another and comment sometime, either here or on GitHub.

Changed in epics-base:
status: Triaged → In Progress
Revision history for this message
mdavidsaver (mdavidsaver) wrote :

> osiLocalAddr() is used by the CA client library code, in modules/ca/src/client/udpiiu.cpp

I checked and yes, this code can actually be called. It's reached from a (hard coded) 10 second timer, so eg. 'caget' with the default timeout won't try. Longer running CA clients will though.

Based on the (lengthy) comments in udpiiu.cpp and repeater.cpp this is a legacy behavior <= 3.13.0-b11 (CA proto version <= 8). We currently claim support for CA proto version >=4.

While I suspect that there would no great harm in removing this compatibility. Anyone running a 20+ year old *nix installation probably can't build recent Base anyway. I don't see it as priority though. We've removed support for older CA proto versions so far only for security reasons. Both the Base and CAJ repeater implementations appear to correctly reject non-local registrations though, so I don't think that applies.

Andrew Johnson (anj)
Changed in epics-base:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.