VRF support to solve routing problems associated with multi-homing

Bug #1737428 reported by Dmitrii Shcherbakov
80
This bug affects 14 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Incomplete
Wishlist
Unassigned
MAAS
Invalid
Wishlist
Unassigned
linux (Ubuntu)
Incomplete
Wishlist
Unassigned
netplan.io (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

Problem description:

* a host is multi-homed if it has multiple network interfaces with L3 addresses configured (physical or virtual interfaces, natural to OpenStack regardless of IPv4/IPv6 and IPv6 in general);

(see 3.3.4 Local Multihoming https://tools.ietf.org/html/rfc1122#page-60 and 3.3.4.2 Multihoming Requirements)

* if all hosts that need to participate in L3 communication are located on the same L2 network there is no need for a routing device to be present. ARP/NDP and auto-created directly connected routes are enough;

* multi-homing with hosts located on different L2 networks requires more intelligent routing:
  - "directly connected" routes are no longer enough to talk to all relevant hosts in the same network space;
  - a default gateway in the main routing table may not be the correct routing device that knows where to forward traffic (management network traffic goes to a management switch and router, other traffic goes to L3 ToR switch but may go via different bonds);
  - even if a default gateway knows where to forward traffic, it may not be the intended physical path (storage replication traffic must go through a specific outgoing interface, not the same interface as storage access traffic although both interfaces are connected to the same ToR);
  - there is no longer a single "default gateway" as applications need either per-logical-direction routers or to become routers themselves (if destination == X, forward to next-hop Y). Leaf-spine architecture is a good example of how multiple L2 networks force you to use spaces that have VLANs in different switch fabrics => one or more hops between hosts with interfaces associated with the same network space;
  - while network spaces implicitly require L3 reachability between each host that has a NIC associated with a network space, the current definition does not mention routing infrastructure required for that. For a single L2 this problem is hidden by directly connected routes, for multi-L2, no solution is provided or discussed;

* existing solutions to multi-homing require routing table management on a given host: complex static routing rules, dynamic routing (e.g. running an OSPF or BGP daemon on a host);

* using static routes is rigid and requires network planning (i.e. working with network engineers which may have varying degrees of experience, doing VLSM planning etc.);

* using dynamic routing requires a broader integration into an organization's L3 network infrastructure. Routing can be implemented differently across different organizations and it is a security and operational burden to integrate with a company's routing infrastructure.

Summary: a mechanism is needed to associate an interface with a forwarding table (FIB) which has its own default gateway and make an application with a listen(2)ing socket(2) return connected sockets associated with different FIBs. In other words, applications need to implicitly get source/destination-based routing capabilities without the need to use static routing schemes or dynamic routing and with minimum or no modifications to the applications themselves.

Goals:

* avoid turning individual hosts into routers;
* avoid complex static rules;
* better support multi-fabric deployments with minimum effort (Juju, charms, MAAS, applications, network infrastructure);
* reduce operational complexity (custom L3 infrastructure integration for each deployment);
* reduce delivery risks (L3 infrastructure, L3 department responsiveness varies);
* avoid any form of L2 stretching at the infrastructure level - this is inefficient for various reasons.

NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to read this post to understand suggestions below.

How to solve it?

What does it mean for Juju to support VRF devices?

* enslave certain devices on provisioning based on network space information (physical NICs, VLAN devices, bonds AND bridges created for containers must be considered) - VRF devices logically enslave devices similar to bridges but work differently (on L3, not L2);
* the above is per network namespace so it will work equally well in a LXD container;

Conceptually:

# echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf
# echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf
# sysctl -p

# # create additional routing tables
# cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF
1 mgmt
10 pub
20 storacc
30 storrepl
EOF

# # populate per-routing table default gateways
# ip route add mgmt default via 192.168.0.1
# ip route add pub default via 172.16.0.1
# ip route add storacc default via 10.10.4.1
# ip route add storrepl default via 10.10.5.1

# # add and bring up VRF devices
# ip link add mgmt type vrf table 1 && ip link set dev mgmt up
# ip link add pub type vrf table 10 && ip link set dev pub up
# ip link add storacc type vrf table 20 && ip link set dev storacc up
# ip link add storrepl type vrf table 30 && ip link set dev storrepl up

# # enslave actual devices to VRF devices
# ip link set mgmtbr0 master mgmt
# ip link set pubbr0 master pub
# ip link set storaccbr0 master storacc
# ip link set storreplbr0 master storrepl

# make your services use INADDR_ANY for listening sockets in charms if not done already (use 0.0.0.0)

charm-related:

* (no-op) services with listening sockets on INADDR_ANY will not need any modifications either on the charm side or at the application level - this is the cheapest way to solve multi-homing problems;

* (later) a more advanced functionality for applications that do not use INADDR_ANY but bind a listening socket to a specific address - this requires `ip vrf exec` functionality in iproute2 or application modifications.

Notes:

* Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move routing problems to L3 departments. Juju deploy "router" is a different scenario which should reside on a model separate from IAAS;
* We are not turning hosts into routers with this - this is a way to move routing decisions to the next hop which is available on a directly connected route. The problem we are solving here is N next hops instead of just one. Those hops can worry about administrative distance/different routing protocols, route costs/metrics, routing protocol peer authentication etc.
* Linux kernel functionality was mostly upstreamed in 4.4;
* Linux kernel only while a unit agent can run on Windows too (nothing we can do here).

Implementation description:

1. Kernel

4.4 (GA xenial)

* CONFIG_NET_VRF=m - present in xenial GA kernels
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172

* CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109

backports needed from 4.5 - required for VRF-unaware applications that use INADDR_ANY:

6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept)
63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept)

only `ip vrf exec` related - NOT required for baseline functionality:

* http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and CGROUP_BPF enabled - xenial HWE only (not HWE-edge)

2. User space (iproute2)

iproute2 supports the vrf keyword in a version packaged with Ubuntu 16.04.

More specific functionality like `ip vrf exec <vrf-name>` is available in later versions:

https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0
git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0
v4.10.0
v4.11.0
...

3. MAAS - already hands over per-subnet default gateways

https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360
https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378

4. Juju and/or MAAS:

* create per-network-space routing tables (default gateways must be taken from subnets in MAAS - subnets related to the same space will have different default gateways)
* create VRF devices relevant to network spaces;
* enslave interfaces to VRF devices (this includes Linux bridges created by Juju for containers).

5. Nothing for baseline functionality other than configuring software to use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets.

(future work) configure software to use `ip vrf exec` even if it doesn't support VRFs directly when INADDR_ANY is not used.

See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note that setsockopt requirement is worked around via `ip vrf exec` in iproute2 (no need to rewrite every application):

"Applications that are to work within a VRF need to bind their socket to the VRF device:

setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1);

or to specify the output device using cmsg and IP_PKTINFO.

TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across ***all VRF domains*** by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:

sysctl -w net.ipv4.tcp_l3mdev_accept=1
sysctl -w net.ipv4.udp_l3mdev_accept=1"

http://man7.org/linux/man-pages/man8/ip-vrf.8.html
"This ip-vrf command is a helper to run a command against a specific VRF with the VRF association ***inherited parent to child***."

References:

https://en.wikipedia.org/wiki/Multihoming
http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html
http://blog.ipspace.net/2010/09/ribs-and-fibs.html

https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read

https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF

http://netdevconf.org/1.2/session.html?david-ahern-talk

https://www.kernel.org/doc/Documentation/networking/vrf.txt

https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-Forwarding-%28VRF%29

http://blog.ipspace.net/2016/02/running-bgp-on-servers.html
https://tools.ietf.org/html/rfc7938

http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage example on 16.04)

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

For Ubuntu kernel this is a backport request.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1737428

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Dimitrii,

This request seems like something we have never seen before. This doesn't seem to be quite a complex thing to do. Right now, this is and never has been in our plans to implement. This is also something that has never come up before.

I'm going to mark this as a wishlist, and incomplete, until we understand what are the needs and the impact on the project. I would suggest you raise this during the sprint.

Changed in maas:
status: New → Incomplete
importance: Undecided → Wishlist
milestone: none → next
tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Undecided → Wishlist
description: updated
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (5.2 KiB)

Andres,

I'm not going to be at the sprint but the problems described need a proper solution in MAAS and Juju at least from the end host perspective. Similar to how VLANs are supported natively in MAAS & Juju, L3 virtualization technologies like VRF should be as well. I hope the information I will give here will be enough to understand the use-cases and past experience in this field.

The concept is very similar to VLANs but for L3 which is probably less familiar and spans many hosts and routers/L3 switches within a single organization instead of being tied to a given switch fabric and either the same process or a group of processes on a host need to (1) receive & respond and (2) send data using different L3 topologies. Instead of virtual broadcast domains you get virtual paths because of per-virtual-L3 routing topologies. Good L2 analogies are Multiple Spanning Tree Protocol (MSTP) or PVST+ that were created to avoid blocking of switchports depending on logical L2 topologies related to a VLAN or group of VLANs (this is hidden on L2 though - no end host modifications required).

The use-cases I am talking about are not new - they were not used as much in data center networks until a certain point. They were used in service provider networks for multi-site L3 VPN for many years (https://tools.ietf.org/html/rfc4364). There are still many deployments which rely on large L2 domains where those problems do not occur as much because routing is done trivially via using directly connected routes and ARP broadcasts (there is never a hop between a source and destination host in most cases).

I may be wrong but it seems to me that Network Spaces were originally designed with multi-homing in mind but with limited support for multi-L2 and routing in mind (I don't judge, VRFs are fairly new to the Linux kernel). They are not that far from supporting that though because of the recent upstream kernel work.

With leaf-spine you are building a complex L3 network with different virtual topologies for different purposes and different SLAs for various kinds of traffic (IOW, a multi-tenant network). This is a typical service provider scenario with different customers on a shared infrastructure. You need to build many parallel dedicated communication lines but since infrastructure is shared it is not possible physically, however, you still need to do load-sharing across links, use distinct paths for different kinds of traffic and other optimizations to make sure your physical links are utilized and clients get certain quality of service and are separated from each other. In this case L3 VPNs are built not for clients (companies "x" and "y") but for different purposes: general purpose data, storage access or replication, management, public API traffic (originally, this was done for voice/video/data, see the first two paragraphs in the "background" section https://www.google.ch/patents/US8457117).

I can describe this in many ways, i.e. we need:

* multi-point L3VPN between racks to simulate L3 virtual circuits/pseudowires for different types of traffic;
* virtual routing domains (VRFs);
* traffic and routing separation for multi-L2 segment networks;
* L3 network mul...

Read more...

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Marking as Incomplete Wishlist as per 'maas'

Changed in juju:
status: New → Incomplete
importance: Undecided → Wishlist
Revision history for this message
Sandor Zeestraten (szeestraten) wrote :

I'd like to chime in and add that this is something that we've been missing in our Juju and MAAS deployment of OpenStack.

One of our problem area for example is properly routing management, storage and public traffic for our OpenStack deployment without complex static rules and a lot of annoying workarounds.

I hope that both the Juju and MAAS team take a proper look at supporting these use cases.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1737428] Re: VRF support to solve routing problems associated with multi-homing

It would be good to have a clearer discussion of what issues you are
running into with routes. There are several ways that we *could* tackle the
issue. Static Routes was the mechanism that we started modeling because
that was the ask from the field (because, as-I-understand, that was the
solution they were using manually).
There have been discussions about enabling things like BGP. It would be
possible to push heavier on the modeling aspect, and give ways to model
routing as part of the network and space model, and then have Juju tracking
and updating those (eg, route traffic from this space to that space via
these gateways).

On Fri, Jan 26, 2018 at 7:42 PM, Sandor Zeestraten <email address hidden>
wrote:

> I'd like to chime in and add that this is something that we've been
> missing in our Juju and MAAS deployment of OpenStack.
>
> One of our problem area for example is properly routing management,
> storage and public traffic for our OpenStack deployment without complex
> static rules and a lot of annoying workarounds.
>
> I hope that both the Juju and MAAS team take a proper look at supporting
> these use cases.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1737428
>
> Title:
> VRF support to solve routing problems associated with multi-homing
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1737428/+subscriptions
>

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

John,

Interfaces of a host carry enough information to be used to make routing decisions - that's the core idea of host and router-side VRF implementations. Network spaces as of now do not help you to solve routing problems in any way unless you have one big L2 network and "routing" is done without routers: via ARP/NDP in a single broadcast domain.

Static routes are not flexible enough and are a workaround for the lack of VRF support. They require many additional steps from a deployer's perspective to worry about: one should just take a set of VLANs and subnets to configure in MAAS and assign them to a network space. With a default gateway per subnet there is always a next hop to delegate a routing decision to for a given network space from a host's perspective. Charms and potentially applications do need to be VRF-aware (discussed above on how).

BGP on a host, while feasible in some scenarios, is not always doable in practice: not every network and/or security department will give you an ability to deploy something and set up peering with their BGP-enabled routers.

I'd be happy to discuss scenarios in-depth here or out of band but the idea is that Network Spaces need to learn how to assist with Routing and Forwarding parts - currently they solve only end-end discovery via relation data.

Revision history for this message
james beedy (jamesbeedy) wrote :

BUMP! +1000 for this feature. We need vrf support in MAAS to be able to use it with in bgp/vrf stack. @dmitriis thank you for the thorough write up and detailed explanation of the problem/solution here!

Revision history for this message
Sandor Zeestraten (szeestraten) wrote :

Any news or progress on these issues now that we are a year and a half into this bug?

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Sandor,

Not on the VRF usage side but there is a feature in MAAS 2.6 to have a better way to work in multi-homed environments (for bionic+ machines):

https://docs.maas.io/2.6/en/intro-new
"Networking - Multiple default gateways"

It relies on "routing policy database" (RPDB) functionality
https://paste.ubuntu.com/p/xg6vFm8Hx7/ (netplan config, routing-policy sections are defined only for subnets that have a gateway configured in MAAS)

At the target machine you will see something like this:

# ip rule
0: from all lookup local
0: from 10.232.24.0/21 to 10.232.24.0/21 lookup main
0: from 10.232.40.0/21 to 10.232.40.0/21 lookup main
100: from 10.232.24.0/21 lookup 2
100: from 10.232.40.0/21 lookup 1
32766: from all lookup main
32767: from all lookup default

# ip route show table 1
default via 10.232.40.1 dev b-enp4s0f0-2730 proto static

# ip route show table 2
default via 10.232.24.1 dev b-enp4s0f0-2731 proto static

This works well for TCP when responding to traffic (even when software listens on 0.0.0.0). For UDP a frequent server use-case is DNS servers and bind9 binds its UDP sockets to interface addresses directly as opposed to using 0.0.0.0 (some other DNS servers do the same, e.g. PowerDNS - they even have a post about it https://blog.powerdns.com/2012/10/08/on-binding-datagram-udp-sockets-to-the-any-addresses/).

For sending, the policy rules will also kick in provided that a client socket (TCP or UDP) is bound to a specific address (so that the source IP is not automatically selected). This requires that the target software supports binding client sockets to specific addresses unfortunately.

So far using static routes to summarized prefixes has been a solution for east-west traffic (because we control nodes managed by MAAS) and using the approach above for client responses to arbitrary networks (via https://jaas.ai/u/canonical-bootstack/policy-routing).

After juju starts supporting this new MAAS feature https://bugs.launchpad.net/juju/+bug/1829150 we can stop using charm-policy-routing.

I hope that helps while VRF functionality is not implemented.

Ante Karamatić (ivoks)
tags: added: sts
Revision history for this message
Björn Tillenius (bjornt) wrote :

This sounds great, but it's not a bug report. It's a feature request. For MAAS, we track feature requests at https://discourse.maas.io/c/features, so I'm going to mark this bug report as Invalid for MAAS.

Changed in maas:
status: Incomplete → Invalid
Revision history for this message
Billy Olsen (billy-olsen) wrote :

@bjornt Are you going to copy it over to the MAAS discourse feature set then?

Revision history for this message
Björn Tillenius (bjornt) wrote :

On Tue, May 05, 2020 at 06:04:28PM -0000, Billy Olsen wrote:
> @bjornt Are you going to copy it over to the MAAS discourse feature set
> then?

We would prefer that one of the stakeholders actually would add it to
discourse. That will ensure that the stakeholders are still in the loop,
if there are questions about the feature.

Otherwise it might be that the MAAS team will have a discussion with
itself :)

Changed in netplan.io (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

https://www.freedesktop.org/software/systemd/man/systemd.network.html#VRF=

VRF= The name of the VRF to add the link to. See systemd.netdev(5).

vrf A Virtual Routing and Forwarding (VRF) interface to create separate routing and forwarding domains.

[VRF] Section Options
The [VRF] section only applies for netdevs of kind "vrf" and accepts the following key:

Table=
The numeric routing table identifier. This setting is compulsory.

Example 15. /etc/systemd/network/25-vrf.netdev

Create a VRF interface with table 42.

[NetDev]
Name=vrf-test
Kind=vrf

[VRF]
Table=42

So there is backend support in networkd to create tables / vrf, and add link to a given vrf. Not sure about if routes are hooked up in networkd too.

Changed in maas:
milestone: next → none
Revision history for this message
Lukas Märdian (slyon) wrote :

Basic VRF support landed in netplan v0.105: https://github.com/canonical/netplan/pull/285

Changed in netplan.io (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.