Static routes are not per-interface, which breaks some deployments

Bug #1758919 reported by Gábor Mészáros on 2018-03-26
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
High
Mike Pontillo
2.3
High
Mike Pontillo
cloud-init
Medium
Unassigned

Bug Description

When juju tries to deploy a lxd container on a maas managed machine, it looses all static routes, due to ifdown/ifup being issued and e/n/i has no saved data on the original state.

Machine with no lxd container deployed:
root@4-compute-4:~# ip r
default via 100.68.4.254 dev bond2 onlink
100.68.4.0/24 dev bond2 proto kernel scope link src 100.68.4.1
100.68.5.0/24 via 100.68.4.254 dev bond2
100.68.6.0/24 via 100.68.4.254 dev bond2
100.84.4.0/24 dev bond1 proto kernel scope link src 100.84.4.2
100.84.5.0/24 via 100.84.4.254 dev bond1
100.84.6.0/24 via 100.84.4.254 dev bond1
100.99.4.0/24 dev bond0 proto kernel scope link src 100.99.4.101
100.99.5.0/24 via 100.99.4.254 dev bond0
100.99.6.0/24 via 100.99.4.254 dev bond0
100.107.0.0/24 via 100.99.4.254 dev bond0

After juju deploys a container, routes are disappearing:
root@4-management-1:~# ip r
default via 100.68.100.254 dev bond2 onlink
10.177.144.0/24 dev lxdbr0 proto kernel scope link src 10.177.144.1
100.68.100.0/24 dev bond2 proto kernel scope link src 100.68.100.26
100.84.4.0/24 dev br-bond1 proto kernel scope link src 100.84.4.1
100.99.4.0/24 dev br-bond0 proto kernel scope link src 100.99.4.3

After host reboot, the routes are NOT getting back in place, they are still gone:
root@4-management-1:~# ip r s
default via 100.68.100.254 dev bond2 onlink
100.68.100.0/24 dev bond2 proto kernel scope link src 100.68.100.26
100.84.4.0/24 dev br-bond1 proto kernel scope link src 100.84.4.1
100.84.5.0/24 via 100.84.4.254 dev br-bond1
100.84.6.0/24 via 100.84.4.254 dev br-bond1
100.99.4.0/24 dev br-bond0 proto kernel scope link src 100.99.4.3

Related branches

Ante Karamatić (ivoks) on 2018-03-26
tags: added: cpe-onsite
tags: added: 4010

attached is the original and juju modified interfaces file

the routes are in the wrong bond (bond2), however the gateways are on br-bond0. Also in MAAS they are set to those proper subnets.

on nodes without containers, the configuration is put to /etc/network/interfaces.d/50-cloud-init.cfg, which is also available on all nodes, but getting overridden.

Ante Karamatić (ivoks) wrote :

ifup brings interfaces in serial. In juju's ENI, this means that it would bring bond0 before br-bond0 and br-bond1. And since layer3, provided by br-bond1 and br-bond2 would not exist when post-up is run, post-up would fail. Because of '|| true' that would not cause ifup to fail, but it would leave the machine without routes.

I believe MAAS add 'post-up' static routes always to last interface (which is a good approach until netplan solves this). This means that juju should do the same; pick up post-up routes from the bottom of ENI and place them at the end of the last bridge it creates.

Why is adding it to the "last interface" correct. Wouldn't it be more
correct to attach routes to the interface that contains that route?

Eg, in your above scenario, bond0 is getting 100.99.4.3/24 thus things that
use the 100.99.4.254 as the gateway should be attached to bond0?

On Mon, Mar 26, 2018 at 5:59 PM, Gábor Mészáros <
<email address hidden>> wrote:

> ** Attachment added: "50-cloud-init.cfg"
> https://bugs.launchpad.net/juju/+bug/1758919/+attachment/
> 5091207/+files/50-cloud-init.cfg
>
> --
> You received this bug notification because you are a member of Canonical
> Field Critical, which is subscribed to the bug report.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1758919
>
> Title:
> static routes get lost when lxd container being deployed [MAAS
> environment]
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1758919/+subscriptions
>

John A Meinel (jameinel) wrote :

Note that:
https://launchpadlibrarian.net/362140579/50-cloud-init.cfg
Is a lie, because it pretends that it is just setting up the "post-up"
routes "after all interfaces", but really all of those are explicitly
attached to bond2 (it though it wasn't indenting them, but ifup, et al,
don't actually pay attention in that fashion.)

On Mon, Mar 26, 2018 at 6:24 PM, John Meinel <email address hidden> wrote:

> Why is adding it to the "last interface" correct. Wouldn't it be more
> correct to attach routes to the interface that contains that route?
>
> Eg, in your above scenario, bond0 is getting 100.99.4.3/24 thus things
> that use the 100.99.4.254 as the gateway should be attached to bond0?
>
> On Mon, Mar 26, 2018 at 5:59 PM, Gábor Mészáros <
> <email address hidden>> wrote:
>
>> ** Attachment added: "50-cloud-init.cfg"
>> https://bugs.launchpad.net/juju/+bug/1758919/+attachment/50
>> 91207/+files/50-cloud-init.cfg
>>
>> --
>> You received this bug notification because you are a member of Canonical
>> Field Critical, which is subscribed to the bug report.
>> Matching subscriptions: juju bugs
>> https://bugs.launchpad.net/bugs/1758919
>>
>> Title:
>> static routes get lost when lxd container being deployed [MAAS
>> environment]
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/juju/+bug/1758919/+subscriptions
>>
>
>

Is this actually Field Critical? Isn't just moving the post-up to a
different section enough to fix the field issue as a workaround?

On Mon, Mar 26, 2018 at 6:26 PM, John Meinel <email address hidden> wrote:

> Note that:
> https://launchpadlibrarian.net/362140579/50-cloud-init.cfg
> Is a lie, because it pretends that it is just setting up the "post-up"
> routes "after all interfaces", but really all of those are explicitly
> attached to bond2 (it though it wasn't indenting them, but ifup, et al,
> don't actually pay attention in that fashion.)
>
>
> On Mon, Mar 26, 2018 at 6:24 PM, John Meinel <email address hidden>
> wrote:
>
>> Why is adding it to the "last interface" correct. Wouldn't it be more
>> correct to attach routes to the interface that contains that route?
>>
>> Eg, in your above scenario, bond0 is getting 100.99.4.3/24 thus things
>> that use the 100.99.4.254 as the gateway should be attached to bond0?
>>
>> On Mon, Mar 26, 2018 at 5:59 PM, Gábor Mészáros <
>> <email address hidden>> wrote:
>>
>>> ** Attachment added: "50-cloud-init.cfg"
>>> https://bugs.launchpad.net/juju/+bug/1758919/+attachment/50
>>> 91207/+files/50-cloud-init.cfg
>>>
>>> --
>>> You received this bug notification because you are a member of Canonical
>>> Field Critical, which is subscribed to the bug report.
>>> Matching subscriptions: juju bugs
>>> https://bugs.launchpad.net/bugs/1758919
>>>
>>> Title:
>>> static routes get lost when lxd container being deployed [MAAS
>>> environment]
>>>
>>> To manage notifications about this bug go to:
>>> https://bugs.launchpad.net/juju/+bug/1758919/+subscriptions
>>>
>>
>>
>

You are right, and as soon as I wrote the comment I realized I was wrong (mixed it with using iptables in post-up).

16:07 < ivoks> so, ideally, cloud-config would be smarter here
16:08 < ivoks> and place those routes where they belong
16:08 < ivoks> well, whoever generates that cloud-init.cfg should be a wee smarter

Routes should be placed on the interfaces that provide access to gateways for those routes.

IMHO that's an obvious MAAS fault in writing the routes always to the last device and not to the device the routes are 'attached' to. In this scenario doing ifdown bond2 (an interface that has absolutely nothing to do with the static routes) would bring the routes down. Moreover, the assumption that order of the devices in e/n/i will be the order in which the devices are brought might be incorrect. IMHO This should be fixed in MAAS.

Ante Karamatić (ivoks) on 2018-03-26
Changed in juju:
status: New → Invalid
Changed in maas (Ubuntu):
status: New → Triaged
importance: Undecided → High
no longer affects: juju
no longer affects: maas (Ubuntu)
summary: - static routes get lost when lxd container being deployed [MAAS
- environment]
+ Static routes are not per-interface, which breaks some deployments
Changed in maas:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Mike Pontillo (mpontillo)
milestone: none → 2.4.0beta2
tags: added: field-critical
Changed in maas:
status: Triaged → In Progress
Mike Pontillo (mpontillo) wrote :

IMHO, this should also be fixed in cloud-init. If the input netplan contains "global" routes, the renderer (or whatever can pre-process the Netplan before renderering) should intelligently determine which interfaces have an on-link gateway that matches the global route, and automatically render the route at interface scope instead of "global".

Arguably, if the route's gateway address doesn't match an on-link prefix, it should not be installed anyway (the kernel will reject it anyway, unless the `onlink` flag is supplied, which instructs the kernel to assume the address is on-link even if it doesn't appear to be). But the only useful scenario I can see for supporting the `onlink` flag is if we're installing a route on an interface that will get is IP address via DHCP.

Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 2.4.0beta2 → 2.4.0beta1
Changed in maas:
status: Fix Committed → Fix Released
Ryan Harper (raharper) on 2019-07-19
Changed in cloud-init:
importance: Undecided → Medium
status: New → Triaged
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers