ppc64el / arm64 - issues with cloud-init setting default route

Bug #1879933 reported by Andrew McLeod on 2020-05-21
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Undecided
Unassigned
cloud-init
Undecided
Unassigned
netplan
Undecided
Unassigned

Bug Description

This is quite possibly a cloud-init bug.

MAAS version: 2.6.2 (7841-ga10625be3-0ubuntu1~18.04.1)

This problem manifests whether to machine is deployed with juju or manually via the MAAS ui.

This problem is intermittent and I have only seen it affecting arm64 and ppc64el machines (out of 29 machines in total) - all of these machines have 2 interfaces connected to the same fabric in the same subnet - one is set to unassigned to be used as a bridge port / data port for openstack deployments, the other is set to auto assign.

This problem occurs with bionic, eoan and focal deployments.

I have recommissioned the affected machines numerous times, including attempts to update firmware.

Symptoms: when the machine comes up after it is deployed there is no default gateway, e.g.

ubuntu@node-mawhile:/var/log$ ip route
10.245.168.0/21 dev enP5p9s0f0 proto kernel scope link src 10.245.168.63

The rsyslog on the MAAS server shows that the machine is being configured correctly:

https://pastebin.ubuntu.com/p/ZZzQ4q2ZCT/

But the cloud-init log on the machine does not have a default gateway:

https://pastebin.ubuntu.com/p/cCJbF7zhtK/

Additional info:

Something I have observed is that the machines where this problem occurs seem to sometimes have the 'unassigned' interface as the PXE interface, and sometimes the auto-assigned interface. I've tried to force this but the PXE interface moves around by itself.

Lee Trager (ltrager) wrote :

MAAS passes network config to cloud-init which writes it to /etc/netplan/50-cloud-init.yaml and uses netplan to actually apply it once the system has booted. netplan is non-blocking and I've seen cloud-init output incomplete network information even though netplan hasn't finished applying network config.

* Have you verified the network configuration isn't correct by logging onto the effecting system and checking routes with `route`?
* When you logged in is the netplan process running?
* Can you post full Curtin output? You can get this with
maas $PROFILE machine get-curtin-config $SYSTEM_ID

Changed in maas:
status: New → Incomplete
Ryan Harper (raharper) wrote :
Download full text (6.7 KiB)

> netplan is non-blocking and I've seen cloud-init output incomplete network information even though netplan hasn't finished applying network config

cloud-init calls netplan generate which reads the config passed in from MAAS, and writes out all of the networkd files per the config; this happens before network-online.target is reached, so systemd-networkd runs and cloud-init will not proceed until systemd-networkd-wait-online.service is complete;

systemd-networkd-wait-online.service will wait for all interfaces which have configuration on them.

From the config posted, there's not config for eno1, so this appears to be a
output from one config and input from a different system. can you provide
the failing out, and the /etc/netplan/50-cloud-init.yaml and
/etc/cloud/cloud.cfg.d/50-curtin-networking.cfg files?

> cloud-init log on the machine does not have a default gateway

   0 | 0.0.0.0 | 10.245.168.1 | 0.0.0.0 | eno1 | UG |

Is this not the default gateway?

And lastly, if your config is using non-standard routing tables like the
paste you supplied, ip route will only show routes in the default table,
and the default route appears to be in table 1.

routes:
  - table: 1
  to: 0.0.0.0/0
  via: 10.245.168.1

I took the config from your paste andput it in a container, then ran netplan apply

root@g1:~# netplan --debug apply
** (generate:5092): DEBUG: 21:55:25.895: Processing input file /etc/netplan/50-cloud-init.yaml..
** (generate:5092): DEBUG: 21:55:25.895: starting new processing pass
** (generate:5092): DEBUG: 21:55:25.895: We have some netdefs, pass them through a final round of validation
** (generate:5092): DEBUG: 21:55:25.895: eth0: setting default backend to 1
** (generate:5092): DEBUG: 21:55:25.895: Configuration is valid
** (generate:5092): DEBUG: 21:55:25.895: Generating output files..
** (generate:5092): DEBUG: 21:55:25.895: NetworkManager: definition eth0 is not for us (backend 1)
(generate:5092): GLib-DEBUG: 21:55:25.895: posix_spawn avoided (fd close requested)
DEBUG:netplan generated networkd configuration changed, restarting networkd
DEBUG:no netplan generated NM configuration exists
DEBUG:eth0 not found in {}
DEBUG:Merged config:
network:
  bonds: {}
  bridges: {}
  ethernets:
    eth0:
      addresses:
      - 10.245.168.63/21
      match:
        macaddress: 00:16:3e:39:6c:f7
      mtu: 1500
      routes:
      - table: 1
        to: 0.0.0.0/0
        via: 10.245.168.1
      routing-policy:
      - from: 10.245.168.0/21
        priority: 100
        table: 1
      - from: 10.245.168.0/21
        table: 254
        to: 10.245.168.0/21
  vlans: {}
  wifis: {}

DEBUG:Skipping non-physical interface: lo
DEBUG:{}
DEBUG:netplan triggering .link rules for lo
DEBUG:netplan triggering .link rules for eth0
root@g1:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP gro...

Read more...

Ryan Harper (raharper) wrote :

I believe the cloud-init task is invalid, but let's wait for some more information from submitter.

Changed in cloud-init:
status: New → Incomplete
Andrew McLeod (admcleod) wrote :

Perhaps the bug title should be changed from 'no default route' to 'default route doesn't seem to be relevant if it is not in the default routing table':

I wasn't aware that the default route would be in another table - it is, but it doesn't work. If i add the route to the default table it does work.

ubuntu@node-mawhile:~$ ip route
10.140.121.0/24 dev lxdbr0 proto kernel scope link src 10.140.121.1
10.245.168.0/21 dev enP5p9s0f0 proto kernel scope link src 10.245.168.63
ubuntu@node-mawhile:~$ ip route list table 1
default via 10.245.168.1 dev enP5p9s0f0 proto static

ubuntu@node-mawhile:~$ ip route get 8.8.8.8
RTNETLINK answers: Network is unreachable
ubuntu@node-mawhile:~$ sudo ip route add default via 10.245.168.1 dev enP5p9s0f0
ubuntu@node-mawhile:~$ ip route get 8.8.8.8
8.8.8.8 via 10.245.168.1 dev enP5p9s0f0 src 10.245.168.63 uid 1000
    cache

pastebin for /etc/netplan/50-cloud-init.yaml
https://pastebin.ubuntu.com/p/DbHvkCHtNr/

pastebin for /etc/cloud/cloud.cfg.d/50-curtin-networking.cfg
https://pastebin.ubuntu.com/p/KH4R2XMTCN/

Here is the replicated cloud-init output (note the lxd route is there because I launched some containers after adding the route) - note no default route in this output.

https://pastebin.ubuntu.com/p/XtwmcVZxV3/

Running netplan apply --debug doesnt make a difference, the default route is still where it should be, in table 1, but nothing external is reachable.

Lee Trager (ltrager) on 2020-06-04
Changed in cloud-init:
status: Incomplete → New
Paride Legovini (paride) wrote :

Hi,

now my question is: isn't the fact that non-default routing tables are not used by default the expected behavior? IIUC non-default tables need rules to configure when they should be used, e.g.

  ip rule add from <ip> table <table>

Also, you wrote in the bug description that the problem is intermittent. I think it would be really interesting to see how the config files are and how the routing configured when everything does happen to work. Do you think you can collect the relevant logs?

Thanks!

Changed in cloud-init:
status: New → Incomplete
Andrew McLeod (admcleod) wrote :

It took about 12 deploys - I did nothing but release/deploy (focal) - and I managed to get one that had a functional network:

ubuntu@node-gengar:~$ ip route get 8.8.8.8
8.8.8.8 via 10.245.168.1 dev enP5p9s0f1 src 10.245.168.27 uid 1000
    cache

ubuntu@node-gengar:~$ ip route
default via 10.245.168.1 dev enP5p9s0f1 proto static
10.245.168.0/21 dev enP5p9s0f1 proto kernel scope link src 10.245.168.27
ubuntu@node-gengar:~$ ip route list table 1
Error: ipv4: FIB table does not exist.
Dump terminated
ubuntu@node-gengar:~$ ip rule list
0: from all lookup local
32766: from all lookup main
32767: from all lookup default

/etc/netplan/50-cloud-init.yaml
https://pastebin.ubuntu.com/p/qxFJCSkyfn/

/etc/cloud/cloud.cfg.d/50-curtin-networking.cfg
https://pastebin.ubuntu.com/p/zdDwgVbSJd/

cloud-init
https://pastebin.ubuntu.com/p/pySk8r6Cp3/

I'm going to leave this one up and a broken one in case anyone wants any other logs etc.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers