[Focal] cloud-init service never get nework actived during MaaS deploy.

Bug #1869181 reported by Alex Tu
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OEM Priority Project
New
High
Alex Tu
cloud-init
Expired
Undecided
Unassigned

Bug Description

MaaS server used to wait for cloud-init on target reporting status.
It works well on Bionic desktop but failed on Focal desktop.
It might be caused by the ordering of systemd service because the network service always is started after the cloud-init service.

Journalctl:
 三 26 18:34:18 CANONICALID cloud-init[816]: Cloud-init v. 20.1-10-g71af48df-0ubuntu2 running 'init' at Thu, 26 Mar 2020 10:34:18 +0000. Up 6.59 seconds.
 三 26 18:34:18 CANONICALID cloud-init[816]: ci-info: ++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++
 三 26 18:34:18 CANONICALID cloud-init[816]: ci-info: +-----------------+-------+-----------+-----------+-------+-------------------+
 三 26 18:34:18 CANONICALID cloud-init[816]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
 三 26 18:34:18 CANONICALID cloud-init[816]: ci-info: +-----------------+-------+-----------+-----------+-------+-------------------+
 三 26 18:34:18 CANONICALID cloud-init[816]: ci-info: | enx00e04c70045f | False | . | . | . | 00:e0:4c:70:04:5f |
 三 26 18:34:18 CANONICALID cloud-init[816]: ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
 三 26 18:34:18 CANONICALID cloud-init[816]: ci-info: | lo | True | ::1/128 | . | host | . |
 三 26 18:34:18 CANONICALID cloud-init[816]: ci-info: | wlp2s0 | False | . | . | . | 9c:b6:d0:8e:90:81 |
 三 26 18:34:18 CANONICALID cloud-init[816]: ci-info: +-----------------+-------+-----------+-----------+-------+-------------------+
.....[skip]....
 三 26 18:34:18 CANONICALID cloud-init[816]: 2020-03-26 10:34:18,361 - handlers.py[WARNING]: failed posting event: start: init-network/check-cache: attempting to read from cache [trust]
.....[skip]....
 三 26 18:36:25 CANONICALID systemd[1]: Starting Network Manager...

Tags: oem-priority
Alex Tu (alextu)
Changed in oem-priority:
importance: Undecided → Critical
Revision history for this message
Paride Legovini (paride) wrote :

Hello Alex, is it possible for you to collect the full cloud-init logs (using the `cloud-init collect-logs` command) and the full NetworkManager logs, and attach them to this report? Thanks!

Revision history for this message
Paul Larson (pwlars) wrote :

I had this problem with the stock image for both eoan and focal as well. I don't think it's really the "recommended" solution, but the only way to fix it that I've found is to remove "Before=sysinit.target" from /lib/systemd/system/cloud-init.service in the target install. I haven't seen any negative effects from this so far.

Revision history for this message
Dan Watkins (oddbloke) wrote :

Marking this as Incomplete as we're waiting on the logs requested in comment #1.

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Alex Tu (alextu) wrote :
Revision history for this message
Alex Tu (alextu) wrote :

the NetworkManager logs are covered by journalctl log.

Changed in cloud-init:
status: Incomplete → New
Revision history for this message
Alex Tu (alextu) wrote :

Sorry fo reply late due to busy days after vocation.
I provided the logs and changed the state to New for notice.

Changed in oem-priority:
assignee: nobody → Alex Tu (alextu)
Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (3.8 KiB)

Something related to NetworkManager is preventing systemd-networkd from starting.

networkd-dispatcher[1163]: WARNING: systemd-networkd is not running, output will be incomplete.

Normally after network-pre.target, systemd-networkd-wait-online.service runs
and blocks until systemd-networkd has brough up configured interfaces.
systemd-networkd did not start so when cloud-init.service runs network is
*not* up like it expects.

systemd[1]: Finished Initial cloud-init job (pre-networking).
systemd[1]: Reached target Network (Pre).
systemd[1]: Starting Initial cloud-init job (metadata service crawler)...
cloud-init[931]: Cloud-init v. 20.1-10-g71af48df-0ubuntu2 running 'init' at Wed, 08 Apr 2020 05:45:26 +0000. Up 31.04 seconds.
cloud-init[931]: ci-info: +++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++
cloud-init[931]: ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
cloud-init[931]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
cloud-init[931]: ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
cloud-init[931]: ci-info: | enp2s0 | False | . | . | . | 6c:2b:59:59:9d:ca |

Notice later, NetworkManager takes over the ethernet device:

kernel: r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control rx/tx
kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0: link becomes ready
NetworkManager[1035]: <info> [1586324867.0369] device (enp2s0): carrier: link connected
NetworkManager[1035]: <info> [1586324867.0380] device (enp2s0): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
NetworkManager[1035]: <info> [1586324867.0403] policy: auto-activating connection 'netplan-enp2s0' (7ea6f90b-3495-3533-948a-ef0035687c34)
NetworkManager[1035]: <info> [1586324867.0425] device (enp2s0): Activation: starting connection 'netplan-enp2s0' (7ea6f90b-3495-3533-948a-ef0035687c34)
NetworkManager[1035]: <info> [1586324867.0430] device (enp2s0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
NetworkManager[1035]: <info> [1586324867.0445] manager: NetworkManager state is now CONNECTING
NetworkManager[1035]: <info> [1586324867.0454] device (enp2s0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
NetworkManager[1035]: <info> [1586324867.0470] device (enp2s0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
avahi-daemon[1011]: Joining mDNS multicast group on interface enp2s0.IPv4 with address 192.168.101.86.

The network config maas sends is for Ubuntu Server, which defaults ethernet
devices to be controlled by systemd-networkd; I suspect Desktop has a
different policy in place which puts all network devices in NetworkManager
control, so this config that MAAS sends (also, this MAAS looks old, it's
sending network config v1 (not v2, netplan)).

network:
    config:
    - id: enp2s0
        mac_address: 6c:2b:59:59:9d:ca
        mtu: 1500
        name: enp2s0
        subnets:
        - address: 192.168.101.86/24
            dns_nameservers:
            - 192.168.101.1
      ...

Read more...

Revision history for this message
Ryan Harper (raharper) wrote :

Please confirm if such a file is present in the desktop image.

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Alex Tu (alextu) wrote :

Thanks a lot! it works with the change of #7.

I can have preseed/script for that change, but also wondering if it is possible to make could-init also natively compatible with the environment that installed NetworkManager.

Revision history for this message
Ryan Harper (raharper) wrote :

Possibly; what's the origin of the image? Is there an "official" Ubuntu Desktop image? And if so, does it *only* have network-manager? or both? Can we get the tarball of the image you're deploying?

I suspect there is a netplan bug here for network-manager renderer where it does not add the network-manager-wait-online.service into the correct targets (they should run before network-online.target) so that NM can bring up networking before the target).

Revision history for this message
Alex Tu (alextu) wrote :

I would like to revise #9, the steps of original test I did is:
1. trigger the maas deploy and wait until 1st login after provision done.
2. MaaS still show deploying when I manually login to target machine.
3. manually apply the change of #7 then reboot.
4. this reboot responded MaaS and the status be changed to deployed on MaaS.

But when I just inject the change of #7 into MaaS image to let the change is there before 1st login, the network will never up. So, I can not just use the change of #7 so far.

The origin of the image is OEM image which is almost the same as stock Ubuntu Focal but added dell-recovery. And we just let the deployed machine to 1st boot into recovery mode to install cloud-init related packages then target machine can response to MaaS server in next boot.(which way learned from certification team SRU process)

For the question: "And if so, does it *only* have network-manager? or both?"
Just have a check that both networkd-dispatcher and network-manager be installed (which is same as stock Ubuntu desktop)
 - dpkg -l : https://paste.ubuntu.com/p/5m5wsq9y5g/

I can share the tarball but it is 3GB. I'm not sure if I can attach so big file on launchpad. I will try. BTW, the nature of dell-recovery will block out machines not produced by Dell. I am afraid you can not test it if you don't have a Dell laptop.

For MaaS image converted from official stock Ubuntu desktop, pwlars might have more experience to me because the certification team used to need it for SRU as well.

Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (3.7 KiB)

Thanks for the additional details.

From the logs and some local testing, the best choice for
getting things to work with the desktop image via MAAS is

to remove the 01-network-manager-all.yaml file before rebooting into the
target. This will allow networking to be fully managed by networkd.

We will likely need to file some bugs to get NetworkManager and cloud-init
and netplan to work together.

1) NetworkManager.service and NetworkManager-wait-online.service run too late
to work with cloud-init, This can be fixed by

   a) Adding DefaultDependencies=no to both NM service files
   b) cloud-init.service also adds a After=NetworkManager-wait-online.service

   Note, both of these changes can be done via systemd-drop in files but
   ideally they would go into the package itself.

2) The rendered NM configuration is not complete enough to auto enable
the configuration.

cloud-init writes a netplan config file and calls netplan generate which
with NetworkManager renderer populates /run/NetworkManager with config

# find /run/NetworkManager/ -type f
/run/NetworkManager/resolv.conf
/run/NetworkManager/no-stub-resolv.conf
/run/NetworkManager/devices/30
/run/NetworkManager/conf.d/10-globally-managed-devices.conf
/run/NetworkManager/system-connections/netplan-eth0.nmconnection

NetworkManager.service comes up and it can "see" an 'eth0' connection
and a 'netplan-eth0' connection. The 'eth0' connection is enabled.

# nmcli conn
NAME UUID TYPE DEVICE
eth0 e8ffebad-dba7-4035-a9a8-29d9e004e6c1 ethernet eth0
netplan-eth0 626dd384-8b3d-3690-9511-192b2c79b3fd ethernet --

eth0 has no IP (it's configured for dhcp)

After running netplan apply, we can see this change
# netplan --debug apply
** (generate:476): DEBUG: 15:39:02.332: Processing input file /etc/netplan/01-network-manager-all.yaml..
** (generate:476): DEBUG: 15:39:02.332: starting new processing pass
** (generate:476): DEBUG: 15:39:02.332: Processing input file /etc/netplan/50-cloud-init.yaml..
** (generate:476): DEBUG: 15:39:02.332: starting new processing pass
** (generate:476): DEBUG: 15:39:02.332: eth0: setting default backend to 2
** (generate:476): DEBUG: 15:39:02.332: Configuration is valid
** (generate:476): DEBUG: 15:39:02.332: Generating output files..
** (generate:476): DEBUG: 15:39:02.332: networkd: definition eth0 is not for us (backend 2)
(generate:476): GLib-DEBUG: 15:39:02.332: posix_spawn avoided (fd close requested)
DEBUG:netplan generated networkd configuration changed, restarting networkd
DEBUG:netplan generated NM configuration changed, restarting NM
DEBUG:eth0 not found in {}
DEBUG:Merged config:
network:
  bonds: {}
  bridges: {}
  ethernets:
    eth0:
      dhcp4: true
      match:
        macaddress: 00:16:3e:0d:4c:a7
      set-name: eth0
  vlans: {}
  wifis: {}

DEBUG:Skipping non-physical interface: lo
DEBUG:device eth0 operstate is up, not changing
DEBUG:{}
DEBUG:netplan triggering .link rules for lo
DEBUG:netplan triggering .link rules for eth0

# nmcli conn
NAME UUID TYPE DEVICE
netplan-eth0 626dd384-8b3d-3690-9511-192b2c79b3fd ethernet eth0

Looking into NM...

Read more...

Rex Tsai (chihchun)
tags: added: oem-priority
Changed in oem-priority:
importance: Critical → High
Revision history for this message
James Falcon (falcojr) wrote :
Changed in cloud-init:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.