Setting "optional: true" to overcome he timeout "Job systemd-networkd-wait-online" does no longer work with latest noble image

Bug #2060311 reported by Frank Heimes
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Netplan
Fix Committed
High
Lukas Märdian
Ubuntu on IBM z Systems
Fix Released
Medium
Unassigned
netplan.io (Ubuntu)
Fix Released
High
Lukas Märdian
Noble
Fix Released
High
Lukas Märdian
systemd (Ubuntu)
Invalid
High
Unassigned
Noble
Invalid
High
Unassigned

Bug Description

Especially on s390x (but not limited to s390x) it's often the case that a system has network devices that are not necessarily connected during boot-up and one gets such a 2 min timeout:
"Job systemd-networkd-wait-online. Start running (1min 59s / no limit)"

In the past I could avoid that by setting "optional: true" post-install (no perfect, but worked),
but this does no longer seem to work using the latest noble ISO image (Apr 5th).

Setting 'optional: true' in /etc/netplan/50-cloud-init.yaml looks like this for me:

# This file is generated from information provided by the datasource. Changes
# to it will not persist across an instance reboot. To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    ethernets:
        enP1p0s0:
            optional: true
            dhcp4: true
        enP1p0s0d1:
            optional: true
            dhcp4: true
        enP2p0s0:
            optional: true
            dhcp4: true
        enP2p0s0d1:
            optional: true
            dhcp4: true
        encc000: {}
    version: 2
    vlans:
        encc000.2653:
            addresses:
            - 10.11.12.15/24
            gateway4: 10.11.12.1
            id: 2653
            link: encc000
            nameservers:
                addresses:
                - 10.11.12.1

... can be set fine (also --dry-run does not moan, except about dhcp4).

This worked in the past on noble, but also on older Ubuntu releases like jammy.

Revision history for this message
Frank Heimes (fheimes) wrote :
Changed in ubuntu-z-systems:
importance: Undecided → Medium
summary: - Setting optional: true to bypass timeout "Job systemd-networkd-wait-
- online" does no longer work with latest noble image
+ Setting "optional: true" to overcome he timeout "Job systemd-networkd-
+ wait-online" does no longer work with latest noble image
Revision history for this message
Nick Rosbrook (enr0n) wrote :

There has been confusion in this area in the past. But the stance of upstream is that RequiredForOnline=no => "interface is _ignored_ by systemd-networkd-wait-online". Hence, if every interface is optional: true, it is expected that systemd-networkd-wait-online will timeout.

I have recently suggested that netplan moves to a strategy where if interface ethX is optional:false, then enable <email address hidden>. And, if ethX has optional: true, then do nothing (namely do _not_ set RequiredForOnline=no). We probably need to then either disable systemd-networkd-wait-online.service by default, or change the default flags.

tags: added: rls-nn-incoming
Revision history for this message
Frank Heimes (fheimes) wrote :

I see (deep in my mind I remember that such a discussion happened or at least started somewhere).

Just notice that one interface is still _not_ optional, here in my case: encc000

And the behavior changed recently, with the above config I didn not hit the timeout in the past (even with earlier noble daily images).

Revision history for this message
Nick Rosbrook (enr0n) wrote :

Okay. Looking at the logs though, it doesn't seem that encc00 ever gets configured. If it's optional: false, then it has RequiredForOnline=yes. If it's not getting configured, then it's also expected that systemd-networkd-wait-online times out. Or am I missing something else?

If we don't want network-online.target to block boot, then we need to change cloud-init's systemd units. The dependency ordering in their units is what leads to the possibility of boot blocking on network-online.target.

Revision history for this message
Lukas Märdian (slyon) wrote :

Also see bug #2036358

Nick Rosbrook (enr0n)
Changed in netplan:
importance: Undecided → Critical
importance: Critical → High
Changed in systemd (Ubuntu):
importance: Undecided → High
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
Revision history for this message
Ubuntu QA Website (ubuntuqa) wrote :

This bug has been reported on the Ubuntu ISO testing tracker.

A list of all reports related to this bug can be found here:
http://iso.qa.ubuntu.com/qatracker/reports/bugs/2060311

tags: added: iso-testing
Lukas Märdian (slyon)
Changed in netplan:
assignee: nobody → Lukas Märdian (slyon)
tags: added: foundations-todo
removed: rls-nn-incoming
Revision history for this message
Lukas Märdian (slyon) wrote :

I started some work to help with this here: https://github.com/canonical/netplan/pull/455

Changed in netplan:
status: New → In Progress
Changed in systemd (Ubuntu):
milestone: none → ubuntu-24.04
Changed in netplan.io (Ubuntu):
milestone: none → ubuntu-24.04
Revision history for this message
Lukas Märdian (slyon) wrote :

Can somebody please confirm that Netplan from this PPA fixes the problem? https://launchpad.net/~slyon/+archive/ubuntu/lp2060311/+packages

Revision history for this message
Heinrich Schuchardt (xypron) wrote (last edit ):

On riscv64 preinstalled images we have

$ sudo cat 50-cloud-init.yaml
# This file is generated from information provided by the datasource. Changes
# to it will not persist across an instance reboot. To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    ethernets:
        zz-all-en:
            dhcp4: true
            match:
                name: en*
            optional: true
        zz-all-eth:
            dhcp4: true
            match:
                name: eth*
            optional: true
    version: 2

Before the change cloud-init finds an IPv4 address. With the change cloud-init sees no IPv4 address. So it seems that 'optional: true' is observed now.

After login network is available with IPv4 address.

How can we ensure that at least one Ethernet port is set up?

Revision history for this message
Lukas Märdian (slyon) wrote :

Thanks for testing!

Heinrich confirmed offline, that the IPv4 address will come online asynchronously, as expected for an "optional: true" definition.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in netplan.io (Ubuntu):
status: New → Confirmed
Revision history for this message
Lukas Märdian (slyon) wrote :

cloud-init seems to order After=sytemd-networkd-wait-online.service AND Before=network-online.target. So the proposed solution is a no-go.

Revision history for this message
Lukas Märdian (slyon) wrote (last edit ):

New attempt, that should be transparent to cloud-init, as we're just creating a /run/systemd/systemd-networkd-wait-online.service.d/10-netplan.conf override config, specifiying non-optional interfaces as "/lib/systemd/systemd-networkd-wait-online -i eth0 -i eth2 -i ..", but keeping the overall service in place.

https://github.com/canonical/netplan/pull/456

Please test the ~ppa3 build from https://launchpad.net/~slyon/+archive/ubuntu/lp2060311/+packages

Revision history for this message
Talha Can Havadar (tchavadar) wrote (last edit ):

Hi Lukas,
I tested the package in your ppa with following configuration:

```
network:
        ethernets:
            all:
                dhcp4: true
                match:
                    name: e*
                optional: true
        version: 2
```

With the version `1.0-2build1` it hit the timeout even though I set all the interfaces on my DUT(arm64) optional.

Here is the log of journalctl for relevant occurance:
```
Apr 16 11:47:11 kria systemd[1]: Starting systemd-networkd-wait-online.service - Wait for Network to be Configured...
Apr 16 11:49:11 kria systemd-networkd-wait-online[1055]: Timeout occurred while waiting for network connectivity.
Apr 16 11:49:11 kria systemd[1]: systemd-networkd-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Apr 16 11:49:11 kria systemd[1]: systemd-networkd-wait-online.service: Failed with result 'exit-code'.
Apr 16 11:49:11 kria systemd[1]: Failed to start systemd-networkd-wait-online.service - Wait for Network to be Configured.
```

But after installing the version `1.0-2ubuntu1~ppa3`, `systemd-networkd-wait-online` is not triggered since all interfaces set to optional. Therefore no timeout
```
ubuntu@kria:~$ apt policy netplan.io
netplan.io:
  Installed: 1.0-2ubuntu1~ppa3
  Candidate: 1.0-2ubuntu1~ppa3
  Version table:
 *** 1.0-2ubuntu1~ppa3 500
        500 https://ppa.launchpadcontent.net/slyon/lp2060311/ubuntu noble/main arm64 Packages
        100 /var/lib/dpkg/status
     1.0-2build1 500
        500 http://ports.ubuntu.com/ubuntu-ports noble/main arm64 Packages
```

```
ubuntu@kria:~$ journalctl -u systemd-networkd-wait-online --no-pager -b
-- No entries --
```

It looks like the package you have fixes the issue

Revision history for this message
Lukas Märdian (slyon) wrote :

Thanks for testing! There is a failure in your systemd-networkd-wait-online.service logs:
> systemd-networkd-wait-online.service: Main process exited, code=exited, status=1/FAILURE

I fixed this failure in the ~ppa4 version. Could you confirm the failure is gone with that newer version and still no delay?

Revision history for this message
Lukas Märdian (slyon) wrote :

With version ~ppa5 we're now skipping the activation of systemd-networkd-wait-online.service in case all Netplan interfaces are defined to be "optional: true", using "ConditionPathIsSymbolicLink=" on Netplan's s-n-wait-online.service enablement link, that's only set when we have non-optional interfaces.

All the magic happens in /run/systemd/system/systemd-networkd-wait-online.service.d/10-netplan.conf override under the generic "systemd-networkd-wait-online.service" umbrella.

Code is up-to-date in https://github.com/canonical/netplan/pull/456
Test buidls in PPA: https://launchpad.net/~slyon/+archive/ubuntu/lp2060311/+packages

Revision history for this message
Frank Heimes (fheimes) wrote :

I gave ~ppa5 a try on my s390x system.

If I set all interfaces to "optional: true" (incl. encc000), but except encc000.2653, I don't face the timeout anymore. But if I UNset "optional: true" for encc000 on top, I tap into the timeout again.

In the past it was okay to NOT have "optional: true" set for both: encc000 and encc000.2653 (and I found that logical, since both interfaces are needed in a VLAN context).

Knowing now what's missing, I could live with that (even if it's a change in behavior).

Revision history for this message
Lukas Märdian (slyon) wrote :

> In the past it was okay to NOT have "optional: true" set for both: encc000 and encc000.2653 (and I found that logical, since both interfaces are needed in a VLAN context).
>
> Knowing now what's missing, I could live with that (even if it's a change in behavior).

Interesting.. I suspect some infrastructure changes here. The default for systemd-networkd-wait-online is to wait on the "degraded" operational state, i.e. having a link-local address assigned to the interface.

If there is SLAAC or IPv6 RA enabled on the other side of "encc000", the interface might come online without "optional: true". But lacking such setup, it would be stuck in the "configuring" state, as it waits for an (IPv6) link-locaL address, putting it as "optional: true" helps in that case with our recent changes.

Lukas Märdian (slyon)
Changed in systemd (Ubuntu Noble):
status: Confirmed → Invalid
tags: added: block-proposed update-excuse
removed: foundations-todo
Changed in netplan.io (Ubuntu Noble):
status: Confirmed → In Progress
assignee: nobody → Lukas Märdian (slyon)
Revision history for this message
Lukas Märdian (slyon) wrote :

Extensive testing, from different teams and individuals, has happened in this bug report and especially in the upstream PR https://github.com/canonical/netplan/pull/456. This is in addition to the newly added build-time tests and autopkgtests.

This change affects the "systemd-networkd-wait-online" behavior via "/run/systemd/system/systemd-networkd-wait-online.service.d/10-netplan.conf" and should not affect core-networking. If anything goes wrong the change can easily be disabled (at runtime):
$ mkdir -p /etc/systemd/system/systemd-networkd-wait-online.service.d
$ ln -s /dev/null /etc/systemd/system/systemd-networkd-wait-online.service.d/10-netplan.conf

I went ahead and uploaded this as https://launchpad.net/ubuntu/+source/netplan.io/1.0-2ubuntu1

This bug is still "block-proposed", to give a last chance to anybody who feels like there is still a blocker in this upload.

Changed in netplan.io (Ubuntu Noble):
importance: Undecided → High
Lukas Märdian (slyon)
Changed in netplan.io (Ubuntu Noble):
status: In Progress → Fix Committed
Revision history for this message
Lukas Märdian (slyon) wrote :

I didn't hear about any blocker and in the "Foundation Leadership Sync" meeting people were overall positive about the change. I'm dropping the "block-proposed" tag.

tags: removed: block-proposed
Revision history for this message
Chad Smith (chad.smith) wrote (last edit ):

Thanks Lukas: +1 on this changeset not degrading early boot install scenarios where systemd services are ordered After=systemd-networkd-wait-online.service

Preliminary testing on Azure platform look good with accelerated networking enabled have confirmed dual-nic tests correctly configure the primary network devices brought and blocking systemd-networkd-wait-online awaiting the matching hv_netsvc devices.
The tests confirm systemd-networkd-wait-online.service properly awaits the individual devices Also, initial cloud-init integration test runs on Azure noble w/ netplan.io v.1.0-2ubuntu1 do not seem to expose any regressions for our integration tests so far.

cloud-init Ec2 testing of this same proposed netplan.io package doesn't show any degradation of behavior in detecting Ec2's IMDS in early boot.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package netplan.io - 1.0-2ubuntu1

---------------
netplan.io (1.0-2ubuntu1) noble; urgency=medium

  * debian/patches/lp2060311/, LP: #2060311
    Fix wait-online via s-n-wait-online.service.d/10-netplan.conf.
    Using an override config file for systemd-networkd-wait-online.service,
    specifing all the individual, non-optional interfaces to wait for and not
    enabling the s-n-wait-online.service at all when all interfaces are
    optional.
  * d/libnetplan1.symbols: Update for new (private) symbol

 -- Lukas Märdian <email address hidden> Thu, 18 Apr 2024 14:07:08 +0200

Changed in netplan.io (Ubuntu Noble):
status: Fix Committed → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: New → Fix Released
Lukas Märdian (slyon)
Changed in netplan:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.