netplan dbus returns "false" in io.netplan.Netplan.Config.Set with networkd

Bug #1967084 reported by Michael Vogt
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
netplan
Fix Committed
Undecided
Unassigned
netplan.io (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned

Bug Description

[Impact]
* netplan-dbus can time-out and return FALSE after 5 sec (irrespective of the given timeout parameter) on calling the io.netplan.Netplan.Config.Try DBus method
* This happens if a network device is set up, which never finishes configuring in systemd-networkd (e.g. an empty bridge, using DHCP).

[Test Plan]
$ cat ./repro.sh
set -ev
P=$(dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan /io/netplan/Netplan io.netplan.Netplan.Config | grep "object path" | cut -d'"' -f2)
echo $P
dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan $P io.netplan.Netplan.Config.Set string:bridges.br54.dhcp4=true string:90-foo
dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan $P io.netplan.Netplan.Config.Try uint32:300

$ chmod +x repro.sh
$ time ./repro.sh
P=$(dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan /io/netplan/Netplan io.netplan.Netplan.Config | grep "object path" | cut -d'"' -f2)
echo $P
/io/netplan/Netplan/config/K0PEL1
dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan $P io.netplan.Netplan.Config.Set string:bridges.br54.dhcp4=true string:90-foo
method return time=1651225120.656093 sender=:1.48 -> destination=:1.49 serial=6 reply_serial=2
   boolean true
dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan $P io.netplan.Netplan.Config.Try uint32:300
method return time=1651225126.820211 sender=:1.48 -> destination=:1.50 serial=8 reply_serial=2
   boolean true

real 0m6.352s
user 0m0.019s
sys 0m0.020s

=> Make sure the final io.netplan.Netplan.Config.Try() method call returns TRUE.

[Where problems could occur]
This change touches netplan's dbus daemon (netplan-dbus), so if anything goes wrong problems could occur interacting with the io.netplan.Netplan DBus interface, but should not affect the netplan generator at a system level.

[Other Info]
* The issue has already been fixed for Core20 in netplan.io 0.104-0ubuntu2~20.04.1+core1 via https://launchpad.net/~canonical-foundations/+archive/ubuntu/ubuntu-image – So no need to rush this as an SRU for Focal
* The fix is already staged for Focal SRU at: https://git.launchpad.net/~ubuntu-core-dev/netplan/+git/ubuntu/commit/?h=ubuntu/focal&id=0fa3404c9627d69adf647db9911cf36cdf9b90b2

=== Original description ===

As part of our snapd core integration testing we run the spread test for netplan on a UC20 PI device.

On this device our test fails with:
"""
2022-03-25 14:33:00 Error executing external:ubuntu-core-20-arm-32:tests/core/netplan-cfg (external:ubuntu-core-20-arm-32) :
-----
+ echo 'Getting version works'
Getting version works
+ snap get system system.network.netplan.network.version
+ MATCH '^2$'
+ echo 'Getting the full document works and it is valid json'
Getting the full document works and it is valid json
+ jq .
+ snap get -d system system.network.netplan
{
  "system.network.netplan": {
    "network": {
      "ethernets": {
        "eth0": {
          "dhcp4": true,
          "match": {
            "macaddress": "b8:27:eb:41:42:43"
          },
          "set-name": "eth0"
        }
      },
      "version": 2
    }
  }
}
+ echo 'Check that setting adding a br54 interface via netplan works'
Check that setting adding a br54 interface via netplan works
+ snap set system system.network.netplan.network.bridges.br54.dhcp4=true
error: cannot perform the following tasks:
- Run configure hook of "core" snap (run hook "configure": cannot try netplan config: no specific reason returned from netplan)
-----
"""

This error means that snapd called io.netplan.Netplan.Config.Set and got a "false" without an error from that call (c.f. https://github.com/snapcore/snapd/blob/master/overlord/configstate/configcore/netplan.go#L218)

[edit: now this also fails in GCE testing on amd64 and this looks like a regression as this used to work].

Michael Vogt (mvo)
description: updated
Revision history for this message
Michael Vogt (mvo) wrote :

We also observe this behavior now on UC20 systems that run with amd64, e.g. in this run: https://github.com/snapcore/snapd/runs/5876437750?check_suite_focus=true

The log looks like this:
"""
2022-04-07T22:28:41.1746417Z Apr 07 22:28:33 apr072148-447340 snapd[3454]: netplan.go:122: DEBUG: using netplan config snapshot /io/netplan/Netplan/config/UY2RJ1
2022-04-07T22:28:41.1747073Z Apr 07 22:28:34 apr072148-447340 snapd[3454]: netplan.go:122: DEBUG: using netplan config snapshot /io/netplan/Netplan/config/FKW9J1
2022-04-07T22:28:41.1747878Z Apr 07 22:28:34 apr072148-447340 snapd[3454]: retry.go:49: DEBUG: Retrying https://api.snapcraft.io/v2/snaps/info/core?architecture=amd64&fields=download, attempt 1, elapsed time=20.694µs
...
2022-04-07T22:28:41.1830683Z Apr 07 22:28:34 apr072148-447340 snapd[3454]: retry.go:61: DEBUG: The retry loop for https://api.snapcraft.io/api/v1/snaps/download/99T7MUlRhtI3U0QFgl5mXXESAiSwt776_12834.snap finished after 1 retries, elapsed time=229.430054ms, status: 200
2022-04-07T22:28:41.1831580Z Apr 07 22:28:34 apr072148-447340 snapd[3454]: netplan.go:228: DEBUG: store reachable before netplan changes: true (tried 0 times)
2022-04-07T22:28:41.1832435Z Apr 07 22:28:40 apr072148-447340 snapd[3454]: task.go:343: DEBUG: 2022-04-07T22:28:40Z ERROR run hook "configure": cannot try netplan config: no specific reason returned from netplan
"""

Revision history for this message
Michael Vogt (mvo) wrote :

It's pretty consistent now, here is a strace log from uc20 in google that shows the calls from dbus: https://paste.ubuntu.com/p/hmjDQYYc9S/

Revision history for this message
Michael Vogt (mvo) wrote :

Sergio reported that he sees "systemd-networkd[804]: Could not set hostname: Permission denied"

Revision history for this message
Michael Vogt (mvo) wrote :

I'm sorry, my initial comment was wrong - we do have a "false" path in "Try()" now, I was looking at old code.

Revision history for this message
Michael Vogt (mvo) wrote :

I was wrong and looked at old code. There is now a case where the dbus Try() returns "false". It was added in e19441c3 and has:
"""
+ /* wait for the /run/netplan/netplan-try.ready stamp file to appear */
+ guint poll_timeout = 500;
+ if (seconds > 0 && seconds < 5)
+ poll_timeout = seconds * 100;
+ /* Timeout after up to 5 sec of waiting for the stamp file */
+ for (int i = 0; i < poll_timeout; i++) {
+ if (stat(netplan_try_stamp, &buf) == 0)
+ break;
+ usleep(1000 * 10);
+ }
+ if (stat(netplan_try_stamp, &buf) != 0) {
+ g_debug("cannot find %s stamp file", netplan_try_stamp);
+ return sd_bus_reply_method_return(m, "b", false);
+ }
"""

Maybe 5s is a bit low on slow systems, maybe something else is doing on. I also wonder if netplan should report some sort of error in this case to provide more context?

description: updated
Revision history for this message
Simon Chopin (schopin) wrote :

I'd need to ask slyon as to why we're having an upper bound to the timeout value, it's oddly specific, but in the mean time it's worth trying out a version where we can get the full 30s as specified by the snapd code.

Revision history for this message
Michael Vogt (mvo) wrote :

Fwiw I was able to reproduce this on core20 via:
"""
$ dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan /io/netplan/Netplan io.netplan.Netplan.Config
[got the /io/netplan/Netplan/config/4ZMNK1 path here]

$ dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan /io/netplan/Netplan/config/4ZMNK1 io.netplan.Netplan.Config.Set string:"bridges.br54.dhcp4=true" string:"90-foo"
[got true here]

$ dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan /io/netplan/Netplan/config/4ZMNK1 io.netplan.Netplan.Config.Try uint32:300
[got false here]
"""

Revision history for this message
Michael Vogt (mvo) wrote :

I managed to capture a debug log from netplan when run via the ddbus comments in #8. It was a bit tricky (bind mounted utils.py and set debug = True there) because netplan is coming from a RO location and no way to use the environment to enable debugging. Unfortunately the log gives me not many hints:
"""
qemu:ubuntu-core-20-64 .../tests/core/netplan-cfg# dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan /io/netplan/Netplan/config/PCJPK1 io.netplan.Netplan.Config.Try uint32:300
DEBUG:all-en not found in {}
DEBUG:all-eth not found in {'all-en': {'match': {'name': 'en*'}, 'dhcp4': True}}
DEBUG:br54 not found in {}
DEBUG:Merged config:
network:
  bridges:
    br54:
      dhcp4: true
  ethernets:
    all-en:
      dhcp4: true
      match:
        name: en*
    all-eth:
      dhcp4: true
      match:
        name: eth*
  version: 2

DEBUG:New interfaces: set()
DEBUG:netplan generated networkd configuration changed, reloading networkd
DEBUG:all-en not found in {}
DEBUG:all-eth not found in {'all-en': {'match': {'name': 'en*'}, 'dhcp4': True}}
DEBUG:br54 not found in {}
DEBUG:Merged config:
network:
  bridges:
    br54:
      dhcp4: true
  ethernets:
    all-en:
      dhcp4: true
      match:
        name: en*
    all-eth:
      dhcp4: true
      match:
        name: eth*
  version: 2

DEBUG:no netplan generated NM configuration exists
DEBUG:all-en not found in {}
DEBUG:all-eth not found in {'all-en': {'match': {'name': 'en*'}, 'dhcp4': True}}
DEBUG:br54 not found in {}
DEBUG:Merged config:
network:
  bridges:
    br54:
      dhcp4: true
  ethernets:
    all-en:
      dhcp4: true
      match:
        name: en*
    all-eth:
      dhcp4: true
      match:
        name: eth*
  version: 2

DEBUG:Link changes: {}
DEBUG:all-en not found in {}
DEBUG:all-eth not found in {'all-en': {'match': {'name': 'en*'}, 'dhcp4': True}}
DEBUG:Merged config:
network:
  ethernets:
    all-en:
      dhcp4: true
      match:
        name: en*
    all-eth:
      dhcp4: true
      match:
        name: eth*
  version: 2

DEBUG:netplan triggering .link rules for lo
DEBUG:netplan triggering .link rules for ens3
DEBUG:netplan triggering .link rules for br54
DEBUG:all-en not found in {}
DEBUG:all-eth not found in {'all-en': {'match': {'name': 'en*'}, 'dhcp4': True}}
DEBUG:br54 not found in {}
DEBUG:Merged config:
network:
  bridges:
    br54:
      dhcp4: true
  ethernets:
    all-en:
      dhcp4: true
      match:
        name: en*
    all-eth:
      dhcp4: true
      match:
        name: eth*
  version: 2

method return time=1649661702.221874 sender=:1.40 -> destination=:1.45 serial=12 reply_serial=2
   boolean false
"""
but maybe it's more useful for you?

Revision history for this message
Michael Vogt (mvo) wrote :

Here is a reproducer script
"""
#!/bin/sh

set -ex

objpath=$(dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan /io/netplan/Netplan io.netplan.Netplan.Config | grep "object path" | cut -d'"' -f2)
dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan "$objpath" io.netplan.Netplan.Config.Set string:"bridges.br54.dhcp4=true" string:"90-foo"
dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan "$objpath" io.netplan.Netplan.Config.Try uint32:300
"""

Revision history for this message
Michael Vogt (mvo) wrote :

So it looks like the timeout in dbus.c that is hardcoded to 5s is the problem. I patched dbus.c to use a 15s timeout (values was arbitrary) and run the reproducer I get:
""""
# time ./repo.sh
+ cut -d" -f2
+ grep object path
+ dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan /io/netplan/Netplan io.netplan.Netplan.Config
+ objpath=/io/netplan/Netplan/config/ENTPK1
+ dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan /io/netplan/Netplan/config/ENTPK1 io.netplan.Netplan.Config.Set string:bridges.br54.dhcp4=true string:90-foo
method return time=1649663748.713135 sender=:1.93 -> destination=:1.95 serial=6 reply_serial=2
   boolean true
+ dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan /io/netplan/Netplan/config/ENTPK1 io.netplan.Netplan.Config.Try uint32:300
method return time=1649663754.663485 sender=:1.93 -> destination=:1.96 serial=8 reply_serial=2
   boolean true

real 0m6.054s
user 0m0.008s
sys 0m0.001s
"""

So it seems like the timeout just needs to be bigger, probably a lot bigger given that it takes 6s on a reasonable fast amd64 VM. This code will run on pi2 like ARM or RISCV systems :)

Revision history for this message
Michael Vogt (mvo) wrote :

Fwiw, it might be interessting to see where this time is actually spend - 6s is long for a relatively small config, maybe there is a bug hidding here somewhere that is just masked when we increase the timeout?

Revision history for this message
Michael Vogt (mvo) wrote :

The (super naive) diff I used to increase the timeout:
"""
diff --git a/src/dbus.c b/src/dbus.c
index 4d9b17a..78cdafe 100644
--- a/src/dbus.c
+++ b/src/dbus.c
@@ -512,8 +512,8 @@ method_try(sd_bus_message *m, void *userdata, sd_bus_error *ret_error)
         // LCOV_EXCL_STOP

     /* wait for the /run/netplan/netplan-try.ready stamp file to appear */
- guint poll_timeout = 500;
- if (seconds > 0 && seconds < 5)
+ guint poll_timeout = 1500;
+ if (seconds > 0 && seconds < 15)
         poll_timeout = seconds * 100;
     /* Timeout after up to 5 sec of waiting for the stamp file */
     for (int i = 0; i < poll_timeout; i++) {
"""
(just for completness)

Revision history for this message
Michael Vogt (mvo) wrote :

I can also reproduce this on a minimal 20.04 VM image (created with autopkgtest-buildvm). Running the reproducer script above triggers the same "false" return so it seems to be related to the renderer. I was not able to reproduce it on my desktop with network-manager.

Revision history for this message
Michael Vogt (mvo) wrote :

The deeper reason appears to be:

https://git.launchpad.net/ubuntu/+source/netplan.io/tree/debian/patches/0006-cli-apply-give-some-extra-time-for-networkctl-reload.patch?h=applied/ubuntu/focal-updates#n25

It was a bit hard to find because for the "networkctl" calls in strace there was no corresponding code in the github repo. But after inspecting the package the issue seems to be clear. So there is a 5s timeout that is hit here because the networkctl output is:
"""
qemu:ubuntu-core-20-64 .../tests/core/netplan-cfg# networkctl
IDX LINK TYPE OPERATIONAL SETUP
  1 lo loopback carrier unmanaged
  2 ens3 ether routable configured
  3 br54 bridge degraded configuring

3 links listed.
"""
I don't know if our test-case is pathological in the sense that nooone would create a br54 interface and then not use it or if this is a real bug. But in any case, that seems to be the root issue.

Michael Vogt (mvo)
summary: - netplan dbus returns "false" in io.netplan.Netplan.Config.Set
+ netplan dbus returns "false" in io.netplan.Netplan.Config.Set with
+ networkd
Simon Chopin (schopin)
tags: added: fr-2238
Revision history for this message
Lukas Märdian (slyon) wrote :
Revision history for this message
Lukas Märdian (slyon) wrote (last edit ):

Phew... looks like we have plenty of layered problems here:

1) calling "netplan set network.bridges.br54.dhcp4=true" will create an empty bridge that will never get its DHCP4 configuration, thus will stay stuck in the "configuring" state of "networkctl". Having empty bridges is not really a production setup, and for test suites a simple workaround would be to define "dhcp4=false" instead, thus it will not trigger this problem.

2) networkctl reload/reconfigure is not super mature in Focal/systemd v245 yet and there are some races that happen when netplan calls "networkctl reload && networkctl reconfigure" right after each other during its "netplan apply" call. I started going down this rabbit hole, patching systemd-networkd, but this just leads to more problems. IMO we should not heavily patch systemd-networkd, as long as it works "good enough". (It works correctly on newer versions of systemd)

3) netplan-dbus does not apply the user's timeout parameter while waiting for 'netplan try' to execute. this should be fixed by this upstream PR: https://github.com/canonical/netplan/pull/271

4) The distro patch applied to the Focal package (https://git.launchpad.net/ubuntu/+source/netplan.io/tree/debian/patches/0006-cli-apply-give-some-extra-time-for-networkctl-reload.patch?h=applied/ubuntu/focal-updates) implements a 5 second timeout (due to the networkctl reload/reconfigure race-condition, mentioned in (2)) that is in conflict with the 5 seconds timeout in netplan-dbus (from (3)). This only applies to Focal.

So my proposal to fix this is:
- change the test-suite to call 'netplan set bridges.br54.dhcp4=false'
- The upstream patch from PR#271 was already applied to the netplan packaging for Core20, and should be included with the next Focal SRU (and potentially the other series, too – but with less urgency, as the conflicting distro patch is not in place there).

TL;DR: The root cause is a distro patch that was included in Focal, but we need this patch to work around systemd-networkd race conditions in systemd v245, so should implement other means to avoid this issue in Focal. This issue does not affect newer series of Ubuntu, due to improved systemd-networkd.

Changed in netplan:
status: New → Fix Committed
Changed in netplan.io (Ubuntu):
status: New → Fix Released
Lukas Märdian (slyon)
description: updated
Revision history for this message
Michael Vogt (mvo) wrote :

The snapd test change is up in https://github.com/snapcore/snapd/pull/11810

Revision history for this message
Steve Langasek (vorlon) wrote : Please test proposed package

Hello Michael, or anyone else affected,

Accepted netplan.io into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/netplan.io/0.104-0ubuntu2~20.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in netplan.io (Ubuntu Focal):
status: New → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (netplan.io/0.104-0ubuntu2~20.04.2)

All autopkgtests for the newly accepted netplan.io (0.104-0ubuntu2~20.04.2) for focal have finished running.
The following regressions have been reported in tests triggered by the package:

initramfs-tools/unknown (s390x)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/focal/update_excuses.html#netplan.io

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Lukas Märdian (slyon) wrote :

I've tested netplan.io 0.104-0ubuntu2~20.04.2 from focal-proposed. The reproducer script passes the test.

root@ff-sru:~# apt list *netplan*
Listing... Done
libnetplan-dev/focal-proposed 0.104-0ubuntu2~20.04.2 amd64
libnetplan0/focal-proposed,now 0.104-0ubuntu2~20.04.2 amd64 [installed,automatic]
netplan.io/focal-proposed,now 0.104-0ubuntu2~20.04.2 amd64 [installed]
root@ff-sru:~# time ./repro.sh
P=$(dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan /io/netplan/Netplan io.netplan.Netplan.Config | grep "object path" | cut -d'"' -f2)
echo $P
/io/netplan/Netplan/config/GTLTO1
dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan $P io.netplan.Netplan.Config.Set string:bridges.br54.dhcp4=true string:90-foo
method return time=1657700713.032547 sender=:1.12 -> destination=:1.13 serial=6 reply_serial=2
   boolean true
dbus-send --type=method_call --system --print-reply --dest=io.netplan.Netplan $P io.netplan.Netplan.Config.Try uint32:300
method return time=1657700718.730106 sender=:1.12 -> destination=:1.14 serial=8 reply_serial=2
   boolean true

real 0m5.869s
user 0m0.007s
sys 0m0.007s

tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-focal
Revision history for this message
Chris Halse Rogers (raof) wrote : Update Released

The verification of the Stable Release Update for netplan.io has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package netplan.io - 0.104-0ubuntu2~20.04.2

---------------
netplan.io (0.104-0ubuntu2~20.04.2) focal; urgency=medium

  * Cherry pick d/p/dbus-Remove-the-upper-limit-on-try-timeout.patch
    (LP: #1967084)
  * Cherry-pick fix for rendering WPA3 password (8934a1b), LP: #1975576
    + d/p/0010-nm-fix-rendering-of-password-for-unknown-passthrough.patch
  * Backport offloading tristate patches (LP: #1956264)
    + d/p/0003-Add-tristate-type-for-offload-options-LP-1956264-270.patch
    + d/p/0004-tests-ethernets-fix-autopkgtest-with-alternating-def.patch
    + d/t/control: add 'ethtool' test-dep for link offloading tests

 -- Lukas Märdian <email address hidden> Wed, 29 Jun 2022 17:54:23 +0200

Changed in netplan.io (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers