netplan crashes on UbuntuCore with the network-manager snap installed

Bug #1959570 reported by Michael Vogt
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
netplan
Fix Released
Undecided
Lukas Märdian
netplan.io (Ubuntu)
Fix Released
High
Unassigned
Focal
In Progress
Undecided
Unassigned
Jammy
Won't Fix
Undecided
Unassigned
Kinetic
Invalid
Undecided
Unassigned
Lunar
Fix Released
High
Unassigned

Bug Description

[ Impact ]

 * When using the netplan-dbus API on Ubuntu Core and the network-manager renderer (NM snap) the "netplan-dbus" application will crash. Leaving the device unreachable.

 * Netplan's CLI is unable to find the network-manager.nmcli binary from the NetworkManager snap, as it's not in netplan-dbus's PATH, therefore it cannot properly restart NM's connections to re-activate the network after changes were applied.

[ Test Plan ]

==============================
Boot an Ubuntu Core 20 system, which this fix as part of the core20 base snap
==============================
$ snap install network-manager --channel=20/stable
$ nmcli # VERIFY br54 is not there
ens3: connected to netplan-ens3
        "ens3"
        ethernet (virtio_net), 52:54:00:12:34:56, hw, mtu 1500
        ip4 default
        inet4 10.0.2.15/24
        route4 0.0.0.0/0
        route4 10.0.2.0/24
        inet6 fe80::5054:ff:fe12:3456/64
        route6 fe80::/64

lo: unmanaged
        "lo"
        loopback (unknown), 00:00:00:00:00:00, sw, mtu 65536

DNS configuration:
        servers: 10.0.2.3
        interface: ens3
$ sudo snap set system system.network.netplan.network.bridges.br54.dhcp4=true
==============================
 Network/SSH connection will be interrupted briefly, but reconnect after some seconds.
==============================
$ netplan get # VERIFY br54 is there
network:
  version: 2
  renderer: NetworkManager
  ethernets:
    ens3:
      dhcp4: true
  bridges:
    br54:
      dhcp4: true
$ sudo journalctl | grep FileNotFound # EMPTY => no errors
$ nmcli # VERIFY br54 is there
ens3: connected to netplan-ens3
        "ens3"
        ethernet (virtio_net), 52:54:00:12:34:56, hw, mtu 1500
        ip4 default, ip6 default
        inet4 10.0.2.15/24
        route4 0.0.0.0/0
        route4 10.0.2.0/24
        inet6 fec0::5054:ff:fe12:3456/64
        inet6 fe80::5054:ff:fe12:3456/64
        route6 fe80::/64
        route6 ::/0
        route6 fec0::/64

br54: connecting (getting IP configuration) to netplan-br54
        "br54"
        bridge, 12:E6:74:FA:8A:75, sw, mtu 1500

lo: unmanaged
        "lo"
        loopback (unknown), 00:00:00:00:00:00, sw, mtu 65536

DNS configuration:
        servers: 10.0.2.3
        interface: ens3

[ Where problems could occur ]

 * Netplan is a core component, thus changing it bears a high risk
 * This upload modifies Netplan's CLI only, therefor the system boot-up and initial network connection should not be affect if anything goes wrong.
 * Netplan's CLI could be affected, breaking the snapd integration (dbus-netplan) and CLI commands like "netplan apply"

[ Other Info ]

 * This is addressed in Jammy via bug #2025519 (SRU/backport of 0.106.1)
 * This has been fixed upstream in https://github.com/canonical/netplan/pull/336
 * A test PPA is available here: https://launchpad.net/~slyon/+archive/ubuntu/lp1959570

=== original bug report ===

When using the netplan-dbus API on Ubuntu Core and the network-manager renderer the "netplan-dbus" appication will crash. To reproduce take a UC20 or UC22 system and run:

```
$ sudo snap set system system.network.netplan.network.bridges.br54.dhcp4=true
```
(which in effect just drives the dbus API of netplan).

After that the network is lost. When manually restarting it the error is:
```
error: cannot perform the following tasks:
- Run configure hook of "core" snap (run hook "configure": cannot try netplan config: no specific reason returned from netplan and cannot cancel netplan config: Unknown object '/io/netplan/Netplan/config/VDDSG1'.)
```

and the journal reports:
```
Jan 31 10:46:04 ubuntu sudo[1746]: mvo : TTY=pts/0 ; PWD=/home/mvo ; USER=root ; COMMAND=/usr/bin/snap set system system.network.netplan.network.bridges.br54.dhcp4=true
Jan 31 10:46:04 ubuntu sudo[1746]: pam_unix(sudo:session): session opened for user root(uid=0) by mvo(uid=1000)
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: Traceback (most recent call last):
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/share/netplan/netplan/cli/commands/try_command.py", line 99, in command_try
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: NetplanApply().command_apply(run_generate=True, sync=True, exit_on_error=False, state_dir=self.state)
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/share/netplan/netplan/cli/commands/apply.py", line 272, in command_apply
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: if b'\nconnected' in subprocess.check_output(cmd, env=env):
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/lib/python3.9/subprocess.py", line 505, in run
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: with Popen(*popenargs, **kwargs) as process:
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/lib/python3.9/subprocess.py", line 951, in __init__
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: self._execute_child(args, executable, preexec_fn, close_fds,
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/lib/python3.9/subprocess.py", line 1821, in _execute_child
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: raise child_exception_type(errno_num, err_msg, err_filename)
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: FileNotFoundError: [Errno 2] No such file or directory: 'nmcli'
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: During handling of the above exception, another exception occurred:
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: Traceback (most recent call last):
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/sbin/netplan", line 23, in <module>
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: netplan.main()
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/share/netplan/netplan/cli/core.py", line 50, in main
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: self.run_command()
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/share/netplan/netplan/cli/utils.py", line 315, in run_command
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: self.func()
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/share/netplan/netplan/cli/commands/try_command.py", line 81, in run
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: self.run_command()
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/share/netplan/netplan/cli/utils.py", line 315, in run_command
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: self.func()
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/share/netplan/netplan/cli/commands/try_command.py", line 113, in command_try
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: self.revert()
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/share/netplan/netplan/cli/commands/try_command.py", line 143, in revert
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: NetplanApply().command_apply(run_generate=False, sync=True, exit_on_error=False, state_dir=tempdir)
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/share/netplan/netplan/cli/commands/apply.py", line 272, in command_apply
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: if b'\nconnected' in subprocess.check_output(cmd, env=env):
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/lib/python3.9/subprocess.py", line 505, in run
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: with Popen(*popenargs, **kwargs) as process:
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/lib/python3.9/subprocess.py", line 951, in __init__
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: self._execute_child(args, executable, preexec_fn, close_fds,
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: File "/usr/lib/python3.9/subprocess.py", line 1821, in _execute_child
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: raise child_exception_type(errno_num, err_msg, err_filename)
Jan 31 10:46:05 ubuntu io.netplan.Netplan[1763]: FileNotFoundError: [Errno 2] No such file or directory: 'nmcli'
Jan 31 10:46:10 ubuntu io.netplan.Netplan[884]: 'netplan try' exited with status: 0
Jan 31 10:46:10 ubuntu snapd[605]: taskrunner.go:271: [change 13 "Run configure hook of \"core\" snap" task] failed: run hook "configure": cannot try netplan config: no specific reason returned from netplan and cannot cancel netplan config: Unknown object '/io/netplan/Netplan/config/LX32G1'.
```

The full journal log is attached.

I guess it's a bit the question if this is a netplan or a nm-snap bug (it might need to provide an nmcli alias?). But in any case the crash and losing network is worth fixing regardless I think.

Fwiw, I also suspect this is the issue that prevents my spread tests in https://github.com/snapcore/snapd/compare/master...mvo5:netplan-read-write-nm?expand=1 from working.

Revision history for this message
Michael Vogt (mvo) wrote :
description: updated
Revision history for this message
Alfonso Sanchez-Beato (alfonsosanchezbeato) wrote :

FTR, there is an alias registered in the store for network-mananger.nmcli

Revision history for this message
Lukas Märdian (slyon) wrote :

There seems to be a patch in place (for a long time) to handle the fallback to network-manager.nmcli. I wonder why it does not work (anymore)?

https://github.com/canonical/netplan/commit/443039b5bd71152b9848c6df096300097e343466

@alfonsosanchezbeato do you think it would be an option to expose the nmcli command from the NM snap?

tags: added: fr-2041
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in netplan.io (Ubuntu):
status: New → Confirmed
Revision history for this message
Isaac True (itrue) wrote :

Hi all,

We have a customer running into this same issue. Is a fix for this on the radar at all?

In the meantime, I've suggested that they just use `netplan set` directly as a workaround.

Revision history for this message
Lukas Märdian (slyon) wrote :

I do not yet fully understand what's going on here.

The 'nmcli' command is exported from the network-manager snap and can be called from the shell without a problem. So I have the impression that for some reason the call stack/execution environment is different here.

AFAIU "snap set" drives netplan, using its DBus API, so is calling "io.netplan.Netplan.Try/Apply" eventually. This will automatically start the "netplan-dbus" process, which will spawn the "netplan apply" command.

The "netplan apply" command has some special case logic if execute inside a snap (detected by reading the "SNAP" environment variable), if inside a snap, it will re-execute itself, using the o.netplan.Netplan.Apply DBus API, in order to "break" out of the snap confinement and be executed in the Core20/22 root system.

=> See we might have an infinite loop here, if the SNAP environment variable is not detected/passed correctly.

When the "netplan apply" command is excuted outside a snap in the Core20/22 base system, it will apply its changes and restart systemd-networkd.service and NetworkManager.service (or the respecitive service that is provided by the network-manager snap) and call into "nmcli" to wait for it to be ready.

For some reason it seems to be unable to find the "nmcli" binary in its PATH, so my assumption is that it is not able to break out of the snap, but rather is executed within the snapd snap's confinement.

This needs more investigation to fully understand what's going on.

Revision history for this message
Lukas Märdian (slyon) wrote :

After some more investigation in snappy land, I think we have two stacked problems here:

1) The netplan-dbus daemon doesn't have PATH set in its environment, thus not finding the `nmcli` or `network-manager.nmlci` binary:

slyon@ubuntu:~$ ps faux | grep netplan
slyon 938 0.0 0.0 5188 720 pts/0 S+ 16:55 0:00 \_ grep --color=auto netplan
root 830 0.0 0.2 11200 4440 ? S 16:54 0:00 /lib/netplan/netplan-dbus
slyon@ubuntu:~$ sudo cat /proc/830/environ
DBUS_STARTER_ADDRESS=unix:path=/var/run/dbus/system_bus_socketDBUS_STARTER_BUS_TYPE=system

2) Ubuntu Core being slow (probably due to the repeated 'run configure hook of "core" snap' action). When first calling into nmcli, the NM service is not yet up and running, thus crashing the subprocess.check_output(nmcli...) command. When patched to handle nmcli's "not running" return code (8) gracefully, giving some more time generally and adding some debug output, I can see the following output:

```diff
             utils.systemctl_network_manager('start', sync=sync)
             if sync:
                 # wait up to 2 sec for 'STATE=connected (site/local-only)' or
                 # 'STATE=connected' to appear in 'nmcli general' STATE
                 env = dict(os.environ, LC_ALL='C')
                 cmd = ['nmcli', 'general', 'status']
                 for _ in range(20):
- if b'\nconnected' in subprocess.check_output(cmd, env=env):
+ sys.stderr.write("ITER\n")
+ out = subprocess.run(cmd, capture_output=True, universal_newlines=True, env=env)
+ sys.stderr.write(str(out.returncode)+": "+str(out.stdout)+"\n"+str(out.stderr)+"\n")
+ sys.stderr.flush()
+ if out.returncode == 8:
+ time.sleep(1)
+ continue
+ if '\nconnected' in str(out.stdout):
                         break
- time.sleep(0.1)
+ time.sleep(0.5)
```

OUTPUT:

slyon@ubuntu:~$ sudo snap set system system.network.netplan.network.bridges.br54.dhcp4=false
Run configure hook of "core" snap \
ITER
Run configure hook of "core" snap |
8:
Error: NetworkManager is not running.

Run configure hook of "core" snap –
ITER
Run configure hook of "core" snap /
0: STATE CONNECTIVITY WIFI-HW WIFI WWAN-HW WWAN
disconnected none enabled enabled enabled enabled

Run configure hook of "core" snap –
ITER
Run configure hook of "core" snap \
0: STATE CONNECTIVITY WIFI-HW WIFI WWAN-HW WWAN
connected full enabled enabled enabled enabled

PS: launched `netplan-dbus` manually (to make the PATH & environment from my shell available) and patched the apply.py command to add debug output in a running core20 snap, by using `sudo mount -o bind apply.py /usr/share/netplan/netplan/cli/commands/apply.py`

Revision history for this message
Lukas Märdian (slyon) wrote :
Changed in netplan.io (Ubuntu):
status: Confirmed → In Progress
status: In Progress → Confirmed
Changed in netplan:
status: New → In Progress
assignee: nobody → Lukas Märdian (slyon)
Changed in netplan.io (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → High
Revision history for this message
Lukas Märdian (slyon) wrote :

The fix from PR#336 seems to be working generally.

There's still a brief interruption of the network connectivity, while restarting the "snap.network-manager.networkmanager.service" daemon (obviously), but the SSH connection is recovered after a few seconds.

I've built a core22 base snap, using https://launchpad.net/~slyon/+archive/ubuntu/lp1959570 and a UC22 image out of this to run this test (all executed over an SSH connection):

slyon@ubuntu:~$ sudo snap set system system.network.netplan.network.bridges.br54.dhcp4=true
slyon@ubuntu:~$ sudo netplan get
network:
  version: 2
  renderer: NetworkManager
  ethernets:
    enp0s2:
      dhcp4: true
  bridges:
    br54:
      dhcp4: true
slyon@ubuntu:~$ sudo snap set system system.network.netplan.network.bridges.br54.dhcp4=false
slyon@ubuntu:~$ sudo netplan get
network:
  version: 2
  renderer: NetworkManager
  ethernets:
    enp0s2:
      dhcp4: true
  bridges:
    br54:
      dhcp4: false

Lukas Märdian (slyon)
Changed in netplan:
status: In Progress → Fix Committed
Revision history for this message
Lukas Märdian (slyon) wrote :

Fixed upstream and uploaded into Lunar as 0.106-0ubuntu3.

Not relevant for Kinetic, due to affecting Ubuntu Core (based on LTS releases). Supposed to be backported into Jammy via Netplan's 0.106 SRU.

Changed in netplan.io (Ubuntu Lunar):
status: Triaged → Fix Committed
Changed in netplan.io (Ubuntu Kinetic):
status: New → Invalid
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package netplan.io - 0.106-0ubuntu3

---------------
netplan.io (0.106-0ubuntu3) lunar; urgency=medium

  * Cherry-pick upstream commit fc12872 (PR#336) to fix crash in 'snapd set'
    (LP: #1959570)

 -- Lukas Märdian <email address hidden> Mon, 03 Apr 2023 16:29:22 +0200

Changed in netplan.io (Ubuntu Lunar):
status: Fix Committed → Fix Released
Lukas Märdian (slyon)
Changed in netplan:
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in netplan.io (Ubuntu Focal):
status: New → Confirmed
Changed in netplan.io (Ubuntu Jammy):
status: New → Confirmed
Lukas Märdian (slyon)
tags: added: sru-next
Lukas Märdian (slyon)
tags: added: foundations-todo
Revision history for this message
Lukas Märdian (slyon) wrote :

Jammy is addressed via the 0.106.1 backport in bug #2025519

Changed in netplan.io (Ubuntu Jammy):
status: Confirmed → Won't Fix
Revision history for this message
Lukas Märdian (slyon) wrote :

I've prepared a Focal upload in https://launchpad.net/~slyon/+archive/ubuntu/lp1959570/+packages, which seems to be working fine, after building a core20 base snap + UC20 image from it.

$ snap install network-manager --channel=20/stable
$ nmcli
ens3: connected to netplan-ens3
        "ens3"
        ethernet (virtio_net), 52:54:00:12:34:56, hw, mtu 1500
        ip4 default
        inet4 10.0.2.15/24
        route4 0.0.0.0/0
        route4 10.0.2.0/24
        inet6 fe80::5054:ff:fe12:3456/64
        route6 fe80::/64

lo: unmanaged
        "lo"
        loopback (unknown), 00:00:00:00:00:00, sw, mtu 65536

DNS configuration:
        servers: 10.0.2.3
        interface: ens3
$ sudo snap set system system.network.netplan.network.bridges.br54.dhcp4=true
==============================
 Network connection will be interrupted briefly, but reconnect after some seconds.
==============================
$ netplan get
network:
  version: 2
  renderer: NetworkManager
  ethernets:
    ens3:
      dhcp4: true
  bridges:
    br54:
      dhcp4: true
$ sudo journalctl | grep FileNotFound # EMPTY
$ nmcli
ens3: connected to netplan-ens3
        "ens3"
        ethernet (virtio_net), 52:54:00:12:34:56, hw, mtu 1500
        ip4 default, ip6 default
        inet4 10.0.2.15/24
        route4 0.0.0.0/0
        route4 10.0.2.0/24
        inet6 fec0::5054:ff:fe12:3456/64
        inet6 fe80::5054:ff:fe12:3456/64
        route6 fe80::/64
        route6 ::/0
        route6 fec0::/64

br54: connecting (getting IP configuration) to netplan-br54
        "br54"
        bridge, 12:E6:74:FA:8A:75, sw, mtu 1500

lo: unmanaged
        "lo"
        loopback (unknown), 00:00:00:00:00:00, sw, mtu 65536

DNS configuration:
        servers: 10.0.2.3
        interface: ens3

Revision history for this message
Lukas Märdian (slyon) wrote :
description: updated
description: updated
description: updated
Changed in netplan.io (Ubuntu Focal):
status: Confirmed → In Progress
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Would it be possible to have a special test case for this as part of the unit-testing/integration-testing suite? Also, I'd feel a bit better if as part of the validation we did some smoke testing of the CLI on a non-core related system.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.