consider changing definition of "has network" to include DNS being set up

Bug #1927695 reported by Alfonso Sanchez-Beato
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
subiquity
New
Undecided
Unassigned

Bug Description

Subiquity currently assumes that if at least one nic has a default route, then the network is up enough to be used during installation. This definitely isn't always true -- one case is where the device being installed was netbooted in an isolated environment (you can configure things so it doesn't get a default route as a workaround but still). The reasonably suggestion is that waiting for a default route and a DNS server is a better indication of the network being usable.

Currently we monitor netlink to wait for a default route. I think we can talk DBUS to systemd-resolved to wait for a DNS server, but I don't have any practical experience with DBUS really.

--- original description ---

I am installing on an arm64 server using the autoinstaller (with the subiquity snap in the edge channel, version 21.04.2+git11.dabd3198, revision 2405) with command line:

linux /vmlinuz console=hvc0 console=ttyAMA0 earlycon=pl011,0x01000000 fixrtc ip=dhcp url=http://192.168.100.1/tftp/bluefield.iso autoinstall ds="nocloud-net;s=http://192.168.100.1/tftp/" cloud-config-url=/dev/null

meta-data is an empty file, while user-data has content:

#cloud-config
autoinstall:
  version: 1
  # ubuntu/ubuntu
  identity:
    hostname: mynic
    username: ubuntu
    password: "$6$rounds=4096$PCrfo.ggdf4ubP$REjyaoY2tUWH2vjFJjvLs3rDxVTszGR9P7mhH9sHb2MsELfc53uV/v15jDDOJU/9WInfjjTKJPlD5URhX5Mix0"
  ssh:
    allow-pw: true
    authorized-keys:
      - "ssh-rsa AAAAB3Nz..."
    install-server: true

The error I see on the console is:

Ubuntu 20.04.2 LTS ubuntu-server ttyAMA0

connecting...
start: subiquity/Meta/status_GET
finish: subiquity/Network/apply_autoinstall_config/apply_config: Command '['netplan', 'apply']' returned non-zero exit status 1.
start: subiquity/ErrorReporter/1616017201.889178991.network_fail/add_info
finish: subiquity/Network/apply_autoinstall_config/wait_for_apply: Command '['netplan', 'apply']' returned non-zero exit status 1.
finish: subiquity/Network/apply_autoinstall_config: Command '['netplan', 'apply']' returned non-zero exit status 1.
finish: subiquity/apply_autoinstall_config: Command '['netplan', 'apply']' returned non-zero exit status 1.
finish: subiquity/ErrorReporter/1616017201.889178991.network_fail/add_info
start: subiquity/Meta/status_GET
start: subiquity/Meta/status_GET
start: subiquity/Meta/status_GET

See also LP: #1926132

Tags: fr-1358
Revision history for this message
Alfonso Sanchez-Beato (alfonsosanchezbeato) wrote :

It would be nice to have a way to get into the device to get more info, but I do not know how to do that atm.

description: updated
Revision history for this message
Alfonso Sanchez-Beato (alfonsosanchezbeato) wrote :

If I create a network-config file, things move forward a bit:

#cloud-config
version: 2
ethernets:
  tmfifo_net0:
    dhcp4: true

But, I get this error:

finish: subiquity/Install/install/curtin_install/cmd-install/stage-curthooks: configuring installed system
finish: subiquity/Install/install/curtin_install/cmd-install: curtin command install
finish: subiquity/Install/install/curtin_install: Command '['systemd-cat', '--level-prefix=false', '--identifier=subiquity_log.3142', '/snap/subiquity/2402/usr/bin/python3', '-m', 'curtin', '--showtrace', '-c', '/var/log/installer/subiquity-curtin-install.conf', 'install']' returned non-zero exit status 3.
finish: subiquity/Install/install: Command '['systemd-cat', '--level-prefix=false', '--identifier=subiquity_log.3142', '/snap/subiquity/2402/usr/bin/python3', '-m', 'curtin', '--showtrace', '-c', '/var/log/installer/subiquity-curtin-install.conf', 'install']' returned non-zero exit status 3.
start: subiquity/ErrorReporter/1616017379.612478018.install_fail/add_info

As things had move forward I was able to grab some logs from curtin:
Console ouput: https://paste.ubuntu.com/p/YCBmP9NzCY/
curtin-install-cfg.yaml: https://paste.ubuntu.com/p/KQt5zPpQ52/
curtin-install.log: https://paste.ubuntu.com/p/Vjv344TYmd/

Errors show some DNS failure problems that make curtin fail in the end.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So the issue here is that if there _is_ a network connection at all (or, more precisely, if there is a default route on any interface), subiquity expects to be able to access the archive or at least an archive mirror (possibly via a proxy). But if you just have a very isolated network set up so you can ssh in and debug an install failure because you only have serial console access, this is a bit of a catch 22. IOW, I think we should fix bug 1926132.

Another workaround would be to leave the network down during install and bring it up with error-commands. But eesh.

Possibly there should be a way for debugging purposes to get subiquity to do an offline install even if there is a network. But that's a bit iffy in some ways too.

Revision history for this message
Alfonso Sanchez-Beato (alfonsosanchezbeato) wrote :

@Michael, yes, that is the situation, there is a default route but no DNS. Here we are doing netboot and the kernel is configuring the interface as we are setting "ip=dhcp" in the kernel comand line, so the initramfs can retrieve the ISO image.

Interestingly, it would be possible to actually have a working DNS as the kernel is storing the value returned by the DHCP server (see [1]) in
/proc/net/pnp
but the system is not using it. I wonder if there is a way networkd could use this. But, even if that is possible, I think that the situation of having a default route but no external connectivity is a situation that can easily happen in some set-ups. I wonder also, what would happen if no default gateway is returned by DHCP?

[1] https://www.kernel.org/doc/html/latest/admin-guide/nfs/nfsroot.html

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Oh hm. Yeah netboot but no useful network is an interesting sub-case.

(For the side question, if a DHCP server does not provide a gateway will be handled in the same way as there being no network at all, I expect. But I haven't tried it)

Revision history for this message
Alfonso Sanchez-Beato (alfonsosanchezbeato) wrote :

I can confirm that if the DHCP server does not provide a gateway (no routers option) the installation finishes successfully.

I wonder if maybe by checking if DNS is not configured we could have this same behaviour?

tags: added: fr-1358
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So fwiw I still don't really have a clear plan for what to do about this. Defining "has network" to be "has a default route _and_ has a nameserver configured" would make sense but I don't know how to do it really, other than polling 'resolvectl status' or something.

summary: - Cannot apply netplan config when running autoinstaller
+ consider changing definition of "has network" to include DNS being set
+ up
description: updated
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So "gdbus monitor -y -o /org/freedesktop/resolve1 --dest org.freedesktop.resolve1" shows DNS servers appearing and disappearing as the UI is manipulated so if we can translate that to python-dbus language this might not be so bad...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.