DNS doesn't work in no-cloud as launched by ubuntu

Bug #1734167 reported by Michael Lyle on 2017-11-23
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init
High
Unassigned
cloud-init (Ubuntu)
Critical
Unassigned
Zesty
Undecided
Unassigned
Artful
Critical
Unassigned
systemd (Ubuntu)
Status tracked in Bionic
Zesty
Undecided
Unassigned
Artful
High
Unassigned
Bionic
Critical
Canonical Foundations Team

Bug Description

[Impact]

 * resolved does not start early enough in the boot-process preventing DNS resolution to be operational during early boot, for example as required by special early stages of cloud-init, resulting in failure to boot / provision the instance fully.

[Test Case]

 * Boot container or a VM with a nocloud-net data source, and a URL pointing to the datasource as explained below
 * Observe that boot completes and provisioning is successful
 * Check that there are no dns-resolution errors in the cloud-init log / boot log

[Regression Potential]

 * starting resolved earlier may prevent it from connecting to dbus, and may require a restart later on when re-triggered over dbus. This is on artful only, as in bionic resolved has gained ability to reconnected to dbus post-start. Backporting that, however, is too large for an SRU as it requires sd-bus changes.

[Other Info]

 * Original bug report.

I use no-cloud to test the kernel in CI (I am maintainer of the bcache subsystem), and have been running it successfully under 16.04 cloud images from qemu, using a qemu command that includes:

-smbios "type=1,serial=ds=nocloud-net;s=https://raw.githubusercontent.com/mlyle/mlyle/master/cloud-metadata/linuxtst/"

As documented here:

http://cloudinit.readthedocs.io/en/latest/topics/datasources/nocloud.html

Under the new 17.10 cloud images, this doesn't work: the network comes up, but name resolution doesn't work-- /etc/resolv.conf is a symlink to a nonexistent file at this point of the boot and systemd-resolved is not running. When I manually hack /etc/resolv.conf in the cloud image to point to 4.2.2.1 it works fine.

I don't know if nameservice not working is by design, but it seems like it should work. The documentation states:

"With ds=nocloud-net, the seedfrom value must start with http://, https:// or ftp://"

And https is not going to work for a raw IP address.

Related bugs:
 * bug 1734939: #include fails silently.

CVE References

Michael Lyle (mlyle) wrote :

Entire command lines of how I'm doing this:

build@nestvirt:~$ qemu-img create -f qcow2 -b artful-server-cloudimg-amd64.img cloudy.img 20G
build@nestvirt:~$ kvm -nographic -machine pc-i440fx-zesty,accel=kvm,usb=off,dump-guest-core=off -m 4096 -smp 3 -cpu Opteron_G3 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:31:33:70,bus=pci.0,addr=0x3 -netdev bridge,id=hostnet0 -drive file=cloudy.img,if=virtio -smbios "type=1,serial=ds=nocloud-net;s=https://raw.githubusercontent.com/mlyle/mlyle/master/cloud-metadata/linuxtst/" -kernel bzImage -append "root=/dev/vda1 ro console=ttyS0"

Michael Lyle (mlyle) wrote :

I'm not using the included kernel or initrd, so I decided to test without that.

kvm -machine pc-i440fx-zesty,accel=kvm,usb=off,dump-guest-core=off -m 4096 -smp 3 -cpu Opteron_G3 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:31:33:70,bus=pci.0,addr=0x3 -netdev bridge,id=hostnet0 -drive file=testful.img,if=virtio -smbios "type=1,serial=ds=nocloud-net;s=https://raw.githubusercontent.com/mlyle/mlyle/master/cloud-metadata/linuxtst/"

Properly gets the hostname of 'linuxtst' and all associated configuration on xenial, but not on artful.

Scott Moser (smoser) wrote :
Download full text (3.5 KiB)

$ wget http://cloud-images.ubuntu.com/artful/20171122/artful-server-cloudimg-amd64.img

## set up dns locally for 'qemu-host' to the default ip for user networking.
$ grep qemu-host /etc/hosts
10.0.2.2 qemu-host

$ cat data/user-data
#cloud-config
password: passw0rd
chpasswd: { expire: False }
ssh_pwauth: True

$ cat data/meta-data
instance-id: i-test

## webserv is http://bazaar.launchpad.net/~curtin-dev/curtin/trunk/view/head:/tools/webserv
$ webserve 44225 data
:: 44225

## backdoor the image so you can login with 'backdoor:passw0rd'
# backdoor-image is http://bazaar.launchpad.net/~smoser/+junk/backdoor-image/view/head:/backdoor-image

$ sudo backdoor-image -v --password=passw0rd

$ url="http://qemu-host:44225/"

$ qemu-system-x86_64 -enable-kvm \
   -device virtio-net-pci,netdev=net00 \
   -netdev type=user,id=net00 \
   -drive file=artful-server-cloudimg-amd64.img,id=disk00,if=none,format=qcow2,index=0 \
   -device virtio-blk,drive=disk00,serial=artful-server-cloudimg-amd64.img \
   -vga none -nographic -snapshot -echr 0x5 \
   -smbios type=1,serial=ds=nocloud-net;s=$url" -m 768

## console does show
## [ 20.388179] cloud-init[606]: 2017-11-24 17:03:13,786 - util.py[WARNING]: Gett
## ing data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloudNet'>
 failed

## login
$ pastebinit /var/log/cloud-init.log
http://paste.ubuntu.com/26035544/

## interesting part of that is
2017-11-24 17:03:12,779 - url_helper.py[DEBUG]: [9/11] open 'http://qemu-host:44667/meta-data' with {'url': 'http://qemu-host:44667/meta-data', 'allow_redirects': True, 'method': 'GET', 'headers': {'User-Agent': 'Cloud-Init/17.1'}} configuration
2017-11-24 17:03:12,782 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again
2017-11-24 17:03:13,783 - url_helper.py[DEBUG]: [10/11] open 'http://qemu-host:44667/meta-data' with {'url': 'http://qemu-host:44667/meta-data', 'allow_redirects': True, 'method': 'GET', 'headers': {'User-Agent': 'Cloud-Init/17.1'}} configuration
2017-11-24 17:03:13,786 - handlers.py[DEBUG]: finish: init-network/search-NoCloudNet: FAIL: no network data found from DataSourceNoCloudNet
2017-11-24 17:03:13,786 - util.py[WARNING]: Getting data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloudNet'> failed
2017-11-24 17:03:13,794 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloudNet'> failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 332, in find_source
    if s.get_data():
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceNoCloud.py", line 157, in get_data
    (md_seed, ud) = util.read_seeded(seedfrom, timeout=None)
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 932, in read_seeded
    md_resp = read_file_or_url(md_url, timeout, retries, file_retries)
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 892, in read_file_or_url
    exception_cb=exception_cb)
  File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 270, in readurl
    raise excps[-1]
cloudinit.url_helper.UrlError: HTTPConnectionPool(host='qemu-host', port=44667): Max retrie...

Read more...

Scott Moser (smoser) wrote :

Heres some more info that is from failed system using bionic.
$ sudo journalctl -o short-monotonic --no-pager | pastebinit
http://paste.ubuntu.com/26035621/

$ sudo base64 /run/log/journal/7ba07d79c32c4103aefee168e433d847/system@e9ae467d022046f0a034147c78254ae9-0000000000000001-00055ebdb4f0260b.journal | pastebinit
http://paste.ubuntu.com/26035632/

Changed in cloud-init:
status: New → Confirmed
importance: Undecided → High
Changed in cloud-init (Ubuntu):
status: New → Confirmed
importance: Undecided → High
Changed in systemd (Ubuntu):
status: New → Confirmed
importance: Undecided → High
Scott Moser (smoser) wrote :

I think the primary issue is that cloud-init.service is depending on using the network fully.
cloud-init.service runs:
  After=networking.service
  After=systemd-networkd-wait-online.service
  Before=network-online.target

But systemd-resolved.service runs
 After=systemd-networkd.service network.target
 Before=network-online.target nss-lookup.target

I tried adding to cloud-init.service.
 After=systemd-resolved.service
but that did not help things.

Dimitri John Ledkov (xnox) wrote :

<xnox> smoser, yeah, so like cloud-init.service should want/after systemd-resolved.service; or e.g. systemd-resolved.service should declare itself before cloud-init.service
<xnox> smoser, i think changing it in systemd unit might be better.

Scott Moser (smoser) wrote :

zesty does not show this problem. neither does xenial. I reflected that in the status.

Changed in cloud-init (Ubuntu Artful):
status: New → Confirmed
importance: Undecided → Medium
importance: Medium → High
Changed in systemd (Ubuntu Artful):
status: New → Confirmed
importance: Undecided → High
Changed in systemd (Ubuntu Zesty):
status: New → Confirmed
status: Confirmed → Fix Released
Changed in cloud-init (Ubuntu Zesty):
status: New → Fix Released
Scott Moser (smoser) wrote :

zesty does not show this problem. neither does xenial. I reflected that in the status.

$ sudo journalctl -b -o short-monotonic | pastebinit
http://paste.ubuntu.com/26035779/
$ sudo journalctl -o short-precise | pastebinit
http://paste.ubuntu.com/26035774/

Nov 24 17:49:25.193028 ubuntu systemd[1]: systemd-resolved.service: Found orderingcycle on basic.target/start
Nov 24 17:49:25.193038 ubuntu systemd[1]: systemd-resolved.service: Found dependencyon paths.target/start
Nov 24 17:49:25.193050 ubuntu systemd[1]: systemd-resolved.service: Found dependencyon acpid.path/start
Nov 24 17:49:25.193060 ubuntu systemd[1]: systemd-resolved.service: Found dependency on sysinit.target/start

Scott Moser (smoser) wrote :

that ordering cycle is if we add 'After=systemd-resolved.service' to cloud-init.service.

Scott Moser (smoser) wrote :

To be clear, the suggestion that xnox made causes a ordering cycle.

Changed in systemd (Ubuntu Bionic):
assignee: nobody → Canonical Foundations Team (canonical-foundations)
Ryan Harper (raharper) wrote :

I suspect because in bionic/artful we're missing resolvconf package, that the systemd-resolved service ends up starting later in boot. The systemd-resolved-update-resolveconf.{service,path} require /sbin/resolvconf to run; this service had a path-based trigger that would get hooked whenever DHCP clients would call resolvconf to kick off a DNS update once config was available.
I suspect that systemd-networkd itself isn't poking DNS service properly after acquiring information.

The dependency loop comes from systemd-resolved using default dependencies which run after when cloud-init.service would run.

This then needs systemd-resolved to specify DefaultDependencies=No and something like network-online.target to require systemd-resolved.

I modified cloud-init.service to include an After=systemd-resolved.service but some other service may require dns, so I feel this is a property of network-online.target.

Steve Langasek (vorlon) wrote :

I agree that systemd-resolved should be DefaultDependencies=no.

Of the individual dependencies of sysinit.target.wants, I'm guessing it should be After=systemd-journald.service systemd-machine-id-commit.service and possibly After=systemd-random-seed.service.

Ryan Harper (raharper) wrote :

We will still need something that helps ensure systemd-resolved runs we reach network-online.target; and I suspect (though I've not validated yet) that we really want systemd-resolved to be running prior to systemd-networkd such that systemd-networkd can relay DNS configuration info retrieved from DHCP results, ala how resolvconf was hooked on networking config touching files in /run.

Scott Moser (smoser) wrote :

I've verified that this is reproducible within lxc, and then filed a bug i
saw (bug 1734939) as a result.

Heres a trivial reproduce:

## just showing content of the url.
$ curl --silent https://hastebin.com/raw/coladicuva
#!/bin/sh
cat /proc/uptime | tee /run/user-script-uptime

$ name=btest
$ lxc launch ubuntu-daily:bionic $name \
   "--config=user.user-data=#include https://hastebin.com/raw/coladicuva"

$ sleep 20
$ lxc exec b4 grep WARN /var/log/cloud-init.log
2017-11-28 16:49:12,251 - user_data.py[WARNING]: HTTPSConnectionPool(host='hastebin.com', port=443): Max retries exceeded with url: /raw/coladicuva (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f20736a4e80>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)) for url: https://hastebin.com/raw/coladicuva

Changed in cloud-init (Ubuntu Bionic):
importance: High → Critical
Changed in cloud-init (Ubuntu Artful):
importance: High → Critical
Changed in systemd (Ubuntu Bionic):
importance: High → Critical
Scott Moser (smoser) on 2017-11-29
description: updated
Changed in systemd (Ubuntu Bionic):
status: Confirmed → Fix Committed
Changed in systemd (Ubuntu Artful):
status: Confirmed → In Progress
Scott Moser (smoser) wrote :

Dimitri,
What is the fix that you put in? I assume it was to systemd ?

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 235-3ubuntu3

---------------
systemd (235-3ubuntu3) bionic; urgency=medium

  * netwokrd: add support for RequiredForOnline stanza. (LP: #1737570)
  * resolved.service: set DefaultDependencies=no (LP: #1734167)
  * systemd.postinst: enable persistent journal. (LP: #1618188)
  * core: add support for non-writable unified cgroup hierarchy for container support.
    (LP: #1734410)

 -- Dimitri John Ledkov <email address hidden> Tue, 12 Dec 2017 13:25:32 +0000

Changed in systemd (Ubuntu Bionic):
status: Fix Committed → Fix Released
Scott Moser (smoser) wrote :

Marked as fix-released.
I tested today with 20180115.1 image from bionic.

wget http://cloud-images.ubuntu.com/bionic/20180115.1/bionic-server-cloudimg-amd64.img -O bionic-server-cloudimg-amd64.img

url="https://smoser.brickies.net/ubuntu/nocloud/"
qemu-system-x86_64 -enable-kvm -m 768 \
   -net nic -net user \
   -drive file=disk.img,if=virtio \
   -smbios "type=1,serial=ds=nocloud-net;s=$url"

Just for info, showing:
$ curl https://smoser.brickies.net/ubuntu/nocloud/user-data
#cloud-config
password: passw0rd
chpasswd: { expire: False }
ssh_pwauth: True

$ curl https://smoser.brickies.net/ubuntu/nocloud/meta-data
instance-id: iid-brickies-nocloud

no longer affects: cloud-init (Ubuntu Bionic)
Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
tags: added: id-5a1c7e7be1c6883c5a843d1f
description: updated

Hello Michael, or anyone else affected,

Accepted systemd into artful-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/systemd/234-2ubuntu12.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-artful to verification-done-artful. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-artful. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in systemd (Ubuntu Artful):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-artful
Scott Moser (smoser) wrote :
tags: added: verification-done verification-done-artful
removed: verification-needed verification-needed-artful
Scott Moser (smoser) wrote :

See my attached log for verification of artful.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 234-2ubuntu12.3

---------------
systemd (234-2ubuntu12.3) artful; urgency=medium

  [ Dimitri John Ledkov ]
  * Fix test-functions failing with Ubuntu units. LP: #1750608
  * tests: switch to using ext4 by default, instead of ext3. LP: #1750608
  * Fix kdump service not starting, due to systemd not loading dropins.
    Cherrypick a fix from upstream. (LP: #1708409)
  * systemd-fsckd: Fix ADT tests to work on s390x too. (LP: #1736955)
  * netwokrd: add support for RequiredForOnline stanza. (LP: #1737570)
  * resolved.service: set DefaultDependencies=no (LP: #1734167)
  * systemd.postinst: enable persistent journal. (LP: #1618188)
  * core: add support for non-writable unified cgroup hierarchy for container support.
    Rebase and de-fuzz. (LP: #1734410)
  * Prevent MemoryDenyWriteExecution policy bypass, by disallowing pkey_mprotect when mprotect is disallowed.
    CVE-2017-15908 (LP: #1725348)
  * networkd: enable promote_secondaries on networkd managed dhcp links.
    This fixes failing to renew DHCP lease, on networkd managed devices.
    (LP: #1721223)

  [ Kleber Sacilotto de Souza ]
  * systemd-rfkill service times out when a new rfkill device is added
    - rfkill-fix-erroneous-behavior-when-polling-the-udev-.patch: Comparing
    udev_device_get_sysname(device) and sysname will always return true. We need to
    check the device received from udev monitor instead.
    - rfkill-fix-typo.patch: Fix typo in rfkill log message. (LP: #1734908)

 -- Dimitri John Ledkov <email address hidden> Tue, 20 Feb 2018 16:11:58 +0000

Changed in systemd (Ubuntu Artful):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for systemd has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers