DNS doesn't work in no-cloud as launched by ubuntu

Bug #1734167 reported by Michael Lyle
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init
Fix Released
High
Unassigned
cloud-init (Ubuntu)
Fix Released
Critical
Unassigned
Zesty
Fix Released
Undecided
Unassigned
Artful
Won't Fix
Critical
Unassigned
systemd (Ubuntu)
Fix Released
Critical
Canonical Foundations Team
Zesty
Fix Released
Undecided
Unassigned
Artful
Fix Released
High
Unassigned
Bionic
Fix Released
Critical
Canonical Foundations Team

Bug Description

[Impact]

 * resolved does not start early enough in the boot-process preventing DNS resolution to be operational during early boot, for example as required by special early stages of cloud-init, resulting in failure to boot / provision the instance fully.

[Test Case]

 * Boot container or a VM with a nocloud-net data source, and a URL pointing to the datasource as explained below
 * Observe that boot completes and provisioning is successful
 * Check that there are no dns-resolution errors in the cloud-init log / boot log

[Regression Potential]

 * starting resolved earlier may prevent it from connecting to dbus, and may require a restart later on when re-triggered over dbus. This is on artful only, as in bionic resolved has gained ability to reconnected to dbus post-start. Backporting that, however, is too large for an SRU as it requires sd-bus changes.

[Other Info]

 * Original bug report.

I use no-cloud to test the kernel in CI (I am maintainer of the bcache subsystem), and have been running it successfully under 16.04 cloud images from qemu, using a qemu command that includes:

-smbios "type=1,serial=ds=nocloud-net;s=https://raw.githubusercontent.com/mlyle/mlyle/master/cloud-metadata/linuxtst/"

As documented here:

http://cloudinit.readthedocs.io/en/latest/topics/datasources/nocloud.html

Under the new 17.10 cloud images, this doesn't work: the network comes up, but name resolution doesn't work-- /etc/resolv.conf is a symlink to a nonexistent file at this point of the boot and systemd-resolved is not running. When I manually hack /etc/resolv.conf in the cloud image to point to 4.2.2.1 it works fine.

I don't know if nameservice not working is by design, but it seems like it should work. The documentation states:

"With ds=nocloud-net, the seedfrom value must start with http://, https:// or ftp://"

And https is not going to work for a raw IP address.

Related bugs:
 * bug 1734939: #include fails silently.

CVE References

Revision history for this message
Michael Lyle (mlyle) wrote :

Entire command lines of how I'm doing this:

build@nestvirt:~$ qemu-img create -f qcow2 -b artful-server-cloudimg-amd64.img cloudy.img 20G
build@nestvirt:~$ kvm -nographic -machine pc-i440fx-zesty,accel=kvm,usb=off,dump-guest-core=off -m 4096 -smp 3 -cpu Opteron_G3 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:31:33:70,bus=pci.0,addr=0x3 -netdev bridge,id=hostnet0 -drive file=cloudy.img,if=virtio -smbios "type=1,serial=ds=nocloud-net;s=https://raw.githubusercontent.com/mlyle/mlyle/master/cloud-metadata/linuxtst/" -kernel bzImage -append "root=/dev/vda1 ro console=ttyS0"

Revision history for this message
Michael Lyle (mlyle) wrote :

I'm not using the included kernel or initrd, so I decided to test without that.

kvm -machine pc-i440fx-zesty,accel=kvm,usb=off,dump-guest-core=off -m 4096 -smp 3 -cpu Opteron_G3 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:31:33:70,bus=pci.0,addr=0x3 -netdev bridge,id=hostnet0 -drive file=testful.img,if=virtio -smbios "type=1,serial=ds=nocloud-net;s=https://raw.githubusercontent.com/mlyle/mlyle/master/cloud-metadata/linuxtst/"

Properly gets the hostname of 'linuxtst' and all associated configuration on xenial, but not on artful.

Revision history for this message
Scott Moser (smoser) wrote :
Download full text (3.5 KiB)

$ wget http://cloud-images.ubuntu.com/artful/20171122/artful-server-cloudimg-amd64.img

## set up dns locally for 'qemu-host' to the default ip for user networking.
$ grep qemu-host /etc/hosts
10.0.2.2 qemu-host

$ cat data/user-data
#cloud-config
password: passw0rd
chpasswd: { expire: False }
ssh_pwauth: True

$ cat data/meta-data
instance-id: i-test

## webserv is http://bazaar.launchpad.net/~curtin-dev/curtin/trunk/view/head:/tools/webserv
$ webserve 44225 data
:: 44225

## backdoor the image so you can login with 'backdoor:passw0rd'
# backdoor-image is http://bazaar.launchpad.net/~smoser/+junk/backdoor-image/view/head:/backdoor-image

$ sudo backdoor-image -v --password=passw0rd

$ url="http://qemu-host:44225/"

$ qemu-system-x86_64 -enable-kvm \
   -device virtio-net-pci,netdev=net00 \
   -netdev type=user,id=net00 \
   -drive file=artful-server-cloudimg-amd64.img,id=disk00,if=none,format=qcow2,index=0 \
   -device virtio-blk,drive=disk00,serial=artful-server-cloudimg-amd64.img \
   -vga none -nographic -snapshot -echr 0x5 \
   -smbios type=1,serial=ds=nocloud-net;s=$url" -m 768

## console does show
## [ 20.388179] cloud-init[606]: 2017-11-24 17:03:13,786 - util.py[WARNING]: Gett
## ing data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloudNet'>
 failed

## login
$ pastebinit /var/log/cloud-init.log
http://paste.ubuntu.com/26035544/

## interesting part of that is
2017-11-24 17:03:12,779 - url_helper.py[DEBUG]: [9/11] open 'http://qemu-host:44667/meta-data' with {'url': 'http://qemu-host:44667/meta-data', 'allow_redirects': True, 'method': 'GET', 'headers': {'User-Agent': 'Cloud-Init/17.1'}} configuration
2017-11-24 17:03:12,782 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again
2017-11-24 17:03:13,783 - url_helper.py[DEBUG]: [10/11] open 'http://qemu-host:44667/meta-data' with {'url': 'http://qemu-host:44667/meta-data', 'allow_redirects': True, 'method': 'GET', 'headers': {'User-Agent': 'Cloud-Init/17.1'}} configuration
2017-11-24 17:03:13,786 - handlers.py[DEBUG]: finish: init-network/search-NoCloudNet: FAIL: no network data found from DataSourceNoCloudNet
2017-11-24 17:03:13,786 - util.py[WARNING]: Getting data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloudNet'> failed
2017-11-24 17:03:13,794 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloudNet'> failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 332, in find_source
    if s.get_data():
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceNoCloud.py", line 157, in get_data
    (md_seed, ud) = util.read_seeded(seedfrom, timeout=None)
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 932, in read_seeded
    md_resp = read_file_or_url(md_url, timeout, retries, file_retries)
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 892, in read_file_or_url
    exception_cb=exception_cb)
  File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 270, in readurl
    raise excps[-1]
cloudinit.url_helper.UrlError: HTTPConnectionPool(host='qemu-host', port=44667): Max retrie...

Read more...

Revision history for this message
Scott Moser (smoser) wrote :

Heres some more info that is from failed system using bionic.
$ sudo journalctl -o short-monotonic --no-pager | pastebinit
http://paste.ubuntu.com/26035621/

$ sudo base64 /run/log/journal/7ba07d79c32c4103aefee168e433d847/system@e9ae467d022046f0a034147c78254ae9-0000000000000001-00055ebdb4f0260b.journal | pastebinit
http://paste.ubuntu.com/26035632/

Changed in cloud-init:
status: New → Confirmed
importance: Undecided → High
Changed in cloud-init (Ubuntu):
status: New → Confirmed
importance: Undecided → High
Changed in systemd (Ubuntu):
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Scott Moser (smoser) wrote :

I think the primary issue is that cloud-init.service is depending on using the network fully.
cloud-init.service runs:
  After=networking.service
  After=systemd-networkd-wait-online.service
  Before=network-online.target

But systemd-resolved.service runs
 After=systemd-networkd.service network.target
 Before=network-online.target nss-lookup.target

I tried adding to cloud-init.service.
 After=systemd-resolved.service
but that did not help things.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

<xnox> smoser, yeah, so like cloud-init.service should want/after systemd-resolved.service; or e.g. systemd-resolved.service should declare itself before cloud-init.service
<xnox> smoser, i think changing it in systemd unit might be better.

Revision history for this message
Scott Moser (smoser) wrote :

zesty does not show this problem. neither does xenial. I reflected that in the status.

Changed in cloud-init (Ubuntu Artful):
status: New → Confirmed
importance: Undecided → Medium
importance: Medium → High
Changed in systemd (Ubuntu Artful):
status: New → Confirmed
importance: Undecided → High
Changed in systemd (Ubuntu Zesty):
status: New → Confirmed
status: Confirmed → Fix Released
Changed in cloud-init (Ubuntu Zesty):
status: New → Fix Released
Revision history for this message
Scott Moser (smoser) wrote :

zesty does not show this problem. neither does xenial. I reflected that in the status.

$ sudo journalctl -b -o short-monotonic | pastebinit
http://paste.ubuntu.com/26035779/
$ sudo journalctl -o short-precise | pastebinit
http://paste.ubuntu.com/26035774/

Nov 24 17:49:25.193028 ubuntu systemd[1]: systemd-resolved.service: Found orderingcycle on basic.target/start
Nov 24 17:49:25.193038 ubuntu systemd[1]: systemd-resolved.service: Found dependencyon paths.target/start
Nov 24 17:49:25.193050 ubuntu systemd[1]: systemd-resolved.service: Found dependencyon acpid.path/start
Nov 24 17:49:25.193060 ubuntu systemd[1]: systemd-resolved.service: Found dependency on sysinit.target/start

Revision history for this message
Scott Moser (smoser) wrote :

that ordering cycle is if we add 'After=systemd-resolved.service' to cloud-init.service.

Revision history for this message
Scott Moser (smoser) wrote :

To be clear, the suggestion that xnox made causes a ordering cycle.

Changed in systemd (Ubuntu Bionic):
assignee: nobody → Canonical Foundations Team (canonical-foundations)
Revision history for this message
Ryan Harper (raharper) wrote :

I suspect because in bionic/artful we're missing resolvconf package, that the systemd-resolved service ends up starting later in boot. The systemd-resolved-update-resolveconf.{service,path} require /sbin/resolvconf to run; this service had a path-based trigger that would get hooked whenever DHCP clients would call resolvconf to kick off a DNS update once config was available.
I suspect that systemd-networkd itself isn't poking DNS service properly after acquiring information.

The dependency loop comes from systemd-resolved using default dependencies which run after when cloud-init.service would run.

This then needs systemd-resolved to specify DefaultDependencies=No and something like network-online.target to require systemd-resolved.

I modified cloud-init.service to include an After=systemd-resolved.service but some other service may require dns, so I feel this is a property of network-online.target.

Revision history for this message
Steve Langasek (vorlon) wrote :

I agree that systemd-resolved should be DefaultDependencies=no.

Of the individual dependencies of sysinit.target.wants, I'm guessing it should be After=systemd-journald.service systemd-machine-id-commit.service and possibly After=systemd-random-seed.service.

Revision history for this message
Ryan Harper (raharper) wrote :

We will still need something that helps ensure systemd-resolved runs we reach network-online.target; and I suspect (though I've not validated yet) that we really want systemd-resolved to be running prior to systemd-networkd such that systemd-networkd can relay DNS configuration info retrieved from DHCP results, ala how resolvconf was hooked on networking config touching files in /run.

Revision history for this message
Scott Moser (smoser) wrote :

I've verified that this is reproducible within lxc, and then filed a bug i
saw (bug 1734939) as a result.

Heres a trivial reproduce:

## just showing content of the url.
$ curl --silent https://hastebin.com/raw/coladicuva
#!/bin/sh
cat /proc/uptime | tee /run/user-script-uptime

$ name=btest
$ lxc launch ubuntu-daily:bionic $name \
   "--config=user.user-data=#include https://hastebin.com/raw/coladicuva"

$ sleep 20
$ lxc exec b4 grep WARN /var/log/cloud-init.log
2017-11-28 16:49:12,251 - user_data.py[WARNING]: HTTPSConnectionPool(host='hastebin.com', port=443): Max retries exceeded with url: /raw/coladicuva (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f20736a4e80>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)) for url: https://hastebin.com/raw/coladicuva

Changed in cloud-init (Ubuntu Bionic):
importance: High → Critical
Changed in cloud-init (Ubuntu Artful):
importance: High → Critical
Changed in systemd (Ubuntu Bionic):
importance: High → Critical
Scott Moser (smoser)
description: updated
Changed in systemd (Ubuntu Bionic):
status: Confirmed → Fix Committed
Changed in systemd (Ubuntu Artful):
status: Confirmed → In Progress
Revision history for this message
Scott Moser (smoser) wrote :

Dimitri,
What is the fix that you put in? I assume it was to systemd ?

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 235-3ubuntu3

---------------
systemd (235-3ubuntu3) bionic; urgency=medium

  * netwokrd: add support for RequiredForOnline stanza. (LP: #1737570)
  * resolved.service: set DefaultDependencies=no (LP: #1734167)
  * systemd.postinst: enable persistent journal. (LP: #1618188)
  * core: add support for non-writable unified cgroup hierarchy for container support.
    (LP: #1734410)

 -- Dimitri John Ledkov <email address hidden> Tue, 12 Dec 2017 13:25:32 +0000

Changed in systemd (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Scott Moser (smoser) wrote :

Marked as fix-released.
I tested today with 20180115.1 image from bionic.

wget http://cloud-images.ubuntu.com/bionic/20180115.1/bionic-server-cloudimg-amd64.img -O bionic-server-cloudimg-amd64.img

url="https://smoser.brickies.net/ubuntu/nocloud/"
qemu-system-x86_64 -enable-kvm -m 768 \
   -net nic -net user \
   -drive file=disk.img,if=virtio \
   -smbios "type=1,serial=ds=nocloud-net;s=$url"

Just for info, showing:
$ curl https://smoser.brickies.net/ubuntu/nocloud/user-data
#cloud-config
password: passw0rd
chpasswd: { expire: False }
ssh_pwauth: True

$ curl https://smoser.brickies.net/ubuntu/nocloud/meta-data
instance-id: iid-brickies-nocloud

no longer affects: cloud-init (Ubuntu Bionic)
Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
tags: added: id-5a1c7e7be1c6883c5a843d1f
description: updated
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Michael, or anyone else affected,

Accepted systemd into artful-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/systemd/234-2ubuntu12.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-artful to verification-done-artful. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-artful. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in systemd (Ubuntu Artful):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-artful
Revision history for this message
Scott Moser (smoser) wrote :
tags: added: verification-done verification-done-artful
removed: verification-needed verification-needed-artful
Revision history for this message
Scott Moser (smoser) wrote :

See my attached log for verification of artful.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 234-2ubuntu12.3

---------------
systemd (234-2ubuntu12.3) artful; urgency=medium

  [ Dimitri John Ledkov ]
  * Fix test-functions failing with Ubuntu units. LP: #1750608
  * tests: switch to using ext4 by default, instead of ext3. LP: #1750608
  * Fix kdump service not starting, due to systemd not loading dropins.
    Cherrypick a fix from upstream. (LP: #1708409)
  * systemd-fsckd: Fix ADT tests to work on s390x too. (LP: #1736955)
  * netwokrd: add support for RequiredForOnline stanza. (LP: #1737570)
  * resolved.service: set DefaultDependencies=no (LP: #1734167)
  * systemd.postinst: enable persistent journal. (LP: #1618188)
  * core: add support for non-writable unified cgroup hierarchy for container support.
    Rebase and de-fuzz. (LP: #1734410)
  * Prevent MemoryDenyWriteExecution policy bypass, by disallowing pkey_mprotect when mprotect is disallowed.
    CVE-2017-15908 (LP: #1725348)
  * networkd: enable promote_secondaries on networkd managed dhcp links.
    This fixes failing to renew DHCP lease, on networkd managed devices.
    (LP: #1721223)

  [ Kleber Sacilotto de Souza ]
  * systemd-rfkill service times out when a new rfkill device is added
    - rfkill-fix-erroneous-behavior-when-polling-the-udev-.patch: Comparing
    udev_device_get_sysname(device) and sysname will always return true. We need to
    check the device received from udev monitor instead.
    - rfkill-fix-typo.patch: Fix typo in rfkill log message. (LP: #1734908)

 -- Dimitri John Ledkov <email address hidden> Tue, 20 Feb 2018 16:11:58 +0000

Changed in systemd (Ubuntu Artful):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for systemd has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Joshua Powers (powersj)
Changed in cloud-init:
status: Confirmed → Fix Released
Changed in cloud-init (Ubuntu Artful):
status: Confirmed → Won't Fix
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.