systemd-networkd runs too late for cloud-init.service (net)

Bug #1636912 reported by Ryan Harper on 2016-10-26
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
systemd
Fix Released
Unknown
systemd (Ubuntu)
Medium
Martin Pitt
Xenial
High
Unassigned
Yakkety
Medium
Unassigned

Bug Description

Ubuntu Core 16 images using cloud-init fail to function when the DataSource is over the network (Like OpenStack) as networking is not yet available when cloud-init.service runs.

cloud-init service unit deps look like this:

[Unit]
Description=Initial cloud-init job (metadata service crawler)
DefaultDependencies=no
Wants=cloud-init-local.service
Wants=local-fs.target
Wants=sshd-keygen.service
Wants=sshd.service
After=cloud-init-local.service
After=networking.service
Requires=networking.service
Before=basic.target
Before=dbus.socket
Before=network-online.target
Before=sshd-keygen.service
Before=sshd.service
Before=systemd-user-sessions.service
Conflicts=shutdown.target

Here's networkd unit deps:

[Unit]
Description=Network Service
Documentation=man:systemd-networkd.service(8)
ConditionCapability=CAP_NET_ADMIN
DefaultDependencies=no
# dbus.service can be dropped once on kdbus, and systemd-udevd.service can be
# dropped once tuntap is moved to netlink
After=systemd-udevd.service dbus.service network-pre.target systemd-sysusers.service systemd-sysctl.service
Before=network.target multi-user.target shutdown.target
Conflicts=shutdown.target
Wants=network.target

# On kdbus systems we pull in the busname explicitly, because it
# carries policy that allows the daemon to acquire its name.
Wants=org.freedesktop.network1.busname
After=org.freedesktop.network1.busname

And a critical-chain output:

root@snap-test7:~# systemd-analyze critical-chain systemd-networkd
Failed to get ID: Unit name systemd-networkd is not valid.
The time after the unit is active or started is printed after the "@" character.
The time the unit takes to start is printed after the "+" character.

root@snap-test7:~# systemd-analyze critical-chain systemd-networkd.service
The time after the unit is active or started is printed after the "@" character.
The time the unit takes to start is printed after the "+" character.

systemd-networkd.service +440ms
└─dbus.service @11.461s
  └─basic.target @11.403s
    └─sockets.target @11.401s
      └─dbus.socket @11.398s
        └─cloud-init.service @10.127s +1.266s
          └─networking.service @9.305s +799ms
            └─network-pre.target @9.295s
              └─cloud-init-local.service @3.822s +5.469s
                └─local-fs.target @3.813s
                  └─run-cgmanager-fs.mount @12.687s
                    └─local-fs-pre.target @1.393s
                      └─systemd-tmpfiles-setup-dev.service @1.116s +195ms
                        └─kmod-static-nodes.service @887ms +193ms
                          └─system.slice @783ms
                            └─-.slice @721ms

cloud-init would need networkd to run at or before 'networking.service' so it can raise networking to then find and use network-based datasources.

# grep systemd /usr/share/snappy/dpkg.list
ii libnss-resolve:amd64 229-4ubuntu11 amd64 nss module to resolve names via systemd-resolved
ii libpam-systemd:amd64 229-4ubuntu11 amd64 system and service manager - PAM module
ii libsystemd0:amd64 229-4ubuntu11 amd64 systemd utility library
ii systemd 229-4ubuntu11 amd64 system and service manager
ii systemd-sysv 229-4ubuntu11 amd64 system and service manager - SysV links

# grep cloud-init /usr/share/snappy/dpkg.list
ii cloud-init 0.7.8-201610260005-gf7a5756-0ubuntu1~trunk~ubuntu16.04.1 all Init scripts for cloud instances

SRU INFORMATION FOR systemd
===========================
Fix: For xenial it is sufficient to drop systemd-networkd's After=dbus.service (https://github.com/systemd/systemd/commit/5f004d1e32) and (for xenial only) drop the useless org.freedesktop.network1.busname unit (which is always "condition failed" as there is no kdbus, but it moves systemd-network.service after sockets.target which is too late for cloud-init).

Regression potential: Low. networkd is not widely being used outside of netplan/snappy in xenial. Running it before dbus.service is running has two consequences:
 - It cannot immediately expose its D-Bus status interface. But it will retry every 5 s until that succeeds, so the D-Bus status interface will continue to work. (see test case)
 - If a DHCP response with a hostname or timezone is received before dbus.service is running, it cannot talk to systemd-hostnamed/systemd-timedated to set these properties (if enabled). However, this is broken in xenial anyway as it fails on polkit permissions (this and retrying this configuration after D-Bus is up has been fixed in upstream master now).

As for removing the "*.busname" units in xenial: kdbus has never been part of any distribiution, there had just been some experimental DKMS package in some PPA for it. It's dead as an upstream project, so by dropping the *.busname unit(s) from xenial there should be no practical effect as these should always not start with "condition failed". Yakkety's systemd already has them removed.

Test case:
 - Install nplan, set up a netplan configuration and remove /etc/network/interfaces.
 - Upgrade to the proposed packages.
 - Ensure that the network is still functional and "busctl" shows org.freedesktop.network1, i. e. networkd successfully connected to the bus.
 - Check the journal that systemd-networkd.service starts before dbus.service, which should usually be the case with this fix. Check "journalctl -b" for "Started Network Service." vs. "Started D-Bus System Message Bus."

  If it repeatedly starts the other way around, you can force it with "sudo systemctl edit systemd-networkd.service" and
   [Unit]
   Before=sysinit.target

  (This is effectively what cloud-init.service will do soon.)

Related branches

Martin Pitt (pitti) wrote :

cloud-init already has *very* strong dependencies:

  Requires=networking.service
  Before=basic.target

(which is sorting the early boot fairly strictly). But I guess in the same vein, if cloud-init wants to run in between networkd and basic.target, it needs to grow an After=systemd-networkd.service. I also suggest to replace the "Requires=networking.service" with "After=networking.service" as ifupdown is not mandatory any more.

However, due to networkd's After=dbus.service this wouldn't work yet, as dbus.service runs in late boot -- I proposed/tried to change this, but it's not ready for that (bug 1629797, https://bugs.freedesktop.org/show_bug.cgi?id=98254).

networkd can run without D-Bus in principle, but if that is not running yet but dbus.socket is we'll run into deadlocks again -- and if we start it before we need to teach it to connect to D-Bus once it becomes available.

Changed in systemd (Ubuntu):
status: New → Triaged
Changed in cloud-init (Ubuntu):
status: New → Triaged
Steve Langasek (vorlon) on 2016-10-26
Changed in systemd (Ubuntu):
importance: Undecided → High
assignee: nobody → Martin Pitt (pitti)
Changed in cloud-init (Ubuntu):
assignee: nobody → Martin Pitt (pitti)
importance: Undecided → High
Steve Langasek (vorlon) wrote :

Note that the systemd dependencies shown for cloud-init in xenial (on which Ubuntu Core 16 is based) don't match those listed in the bug description. Instead, xenial currently has:

After=cloud-init-local.service networking.service
Before=network-online.target sshd.service sshd-keygen.service systemd-user-sessions.service
Requires=networking.service
Wants=local-fs.target cloud-init-local.service sshd.service sshd-keygen.service

That's certainly an easier target for fixing the ordering on.

I'm surprised that cloud-init is being ordered before dbus in 16.10+. This appears to have been done in response to bug #1629797. This change to ordering on account of nss-resolve does not give me the warm fuzzies. There's feedback in the end of that bug that cloud-init should *not* be using Before=basic.target / Before=dbus.socket, but instead use Before=sysinit.target. That seems like a change we should be making. But I also didn't understand that we would be using nss-resolve - I missed a nuance in <https://lists.ubuntu.com/archives/ubuntu-devel/2016-June/039406.html>, I thought that once resolved supported DNS, we would use that /instead of/ the NSS module. Is there really a benefit to using both?

Regardless, none of that seems to account for the specific problem reported here, on Ubuntu Core 16; because Ubuntu Core 16 does contain libnss-resolve, but does *not* contain the cloud-init with the DefaultDependencies=no.

On Wed, Oct 26, 2016 at 6:41 PM, Steve Langasek <
<email address hidden>> wrote:

> Note that the systemd dependencies shown for cloud-init in xenial (on
> which Ubuntu Core 16 is based) don't match those listed in the bug
> description. Instead, xenial currently has:
>
> After=cloud-init-local.service networking.service
> Before=network-online.target sshd.service sshd-keygen.service
> systemd-user-sessions.service
> Requires=networking.service
> Wants=local-fs.target cloud-init-local.service sshd.service
> sshd-keygen.service
>
> That's certainly an easier target for fixing the ordering on.
>

We're preparing an SRU for Xenial, but the issue is related to
systemd-resolved which is NOT in Xenial, so the
failing behavior of a 120 second timeout for resolving hosts when dbus
isn't up yet doesn't materialize on Xenial, only yakkety.

 https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1629797

>
> I'm surprised that cloud-init is being ordered before dbus in 16.10+.
> This appears to have been done in response to bug #1629797. This change
> to ordering on account of nss-resolve does not give me the warm fuzzies.
> There's feedback in the end of that bug that cloud-init should *not* be
> using Before=basic.target / Before=dbus.socket, but instead use
> Before=sysinit.target. That seems like a change we should be making. But
> I also didn't understand that we would be using nss-resolve - I missed a
> nuance in <https://lists.ubuntu.com/archives/ubuntu-devel/2016-
> June/039406.html>, I thought that once resolved supported DNS, we would
> use that /instead of/ the NSS module. Is there really a benefit to using
> both?
>

systemd-resolved doesn't runs soon enough where cloud-init needs to resolve
metadata service end points, like in GCE.

I don't know enough to say if resolved and NSS are 100% swappable; but at a
min, resolved would need to be able to run as early as NSS runs to allow
cloud-init to resolve metadata service URLS very early in boot.

>
> Regardless, none of that seems to account for the specific problem
> reported here, on Ubuntu Core 16; because Ubuntu Core 16 does contain
> libnss-resolve, but does *not* contain the cloud-init with the
> DefaultDependencies=no.
>

We're specifically introducing a newer cloud-init which adds support for
snap create-user support which also includes the change in Unit
dependencies from upstream cloud-init.
Those changes to unit dependencies will also be SRU to Xenial.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1636912
>
> Title:
> systemd-networkd runs too late for cloud-init.service (net)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/cloud-init/+
> bug/1636912/+subscriptions
>

Martin Pitt (pitti) wrote :

@Steve:

> There's feedback in the end of that bug that cloud-init should *not* be using Before=basic.target / Before=dbus.socket, but instead use Before=sysinit.target.

Correct, that would avoid starting it in between sysinit.target and basic.target when the sockets start, and avoid the deadlock in a simpler way.

> I thought that once resolved supported DNS, we would use that /instead of/ the NSS module. Is there really a benefit to using both?

Both "resolve" and "dns" are NSS modules, and you can't not use NSS for name resolution (as that's what glibc's gethostbyname() and friends use). Maybe you meant "instead of the dns module"? This is a required fallback for either (1) foreign architecture programs which don't have the corresponding nss-resolve:arch installed, or (2) early boot when D-Bus and hence resolved are not yet running.

> because Ubuntu Core 16 does contain libnss-resolve,

"contains" yes, but we don't use nss-resolve in Ubuntu 16.04 and hence snappy. This got introduced step-wise in 16.10, isn't completly finished yet, and IMHO not yet ready for backporting (if we ever actually want that). If snappy's /etc/nsswitch.conf contains "resolve", this is NOT intended.

@Ryan:

> I don't know enough to say if resolved and NSS are 100% swappable;

Again, I figure you mean s/NSS/dns/; dns does not do DNSSEC, caching and the like, but none of this should be relevant at early boot (if you need network that early, you are basically on your own and not guaranteed to succeed anyway -- "dns" is by far good enough for that).

Anyway, all of this is not the primary focus for this bug -- this bug is about the ordering of cloud-init vs. networkd, resolved is not in this picture.

Martin Pitt (pitti) wrote :

So did I understand this right:

 * In current xenial, cloud-init runs in late boot, so there is no principal ordering problem between cloud-init and networkd, other than that cloud-init.service should declare After=systemd-networkd.service similar to what it does for ifupdown (After=networking.service).

 * Thus this would not be a blocker for Ubuntu Core 16 right now, and merely adding that After= should suffice for the moment.

 * This *is* an issue for 16.10/zesty as cloud-init.service runs in early boot and dbus/networkd can't.

 * This *will become* an issue for 16.04 as these cloud-init changes are meant to be backported soon. Will that happen for the Ubuntu Core 16 GA in a few days already? I. e. is this something which we need to crowbar in ASAP, or do we have some time to figure out a proper solution how to start networkd first, and dbus later on?

Martin Pitt (pitti) wrote :

The "networkd after D-Bus" ordering was introduced in https://github.com/systemd/systemd/commit/1346b1f038 and later refined in https://github.com/systemd/systemd/commit/bcbca8291f .

So with the latter, removing this ordering would break the "UseHostname: yes" flag (when you receive/set your host name from what DHCP gives you), i. e. it would silently not work. We don't use that feature in the distro itself, but it would be a shame to break it for everyone even when cloud-init is not involved at all.

So this at least gives us a quick way out for 16.04 -- we can simply drop the "After=dbus.service" from systemd-networkd.service without much trouble, but for devel I'd at least discuss this with upstream.

Changed in systemd (Ubuntu):
importance: High → Medium
Changed in systemd (Ubuntu Xenial):
importance: Undecided → High
status: New → Triaged
Changed in cloud-init (Ubuntu):
assignee: Martin Pitt (pitti) → nobody
Ryan Harper (raharper) wrote :

On Thu, Oct 27, 2016 at 5:59 AM, Martin Pitt <email address hidden> wrote:

> So did I understand this right:
>
> * In current xenial, cloud-init runs in late boot, so there is no
> principal ordering problem between cloud-init and networkd, other than
> that cloud-init.service should declare After=systemd-networkd.service
> similar to what it does for ifupdown (After=networking.service).
>
> * Thus this would not be a blocker for Ubuntu Core 16 right now, and
> merely adding that After= should suffice for the moment.
>
> * This *is* an issue for 16.10/zesty as cloud-init.service runs in
> early boot and dbus/networkd can't.
>

I suppose we could cherry pick the upstream snap create-user feature and
other
fixes but I expect the SRU of cloud-init to Xenial to include the change in
ordering
to enable running cloud-init earlier in boot; without resolved in 16.04
though the
change isn't strictly needed but to keep the backported branch closer to
trunk, I
would expect those to come in.

However, it does sound like it's reasonable to just add the
After=systemd-networkd.service
as you say; I can test this out now and confirm if that's sufficient to get
a picture of what
this should look like.

>
> * This *will become* an issue for 16.04 as these cloud-init changes are
> meant to be backported soon. Will that happen for the Ubuntu Core 16 GA
> in a few days already? I. e. is this something which we need to crowbar
> in ASAP, or do we have some time to figure out a proper solution how to
> start networkd first, and dbus later on?
>

The official Ubuntu Core 16 GA images won't have cloud-init enabled by
default.
I'm currently working on a separate build of UC16 *with* cloud-init
enabled, focusing
on primary public cloud support. In this build, I'll include at least
snapshot of cloud-init
trunk as of a few days ago to include the snap create-user support needed
for function.

I can add a task against cloud-init for this bug and get a MR with the
change to the
cloud-init.service file after testing that the changes achieves the goal.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1636912
>
> Title:
> systemd-networkd runs too late for cloud-init.service (net)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/cloud-init/+
> bug/1636912/+subscriptions
>

Scott Moser (smoser) wrote :

I really want to avoid having different cloud-init in any of xenial/yakkety/zesty.

I'm willing to carry a patch, and more so if we believe it to just be a temporary thing, but its less than ideal. The goal is to have one cloud-init on each. I realize maybe that is to idealistic for reality, so if we need to change some systemd config files, then we need to.

That just makes testing trunk all that much more difficult though.

Ryan Harper (raharper) wrote :

On Thu, Oct 27, 2016 at 8:43 AM, Ryan Harper <email address hidden>
wrote:

>
>
> On Thu, Oct 27, 2016 at 5:59 AM, Martin Pitt <email address hidden>
> wrote:
>
>> So did I understand this right:
>>
>> * In current xenial, cloud-init runs in late boot, so there is no
>> principal ordering problem between cloud-init and networkd, other than
>> that cloud-init.service should declare After=systemd-networkd.service
>> similar to what it does for ifupdown (After=networking.service).
>>
>>
I'm playing with this now, and we've got some more ordering to do.

I've used After and Requires; but these are focused on when the units starts
rather than when we can expect networking to be up.

networkd runs but it takes a few seconds to get DHCP responses and have
the interfaces come up. What I'm seeing is that systemd runs networkd, this
then allows cloud-init.service to run; which it then checks on the network
interfaces
and finds that eth0 isn't up yet, several seconds later eth0 does come up
but not
before cloud-init.service runs.

There is a network-online.target, which I think we'd want to say
cloud-init.service runs
After that; oddly though when using the ifupdown 'networking.service'; we
don't need
to use that target. and cloud-init.service explicitly runs
Before=network-online.target

That seems wrong since cloud-init.service may scan for network metadata
services.
If I remove that line, I'll see how that affects both the ifupdown and
networkd path.

Ryan

Ryan Harper (raharper) wrote :

networkd bringing up eth0 (virtio) on qemu user-net is taking like 40 seconds... why?

root@localhost:~# journalctl --unit systemd-networkd.service | egrep "(Started|Configured)"
Oct 27 16:31:59 localhost.localdomain systemd[1]: Started Network Service.
Oct 27 16:32:32 localhost.localdomain systemd[1]: Started Network Service.
Oct 27 16:32:45 localhost.localdomain systemd-networkd[1307]: eth0: Configured

Ryan Harper (raharper) wrote :

It appears that the networkd in Xenial is sensitive to dbus service being available; it times out a bit waiting for dbus before continuing; this is the delay.

If I drop cloud-init.service 'Before=dbus.socket' and 'Before=basic.target'; Add 'After=systemd-networkd-wait-online.service'; then ensure that we have a symlink to systemd-networkd.service and systemd-networkd-wait-online.service in /etc/systemd/system/network-online.target.wants/ then I can get reliable service; networkd comes up, waits for eth0 to DHCP, then when complete, cloud-init init runs and eth0 is up.

This of course diverges cloud-init.service from trunk;

Ideally we'd be able to run networkd without dbus *and* without timeouts.

Ryan Harper (raharper) wrote :

Oct 27 19:22:27 localhost.localdomain systemd[1]: writable.mount: Unit is bound to inactive unit dev-vda3.device. Stopping, too.
Oct 27 19:22:27 localhost.localdomain systemd[1]: systemd-networkd.service: Found ordering cycle on systemd-networkd.service/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: systemd-networkd.service: Found dependency on org.freedesktop.network1.busname/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: systemd-networkd.service: Found dependency on sysinit.target/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: systemd-networkd.service: Found dependency on cloud-init.service/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: systemd-networkd.service: Found dependency on systemd-networkd-wait-online.service/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: systemd-networkd.service: Found dependency on systemd-networkd.service/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: systemd-networkd.service: Breaking ordering cycle by deleting job org.freedesktop.network1.busname/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: org.freedesktop.network1.busname: Job org.freedesktop.network1.busname/start deleted to break ordering cycle starting with systemd-networkd.service/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: snapd.socket: Found ordering cycle on snapd.socket/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: snapd.socket: Found dependency on sysinit.target/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: snapd.socket: Found dependency on cloud-init.service/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: snapd.socket: Found dependency on systemd-networkd-wait-online.service/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: snapd.socket: Found dependency on systemd-networkd.service/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: snapd.socket: Found dependency on dbus.service/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: snapd.socket: Found dependency on basic.target/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: snapd.socket: Found dependency on sockets.target/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: snapd.socket: Found dependency on snapd.socket/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: snapd.socket: Breaking ordering cycle by deleting job cloud-init.service/start
Oct 27 19:22:27 localhost.localdomain systemd[1]: cloud-init.service: Job cloud-init.service/start deleted to break ordering cycle starting with snapd.socket/start

Martin Pitt (pitti) wrote :

> I really want to avoid having different cloud-init in any of xenial/yakkety/zesty.

Me too, and I don't see why this would be conceptually required? 'After=systemd-networkd.service" is appropriate in all releases (as that's what you intend to do), after (!) we drop the After=dbus.service from networkd. Simplifying the "Before=basic.target dbus.socket" to "Before=sysinit.target" is something that should be done in y/z and then cleanly backports to x as well.

> I've used After and Requires; but these are focused on when the units starts
rather than when we can expect networking to be up.

Correct. You can't use network-online.target in early boot, that would be a too strong requirement (as e. g. NetworkManager implements this as well). In early boot there can't ever be a *guarantee* that networking works, so the best that you can do is to wait a little bit if you get a default route, e. g. with /lib/systemd/systemd-networkd-wait-online. Of course in desktop or snappy systems you might never get a connection even in late boot, so there need to be some sensible timeouts.

> oddly though when using the ifupdown 'networking.service'; we don't need to use that target.

Yes, that's a Type=oneshot, as it just calls "ifup -a". So that's more or less equivalent to s-n-wait-online --timeout=30 or After=s-n-wait-online.service. But the latter would block the entire boot process for that long if there is no network (and this *did* hit us in snappy already, like bug 1431836) -- my gut feeling is that this can be handled more gracefully/asynchronously in code.

> networkd bringing up eth0 (virtio) on qemu user-net is taking like 40 seconds... why?

That's certainly unusual, it should only take ~ 5s or so. I suggest filing a separate bug for that with your precise config (my suspicion is that you enabled DHCPv6 or similar and it's waiting/timing out for that, or something similar).

Steve Langasek (vorlon) wrote :

On Thu, Oct 27, 2016 at 01:43:05PM -0000, Ryan Harper wrote:

> without resolved in 16.04 though the change isn't strictly needed but to
> keep the backported branch closer to trunk, I would expect those to come
> in.

Except it's not correct that we don't have resolved in 16.04. We're not
using resolved in the *desktop or server* in 16.04. Ubuntu Core 16 has
explicitly seeded libnss-resolve in their image, you can see it configured
in /etc/nsswitch.conf; for some reason the symlink that libnss-resolve sets
up for /etc/systemd/system/multi-user.target.wants/systemd-resolved.service
→ /lib/systemd/system/systemd-resolved.service is absent, but this may be a
bug related to the bind mounts over /etc/systemd and maybe shouldn't be
relied on. I certainly don't know what to expect from the current resolved
setup in Ubuntu Core 16, combined with the cloud-init systemd changes from
zesty.

(Martin, I don't suppose the snappy team talked to you before seeding
libnss-resolve, which is a universe package in xenial, in their images...?)

Ryan Harper (raharper) wrote :

On Thu, Oct 27, 2016 at 2:50 PM, Martin Pitt <email address hidden> wrote:

> > oddly though when using the ifupdown 'networking.service'; we don't
> need to use that target.
>
> Yes, that's a Type=oneshot, as it just calls "ifup -a". So that's more
> or less equivalent to s-n-wait-online --timeout=30 or After=s-n-wait-
> online.service. But the latter would block the entire boot process for
> that long if there is no network (and this *did* hit us in snappy
> already, like bug 1431836) -- my gut feeling is that this can be handled
> more gracefully/asynchronously in code.
>

Where though? cloud-init expects networking to be up, like
'networking.service'
before it runs.. So why shouldn't we use networkd-wait-online ?

Additionally, cloud-init needs to wait for networking to be up, whether the
system
is using ifupdown/networking.service or netplan/networkd ... Adding After=
for both of these
appears to be problematic; we really want something like

After=networking|networkd-wait-online

which handles determining if networkd was supposed to run or not

Maybe a Conditional After would be nice here; we could see if networkd was
expected to start

It's possible that this isn't an issue outside of Ubuntu Core 16.

For Xenial cloud-images, we don't yet have networkd/resolved/ and netplan
to replace ifupdown setup
For Y+ cloud-images, we can moved to that if we want since all of the parts
are there too

For UC16 on Xenial, it *does* have networkd/netplan and expects to use that
by default; however it
currently comes in with a dep on ifupdown which could be dropped if
cloud-init has enough support
for network yaml v2/netplan for fallback networking (though the UC16 image
has a built-in network
config like the older cloud-images did).

Martin Pitt (pitti) wrote :

> Ubuntu Core 16 has explicitly seeded libnss-resolve in their image

Eek -- this was most definitively not intended. resolved still has quirks in 231 (plus a lot of backported patches), it's not even remotely ready in xenial's 229.

> Martin, I don't suppose the snappy team talked to you before seeding libnss-resolve

No, in fact I asked them to not do this in xenial, for the reasons above.

Martin Pitt (pitti) wrote :

> cloud-init expects networking to be up, like 'networking.service' before it runs..

I think it needs to make up its mind -- why does it want to run Before=network-online.target then? I thought the idea was that cloud-init is able to *provide* a network configuration (through the YAML). It seems to me that this might have to be split into two parts then -- one that can provide network config which runs early and does not require networking, and one that can use the network to configure other bits?

> So why shouldn't we use networkd-wait-online ?

You can use the program. I said that it might not be the best idea to use After=s-n-wait-online.service, as that would block the entire cloud-init.service and with it the entire boot (as cloud-init.service has very strong dependencies) if there is no network available.

Ryan Harper (raharper) wrote :

On Thu, Oct 27, 2016 at 3:57 PM, Martin Pitt <email address hidden> wrote:

> > cloud-init expects networking to be up, like 'networking.service'
> before it runs..
>
>
> I think it needs to make up its mind -- why does it want to run
> Before=network-online.target then? I thought the idea was that cloud-init
> is able to *provide* a network configuration (through the YAML). It seems
> to me that this might have to be split into two parts then -- one that can
> provide network config which runs early and does not require networking,
> and one that can use the network to configure other bits?
>

we have to do both.

In the case that we have a local config which provides networking
configuration, we can emit it, and then we want to "bring it up"

In other cases, we may need to bring up fallback networking (ie, dhcp on
eth0) to look for a metadata service on the network
which also may provide network configuration (which we writeout) and then
want to "bring it up"

>
> > So why shouldn't we use networkd-wait-online ?
>
> You can use the program. I said that it might not be the best idea to
> use After=s-n-wait-online.service, as that would block the entire cloud-
> init.service and with it the entire boot (as cloud-init.service has very
> strong dependencies) if there is no network available.
>

cloud-init.service *requires* network; it's designed to block until then
otherwise
you just timeout trying to reach network endpoints like Openstack metadata
service.

The "hang" in UbuntuCore was primarily related to *not* providing a nocloud
seed
in the image; when it was booted without a seed and not disabled, it goes
an *looks* for a seed, first local, then over the net.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1636912
>
> Title:
> systemd-networkd runs too late for cloud-init.service (net)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/systemd/+bug/1636912/+subscriptions
>

Ryan Harper (raharper) wrote :

On Thu, Oct 27, 2016 at 3:57 PM, Martin Pitt <email address hidden> wrote:

> > cloud-init expects networking to be up, like 'networking.service'
> before it runs..
>
>
> It seems to me that this might have to be split into two parts then -- one
> that can provide network config which runs early and does not require
> networking, and one that can use the network to configure other bits?
>

We already do this. cloud-init-local.service runs before networking; it
examines for *local* non-network seeds (like nocloud-net, or a config-drive
)
If that's present, it's sourced and used.

However, if there isn't a local seed, then we must search again *once*
networking is up.

This works just fine with 'networking.service' due to the "atomic" nature
of ifup where once
the oneshot service runs, we can assume that networking is up.

However, networkd runs and asynchronously brings up networking; which is
fine but we now
no longer have a clear checkpoint at which cloud-init can run with
networking up but before
we're at the full 'network-online.target'

I'm not sure how to close the subtle distinction between 'networking' and
'systemd-networkd'
but it's clearly different with no obvious way to make them equivalent.

>
> > So why shouldn't we use networkd-wait-online ?
>
> You can use the program. I said that it might not be the best idea to
> use After=s-n-wait-online.service, as that would block the entire cloud-
> init.service and with it the entire boot (as cloud-init.service has very
> strong dependencies) if there is no network available.
>

It actually works quite well, except the netplan generator only creates a
wants for systemd-networkd, so nothing *wants* the netword-wait-online
unless we add it; this is problematic for cloud-init on a system which
doesn't
have a netplan config (where networkd isn't going to run).

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1636912
>
> Title:
> systemd-networkd runs too late for cloud-init.service (net)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/systemd/+bug/1636912/+subscriptions
>

Changed in systemd:
status: Unknown → New
Martin Pitt (pitti) wrote :

> However, if there isn't a local seed, then we must search again *once* networking is up.

Fair enough, but you can then of course not use that unit to configure the network. But this "if there isn't a local seed" isn't something you can express as a static condition, hence my thought that it might be better if c-i calls s-n-wait-online if and only if it's necessary. But YMMV.

> This works just fine with 'networking.service'

This did/does not really work "fine" IMHO -- all of our cloud images hang for a long time at boot unless you give them a local data source or disable cloud-init. It also imposes the restriction that you must be online during boot, which is fine for a cloud environment, but rather unfriendly for other scenarios.

> due to the "atomic" nature of ifup where once the oneshot service runs, we can assume that networking is up. However, networkd runs and asynchronously brings up networking; which is fine but we now no longer have a clear checkpoint at which cloud-init can run with networking up but before.

Again -- s-n-wait-online.service is exactly the networkd counterpart of networking.service for ifupdown, that gives you the "network is fully configured" synchronization point. The issue is not that it doesn't exist, but that I think that it's not a good thing to depend on either one.

> we really want something like
> After=networking|networkd-wait-online
> which handles determining if networkd was supposed to run or not

That already exists, it's network-online.target -- whatever "implements" it (ifupdown, networkd, NM) will hook itself into this target. Nothing more, nothing less, so if cloud-init just wants to wait until it's online, then just make it Requires/After=network-online.target instead of Before= it. (But again -- this is a very strong dependency which is very inconvenient anywhere but cloud environments with essentially one virtual ethernet card).

BTW, I'm not sure if it came across -- if you play around with this, please drop systemd-networkd.service's After=dbus.service; that will get rid of the worst dependency cycles, and it's something which we can do in Xenial rather easily (not so easy for devel, that's the part we need to discuss with upstream or decide if we care enough about this feature, but eventually I figure we want to get rid of it either way).

Ryan Harper (raharper) wrote :
Download full text (4.2 KiB)

On Fri, Oct 28, 2016 at 1:02 PM, Martin Pitt <email address hidden> wrote:

> > However, if there isn't a local seed, then we must search again *once*
> networking is up.
>
> Fair enough, but you can then of course not use that unit to configure
> the network.

Of course we can. We need to cycle the network though.

> But this "if there isn't a local seed" isn't something you
> can express as a static condition, hence my thought that it might be
> better if c-i calls s-n-wait-online if and only if it's necessary. But
> YMMV.
>

Right, though it's not clear to me that we can express this in unit terms.
It may have to be done internally; That is, it's possible that in
cloud-init "net"
mode we need to block ourselves, until networking is-up.

>
>
> > This works just fine with 'networking.service'
>
> This did/does not really work "fine" IMHO -- all of our cloud images
> hang for a long time at boot unless you give them a local data source or
> disable cloud-init.

This is by-design. cloud-init is *interposing* itself on purpose.

> It also imposes the restriction that you must be
>
online during boot, which is fine for a cloud environment, but rather
> unfriendly for other scenarios.
>

No, you need provide a datasource, or indicate (via boot params) that
you're not
interested in cloud-init running.

It's certainly true that if someone just qemu-system-x86 -hda cloud.img
that it's going
to hang. But folks are explicitly booting a *cloud* image without a cloud.

We handle this fine with uvt-kvm which provides a nocloud-net seed when
booting.

> > due to the "atomic" nature of ifup where once the oneshot service
> runs, we can assume that networking is up. However, networkd runs and
> asynchronously brings up networking; which is fine but we now no longer
> have a clear checkpoint at which cloud-init can run with networking up
> but before.
>
> Again -- s-n-wait-online.service is exactly the networkd counterpart of
> networking.service for ifupdown, that gives you the "network is fully
> configured" synchronization point. The issue is not that it doesn't
> exist, but that I think that it's not a good thing to depend on either
> one.
>

It is, but it's a separate unit "networking" == "networkd" +
"networkd-wait-online"
However, netplan generator only emits the "systemd-networkd" target wants,
so
if we use After=systemd-networkd-wait-online; that's never run since
nothing wants it.
If we add it explicitly, then it runs even when networkd doesn't

>
> > we really want something like
> > After=networking|networkd-wait-online
> > which handles determining if networkd was supposed to run or not
>
> That already exists, it's network-online.target -- whatever "implements"
> it (ifupdown, networkd, NM) will hook itself into this target. Nothing
> more, nothing less, so if cloud-init just wants to wait until it's
> online, then just make it Requires/After=network-online.target instead
> of Before= it. (But again -- this is a very strong dependency which is
> very inconvenient anywhere but cloud environments with essentially one
> virtual ethernet card).
>

It may be that network-online.target is the right place. Scott had some
reason
for n...

Read more...

Steve Langasek (vorlon) wrote :

On Fri, Oct 28, 2016 at 06:02:29PM -0000, Martin Pitt wrote:
> > However, if there isn't a local seed, then we must search again *once*
> networking is up.

> But this "if there isn't a local seed" isn't something you can express as
> a static condition, hence my thought that it might be better if c-i calls
> s-n-wait-online if and only if it's necessary. But YMMV.

Yes, it's not a static condition; and even if it *were*, cloud-init should
apply sensible timeouts in the event that no network source is available.
So s-n-wait-online is still the better answer.

> > This works just fine with 'networking.service'

> This did/does not really work "fine" IMHO -- all of our cloud images
> hang for a long time at boot unless you give them a local data source or
> disable cloud-init.

That is the image working as designed, when booted in an environment it's
not designed for.

> It also imposes the restriction that you must be online during boot, which
> is fine for a cloud environment, but rather unfriendly for other
> scenarios.

cloud-init does not *require* you to be online. It *requires* you to
provide a data source; as you've already pointed out, it can be a local disk
or it can be a network source. Sometimes you have a disk source, sometimes
you have a network source; this is not a design decision of cloud-init, it's
a function of the *cloud environment* where you're booting the image, and
it's out of the scope of this bug to redesign cloud-init to be something
other than it is - a tool for provisioning generic images when booting
noninteractively in a cloud.

Steve Langasek (vorlon) wrote :

On Fri, Oct 28, 2016 at 06:46:07PM -0000, Ryan Harper wrote:
> > > we really want something like
> > > After=networking|networkd-wait-online
> > > which handles determining if networkd was supposed to run or not

> > That already exists, it's network-online.target -- whatever "implements"
> > it (ifupdown, networkd, NM) will hook itself into this target. Nothing
> > more, nothing less, so if cloud-init just wants to wait until it's
> > online, then just make it Requires/After=network-online.target instead
> > of Before= it. (But again -- this is a very strong dependency which is
> > very inconvenient anywhere but cloud environments with essentially one
> > virtual ethernet card).

> It may be that network-online.target is the right place. Scott had some
> reason for not using that explicitly before; I expect some details from
> him.

Because network-online.target is not guaranteed to be reached for the
reasons Martin mentions, so cloud-init depending on it will forever block
other services from starting up, /even if/ you had a valid local data
source.

Martin Pitt (pitti) wrote :

Steve Langasek [2016-10-29 4:38 -0000]:
> cloud-init does not *require* you to be online. It *requires* you to
> provide a data source

Yes, I know, that's exactly my point -- by design it should/does not
require you to be online, but by making it wait for
s-n-wait-online.target or network-online.target you would introduces
this requirement. Hence my suggestion to dynamically call the wait
binaries when appropriate.

(I think we actually agree, and just talk past each other ☺ )

Martin Pitt (pitti) wrote :

Ryan Harper [2016-10-28 18:46 -0000]:
> netplan generator only emits the "systemd-networkd" target wants, so
> if we use After=systemd-networkd-wait-online; that's never run since
> nothing wants it.

Right, if you want it, you need to pull it in yourself.

> If we add it explicitly, then it runs even when networkd doesn't

Oh, indeed -- it quickly ends with "Dependency failed" as networkd is
just a Requisite=, not a Requires=. So this actually does give you
exactly the semantics that we want, no? *If* networkd is running then
it waits for it, otherwise it's a no-op.

@Steve: ^ So I take back my previous comment -- it seems
Requires/After=s-n-wait-online.service is actually exactly what we
want after all.

> I did play with it, but the networkd in xenial blocks for some non-trivial
> amount of time (10s of seconds)
> if dbus.service is not up.

OK, that's something to look into then. It doesn't hang if there's no
configuration (i. e not trivially reproducible), I'll check if I can
reproduce this with a real config ASAP.

Martin Pitt (pitti) wrote :

> I did play with it, but the networkd in xenial blocks for some non-trivial
> amount of time (10s of seconds) if dbus.service is not up.

I cannot reproduce this. I removed /etc/network/interface*, created

$ cat /etc/systemd/network/ens3.network
[Match]
Name=e*

[Network]
DHCP=yes

then dropped the After=dbus.service from systemd-networkd.service and enabled it. Booting and bringing up the ethernet is fast. Even with "systemctl mask dbus.service" it is fast, just that logind and thermald fail to start (expectedly). So how did you get this hang?

Martin Pitt (pitti) wrote :

For the record: the "hang" was the ~ 10s timeout for IPv6 RA. I was testing with Yakkety/Zesty's QEMU whose "user" net has a builtin RA (you get an fec0::* address), while xenial's doesn't.

We enable RA (on the client side) by default, and IMHO should really do so -- selling a new solution in 2016 which does not speak IPv6 would be hilarious. You can't also significantly reduce the timeout, as this would make RA unreliable and it's presumably also a specification somewhere.

So AFAICS the remaining issue is just to make networkd run before dbus.service/socket in systemd, and add "After=networking.service systemd-networkd-wait-online.service" and drop "Wants=networking.service" in cloud-init.service.

Changed in systemd (Ubuntu):
milestone: none → ubuntu-16.11
status: Triaged → In Progress
Changed in systemd (Ubuntu Xenial):
status: Triaged → In Progress
assignee: nobody → Martin Pitt (pitti)
Scott Moser (smoser) wrote :

I've added a MP for some doc on cloud-init boot and what it is trying to accomplish at
 https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+merge/310386

Martin Pitt (pitti) wrote :

UseHostname: is currently broken anyway (https://github.com/systemd/systemd/issues/4646), so letting networkd start earlier than dbus.service does not actually regress anything for now. This will be different once #4646 gets fixed, as then this will break functionality.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.8-45-g584b843-0ubuntu1

---------------
cloud-init (0.7.8-45-g584b843-0ubuntu1) zesty; urgency=medium

  * New upstream snapshot.
    - pep8: fix style errors reported by pycodestyle 2.1.0 [Scott Moser]
    - systemd: drop both Wants and After local-fs.target [Scott Moser]
    - systemd: networking service adjustments. (LP: #1636912)
    - systemd: replace Before=basic.target, dbus.target with sysinit.target
      (LP: #1629797)
    - doc: Add documentation on stages of boot.
    - doc: make the RST files consistently formated and other improvements.
    - Ec2: fix syntax and tox in previous commit.
    - Ec2: protect against non-dictionary in block-device-mapping.
    - doc: fixed example to not overwrite /etc/hosts [Chris Glass]
    - Doc: fix spelling / typos in ca_certs and scripts_vendor.

 -- Scott Moser <email address hidden> Thu, 10 Nov 2016 21:04:09 -0500

Changed in cloud-init (Ubuntu):
status: Triaged → Fix Released
Martin Pitt (pitti) on 2016-11-11
Changed in systemd (Ubuntu Xenial):
status: In Progress → Triaged
Martin Pitt (pitti) wrote :

Dropping the After=dbus.service isn't regression free after all -- if networkd starts before dbus.service, then you will lose networkd's D-Bus control interface. So it seems either of https://bugs.freedesktop.org/show_bug.cgi?id=98254 or https://github.com/systemd/systemd/issues/4504 is necessary after all, and this isn't simple to do.

Changed in systemd (Ubuntu):
status: In Progress → Triaged
Scott Moser (smoser) on 2016-11-15
Changed in cloud-init (Ubuntu Xenial):
status: New → Confirmed
importance: Undecided → Medium
Scott Moser (smoser) on 2016-11-15
Changed in cloud-init (Ubuntu):
status: Fix Released → Triaged
Kristian Jensen (spexxter) wrote :

Hi,

Can we get some attention to this? We are unable to use Xenial, since the instance boots up without a default gateway. Not sure if we are hit harder because we are using vmware as hypervisor, but should not be a problem since Trusty works fine.

Martin Pitt (pitti) wrote :

@Kristian: Whatever problem you have, it can almost certainly not be this one. This is a "future feature" for xenial. I suggest to report a new bug with details.

Kristian Jensen (spexxter) wrote :

@Martin: Okay sorry, I'm pretty new in this world.

Martin Pitt (pitti) on 2016-11-22
Changed in systemd (Ubuntu):
status: Triaged → In Progress
Martin Pitt (pitti) wrote :

I just landed the last PR (https://github.com/systemd/systemd/pull/4710) in upstreamd master that fully fixes networkd for early boot. The complete set is too intrusive to backport, but we don't need to in xenial: Transient (DHCP-acquired) hostname and timezone have never worked in xenial, and it should already have the "try to reconnect to D-Bus every 5s" behaviour (I'll verify this). Then we only need to backport https://github.com/systemd/systemd/commit/5f004d1e32 .

Changed in systemd (Ubuntu):
milestone: ubuntu-16.11 → none
status: In Progress → Fix Committed
Martin Pitt (pitti) on 2016-11-24
description: updated
Martin Pitt (pitti) on 2016-11-24
description: updated
description: updated
Martin Pitt (pitti) wrote :
description: updated
Changed in systemd (Ubuntu Yakkety):
status: New → In Progress
Changed in systemd:
status: New → Fix Released

Hello Ryan, or anyone else affected,

Accepted systemd into yakkety-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/systemd/231-9ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in systemd (Ubuntu Yakkety):
status: In Progress → Fix Committed
tags: added: verification-needed
Changed in systemd (Ubuntu Xenial):
status: In Progress → Fix Committed
Timo Aaltonen (tjaalton) wrote :

Hello Ryan, or anyone else affected,

Accepted systemd into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/systemd/229-4ubuntu13 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Martin Pitt (pitti) on 2016-11-28
Changed in systemd (Ubuntu Xenial):
assignee: Martin Pitt (pitti) → nobody
David Glasser (glasser) wrote :

Hi. This issue affected us on Xenial; we explicitly enable systemd-networkd on our images (when creating our AMI), and after a recent AMI rebuild we were no longer able to start our AMIs. When I looked at the system console we saw things that looked like:

[ 52.866176] cloud-init[721]: Cloud-init v. 0.7.8 running 'init' at Wed, 30 Nov 2016 03:13:22 +0000. Up 51.74 seconds.
[ 52.873058] cloud-init[721]: ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
[ 52.879734] cloud-init[721]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 52.886030] cloud-init[721]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
[ 52.892162] cloud-init[721]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 52.897909] cloud-init[721]: ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | . | . |
[ 52.904408] cloud-init[721]: ci-info: | lo | True | ::1/128 | . | host | . |
[ 52.910315] cloud-init[721]: ci-info: | ens3 | False | . | . | . | 0a:c6:90:b1:76:26 |
[ 52.916070] cloud-init[721]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 52.921096] cloud-init[721]: 2016-11-30 03:13:23,567 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f4feee32cf8>: Failed to establish a new connection: [Errno 101] Network is unreachable',))]

I eventually noticed that (in comparison to the system log for an older working AMI) the "Starting Network Service" line was missing and found this bug. (Text above included mostly in case anybody else sees the same issue and searches for the error.)

I tested with xenial-proposed and 229-4ubuntu13, and it fixed the issue. I'd love to see this fix in stable xenial soon!

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 232-7

---------------
systemd (232-7) unstable; urgency=medium

  [ Michael Biebl ]
  * Mark liblz4-tool build dependency as <!nocheck>
  * udev: Try mount -n -o move first
    initramfs-tools is not actually using util-linux mount (yet), so making
    mount -n --move the first alternative would trigger an error message if
    users have built their initramfs without busybox support.

  [ Alexander Kurtz ]
  * debian/extra/kernel-install.d/85-initrd.install: Remove an unnecessary
    variable. (Closes: #845977)

  [ Martin Pitt ]
  * Drop systemd-networkd's "After=dbus.service" ordering, so that it can
    start during early boot (for cloud-init.service). It will auto-connect to
    D-Bus once it becomes available later, and transient (from DHCP) hostname
    and timezone setting do not currently work anyway. (LP: #1636912)
  * Run hwdb/parse_hwdb.py during package build.
  * Package libnss-systemd
  * Make libnss-* depend on the same systemd package version.

 -- Martin Pitt <email address hidden> Wed, 30 Nov 2016 14:38:36 +0100

Changed in systemd (Ubuntu):
status: Fix Committed → Fix Released
Ryan Harper (raharper) wrote :

I built a new UC16 image with the systemd proposed package. Initially networkd running early is fine. However, under closer inspection, in a networkd-only image, DNS (resolvconf) was not running early enough to allow DNS service to be available at the time that cloud-init.service runs (which may look up resources via hostnames).

After some discussion, the following change is also needed in resolvconf to ensure that in a networkd-based image, that we get DNS early along with networkd early.

% diff -u resolvconf.service.orig resolvconf.service
--- resolvconf.service.orig 2016-12-06 04:58:43.202698062 -0600
+++ resolvconf.service 2016-12-06 04:58:50.367042811 -0600
@@ -3,6 +3,7 @@
 Documentation=man:resolvconf(8)
 DefaultDependencies=no
 Before=networking.service
+Before=systemd-networkd.service

 [Service]
 RemainAfterExit=yes

Martin Pitt (pitti) wrote :

> Before=networking.service
> +Before=systemd-networkd.service

FTR, this should be generalized to Before=network-pre.target. This is also applicable to Debian.

Changed in resolvconf (Ubuntu):
status: New → Triaged
Changed in resolvconf (Ubuntu Xenial):
status: New → Triaged
Changed in resolvconf (Ubuntu Yakkety):
status: New → Triaged
Ryan Harper (raharper) wrote :

I've tested Before=network-pre.target; that works fine. However, for the networkd case, systemd-networkd-wait-online.target should ensure that systemd-networkd-resolvconf-update.service has run first otherwise there might a window where interfaces are configured, but DNS is not.

The following change should go against systemd-networkd-wait-online.service

+ # Ensure that DNS is working before reaching online target
+ After=systemd-networkd-resolvconf-update.service

Ryan Harper (raharper) wrote :

Adding xenial debdiff for resolvconf changes.

Ryan Harper (raharper) wrote :

Adding yakkety debdiff for resolvconf changes.

Martin Pitt (pitti) wrote :

resolvconf uploaded to zesty and x/y SRU review queues. Please forward to Debian too.

Changed in resolvconf (Ubuntu):
assignee: nobody → Ryan Harper (raharper)
status: Triaged → Fix Committed
Ryan Harper (raharper) wrote :

Testing in debian shows that to use network-pre.target, resolvconf needs to include a Wants=network-pre.target to ensure that it gets pulled in as a unit. I'm attaching updated debdiffs for xenial and yakkety.

Martin Pitt (pitti) wrote :

Reuploaded x/y/z to add missing Wants=network-pre.target.

Ryan Harper (raharper) wrote :

Add a Wants=network-pre.target for resolvconf

Martin Pitt (pitti) on 2016-12-08
Changed in resolvconf (Ubuntu Xenial):
status: Triaged → In Progress
Changed in resolvconf (Ubuntu Yakkety):
status: Triaged → In Progress
Ryan Harper (raharper) wrote :

Add a Wants=network-pre.target to resolvconf service.

Changed in resolvconf (Debian):
status: Unknown → New
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package resolvconf - 1.79ubuntu4

---------------
resolvconf (1.79ubuntu4) zesty; urgency=medium

  * debian/resolvconf.service: Add missing Wants=network-pre.target.

 -- Martin Pitt <email address hidden> Thu, 08 Dec 2016 10:21:12 +0100

Changed in resolvconf (Ubuntu):
status: Fix Committed → Fix Released

Ryan Harper [2016-12-06 12:54 -0000]:
> The following change should go against systemd-networkd-wait-
> online.service
>
> + # Ensure that DNS is working before reaching online target
> + After=systemd-networkd-resolvconf-update.service

For the record, this should be the other way around -- add
Before=systemd-networkd-wait-online.service to
s-n-resolvconf-update.service. The latter is a Debian downstream unit
and thus avoids carrying a patch to an upstream unit that refers to a
downstream one.

Ryan Harper (raharper) wrote :

On Tue, Dec 13, 2016 at 10:02 AM, Martin Pitt <email address hidden>
wrote:

> Ryan Harper [2016-12-06 12:54 -0000]:
> > The following change should go against systemd-networkd-wait-
> > online.service
> >
> > + # Ensure that DNS is working before reaching online target
> > + After=systemd-networkd-resolvconf-update.service
>
> For the record, this should be the other way around -- add
> Before=systemd-networkd-wait-online.service to
> s-n-resolvconf-update.service. The latter is a Debian downstream unit
> and thus avoids carrying a patch to an upstream unit that refers to a
> downstream one.
>

Well, ideally we'd have both. Part of the challenge in dealing with
systemd units is that it's very difficult to determine the ordering.
If one doesn't look at the right file.

I won't push for a delta but I do think that these unit relationships ought
to be explicit on both sides.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1636912
>
> Title:
> systemd-networkd runs too late for cloud-init.service (net)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/systemd/+bug/1636912/+subscriptions
>

Ryan Harper (raharper) wrote :

I've opened a new bug for the DNS networkd/resolvconf issue:

https://bugs.launchpad.net/ubuntu/+source/resolvconf/+bug/1649931

We'll track a new SRU for fixing that issue separately.

Ryan Harper (raharper) on 2016-12-14
tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 229-4ubuntu13

---------------
systemd (229-4ubuntu13) xenial; urgency=medium

  [ Martin Pitt ]
  * Backport graphical-session{,-pre}.target user units, for future usage from
    snaps. (LP: #1640293)
  * debian/rules: Clean up *.busname units. They are useless in 16.04 as they
    will always be "condition failed" as kdbus has never existed. But they add
    ordering constraints which make it impossible to start
    systemd-networkd.service during early boot, which is an upcoming
    requirement for cloud-init. (Part of LP: #1636912)
  * Drop systemd-networkd's "After=dbus.service" ordering so that it can start
    during early boot (for cloud-init.service). It will auto-connect to D-Bus
    once it becomes available later, and transient (from DHCP) hostname and
    timezone setting do not work in 16.04 anyway. (LP: #1636912)

  [ Dan Streetman ]
  * rules: introduce disk/by-id (wwid and model_serial) symlinks
    for NVMe drives (LP: #1642903)

 -- Martin Pitt <email address hidden> Thu, 24 Nov 2016 12:41:23 +0100

Changed in systemd (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for systemd has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 231-9ubuntu2

---------------
systemd (231-9ubuntu2) yakkety; urgency=medium

  [ Dan Streetman ]
  * rules: introduce disk/by-id (model_serial) symlinks for NVMe drives
    (LP: #1642903)

  [ Martin Pitt ]
  * Drop systemd-networkd's "After=dbus.service" ordering, so that it can
    start during early boot (for cloud-init.service). It will auto-connect to
    D-Bus once it becomes available later, and transient (from DHCP) hostname
    and timezone setting do not work in 16.10 anyway. (LP: #1636912)

 -- Martin Pitt <email address hidden> Thu, 24 Nov 2016 13:21:05 +0100

Changed in systemd (Ubuntu Yakkety):
status: Fix Committed → Fix Released

The attachment "xenial_resolvconf-lp1636912.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Changed in resolvconf (Debian):
status: New → Fix Committed

Hello Ryan, or anyone else affected,

Accepted resolvconf into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/resolvconf/1.78ubuntu3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in resolvconf (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: removed: verification-done
tags: added: verification-needed
Steve Langasek (vorlon) wrote :

The resolvconf portion of this issue has been moved to bug #1649931. I'm removing resolvconf 1.78ubuntu3 from xenial-proposed and will replace it with a resolvconf 1.78ubuntu4 with the correct bug ref.

Changed in resolvconf (Ubuntu Xenial):
status: Fix Committed → Invalid
Changed in resolvconf (Ubuntu Yakkety):
status: In Progress → Invalid
Changed in systemd (Ubuntu Yakkety):
importance: Undecided → Medium
no longer affects: resolvconf (Ubuntu Yakkety)
no longer affects: resolvconf (Ubuntu Xenial)
no longer affects: resolvconf (Ubuntu)
affects: resolvconf (Debian) → ubuntu-translations
Changed in ubuntu-translations:
importance: Unknown → Undecided
status: Fix Committed → New
no longer affects: ubuntu-translations
Changed in cloud-init (Ubuntu Yakkety):
importance: Undecided → Medium
Tobias Wolf (towolf) wrote :

Thanks for bricking our servers: http://i.imgur.com/DFFrSs1.png

This fixes it:

cat /etc/systemd/system/systemd-networkd.service.d/override.conf
[Unit]
After=dbus.service

tags: removed: verification-needed
Scott Moser (smoser) on 2017-03-01
no longer affects: cloud-init
no longer affects: cloud-init (Ubuntu)
no longer affects: cloud-init (Ubuntu Xenial)
no longer affects: cloud-init (Ubuntu Yakkety)
Steve Langasek (vorlon) wrote :

Tobias, sorry for being so long in coming back around to this, but I followed a pointer to this bug from another one and am now trying to understand the regression that you're describing.

You described this as "bricking" your servers, but per the SRU regression analysis:

  Running [networkd] before dbus.service is running has two consequences:
   - It cannot immediately expose its D-Bus status interface. But it will retry every 5 s until that succeeds, so the D-Bus status interface will continue to work. (see test case)

That seems to have some relation to your screenshot, which shows systemd-networkd failing and being retried. But I wouldn't have understood from the SRU description that the service would *fail* to start and be retried by systemd, and that doesn't explain it leading to a "brick" situation.

The screenshot points to systemctl status systemd-networkd.service for details. Are you able to capture any of those details? Can you give us information on how to reproduce this hang so that we might debug it?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.