snapd.boot-ok.service hangs eternally on cloud image upgrades

Bug #1621336 reported by Martin Pitt on 2016-09-08
164
This bug affects 35 people
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
High
Unassigned
Xenial
Medium
Unassigned
snapd (Ubuntu)
Critical
Unassigned
Xenial
Medium
Eric Desrochers

Bug Description

==== Begin SRU Template [cloud-init] ====
[Impact]
One of cloud-init's features is to upgrade the system during first boot so that it is fully up to date when the user code starts running.

[Test Case]
launch an old instance of 16.04 that will need an update to snapd with
user-data that indicates a package upgrade should be done.

$ lxc image show ubuntu:74a491804877
autoupdate: false
properties:
  aliases: 16.04,default,lts,x,xenial
  architecture: amd64
  description: ubuntu 16.04 LTS amd64 (release) (20160830)
  label: release
  os: ubuntu
  release: xenial
  serial: "20160830"
  version: "16.04"
public: true

$ printf "#%s\n%s\n" cloud-config "packages: [snapd]" > user-data

$ lxc launch ubuntu:74a491804877 xrecreate "--config=user.user-data=$(cat user-data)"
$ lxc exec xrecreate -- tail -f /var/log/cloud-init-output.log

# you will see the output log hang at:
# Setting up snapd (2.14.2~16.04) ...

## Now get new container and patch in cloud-init
$ lxc launch ubuntu:74a491804877 xpatched
# let it boot, with no user-data saying to update.
$ sleep 10

# update the container to new cloud-init, then clean it to make
# it look like first boot again.
$ lxc file push - xpatched/etc/cloud/cloud.cfg.d/update.cfg < user-data
$ lxc exec xpatched -- sh -c '
    p=/etc/apt/sources.list.d/proposed.list
    echo deb http://archive.ubuntu.com/ubuntu xenial-proposed main > "$p" &&
    apt-get update -q && apt-get -qy install cloud-init'
$ lxc exec xpatched -- sh -c '
    cd /var/lib/cloud && for d in *; do [ "$d" = "seed" ] || rm -Rf "$d"; done
    rm -Rf /var/log/cloud-init*'

$ lxc exec xpatched reboot
$ lxc exec xpatched -- tail -f /var/log/cloud-init-output.log

# snapd installed and a 'Cloud-init finished' message.

[Regression Potential]
The change to running package installation later in boot will likely affect some things. However, previously a larger set of things were unreliable. This will make things over all more reliable.
==== End SRU Template [cloud-init] ====

I reproducibly run into an eternal hang when deploying services with Juju, when it prepares a new xenial testbed. The current xenial cloud image does not have the latest snapd, so snapd gets dist-upgraded:

Preparing to unpack .../snapd_2.14.2~16.04_amd64.deb ...
Warning: Stopping snapd.service, but it can still be activated by:
  snapd.socket
Unpacking snapd (2.14.2~16.04) over (2.13) ...
Setting up snapd (2.14.2~16.04) ...
[...] hangs

The postinst tries to start snapd.boot-ok.service on upgrade:

           |-cloud-init(311)-+-apt-get(577)---dpkg(845)---snapd.postinst(846)---perl(919)---systemctl(922)
           | `-sh(354)---tee(355)

root 922 0.0 0.0 25316 1412 pts/0 S+ 06:09 0:00 /bin/systemctl start snapd.boot-ok.service

This hangs eternally because:

 - cloud-init's dist-upgrade runs *during* the boot process, so that the system is not fully booted yet when this happens (see bug 1576692); thus multi-user.target is *not* yet active

 - snapd.boot-ok.service is After=multi-user.target

 - "systemctl start" is synchronous by default, i. e. it waits until the service is started unless you use --no-block.

Thus snapd.postinst waits on snapd.boot-ok.service waits on multi-user.target waits on cloud-init to finish waits on snapd.postinst to finish.

I think conceptually you shouldn't start snapd.boot-ok.service in the postinst; if the system is already booted (manual dist-upgrade) it should already be running, and if it does get upgraded during boot (with cloud-init) then you shouldn't pretend that booting is already finished. So I suggest to use dh_installinit with --no-scripts for snapd.boot-ok.service.

Related branches

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in snapd (Ubuntu):
status: New → Confirmed
Michael Vogt (mvo) wrote :

Thanks! This is indeed an oversight that this gets started in postinst.

Changed in snapd (Ubuntu):
importance: Undecided → Critical
status: Confirmed → Triaged

FTR, we are using "enable-os-upgrade: false" in ~/.juju/environments.yaml to avoid this bug.

tags: added: oil
Axel Kämpfe (akaempfe) wrote :

for us, i found a tiny "workaround" which works, as for now

echo "bash -c 'service snapd.boot-ok start'" | at now + 4 min

where of course the 4 minutes is up to you how long you want to wait and how many upgrades are to be processed

Scott Moser (smoser) on 2016-09-09
Changed in cloud-init (Ubuntu):
status: New → Confirmed
importance: Undecided → High
Martin Pitt (pitti) wrote :

The proposed cloud-init change will "accidentally" fix this by breaking the loop at a different place -- but conceptually it's still wrong to start the "book ok" marker on package install/upgrade.

Axel Kämpfe (akaempfe) wrote :

yes, i know, the "fix" is not actually a fix, it is "bending the rules" :D

but for my use case, since in my case, the system does a full reboot anyway after the upgrade, it works for me :D

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.7-28-g34a26f7-0ubuntu1

---------------
cloud-init (0.7.7-28-g34a26f7-0ubuntu1) yakkety; urgency=medium

  * New upstream snapshot.
    - systemd: Better support package and upgrade.
      (LP: #1576692, #1621336)
    - tests: cleanup tempdirs in apt_source tests

 -- Scott Moser <email address hidden> Fri, 09 Sep 2016 16:01:13 -0400

Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
Scott Moser (smoser) on 2016-09-12
Changed in snapd (Ubuntu Xenial):
status: New → In Progress
importance: Undecided → Medium
Scott Moser (smoser) on 2016-09-13
Changed in cloud-init (Ubuntu Xenial):
status: New → In Progress
importance: Undecided → Medium
Chris J Arges (arges) on 2016-09-13
Changed in cloud-init (Ubuntu Xenial):
status: In Progress → Fix Committed
Scott Moser (smoser) on 2016-09-13
description: updated

Hello Martin, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.7-31-g65ace7b-0ubuntu1~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-needed
Scott Moser (smoser) wrote :

I walked through the lxc example above. All good.

tags: added: verification-done
removed: verification-needed
Martin Pitt (pitti) wrote :

Hello Martin, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.8-1-g3705bb5-0ubuntu1~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: removed: verification-done
tags: added: verification-needed
Achim Behrens (k1l) wrote :

a user just had the "snapd always hanging on install/reinstall and blocking apt" issue.

after some fiddeling we used the workaround from Comment#4 https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1621336/comments/4 :

starting a rootshell with "sudo -i". then running "echo "bash -c 'service snapd.boot-ok start'" | at now + 4 min", then "apt install snapd" (if it argues about canceled dpkg processes use the "dpkg --configure -a". then wait for at least 4 minutes.

the hanging should gone then.

Scott Moser (smoser) wrote :

verified cloud-init_0.7.8-1-g3705bb5-0ubuntu1~16.04.1 as in sru

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :
Download full text (3.9 KiB)

This bug was fixed in the package cloud-init - 0.7.8-1-g3705bb5-0ubuntu1~16.04.1

---------------
cloud-init (0.7.8-1-g3705bb5-0ubuntu1~16.04.1) xenial-proposed; urgency=medium

  * New upstream release 0.7.8.
  * New upstream snapshot.
    - systemd: put cloud-init.target After multi-user.target (LP: #1623868)

cloud-init (0.7.7-31-g65ace7b-0ubuntu1~16.04.2) xenial-proposed; urgency=medium

  * debian/control: add Breaks of older versions of walinuxagent (LP: #1623570)

cloud-init (0.7.7-31-g65ace7b-0ubuntu1~16.04.1) xenial-proposed; urgency=medium

  * debian/control: fix missing dependency on python3-serial,
    and make SmartOS datasource work.
  * debian/cloud-init.templates fix capitalisation in template so
    dpkg-reconfigure works to select OpenStack. (LP: #1575727)
  * d/README.source, d/control, d/new-upstream-snapshot, d/rules: sync
    with yakkety for changes due to move to git.
  * d/rules: change PYVER=python3 to PYVER=3 to adjust to upstream change.
  * debian/rules, debian/cloud-init.install: remove install file
    to ensure expected files are collected into cloud-init deb.
    (LP: #1615745)
  * debian/dirs: remove obsolete / unused file.
  * upstream move from bzr to git.
  * New upstream snapshot.
    - Allow link type of null in network_data.json [Jon Grimm] (LP: #1621968)
    - DataSourceOVF: fix user-data as base64 with python3 (LP: #1619394)
    - remove obsolete .bzrignore
    - systemd: Better support package and upgrade. (LP: #1576692, #1621336)
    - tests: cleanup tempdirs in apt_source tests
    - apt config conversion: treat empty string as not provided. (LP: #1621180)
    - Fix typo in default keys for phone_home [Roland Sommer] (LP: #1607810)
    - salt minion: update default pki directory for newer salt minion.
      (LP: #1609899)
    - bddeb: add --release flag to specify the release in changelog.
    - apt-config: allow both old and new format to be present.
      [Christian Ehrhardt] (LP: #1616831)
    - python2.6: fix dict comprehension usage in _lsb_release. [Joshua Harlow]
    - Add a module that can configure spacewalk. [Joshua Harlow]
    - add install option for openrc [Matthew Thode]
    - Generate a dummy bond name for OpenStack (LP: #1605749)
    - network: fix get_interface_mac for bond slave, read_sys_net for ENOTDIR
    - azure dhclient-hook cleanups
    - Minor cleanups to atomic_helper and add unit tests.
    - Fix Gentoo net config generation [Matthew Thode]
    - distros: fix get_primary_arch method use of os.uname [Andrew Jorgensen]
    - Apt: add new apt configuration format [Christian Ehrhardt]
    - Get Azure endpoint server from DHCP client [Brent Baude]
    - DigitalOcean: use the v1.json endpoint [Ben Howard]
    - MAAS: add vendor-data support (LP: #1612313)
    - Upgrade to a configobj package new enough to work [Joshua Harlow]
    - ConfigDrive: recognize 'tap' as a link type. (LP: #1610784)
    - NoCloud: fix bug providing network-interfaces via meta-data.
      (LP: 1577982)
    - Add distro tags on config modules that should have it [Joshua Harlow]
    - ChangeLog: update changelog for previous commit.
    - add ntp config module [Ryan Harper]
    - SmartOS: more improvement...

Read more...

Changed in cloud-init (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Erik Damrose (damrose) wrote :

Any update when this will be fixed in the snapd package? We use a script to update packages during the boot process, and run into the same loop described in the original report.

Laryllan (laryllan) wrote :

I have the same problem, but cloud-init is not installed.
Had this problem again while updating to snapd-2.16+16.10ubuntu1.2.

Eric Desrochers (slashd) wrote :

Today, It has been brought to my attention that the problem is still present.

Any update on pitti's suggestion to use dh_installinit with --noscripts for snapd.boot-ok.service ?

Eric Desrochers (slashd) wrote :

@pitti,

As mentionned in the description :
"...So I suggest to use dh_installinit with --no-scripts for snapd.boot-ok.service."

Were you referring to something like the following ?

diff -Nru snapd-2.17.1/debian/rules snapd-2.17.1ubuntu1/debian/rules
--- snapd-2.17.1/debian/rules 2016-11-04 12:40:03.000000000 -0400
+++ snapd-2.17.1ubuntu1/debian/rules 2016-11-23 15:33:37.000000000 -0500
@@ -107,6 +107,9 @@
                -psnapd \
                snapd.autoimport.service

+override_dh_installinit:
+ dh_installinit -psnapd.boot-ok --noscripts
+
 override_dh_install:
        # we do not need this in the package, its just needed during build
        rm -rf ${CURDIR}/debian/tmp/usr/bin/xgettext-go

Eric

Martin Pitt (pitti) wrote :

@Eric: Right, that's what I meant. It should be mitigated by that recent cloud-init reorganization, but even if snapd.boot-ok.service now stopped failing on upgrade I still think it does not make sense to run this on package upgrade, only on boot.

Eric Desrochers (slashd) wrote :

@pitti, ok I will start preparing debdiff(s) for snapd and then start the SRU process for Z/Y/X release.

Eric

Erik Damrose (damrose) wrote :

In my scenario the following patch worked.

diff -Naur snapd-2.16ubuntu3.orig//debian/rules snapd-2.16ubuntu3/debian/rules
--- snapd-2.16ubuntu3.orig//debian/rules 2016-10-28 12:42:06.204048938 +0200
+++ snapd-2.16ubuntu3/debian/rules 2016-10-28 13:45:59.726079099 +0200
@@ -77,6 +77,7 @@
 override_dh_systemd_start:
        # start boot-ok
        dh_systemd_start \
+ --no-start \
                -psnapd \
                snapd.boot-ok.service
        # we want to start the auto-update timer

Erik Damrose (damrose) wrote :

@Eric: I tested your patch, unfortunately it does not work in my scenario. dh_systemd_start modifies the snapd.postinst and the boot hangs while waiting for multi-user.target. Please consider applying my patch.

Michael Vogt (mvo) wrote :

In current snapd 2.17+ the boot-ok systemd unit is no longer used or needed.

Michael Vogt (mvo) wrote :
Eric Desrochers (slashd) wrote :

As per mvo's previous comment (#23)...

In current snapd 2.17+[1] found in xenial-proposed the boot-ok systemd unit is no longer used or needed.

Could someone ,affected by the issue, please enable the -proposed repository[2] and install version 2.17.1.

Note that positive feedbacks about this package in the LP bug, will help to move the package out of -proposed in order to land into it's final destination -updates.

[1] - rmadison output:
snapd | 2.17.1 | xenial-proposed | source, amd64, arm64, i386, powerpc, ppc64el, s390x

[2] - HOWTO enable -proposed
https://wiki.ubuntu.com/Testing/EnableProposed

Commit reference :
- https://github.com/snapcore/snapd/commit/e5011eb

Regards,
Eric

Eric Desrochers (slashd) on 2016-11-24
tags: added: verification-needed
removed: verification-done
Changed in snapd (Ubuntu):
status: Triaged → In Progress
Erik Damrose (damrose) wrote :

Verified: Works with snapd 2.17.1 from xenial-proposed

Eric Desrochers (slashd) wrote :

Thanks damrose for your feedbacks, I will work a making the package land into xenial-updates.

tags: added: verification-done
removed: verification-needed
Eric Desrochers (slashd) on 2016-11-24
Changed in snapd (Ubuntu Xenial):
assignee: nobody → Eric Desrochers (slashd)
Martin Pitt (pitti) wrote :

This is apparently fixed in xenial-proposed, but the release is blocked (see bug 1637215).

Changed in snapd (Ubuntu):
status: In Progress → Fix Committed
Changed in snapd (Ubuntu Xenial):
status: In Progress → Fix Committed
Eric Desrochers (slashd) wrote :

It has been brought to my attention the following from someone who also tried the 2.17.1 package :

"I'm glad to announce that I could test the procedure during the boot (same conditions as when everything hang) ... It works ! YES!!

# sudo apt-cache policy snapd
snapd:
  Installed: 2.17.1
  Candidate: 2.17.1
  Version table:
 *** 2.17.1 500
        500 http://archive.ubuntu.com/ubuntu xenial-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
     2.16ubuntu3 500
        500 http://ch.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
     2.0.2 500
        500 http://ch.archive.ubuntu.com/ubuntu xenial/main amd64 Packages"

Eric

Eric Desrochers (slashd) wrote :

The LP bug status hasn't yet switch to "Fix Released", but I confirmed that the package that address this bug has landed in -updates[1]

You can now install the package if you are experiencing this snapd bug.

[1] $ rmadison snapd --suite=xenial-updates
snapd | 2.17.1ubuntu1 | xenial-updates | source, amd64, arm64, armhf, i386, powerpc, ppc64el, s390x

Changed in snapd (Ubuntu):
status: Fix Committed → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers