cloud-init sometimes fails on dpkg lock due to concurrent apt-daily.service execution

Bug #1693361 reported by Jim Browne
50
This bug affects 6 people
Affects Status Importance Assigned to Milestone
APT
Fix Released
Unknown
cloud-init
Fix Released
Medium
Unassigned
apt (Ubuntu)
Invalid
Undecided
Unassigned
cloud-init (Ubuntu)
Fix Released
Medium
Unassigned
Xenial
Fix Released
Medium
Unassigned
Yakkety
Won't Fix
Medium
Unassigned
Zesty
Fix Released
Medium
Unassigned
Artful
Fix Released
Medium
Unassigned

Bug Description

=== Begin SRU Template ===
[Impact]
A cloud-config that contains packages to install (see below) or
'package_upgrade' will run 'apt-get update'. That can sometimes fail as a
result of contention with the apt-daily.service that updates that information.

Cloud-config showing the problem is just like:

  $ cat my.yaml
  #cloud-config
  packages: ['hello']

[Test Case]
lxc-proposed-snapshot is
  https://git.launchpad.net/~smoser/cloud-init/+git/sru-info/tree/bin/lxc-proposed-snapshot
It publishes an image to lxd with proposed enabled and cloud-init upgraded.

a.) launch an instance with proposed version of cloud-init and some user-data.
   This is platform independent. The test case demonstrates lxd.
   $ printf "%s\n%s\n%s\n" "#cloud-config" "packages: ['hello']" \
       "package_upgrade: true" > config.yaml
   $ release=xenial
   $ ref=proposed-$release
   $ ./lxc-proposed-snapshot --proposed --publish $release $ref;

b.) start the instance
   $ name=$release-1693361
   $ lxc launch my-xenial "--config=user.user-data=$(cat config.yaml)
   $ sleep 1
   $ lxc exec $name -- tail -f /var/log/cloud-init.log /var/log/cloud-init-output.log
   # watch this boot.

 c.) Look for evidence of systemd failure
   journalctl -o short-precise | grep -i break
   journalctl -o short-precise | grep -i order

[Regression Potential]
Regression chance here is low. Its possible that ordering loops
could occur. When that does happen, journalctl will mention it. Unfortunately
in such cases systemd somewhat randomly picks a service to kil so behavior
is somewhat undefined.

[Other Info]
Upstream commit at
  https://git.launchpad.net/cloud-init/commit/?id=11121fe4

=== End SRU Template ===

apt-daily is now a systemd service rather than being invoked by cron.daily. If one builds a custom AMI it is possible that the apt-daily.timer will fire during boot. This can fire at the same time cloud-init is running and if cloud-init loses the race the invocation of apt (e.g. use of "packages:" in the config) will fail.

There is a lot of discussion online about this change to apt-daily (e.g. unattended upgrades happening during business hours, delaying boot, etc.) and discussion of potential systemd changes regarding timers firing during boot (c.f. https://github.com/systemd/systemd/issues/5659).

While it would be better to solve this in apt itself, I suggest that cloud-init be defensive when calling apt and implement some retry mechanism.

Various instances of people running into this issue:

https://github.com/chef/bento/issues/609
https://clusterhq.atlassian.net/browse/FLOC-4486
https://github.com/boxcutter/ubuntu/issues/73
https://unix.stackexchange.com/questions/315502/how-to-disable-apt-daily-service-on-ubuntu-cloud-vm-image

Related branches

Revision history for this message
Steve Langasek (vorlon) wrote : Re: [Bug 1693361] [NEW] cloud-init sometimes fails on dpkg lock due to concurrent apt-daily.service execution

On Wed, May 24, 2017 at 09:10:37PM -0000, Jim Browne wrote:
> While it would be better to solve this in apt itself, I suggest that
> cloud-init be defensive when calling apt and implement some retry
> mechanism.

I would suggest instead that cloud-init should declare itself
Before=apt-daily.service / apt-daily.timer, so that cloud-init takes
precedence over apt-daily on first boot.

Revision history for this message
Jim Browne (jbrowne) wrote :

My concern is another apt dependent task being added somewhere else in systemd that winds up triggering during boot. IMO it's better to be generically defensive about the use of apt, but others certainly have more context and information than I do.

Revision history for this message
Scott Moser (smoser) wrote :

I suspect that Steve's suggestion should fix this mostly for cloud-init.
Apt does of course have a general locking problem that really does need addressing.

We've all seen workarounds/retries at all sorts of levels to address the problems that
a.) you basically have to run 'apt-get update' before you run 'apt-get install' (bug 1429285), which results in the over-usage of that fairly heavy resource.

b.) if another process is running 'apt-get install' or 'apt-get remove' when you attempt, you will fail with the lock contention.

These things should be solved in apt, not worked around in yet another process that uses it.

Revision history for this message
Chris White (cwprogram) wrote :

Some research on this indicates:

* `/etc/init.d/rc` is set to run services in parallel via `startpar`
* However there is a portion with concurrency disabled in the same file
* Assuming `cloud-init` was added as part of the non-concurrent part of the file would this prevent the issue?
* `aptdcon` along with `aptd` appears to allow various `apt-get` operations in a queue like system. Unfortunately I can't tell what happens when a standard `apt-get` package install happens while `aptd` is doing its thing. Not only that but it would increase dependencies on cloud images.

David Britton (dpb)
Changed in cloud-init:
importance: Undecided → High
Scott Moser (smoser)
Changed in cloud-init:
status: New → Confirmed
importance: High → Medium
Changed in cloud-init (Ubuntu Xenial):
status: New → Confirmed
Changed in cloud-init (Ubuntu Yakkety):
status: New → Confirmed
Changed in cloud-init (Ubuntu Zesty):
status: New → Confirmed
Changed in cloud-init (Ubuntu Artful):
status: New → Confirmed
Changed in cloud-init (Ubuntu Xenial):
importance: Undecided → Medium
Changed in cloud-init (Ubuntu Yakkety):
importance: Undecided → Medium
Changed in cloud-init (Ubuntu Zesty):
importance: Undecided → Medium
Changed in cloud-init (Ubuntu Artful):
importance: Undecided → High
importance: High → Medium
David Britton (dpb)
no longer affects: apt (Ubuntu Zesty)
no longer affects: apt (Ubuntu Yakkety)
no longer affects: apt (Ubuntu Xenial)
no longer affects: apt (Ubuntu Artful)
Revision history for this message
Julian Andres Klode (juliank) wrote :

We eventually want wait locking in apt, but I don't think it really solves all issues, especially in scripts with multiple apt invocations. Which is why apt-daily got an additional flock lock for the upcoming SRUs. (see artful).

Feel free.to wait on the same.lock and probably add some ordering against apt-daily and apt-daily-upgrade services.

Changed in apt:
status: Unknown → New
Changed in apt:
status: New → Confirmed
David Britton (dpb)
Changed in cloud-init (Ubuntu Artful):
status: Confirmed → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (3.9 KiB)

This bug was fixed in the package cloud-init - 0.7.9-197-gebc9ecbc-0ubuntu1

---------------
cloud-init (0.7.9-197-gebc9ecbc-0ubuntu1) artful; urgency=medium

  * debian/control: add build dependency python3-jsonschema (LP: #1695318)
  * New upstream snapshot.
    - Azure: Add network-config, Refactor net layer to handle duplicate macs.
      [Ryan Harper]
    - Tests: Simplify the check on ssh-import-id [Joshua Powers]
    - tests: update ntp tests after sntp added [Joshua Powers]
    - FreeBSD: Make freebsd a variant, fix unittests and
      tools/build-on-freebsd.
    - FreeBSD: fix test failure
    - FreeBSD: replace ifdown/ifup with "ifconfig down" and "ifconfig up".
      [Hongjiang Zhang] (LP: #1697815)
    - FreeBSD: fix cdrom mounting failure if /mnt/cdrom/secure did not exist.
      [Hongjiang Zhang] (LP: #1696295)
    - main: Don't use templater to format the welcome message
      [Andrew Jorgensen]
    - docs: Automatically generate module docs form schema if present.
      [Chad Smith]
    - debian: fix path comment in /etc/hosts template.
      [Jens Sandmann] (LP: #1606406)
    - suse: add hostname and fully qualified domain to template.
      [Jens Sandmann]
    - write_file(s): Print permissions as octal, not decimal [Andrew Jorgensen]
    - ci deps: Add --test-distro to read-dependencies to install all deps
      [Chad Smith]
    - tools/run-centos: cleanups and move to using read-dependencies
    - pkg build ci: Add make ci-deps-<distro> target to install pkgs
      [Chad Smith]
    - systemd: make cloud-final.service run before apt daily services.
      (LP: #1693361)
    - selinux: Allow restorecon to be non-fatal. [Ryan Harper] (LP: #1686751)
    - net: Allow netinfo subprocesses to return 0 or 1.
      [Ryan Harper] (LP: #1686751)
    - net: Allow for NetworkManager configuration [Ryan McCabe] (LP: #1693251)
    - Use distro release version to determine if we use systemd in redhat spec
      [Ryan Harper]
    - net: normalize data in network_state object
    - Integration Testing: tox env, pyxld 2.2.3, and revamp framework
      [Wesley Wiedenmeier]
    - Chef: Update omnibus url to chef.io, minor doc changes. [JJ Asghar]
    - tools: add centos scripts to build and test [Joshua Powers]
    - Drop cheetah python module as it is not needed by trunk [Ryan Harper]
    - rhel/centos spec cleanups.
    - cloud.cfg: move to a template. setup.py changes along the way.
    - Makefile: add deb-src and srpm targets. use PYVER more places.
    - makefile: fix python 2/3 detection in the Makefile [Chad Smith]
    - snap: Removing snapcraft plug line [Joshua Powers] (LP: #1695333)
    - RHEL/CentOS: Fix default routes for IPv4/IPv6 configuration.
      [Andreas Karis] (LP: #1696176)
    - test: Fix pyflakes complaint of unused import.
      [Joshua Powers] (LP: #1695918)
    - NoCloud: support seed of nocloud from smbios information
      [Vladimir Pouzanov] (LP: #1691772)
    - net: when selecting a network device, use natural sort order
      [Marc-Aurèle Brothier]
    - fix typos and remove whitespace in various docs [Stephan Telling]
    - systemd: Fix typo in comment in cloud-init.target. [Chen-Han Hsiao]
    - Tests: Skip jso...

Read more...

Changed in cloud-init (Ubuntu Artful):
status: Fix Committed → Fix Released
Scott Moser (smoser)
description: updated
Steve Langasek (vorlon)
description: updated
Revision history for this message
Steve Langasek (vorlon) wrote : Please test proposed package

Hello Jim, or anyone else affected,

Accepted cloud-init into zesty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-153-g16a7302f-0ubuntu1~17.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-zesty to verification-done-zesty.If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-zesty. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-init (Ubuntu Zesty):
status: Confirmed → Fix Committed
tags: added: verification-needed verification-needed-zesty
Revision history for this message
Steve Langasek (vorlon) wrote :

Hello Jim, or anyone else affected,

Accepted cloud-init into yakkety-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-153-g16a7302f-0ubuntu1~16.10.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-yakkety to verification-done-yakkety.If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-yakkety. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-init (Ubuntu Yakkety):
status: Confirmed → Fix Committed
tags: added: verification-needed-yakkety
Changed in cloud-init (Ubuntu Xenial):
status: Confirmed → Fix Committed
tags: added: verification-needed-xenial
Revision history for this message
Steve Langasek (vorlon) wrote :

Hello Jim, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-153-g16a7302f-0ubuntu1~16.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial.If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
Scott Moser (smoser) wrote :

$ for rel in xenial yakkety zesty; do lxc-proposed-snapshot --proposed $rel proposed-$rel --publish || break; done

$ for rel in xenial yakkety zesty; do lxc launch proposed-$rel "--config=user.user-data=$(cat config.yaml)" test-$rel || break; done

$ sleep 2m

$ for rel in xenial yakkety zesty; do mkdir $rel && ( cd $rel && lxc exec test-$rel -- journalctl -o short-precise > journal.log && lxc exec test-$rel -- dpkg-query --show cloud-init > cloud-init-dpkg.txt && lxc file pull test-$rel/var/log/cloud-init.log cloud-init.log && lxc file pull test-$rel/var/log/cloud-init-output.log cloud-init-output.log ) || break; done

$ for rel in xenial yakkety zesty; do tar -czf /tmp/1693361-$rel.tar.gz $rel; done

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :
tags: added: verification-done-xenial verification-done-yakkety verification-done-zesty
removed: verification-needed verification-needed-xenial verification-needed-yakkety verification-needed-zesty
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.9-153-g16a7302f-0ubuntu1~16.04.2

---------------
cloud-init (0.7.9-153-g16a7302f-0ubuntu1~16.04.2) xenial-proposed; urgency=medium

  * debian/patches/ds-identify-behavior-xenial.patch: refresh patch.
  * cherry-pick 5fb49bac: azure: identify platform by well known value
    in chassis asset (LP: #1693939)
  * cherry-pick 003c6678: net: remove systemd link file writing from eni
    renderer
  * cherry-pick 1cd4323b: azure: remove accidental duplicate line in
    merge.
  * cherry-pick ebc9ecbc: Azure: Add network-config, Refactor net layer
    to handle duplicate macs. (LP: #1690430)
  * cherry-pick 11121fe4: systemd: make cloud-final.service run before
    apt daily (LP: #1693361)

 -- Scott Moser <email address hidden> Wed, 28 Jun 2017 17:17:18 -0400

Changed in cloud-init (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Steve Langasek (vorlon) wrote : Update Released

The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.9-153-g16a7302f-0ubuntu1~17.04.2

---------------
cloud-init (0.7.9-153-g16a7302f-0ubuntu1~17.04.2) zesty-proposed; urgency=medium

  * cherry-pick 5fb49bac: azure: identify platform by well known value
    in chassis asset (LP: #1693939)
  * cherry-pick 003c6678: net: remove systemd link file writing from eni
    renderer
  * cherry-pick 1cd4323b: azure: remove accidental duplicate line in
    merge.
  * cherry-pick ebc9ecbc: Azure: Add network-config, Refactor net layer
    to handle duplicate macs. (LP: #1690430)
  * cherry-pick 11121fe4: systemd: make cloud-final.service run before
    apt daily (LP: #1693361)

 -- Scott Moser <email address hidden> Wed, 28 Jun 2017 17:20:51 -0400

Changed in cloud-init (Ubuntu Zesty):
status: Fix Committed → Fix Released
Steve Langasek (vorlon)
Changed in cloud-init (Ubuntu Yakkety):
status: Fix Committed → Won't Fix
Revision history for this message
Scott Moser (smoser) wrote : Fixed in Cloud-init 17.1

This bug is believed to be fixed in cloud-init in 17.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Confirmed → Fix Released
Revision history for this message
Julian Andres Klode (juliank) wrote :

Nothing actionable here for apt, so I'll close this. We should consider making frontend locking more flexible for scripts using apt, though, so scripts can hold the lock all the time and drive apt.

Changed in apt (Ubuntu):
status: New → Invalid
Changed in apt:
status: Confirmed → Fix Released
Revision history for this message
David Reis (dryd) wrote :

This is not fixed, it just affected me on Ubuntu 20.04.3 LTS, resulting in the the subsequent server configuration failing completely because awscli and jq were missing.

Output:

Cloud-init v. 21.3-1-g6803368d-0ubuntu1~20.04.4 running 'modules:config' at Fri, 12 Nov 2021 11:05:29 +0000. Up 18.13 seconds.
Get:1 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu focal InRelease [265 kB]
[... more Gets]
E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 2764 (unattended-upgr)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
Cloud-init v. 21.3-1-g6803368d-0ubuntu1~20.04.4 running 'modules:final' at Fri, 12 Nov 2021 11:05:30 +0000. Up 19.15 seconds.
2021-11-12 11:05:38,955 - util.py[WARNING]: Package upgrade failed
E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 2764 (unattended-upgr)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
2021-11-12 11:05:38,999 - util.py[WARNING]: Failed to install packages: ['awscli', 'nmap', 'tcpdump', 'bind9utils', 'curl', 'wget', 'vim', 'jq', 'htop', 'tmux', 'git', 'iotop', 'iftop', 'fail2ban']
2021-11-12 11:05:38,999 - cc_package_update_upgrade_install.py[WARNING]: 2 failed with exceptions, re-raising the last one
2021-11-12 11:05:39,000 - util.py[WARNING]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_package_update_upgrade_install.py'>) failed

Note: Before=apt-daily.service is only set on cloud-final.service.

Revision history for this message
Julian Andres Klode (juliank) wrote :

Arguably it should run before apt-daily-upgrade too. apt-daily-upgrade is the one locking dpkg; apt-daily locks apt lists (and cache) directory.

Revision history for this message
David Reis (dryd) wrote :

Ah, thanks, I wasn't aware they're distinct. So would simply adding apt-daily-upgrade.service to the Before via cloud-init's bootcmd and then issuing a daemon-reload be a suitable workaround? There's a 30s window until the upgrade process starts if apt's history.log is to be trusted. That is probably enough to be somewhat reliable.

Revision history for this message
Koen Serneels (eskubu) wrote :

From cloud-init point of view the solution now implemented make sense: to run it before the apt-daily-upgrade. However, I wanted to add that there are other use cases as well such as SSM documents being executed on instances. These can be executed in batch at any time and may also require installation of packages and thus interfere with these unattended upgrades.

The execution of documents is not linked directly to cloud-init and may be ran after the instances has been booted, so this falls in the other category of having some kind queuing system or at least a centralized way to obtain a lock to be able to use apt. At the moment there are dozens of different possibilities how to get a mutex to be able to execute apt, but somehow we couldn't find a bullet proof way that works *every time*.

So maybe this does not really fit into this ticket, but to address that this is only a partial fix to a bigger problem.

Revision history for this message
James Falcon (falcojr) wrote :

Not sure if this helps, but we recently added behavior to wait for an apt lock when doing apt commands. This will be included in our next release: https://github.com/canonical/cloud-init/pull/1034

If there are still remaining issues, please open a new bug rather than commenting here. This bug won't be re-opened.

Revision history for this message
Julian Andres Klode (juliank) wrote :

Since 20.04, apt can wait for a lock.

The apt(8) command automatically waits for a lock for 120 seconds (non-interactive) or infinitely.

The apt-get(8) command can be configured to wait as well by passing the -o DPkg::Lock::Timeout=<number of seconds>, where <number of seconds> may also be -1 for infinite.

This avoids any races you'd get by doing the lock yourself and then invoking apt.

Revision history for this message
Jesús Gómez (jgomo3) wrote :

2022 still happens on AWS Ubuntu 20.04.

But in my case, is 100% of the time, not sometimes.

This user-data:

```
#cloud-config

package_update: true
package_upgrade: true

packages:
  - awscli
  - jq
```

```
sudo cloud-init status
status: error
```
Logs collected and attached.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.