cloud-init

eatmydata enabled by default results in apt packages not correctly installed

Bug #2007400 reported by Andrew Martin on 2023-02-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	cloud-init	Invalid	Medium	Unassigned

Bug Description

I'm using LXD 5.0 (LTS) on Ubuntu 22.04 to launch containers and virtual machines. I am using cloud-init in LXD's user.user-data field to install some apt packages when the container or VM is launched. For example, I have the following cloud-config section:
    package_update: true
    packages:
      - openssh-server

The server is launched as follows:
lxc launch images:ubuntu/jammy/desktop myhostname --vm

However, when the server is launched, the apt package is listed as successfully installed, but there are no files associated with it:
util.py[DEBUG]: apt-install [eatmydata apt-get --option=Dpkg::Options::=--force-confold --option=Dpkg::options::=--force-unsafe-io --assume-yes --quiet install openssh-server ... took 25.031 seconds
handlers.py[DEBUG]: finish: modules-final/config-package-update-upgrade-install: SUCCESS: config-package-update-upgrade-install ran successfully

Yet "dpkg -L openssh-server" shows:
Package 'openssh-server' does not contain any files (!)

This results in an unusable system. I don't know what "eatmydata" is or why it is enabled by default, but does indeed appear to be eating this data and resulting in an unusable server.

It looks like there's a sparsely documented option called apt_get_wrapper which seems to let you disable "eatmydata":
https://cloudinit.readthedocs.io/en/18.5/topics/examples.html?highlight=eatmydata#additional-apt-configuration

I configured this to disable "eatmydata" and sure enough, I am now able to successfully create a virtual machine. Is it possible to better document the use of eatmydata in cloud-init, and moreover maybe consider disabling it by default so you have to opt-in to get the performance benefits it provide (while accepting the risk of possible data corruption)?

Thanks!

Tags:

Revision history for this message

Andrew Martin (asmartin) wrote on 2023-02-15:

cloud-init.log.txt Edit (155.1 KiB, text/plain)

Revision history for this message

Brett Holman (holmanb) wrote on 2023-02-16 (last edit on 2023-02-16):

Hi Andrew,

Thanks for filing this bug.

It sounds like there is a bug here, but since this seemed reproducible for you (which is surprising given the nature of eatmydata) I'd like to reproduce. I don't think we've had previous errors related to eatmydata, so closer inspection is required before making behavior changes.

Since these images do not have cloud-init installed, some post-boot steps must be required after pulling the image. Could you please share the steps you took to reproduce? We attempted to reproduce, and have not been able to this far.

As for docs, yes this should probably be better documented.

Changed in cloud-init:
status:	New → Incomplete

Revision history for this message

Andrew Martin (asmartin) wrote on 2023-02-16:

Hi Brett,

Yes, I have this LXD profile:
config:
  user.user-data: |
    #cloud-config
    disable_root: false
    apt:
      preserve_sources_list: true
    package_update: true
    packages:
      - openssh-server
description: install openssh-server
devices: {}
name: myprofile
used_by:

I then ran the following steps to first launch the image, then install cloud-init, and once it was installed apply the profile that uses it:
lxc launch images:ubuntu/jammy/desktop myhostname --vm
lxc exec myhostname -- bash -c 'let i=0; let r=1; while [ $r -ne 0 ] && [ $i -lt 6 ]; do apt-get update; r=$?; let i=i+1; sleep 10; done'
lxc exec myhostname -- apt-get install -y cloud-init
lxc profile assign myhostname default,myprofile
lxc restart

I have been using this code for quite awhile now with no issues and then this problem started occurring within the past week or so. I'm not sure if there was a change in the upstream image that triggered it or what, but that may make it more challenging to reproduce.

Revision history for this message

James Falcon (falcojr) wrote on 2023-02-16:

I went through the scenario outlined here and cannot reproduce the behavior. Whats your output for `lxd --version` and `lxc storage show default`?

Generally, the only time eatmydata should be causing an issue is if there's a power failure after the apt install. Is there any interruption to the VM between the install and when you look for the files? Otherwise, Is there anything strange happening on the boot of your VM that could be emulating this?

Revision history for this message

Andrew Martin (asmartin) wrote on 2023-02-16:

lxd --version is 5.0.2 and here is the storage config:
config:
  source: tank
  volatile.initial_source: tank
  volume.zfs.remove_snapshots: "true"
  zfs.clone_copy: "false"
  zfs.pool_name: tank
description: ""
name: default
driver: zfs
used_by:

As far as power failures, as you can see above I do an "lxc restart" after assigning the profile because that seemed to be necessary to get cloud-init to go through the full boot/init process. After that "lxc restart", I just run lxc exec myhostname -- bash -c "cloud-init status --wait" and wait for cloud-init to finish. Otherwise, there's nothing strange happening with this host.

Revision history for this message

James Falcon (falcojr) wrote on 2023-02-17:

Even using that version of LXD and a zfs storage config, I still can't reproduce the behavior. I'm really not sure how eatmydata could cause an issue like this as the machine is still up since cloud-init has run. I could try adding a manual "sync" after running the apt commands to see if that improves anything, but it's really just a shot in the dark.

Revision history for this message

Andrew Martin (asmartin) wrote on 2023-02-20:

I'm not sure either; this config had been working fine for a couple of months and then suddenly about a week ago this failure started occurring on every attempt to launch servers with LXD.

At a minimum, I think adding more documentation about the presence of eatmydata in the pipeline and how to disable it would be valuable in case others encounter this in the future.

Revision history for this message

Brett Holman (holmanb) wrote on 2023-02-21:

Andrew since you are able to reproduce this issue, and we are not, could you please provide a dmesg log of a freshly booted system that just experienced this issue? Typically this should not happen, but since it has I would be curious to see what the guest kernel is reporting.

Another possible cause of this besides power failures is memory pressure events, which may cause the kernel to attempt evicting dirty pages.

> At a minimum, I think adding more documentation about the presence of eatmydata in the pipeline and how to disable it would be valuable in case others encounter this in the future.

Agreed, I've added a ticket to the backlog to ensure that this happens.

Revision history for this message

Andrew Martin (asmartin) wrote on 2023-02-22:

dmesg.txt Edit (62.9 KiB, text/plain)

Sure, attached is the dmesg from a freshly-booted VM (using the above desktop image) which exhibits this problem. There is a ton of free memory on the host so memory contention doesn't seem to be the problem in this case.

Revision history for this message

James Falcon (falcojr) wrote on 2023-02-24:

#10

Thanks for the updated log. I'm triaging the documentation aspect of this bug. Since we can't reproduce the issue, I don't think there's a specific code change we'll make for this.

Changed in cloud-init:
status:	Incomplete → Triaged
importance:	Undecided → Medium
tags:	added: bitesize

Revision history for this message

Andrew Martin (asmartin) wrote on 2023-02-27:

#11

Sounds good; thanks for doing the investigation

Revision history for this message

Brett Holman (holmanb) wrote on 2023-02-27:

#12

There are signs of disk corruption in your logs:

[ 3.702957] EXT4-fs (sda2): INFO: recovery required on readonly filesystem
[ 3.702967] EXT4-fs (sda2): write access will be enabled during recovery
[ 3.995853] EXT4-fs (sda2): orphan cleanup on readonly fs
[ 3.996145] EXT4-fs (sda2): 1 orphan inode deleted
[ 3.996150] EXT4-fs (sda2): recovery complete

and

[ 6.221353] FAT-fs (sda1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.

You might try deleting and redownloading your base image.

Revision history for this message

Andrew Martin (asmartin) wrote on 2023-02-28:

#13

Confirmed - I can no longer reproduce it when using eatmydata. I see a new version of the ubuntu/jammy/desktop image was released recently (and therefore is the version I am now using), so either something in the upstream image was fixed or the act of refreshing/updating to the latest version fixed corruption in my local cache.

Revision history for this message

Brett Holman (holmanb) wrote on 2023-03-01:

#14

Thanks for confirming. Glad to hear that.

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-03-03:

#15

Marking invalid as it seems this is unrelated to cloud-init. Please re-open if this surfaces again.