[Ubuntu 22.04]cloud-init failed to complete after 10 minutes of waiting was shown during Installation via iDRAC Virtual Console

Bug #1986781 reported by Shubhakar Gowda P S
58
This bug affects 9 people
Affects Status Importance Assigned to Milestone
cloud-init
Invalid
Undecided
Unassigned
subiquity
Invalid
Undecided
Unassigned
casper (Ubuntu)
Fix Released
High
Dan Bungert
Focal
Triaged
Low
Mauricio Faria de Oliveira
Jammy
Fix Released
Low
Mauricio Faria de Oliveira
Kinetic
Won't Fix
Low
Mauricio Faria de Oliveira

Bug Description

[Impact]

 * Users that install Ubuntu Server through slow
   media (eg, virtual optical drive over network,
   which may be common on enterprise deployments)
   might hit the following subiquity startup error:

  'cloud-init failed to complete after 10 minutes of waiting'

 * (That in addition to 10 minutes of waiting themselves.)

 * This happens because casper-md5check.service is
   (slowly) verifying the integrity of install media,
   which blocks `multi-user.target`,
   which blocks `cloud-final.service`,
   which blocks `cloud-init status --wait`
   which is used in subiquity / `waiting on cloud-init`).

[Fix]

 * The adopted solution (merged on lunar) is simply
   not to block `multi-user.target`, but rather run
   _after_ it.

[Test Steps]

 For a synthetic reproducer of slowness of casper-md5check:

 * boot with `break=init` to break into initramfs-tools
   before exec() systemd.
 * chroot /root /bin/bash
 * edit /usr/lib/systemd/system/casper-md5check.service
 * prepend `strace --inject read:delay_enter=5s` to the
   command in `ExecStart`, to introduce a 5 secs delay
   to every read() syscall performed by casper-md5check.
 * exit twice (chroot, initramfs shell) to resume boot.

  See comment 37 for examples.

[Regression Potential]

 * Functionality related to install media integrity check.

 * Users with corrupted install media might not realize
   this until later on; but this is rarely the case and
   even w/out the fix, there's a lot that runs _before_
   we even get to casper-md5check, so they may (still)
   see errors early anyway.

 * There's a cosmetic glitch in the proposed solution:
   the systemd line when casper-md5check finishes
   shows up on top of subiquity's menu (screenshot):

   "[ OK ] Finished casper-md5check Verify Live ISO checksums."

   Dan Bungert mentioned this is known and should be
   addressed in a future change to subiquity, and is
   not supposed to block the SRU for Jammy / 22.04.2.

[Original Description]

Description:

On Dell EMC PowerEdge system when Install Ubuntu 22.04 via iDRAC Virtual Console, cloud-init failed to complete after 10 minutes of waiting.

Steps to Reproduce:

1. Login to iDRAC and Launch Virtual Console.
2. Connect to Virtual Media and Map ubuntu 22.04 iso file using Map CD/DVD option.
3. Try Installing Ubuntu server.
4. "cloud-init" failed to complete after 10 minutes of waiting was shown during Installation.

Expected Results :-

Installation should be successful.

Related branches

no longer affects: bash-completion (Ubuntu)
Revision history for this message
Shubhakar Gowda P S (shubhakara) wrote :
information type: Public → Private
Revision history for this message
James Falcon (falcojr) wrote :

I took a look at the logs, and this doesn't appear to be a problem with cloud-init specifically. Cloud-init runs multiple times during boot, and each run has its own set of systemd services that it depends on having run first. If we look at cloud-init.log, we can see that there's a time where one cloud-init instance exits, then a greater than 10 minute gap before cloud-init's final service runs:

2022-08-17 06:38:53,429 - util.py[DEBUG]: cloud-init mode 'modules' took 0.139 seconds (0.14)
2022-08-17 06:38:53,429 - handlers.py[DEBUG]: finish: modules-config: SUCCESS: running modules for config
2022-08-17 06:52:29,424 - util.py[DEBUG]: Cloud-init v. 22.1-14-g2e17a0d6-0ubuntu1~22.04.5 running 'modules:final' at Wed, 17 Aug 2022 06:52:29 +0000. Up 999.23 seconds.

This can happen if one of cloud-init's dependent services has stalled or failed. To see if another service has stalled or failed, you could try the following commands:
systemd-analyze blame
systemctl --failed

Does the entire installation fail, or just cloud-init?

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Shubhakar Gowda P S (shubhakara) wrote :

James Falcon,

Does the entire installation fail, or just cloud-init?

Only cloud-init will fail. Once issue occurs when hit 'cancel' we can proceed with the further Installation successfully.

information type: Private → Public
Revision history for this message
Chad Smith (chad.smith) wrote :

Thanks for the info, shubhakara.

To debug further what we need from you on the failed installed system once it boots is the output of
 - systemd-analyze blame
 - systemd-analyze show
 - systemd-analyze critical-chain
 - systemctl --failed
 - journalctl -b 0

This will help us determine what other systemd service ordering seems to be delaying cloud-init boot stages from running.

Most of he data above gets collected by running `cloud-init collect-logs` which emits a tar.gz file that can be attached to this bug if you don't want to run the commands individually.

Given that you mentioned clicking "cancel" and there is a /var/log/intaller subdirectory, this looks like you are using the ubuntu server live-install (subiquity) to install this system. I'm adding subiquity project too just to keep devs aware.

Looking through /var/log/installer/subiquity-server-debug*log I can see `Starting Subiquity server revision 3359` is blocking on `cloud-init status --wait` which alludes to the 10 minute wait you referenced in this bug.

# timeout logs on cloud-init status --wait
2022-08-17 06:39:32,689 DEBUG subiquitycore.utils:77 arun_command called: ['cloud-init', 'status', '--wait']
2022-08-17 06:39:32,861 DEBUG curtin.reporting.start.subiquity/Meta/status_GET:45 start: subiquity/Meta/status_GET:
2022-08-17 06:39:32,863 DEBUG curtin.reporting.finish.subiquity/Meta/status_GET:45 finish: subiquity/Meta/status_GET: SUCCESS: 200 {"state": "CLOUD_INIT_WAIT", "confirming_tty": "", "error": null, "cloud_init...
2022-08-17 06:39:32,863 INFO aiohttp.access:233 [17/Aug/2022:06:39:32 +0000] "GET /meta/status?cur=null HTTP/1.1" 200 419 "-" "Python/3.8 aiohttp/3.6.2"
2022-08-17 06:39:33,347 DEBUG curtin.reporting.start.subiquity/Meta/status_GET:45 start: subiquity/Meta/status_GET:
2022-08-17 06:49:32,690 DEBUG subiquity.server.server:519 waited 600.0012300014496s for cloud-init
2022-08-17 06:49:32,691 DEBUG subiquity.server.server:535 cloud-init status: '<timeout>', assumed disabled

Revision history for this message
Chad Smith (chad.smith) wrote :

Please set the bug status back to "New" when further comments have been added to make sure our team gets alerted about the updated comments so we can respond more quickly.

Changed in subiquity:
status: New → Incomplete
information type: Public → Private
information type: Private → Public
Changed in cloud-init:
status: Incomplete → New
Revision history for this message
Shubhakar Gowda P S (shubhakara) wrote :

Chad Smith,

File "cloud-init-install-fail.tar" in comment #6 contains the cloud init logs collected under /var/log.
File "cloud-init.tar.gz" in comment #7 contains the cloud init logs collected using the command "cloud-init collect-logs".

Revision history for this message
Chad Smith (chad.smith) wrote :

Thanks Shubhakar Gowda P S (shubhakara) for getting back to us.

    From journal.txt in your attached logs in comment #7 is looks like the 10 minute timeout is not strictly related to cloud-init, but cloud-init waiting on snaps being seeded.

I can see a snapd log in your journalctl -b 0 output that says snapd was timed out after an inordinate amount of time waiting for NTP:

Aug 25 08:55:42.976933 ubuntu-server snapd[1642]: devicemgr.go:1927: no NTP sync after 10m0s,

I also see a whole bunch of ntpd and network.timed

The systemd sevice which performs cloud configuration cloud-config.service has a clause which makes it wait until snapd.seeded.service completes in the booting environment so cloud-init use of any snap tools/services installed:

After=snapd.seeded.service

So this 10 minute wait is due to something in either networking setup in the ephemeral/early install environment not being configured appropriately for snapd to come up. Once snapd unblocks, cloud-init can complete.

You can confirm that the problem is snap seeding on your system with the commands we listed above in comments #2 and #4:

 - systemd-analyze blame
 - systemd-analyze show
 - systemd-analyze critical-chain
 - systemctl --failed

That should point to snap seeding being the problem in this early/ephemeral install environment.

Also in your journalctl -b 0 output (in cloud-init collect-logs /var/log/journal.txt)

A red flag I'm also seeing is repeated logs in your journal that look like thrashing of time-related services:
Aug 25 08:52:55.229686 ubuntu-server dbus-daemon[1630]: [system] Successfully activated service 'org.freedesktop.timedate1'
Aug 25 08:52:55.229931 ubuntu-server systemd[1]: Started Time & Date Service.
Aug 25 08:53:55.291021 ubuntu-server systemd[1]: systemd-timedated.service: Deactivated successfully.

Something is amiss with ntp or network configuration on this installation and it looks like it is causing issues for snapd which cause issue for cloud-init to make progress on the install of this system.

In checking /var/log/cloud-init-output.log I see that the two NICs are listed as link down:
ci-info: | eno1 | False | . | . | . | f4:ee:08:19:37:49 |
ci-info: | eno2 | False | . | . | . | f4:ee:08:19:37:4a |

But journalctl -b 0 showed them as "UP":
Aug 25 08:45:04.600250 ubuntu-server systemd-networkd[1594]: eno2: Link UP
Aug 25 08:45:04.635710 ubuntu-server systemd-networkd[1594]: eno1: Link UP

Something seems to be turning off the NICs on this device maybe?

Can you please provide:
1. Was the a customized install Ubuntu 22.04 installer ISO?
2. if public ISO can you provide the URL to the ISO used for install?
3. Run the separate commands and attach that output
 - systemd-analyze blame
 - systemd-analyze show
 - systemd-analyze critical-chain
 - systemctl --failed
 - ls /etc/netplan
 - cat /etc/netplan/*
 - ip addr
 - ip route

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

What's going on here is I think that the md5 check casper is doing is using all the (minimal) amount of IO bandwidth to the virtual ISO and starving other processes from making progress. We should probably make the md5 check run under ionice or just abort itself if it's slow or something.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Would someone be able to test a custom ISO that I can make with https://code.launchpad.net/~mwhudson/casper/+git/casper/+merge/429055 incorporated?

Revision history for this message
Narendra K (knarendra) wrote :

Hi Michael,

Thank you for taking a look. Please provide the custom ISO. We will test and share findings.

Revision history for this message
Narendra K (knarendra) wrote :
Download full text (7.2 KiB)

Please find the details:

systemd-analyze-blame-output

12min 33.112s casper-md5check.service
      48.873s snapd.service
      48.004s snapd.seeded.service
      35.847s pollinate.service
      31.663s udisks2.service
      28.538s cloud-init-local.service
      23.592s e2scrub_reap.service
      23.281s thermald.service
      22.361s dev-sr0.device
      17.649s polkit.service
      16.971s dev-loop9.device
      16.768s networkd-dispatcher.service
      16.547s snapd.hold.service
      15.221s dev-loop6.device
      15.003s dev-loop8.device
      14.037s snapd.apparmor.service
      12.682s dev-loop4.device
      12.680s dev-loop5.device
      12.677s dev-loop0.device
      12.093s ModemManager.service
      10.873s snap.lxd.activate.service
      10.661s dev-loop7.device
       5.310s rsyslog.service
       5.231s apport.service
       5.168s systemd-logind.service
       4.633s secureboot-db.service
       4.523s cloud-init.service
       4.297s ua-timer.service
       2.867s console-setup.service
       2.712s systemd-resolved.service
       2.689s systemd-udev-trigger.service
       2.523s systemd-udevd.service
       2.365s multipathd.service
       2.263s systemd-user-sessions.service
       1.742s keyboard-setup.service
       1.590s systemd-modules-load.service
       1.483s systemd-sysusers.service
       1.433s snap-core20-1405.mount
       1.304s snap-lxd-22923.mount
       1.211s snap-snapd-15534.mount
       1.198s update-notifier-download.service
        950ms systemd-journal-flush.service
        840ms systemd-sysctl.service
        837ms setvtrgb.service
        664ms systemd-random-seed.service
  634ms snap-subiquity-3359.mount
        606ms systemd-timesyncd.service
        558ms systemd-tmpfiles-setup.service
        495ms systemd-networkd.service
        461ms cloud-config.service
        418ms systemd-tmpfiles-setup-dev.service
        357ms systemd-update-utmp.service
        340ms systemd-remount-fs.service
        327ms cloud-final.service
        317ms sys-kernel-config.mount
        315ms sys-fs-fuse-connections.mount
        306ms plymouth-read-write.service
        224ms plymouth-quit.service
        199ms lvm2-monitor.service
        181ms systemd-journald.service
        117ms finalrd.service
        111ms ufw.service
         89ms dev-hugepages.mount
         88ms dev-mqueue.mount
         87ms sys-kernel-debug.mount
         87ms sys-kernel-tracing.mount
         83ms lvm2-pvscan@8:3.service
         78ms kmod-static-nodes.service
         76ms <email address hidden>
         76ms <email address hidden>
         75ms <email address hidden>
         26ms ssh.service
         20ms systemd-tmpfiles-clean.service
         17ms plymouth-quit-wait.service
         11ms systemd-update-utmp-runlevel.service
          8ms snapd.socket
          8ms systemd-networkd-wait-online.service
          3ms tmp.mount
         51us blk-availability.service

"systemd-analyze-critical-chain-output"

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

graphical.target @13min 17.133s
`-multi-user.target @13min 17.132s
  `-casper-md5che...

Read more...

Revision history for this message
Narendra K (knarendra) wrote :
Revision history for this message
Narendra K (knarendra) wrote :

Hi Chad/Michael,

When issue occurs, the installation stops as shown in the screen attached in comment #14. If "Close" option is selected, installation completes successfully and OS completes successfully. Does choosing "Close" option and proceeding with installation have any side effects or could it be recommended as a work around till a solution is included in the installation ISO ?

Narendra K (knarendra)
Changed in cloud-init:
status: Incomplete → New
Changed in subiquity:
status: Incomplete → New
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

If you could try booting the ISO from https://cdimage.ubuntu.com/ubuntu-server/daily-live/pending/ I would be very interested to hear if it is better (and if it is we can backport to 22.04).

> When issue occurs, the installation stops as shown in the screen attached in comment #14. If
> "Close" option is selected, installation completes successfully and OS completes successfully.
> Does choosing "Close" option and proceeding with installation have any side effects or could it
> be recommended as a work around till a solution is included in the installation ISO ?

No, if the install completes successfully then there should be no deficiencies. It mostly impacts things about the installer environment, like autoinstall and ssh access to the installer.

Revision history for this message
Narendra K (knarendra) wrote :

>No, if the install completes successfully then there should be no deficiencies.

Thank you for clarifying.

>It mostly impacts things about the installer environment, like autoinstall and ssh access to the installer.

Any thoughts on how users using autoinstall scenarios could work around this issue ?

>ssh access to the installer

Any thoughts on how users requiring ssh could work around this issue ? Or should it confirmed first if this scenario an issue by trying to ssh when issue occurs ?

Revision history for this message
Shubhakar Gowda P S (shubhakara) wrote :
Download full text (6.6 KiB)

Hi Michael,

As per your assist downloaded 64-bit PC (AMD64) server install image from "https://cdimage.ubuntu.com/ubuntu-server/daily-live/pending/" and tested it, Issue was still observed.

Please find the details:

"systemd-analyze-blame-output"

      13min 33.597s casper-md5check.service
      41.640s snapd.seeded.service
      29.360s snapd.service
      24.189s pollinate.service
      18.638s udisks2.service
      16.918s dev-sr0.device
      10.718s snapd.hold.service
      10.585s ModemManager.service
      10.085s snap.lxd.activate.service
       9.802s cloud-init-local.service
       9.375s snapd.apparmor.service
       9.059s thermald.service
       9.044s console-setup.service
       8.591s polkit.service
       7.693s e2scrub_reap.service
       7.348s systemd-logind.service
       7.220s rsyslog.service
       7.097s dev-loop8.device
       6.720s lvm2-monitor.service
       6.331s dev-loop7.device
       6.050s dev-loop6.device
       6.000s secureboot-db.service
       5.983s apport.service
       5.363s systemd-tmpfiles-setup.service
       4.766s dev-loop9.device
       4.074s dev-loop5.device
       4.073s dev-loop0.device
       4.072s dev-loop4.device
       3.766s snap-lxd-23537.mount
       3.432s multipathd.service
       3.372s snap-snapd-16292.mount
       2.970s systemd-resolved.service
       2.927s snap-core20-1611.mount
       2.376s plymouth-read-write.service
       2.234s keyboard-setup.service
       2.212s systemd-user-sessions.service
       2.040s systemd-udev-trigger.service
       1.672s finalrd.service
       1.483s systemd-sysusers.service
       1.460s cloud-init.service
       1.303s systemd-sysctl.service
       1.296s ufw.service
       1.261s systemd-journal-flush.service
       1.219s setvtrgb.service
       1.147s systemd-random-seed.service
       1.131s plymouth-quit.service
       1.103s systemd-modules-load.service
       1.032s dev-hugepages.mount
       1.030s dev-mqueue.mount
       1.029s sys-kernel-debug.mount
       1.028s sys-kernel-tracing.mount
       1.020s kmod-static-nodes.service
       1.019s systemd-journald.service
       1.014s modprobe@chromeos_pstore.service
       1.013s <email address hidden>
       1.012s <email address hidden>
       1.010s modprobe@efi_pstore.service
       1.009s <email address hidden>
       1.008s modprobe@pstore_blk.service
       1.006s modprobe@pstore_zone.service
       1.005s <email address hidden>
        801ms systemd-remount-fs.service
        712ms systemd-udevd.service
        700ms snap-subiquity-3798.mount
        558ms systemd-networkd.service
        471ms update-notifier-download.service
        378ms systemd-tmpfiles-setup-dev.service
        368ms cloud-config.service
        335ms cloud-final.service
        279ms sys-kernel-config.mount
        269ms sys-fs-fuse-connections.mount
        215ms systemd-timesyncd.service
        107ms systemd-update-utmp.service
         23ms systemd-tmpfiles-clean.service
         11ms systemd-update-utmp-runlevel.service
          9ms snapd.socket
          4ms tmp.mount
          4ms systemd-networkd-wait-online.service
          3ms plymouth-quit-wait.service
         17us blk-availability.s...

Read more...

Revision history for this message
Narendra K (knarendra) wrote :

Hi Chad,

2022-08-17 06:49:32,690 DEBUG subiquity.server.server:519 waited 600.0012300014496s for cloud-init
2022-08-17 06:49:32,691 DEBUG subiquity.server.server:535 cloud-init status: '<timeout>', assumed disabled

Do the logs provide any details on why cloud-init takes more than 10 minutes to complete ? Is cloud-init waiting for casper-md5check.service to complete ? casper-md5check.service seems to be taking > 10 minutes when issue occurs.

Hi Michael,

Could subiquity be made to continue without interrupting the installation and reschedule a check to verify if cloud-init completed ? Or does it need to wait for cloud-init complete before proceeding with installation ?

Also, any thoughts on query from comment #17 ?

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Hmm it's a shame those casper changes didn't help but not completely surprising.

I haven't verified this but my theory on what is going on is that the casper-md5 check is using all of the small amount of IO the virtual bmc feature provides and thus starving other services like cloud-init and snap seeding from making progress. Nothing is explicitly waiting for the casper-md5 check to complete -- if my theory is correct it's not something that can be fixed by tweaking unit dependencies.

I should have mentioned this before, but one workaround for this is to pass 'tomem' on the kernel command line. This makes the initrd copy the entire ISO into RAM before pivoting to the real rootfs which makes the first part of the boot pretty slow but should make all other parts including the md5 check much faster. It obviously increases RAM usage too but most "real" servers probably have much more RAM than is needed.

Other workarounds we are looking at is making a much smaller ISO that can be provided as a virtual ISO that will then download the main ISO from the network (as the network is probably much much faster than the virtual ISO). But it would be nice to fix this properly, probably by having the casper md5 check abort if it detects that it is only making very slow progress.

Revision history for this message
Chad Smith (chad.smith) wrote :

Narendra K (knarendra),

   As Michael mentioned, this really looks more like a system that is under load and not responding well to all the systemd services that are being run during the install process.

The time values I'm seeing in comment #18 from systemd-analyze are 10 times greater than what I typically see on laptops, containers, VMs and bare metal.

   From the systemd-analyze blame output from comment in #18, I can confirm the reason for cloud-config.service "waiting" is the sum of the systemd snap*services setup time:

      41.640s snapd.seeded.service
      29.360s snapd.service
...
      16.547s snapd.hold.service

Those values account for the minute and a half of waiting... which aligns with the attached cloud-init analyze blame output as well:

starting stage: modules-config
|`->config-ssh-import-id ran successfully @76.25500s +00.00000s

Which tells us that cloud-init's modules-config stage started at 76 seconds after boot began (around the same amount as the combined snap services).

When looking as cloud-init analyze output we see the cloud-final stage doesn't start until 8817 seconds (13 min) after boot which is greater than subiquity's 10 minute timeout.

Starting stage: modules-final
|`->config-scripts-per-once ran successfully @817.63600s +00.00000s

/lib/systemd/system/cloud-final.service config is only started once the following condition is met:

After=multi-user.target

And multi-user.target depends on that casper-md5check.service which is taking 13 minutes for one reason of another.
We can confirm what's slowing up that final stage of cloud-init with the following command on your system:
$ systemd-analyze critical-chain cloud-final.service

Revision history for this message
Chad Smith (chad.smith) wrote :

Ultimately, each md5 file md5 comparison seems to take around 1/2 second on this system. If we are unable to determine what is slowing down the casper-md5check or an early exit in the event of painful md5checks, can we either provide a config option to disable the casper md5 check, or make that cloud-init timeout configurable to > 600 seconds[1]?

References for cloud-init timeout:
https://github.com/canonical/subiquity/blob/main/subiquity/server/server.py#L523

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

You can also put fsck.mode=skip on the kernel command line to get the md5check to skip itself, I should have mentioned that too!

Revision history for this message
Shubhakar Gowda P S (shubhakara) wrote :

Hi Michael,

Tried passing kernel work around parameter "tomem" as suggested in the comment #20, Issue was still observed in this case. And also Tried passing kernel work around parameter "fsck.mode=skip" as suggested in the comment #23, Issue was not seen in this scenario.

Also, we tried reproducing the Issue in 'Ubuntu 20.04.4' in this case Issue was not seen.

Revision history for this message
Shubhakar Gowda P S (shubhakara) wrote :

Chad Smith,

Please find the details as requested in comment #21.

"systemd-analyze critical-chain cloud-final.service output"

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +266ms
`-multi-user.target @12min 38.156s
  `-casper-md5check.service @39.106s +11min 59.049s
    `-basic.target @39.104s
      `-sockets.target @39.103s
        `-snap.lxd.user-daemon.unix.socket @1min 38.160s
          `-sysinit.target @39.068s
            `-cloud-init.service @35.645s +3.417s
              `-systemd-networkd-wait-online.service @35.634s +8ms
                `-systemd-networkd.service @35.265s +365ms
                  `-network-pre.target @35.262s
                    `-cloud-init-local.service @12.178s +23.078s
                      `-systemd-remount-fs.service @6.689s +483ms
                        `-systemd-journald.socket @6.401s
                          `--.mount @6.360s
                            `--.slice @6.360s

Revision history for this message
Chad Smith (chad.smith) wrote :

Shubhakar Gowda P S (shubhakara)
1. thank you for the specific critical-chain of cloud-final.service. It does confirm the main thing blocking cloud-init from completing is the casper-md5check.service on this system. Without that service delaying the multi-user.target, cloud-final,service can complete in around 50
seconds which beats that 10 minute timeout in the installer.

2. > Also, we tried reproducing the Issue in 'Ubuntu 20.04.4' in this case Issue was not seen.
Do you mean that stock 20.04.4 images without any kernel command-line options did not reproduce this issue on the same system type?

I did notice in running the daily server live image for 20.04 today that in the console output it shows me the `fsck.mode=skip` So that is probably why this didn't hit you on the 20.04 install.

Thanks for the kernel cmdline tip Michael, I didn't know about `fsck.mode=skip` and seems like that would avoid this issue on platforms which are being hit with general IO/performance concerns leading to multi-user.target not being reached for an extended period of time.

Maybe no need to add a config setting for the cloud-init --wait timeout in subiquity as the other 'workaround' seems to work here.

cloud-init squad should be touching the `cloud-init status --wait` call https://github.com/canonical/subiquity/blob/main/subiquity/server/server.py#L521 to introspect cloud-init status a bit better once cloud-init releases support for `cloud-init status --format json`. In the event of a timeout, we can also make sure we touch this section to ensure a timeout results in a report of what stage cloud-init is 'stuck' and give suggested debug steps for bug filing.

Revision history for this message
Shubhakar Gowda P S (shubhakara) wrote :

Chad Smith,

Once After Ubuntu 20.04.4 CD Image Installation got completed, by switching over to a shell using 'Ctrl-Z + F2' keyboard shortcut key we have collected the output of following details.

output "systemd-analyze critical-chain cloud-final.service"

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +475ms
`-cloud-config.service @22.760s +473ms
  `-snapd.seeded.service @12.689s +10.067s
    `-basic.target @9.526s
      `-sockets.target @9.526s
        `-snap.lxd.daemon.unix.socket @14.228s
          `-sysinit.target @9.512s
            `-cloud-init.service @8.350s +1.161s
              `-systemd-networkd-wait-online.service @8.341s +6ms
                `-systemd-journald.socket @1.304s
                  `-system.slice @1.300s
                    `--.slice @1.300s

output "systemctl status casper-md5check.service"

Unit casper-md5check.service could not be found

output "/var/log/casper.log"

stdin: Invalid argument
stdin: Invalid argument
stdin: Invalid argument
/init: line 49: can't open /dev/sda: No medium found
/init: line 49: can't open /dev/sda: No medium found
passwd: password expiry information changed.
Using CD-ROM mount point /cdrom/
Identifying... [bdb176fb941b5200237520bff1ab61c3-2]
Scanning disc for index files...
Found 2 package indexes, 0 source indexes, 0 translation indexes and 1 signatures
Found label 'Ubuntu-Server 20.04.4 LTS _Focal Fossa_ - Release amd64 (20220223.1)'
This disc is called:
'Ubuntu-Server 20.04.4 LTS _Focal Fossa_ - Release amd64 (20220223.1)'
Copying package lists...gpgv: Signature made Wed Feb 23 09:26:53 2022 UTC
gpgv: using RSA key 843938DF228D22F7B3742BC0D94AA3F0EFE21092
gpgv: Good signature from "Ubuntu CD Image Automatic Signing Key (2012) <email address hidden>"
Reading Package Indexes... 0%
Reading Package Indexes... Done
Writing new source list
Source list entries for this disc are:
deb cdrom:[Ubuntu-Server 20.04.4 LTS _Focal Fossa_ - Release amd64 (20220223.1)]/ focal main restricted
Repeat this process for the rest of the CDs in your set.

Revision history for this message
Narendra K (knarendra) wrote :

Hi Chad/Michael,

Any updates or feedback on findings in comment #27 ?

It seems like in Ubuntu 20.04,

a) cloud-init does not depend on multi-user.target
b) casper-md5check.service is not present

Could Ubuntu 22.04 installation be modified to match with Ubuntu 20.04 ? Any thoughts ?

Hi Michael,

Could 'fsck.mode=skip' be documented in Ubuntu 22.04 release notes ?

Revision history for this message
Brett Holman (holmanb) wrote :

Since the issue has been narrowed down to casper-md5check, which is out of scope of cloud-init, I'm marking the cloud-init side of this bug as "Invalid", since there is nothing actionable to do to in cloud-init to resolve this bug, and the plans to increase debugging transparency that Chad mentioned were already planned and scheduled.

Changed in cloud-init:
status: New → Invalid
Revision history for this message
Narendra K (knarendra) wrote :

Hi Michael,

Any updates on the queries in comment #28 ?

It seems like in Ubuntu 20.04,

a) cloud-init does not depend on multi-user.target
b) casper-md5check.service is not present

Could Ubuntu 22.04 installation be modified to match with Ubuntu 20.04 ? Any thoughts ?

Revision history for this message
Dan Bungert (dbungert) wrote :

We propose no longer having the md5check block multi-user. I'm opening a matching MP.

Changed in casper (Ubuntu):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Dan Bungert (dbungert)
Revision history for this message
Narendra K (knarendra) wrote :

Hi Dan,

Would the patch get included in Ubuntu 22.04.2 ?

Revision history for this message
Dan Bungert (dbungert) wrote :

Hi Narendra - yes, the plan is to have this done in time for 22.04.2. Thanks for specifically pointing this out, we need to make sure it's included in the Jammy Stable Release Updates.

Changed in subiquity:
status: New → Invalid
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package casper - 1.476

---------------
casper (1.476) lunar; urgency=medium

  * Change casper-md5check to Type=Simple to take advantage of the
    alternate interpretation of what started means relative to Type=oneshot.
    This should allow us to reach multi-user.target sooner, especially for
    users where the md5 check is slow. (LP: #1986781)

 -- Dan Bungert <email address hidden> Wed, 23 Nov 2022 16:10:57 +1300

Changed in casper (Ubuntu):
status: In Progress → Fix Released
Changed in casper (Ubuntu Kinetic):
status: New → Triaged
importance: Undecided → Low
assignee: nobody → Mauricio Faria de Oliveira (mfo)
Changed in casper (Ubuntu Jammy):
status: New → Triaged
importance: Undecided → Low
assignee: nobody → Mauricio Faria de Oliveira (mfo)
Changed in casper (Ubuntu Focal):
status: New → Triaged
importance: Undecided → Low
assignee: nobody → Mauricio Faria de Oliveira (mfo)
tags: added: se-sponsor-mfo sts
Revision history for this message
Dan Bungert (dbungert) wrote :

Thanks for moving the SRU forward Mauricio.

I see you nominated this fix for Kinetic, I think it won't help much as we probably won't be creating new Kinetic install media. Or do you have something else in mind?

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Dan o/
Yes, it'd be just for documentation purposes for the SRU team's review stage.
I'll mark it as Won't Fix with the rationale described, in the SRU template.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (5.8 KiB)

Reproducer/Verification for SRU to Jammy,
based on strace delay injection on read().

Uploading to jammy.
Attaching debdiff for reference.

...

Launch a VM with the Jammy daily live server:

 $ wget https://cdimage.ubuntu.com/ubuntu-server/jammy/daily-live/current/jammy-live-server-amd64.iso

 $ ISO=jammy-live-server-amd64.iso
 $ VM=casper-jammy
 $ virt-install --name $VM --cdrom $ISO --vcpus 2 --memory 2048 --disk none --osinfo ubuntu-stable-latest

Press e to edit, append "break=init console=ttyS0", press ctrl-x, close window.

Open the serial console and chroot:

 $ virsh console $VM
 ...
 (initramfs) chroot /root /bin/bash

Test strace delay injection:

 # time strace --trace read cat /dev/null
 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3206\2\0\0\0\0\0"..., 832) = 832
 read(3, "", 131072) = 0
 +++ exited with 0 +++

 real 0m0.041s
 user 0m0.009s
 sys 0m0.030s

 # time strace --trace read --inject read:delay_enter=5s cat /dev/null
 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3206\2\0\0\0\0\0"..., 832) = 832 (DELAYED)
 read(3, "", 131072) = 0 (DELAYED)
 +++ exited with 0 +++

 real 0m10.041s
 user 0m0.010s
 sys 0m0.033s

Modify the casper-md5check service:

 # sed -i '/^ExecStart=/ s,=,=/usr/bin/strace --inject read:delay_enter=60s ,' /usr/lib/systemd/system/casper-md5check.service

 # cat /usr/lib/systemd/system/casper-md5check.service
 [Unit]
 Description=casper-md5check Verify Live ISO checksums

 [Service]
 Type=oneshot
 ExecStart=/usr/bin/strace --inject read:delay_enter=60s /usr/lib/casper/casper-md5check /cdrom /cdrom/md5sum.txt
 RemainAfterExit=yes

 [Install]
 WantedBy=multi-user.target

Exit and wait; the issue happens:

 # exit
 (initramfs) exit

 ...

 Ubuntu 22.04.1 ubuntu-server ttyS0

 connecting...
 waiting for cloud-init... /

 <10 minutes later>

 ================================================================================
   Serial [ Help ]
 ================================================================================

   As the installer is running on a serial console, it has started in basic
   mode, using only the ASCII character set and black and white colours.

    ┌────────────────────────────────────────────────────────────────────────
    │ │
    │ cloud-init failed to complete after 10 minutes of waiting. This │
    │ suggests a bug, which we would appreciate help understanding. If │
    │ you could file a bug at │
    │ https://bugs.launchpad.net/subiquity/+filebug and attach the │
    │ contents of /var/log, it would be most appreciated. │
    │ │
    │ [ Switch to a shell ] │
    │ [ Close ] │
    │ │
    └───────────────────────────────────────────────────────────────...

Read more...

description: updated
description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

cosmetic glitch (see description's regression potential section).

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

For documentation purposes,

The critical chain of systemd units/time
before/after the fix, with a fake 'sleep 30'
as the command in casper-md5check.service.

Before:

root@ubuntu-server:/# systemd-analyze critical-chain cloud-final.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +492ms
`-multi-user.target @37.968s
  `-casper-md5check.service @7.954s +30.010s
    `-basic.target @7.947s
      `-sockets.target @7.946s
        `-snap.lxd.user-daemon.unix.socket @15.740s
          `-sysinit.target @7.892s
            `-cloud-init.service @5.854s +2.032s
              `-systemd-networkd-wait-online.service @5.835s +6ms
                `-systemd-networkd.service @5.713s +119ms
                  `-network-pre.target @5.709s
                    `-cloud-init-local.service @1.943s +3.765s
                      `-systemd-remount-fs.service @1.172s +130ms
                        `-systemd-journald.socket @988ms
                          `--.mount @942ms
                            `--.slice @941ms

After:

root@ubuntu-server:/# systemd-analyze critical-chain cloud-final.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +901ms
`-multi-user.target @21.831s
  `-ssh.service @21.712s +38ms
    `-basic.target @8.255s
      `-sockets.target @8.254s
        `-snap.lxd.user-daemon.unix.socket @17.452s
          `-sysinit.target @8.202s
            `-cloud-init.service @6.970s +1.225s
              `-systemd-networkd-wait-online.service @6.958s +9ms
                `-systemd-networkd.service @6.842s +113ms
                  `-network-pre.target @6.838s
                    `-cloud-init-local.service @1.938s +4.898s
                      `-systemd-remount-fs.service @1.160s +174ms
                        `-systemd-journald.socket @941ms
                          `-system.slice @894ms
                            `--.slice @894ms

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Changed in casper (Ubuntu Kinetic):
status: Triaged → Won't Fix
Changed in casper (Ubuntu Jammy):
status: Triaged → In Progress
Changed in casper (Ubuntu Jammy):
milestone: none → ubuntu-22.04.2
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Re-uploading to Jammy with changelog updates suggested by Timo.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello Shubhakar, or anyone else affected,

Accepted casper into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/casper/1.470.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To properly test it you will need to obtain and boot a daily build of a Live CD for jammy. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in casper (Ubuntu Jammy):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Verification done for jammy-proposed.

...

Booting jammy-live-server-amd64.iso with `break=init`
(more details in comment #37).

(initramfs) chroot /root /bin/bash
root@ubuntu-server:/# dhclient enp1s0

root@ubuntu-server:/# add-apt-repository -y -p proposed

root@ubuntu-server:/# apt install -y casper/jammy-proposed

root@ubuntu-server:/# apt policy casper
casper:
  Installed: 1.470.2
  Candidate: 1.470.2
  Version table:
 *** 1.470.2 500
        500 http://security.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
        500 http://archive.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
...
root@ubuntu-server:/# add-apt-repository -y -p proposed -r

...

root@ubuntu-server:/# sed -i '/^ExecStart=/ s,=,=/usr/bin/strace --inject read:delay_enter=60s ,' /usr/lib/systemd/system/casper-md5check.service

root@ubuntu-server:/# cat /usr/lib/systemd/system/casper-md5check.service
[Unit]
Description=casper-md5check Verify Live ISO checksums
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/usr/bin/strace --inject read:delay_enter=60s /usr/lib/casper/casper-md5check /cdrom /cdrom/md5sum.txt
...

root@ubuntu-server:/# exit
exit
(initramfs) exit
...

...
[ OK ] Reached target Multi-User System.
         Starting casper-md5check Verify Live ISO checksums...
         Starting Execute cloud user/final scripts...
...
[ OK ] Finished Execute cloud user/final scripts.
[ OK ] Reached target Cloud-init target.

Ubuntu 22.04.1 LTS ubuntu-server ttyS0

connecting...
<a few seconds later>

================================================================================
  Serial [ Help ]
================================================================================

  As the installer is running on a serial console, it has started in basic
  mode, using only the ASCII character set and black and white colours.

  If you are connecting from a terminal emulator such as gnome-terminal that
  supports unicode and rich colours you can switch to "rich mode" which uses
  unicode, colours and supports many languages.

  You can also connect to the installer over the network via SSH, which will
  allow use of rich mode.

                          [ Continue in rich mode > ]
                          [ Continue in basic mode > ]
                          [ View SSH instructions ]

tags: added: verification-done verification-done-jammy
removed: verification-needed verification-needed-jammy
Revision history for this message
Mike Rushton (leftyfb) wrote :

"To properly test it you will need to obtain and boot a daily build of a Live CD for jammy"

If that is all that is needed then it is failing for me. If, however, we need to stop the installer, enable the proposed repository, update casper, disable the proposed repository and then continue the installer, then it looks like that has been validated above by Mauricio.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Mike,

Right; the package version in -proposed isn't in the daily build yet,
so it needed more manual steps, but once this makes it to -updates
(or if a daily build is built w/ -proposed enabled), then it'll be in.

@ https://cdimage.ubuntu.com/ubuntu-server/jammy/daily-live/20230131/jammy-live-server-amd64.manifest
> casper 1.470.1

cheers,
Mauricio

Revision history for this message
Dan Bungert (dbungert) wrote :

Verification done for jammy-proposed.

I have purposefully done a different test procedure, based on injecting the proposed casper version and using systemd-analyze.

Above, in comments such as https://bugs.launchpad.net/cloud-init/+bug/1986781/comments/25, we can see that casper-md5check needs to complete before hitting multi-user.target.

With casper 1.470.2 in place, it looks like:

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

casper-md5check.service +13.837s
└─multi-user.target @18.459s
  └─ssh.service @18.336s +30ms
    └─basic.target @7.833s
      └─sockets.target @7.831s
        └─snap.lxd.user-daemon.unix.socket @15.746s
          └─sysinit.target @7.778s
            └─cloud-init.service @6.080s +1.695s
              └─systemd-networkd-wait-online.service @6.061s +10ms
                └─systemd-networkd.service @5.950s +109ms
                  └─network-pre.target @5.948s
                    └─cloud-init-local.service @1.615s +4.332s
                      └─systemd-remount-fs.service @884ms +135ms
                        └─systemd-journald.socket @705ms
                          └─system.slice @624ms
                            └─-.slice @624ms

This is the expected behavior, so the SRU LGTM.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package casper - 1.470.2

---------------
casper (1.470.2) jammy; urgency=medium

  [ Dan Bungert ]
  * d/casper.casper-md5check.service: order it After=multi-user.target.
    This should allow us to reach multi-user.target sooner, especially for
    users where the md5 check is slow. (LP: #1986781)

 -- Mauricio Faria de Oliveira <email address hidden> Fri, 27 Jan 2023 15:31:09 -0300

Changed in casper (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for casper has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.