Bug #1986781 “[Ubuntu 22.04]cloud-init failed to complete after ...” : Bugs : cloud-init

Shubhakar Gowda P S (shubhakara) on 2022-08-17

no longer affects:

bash-completion (Ubuntu)

Revision history for this message

Shubhakar Gowda P S (shubhakara) wrote on 2022-08-17:

#1

Attaching cloud-init Installation logs under /var/log Edit (10.2 MiB, application/x-tar)

information type:

Public → Private

Revision history for this message

James Falcon (falcojr) wrote on 2022-08-17:

#2

I took a look at the logs, and this doesn't appear to be a problem with cloud-init specifically. Cloud-init runs multiple times during boot, and each run has its own set of systemd services that it depends on having run first. If we look at cloud-init.log, we can see that there's a time where one cloud-init instance exits, then a greater than 10 minute gap before cloud-init's final service runs:

2022-08-17 06:38:53,429 - util.py[DEBUG]: cloud-init mode 'modules' took 0.139 seconds (0.14)
2022-08-17 06:38:53,429 - handlers.py[DEBUG]: finish: modules-config: SUCCESS: running modules for config
2022-08-17 06:52:29,424 - util.py[DEBUG]: Cloud-init v. 22.1-14-g2e17a0d6-0ubuntu1~22.04.5 running 'modules:final' at Wed, 17 Aug 2022 06:52:29 +0000. Up 999.23 seconds.

This can happen if one of cloud-init's dependent services has stalled or failed. To see if another service has stalled or failed, you could try the following commands:
systemd-analyze blame
systemctl --failed

Does the entire installation fail, or just cloud-init?

Changed in cloud-init:
status:	New → Incomplete

Revision history for this message

Shubhakar Gowda P S (shubhakara) wrote on 2022-08-22:

#3

James Falcon,

Does the entire installation fail, or just cloud-init?

Only cloud-init will fail. Once issue occurs when hit 'cancel' we can proceed with the further Installation successfully.

Shubhakar Gowda P S (shubhakara) on 2022-08-24

information type:

Private → Public

Revision history for this message

Chad Smith (chad.smith) wrote on 2022-08-24:

#4

Thanks for the info, shubhakara.

To debug further what we need from you on the failed installed system once it boots is the output of
- systemd-analyze blame
- systemd-analyze show
- systemd-analyze critical-chain
- systemctl --failed
- journalctl -b 0

This will help us determine what other systemd service ordering seems to be delaying cloud-init boot stages from running.

Most of he data above gets collected by running `cloud-init collect-logs` which emits a tar.gz file that can be attached to this bug if you don't want to run the commands individually.

Given that you mentioned clicking "cancel" and there is a /var/log/intaller subdirectory, this looks like you are using the ubuntu server live-install (subiquity) to install this system. I'm adding subiquity project too just to keep devs aware.

Looking through /var/log/installer/subiquity-server-debug*log I can see `Starting Subiquity server revision 3359` is blocking on `cloud-init status --wait` which alludes to the 10 minute wait you referenced in this bug.

# timeout logs on cloud-init status --wait
2022-08-17 06:39:32,689 DEBUG subiquitycore.utils:77 arun_command called: ['cloud-init', 'status', '--wait']
2022-08-17 06:39:32,861 DEBUG curtin.reporting.start.subiquity/Meta/status_GET:45 start: subiquity/Meta/status_GET:
2022-08-17 06:39:32,863 DEBUG curtin.reporting.finish.subiquity/Meta/status_GET:45 finish: subiquity/Meta/status_GET: SUCCESS: 200 {"state": "CLOUD_INIT_WAIT", "confirming_tty": "", "error": null, "cloud_init...
2022-08-17 06:39:32,863 INFO aiohttp.access:233 [17/Aug/2022:06:39:32 +0000] "GET /meta/status?cur=null HTTP/1.1" 200 419 "-" "Python/3.8 aiohttp/3.6.2"
2022-08-17 06:39:33,347 DEBUG curtin.reporting.start.subiquity/Meta/status_GET:45 start: subiquity/Meta/status_GET:
2022-08-17 06:49:32,690 DEBUG subiquity.server.server:519 waited 600.0012300014496s for cloud-init
2022-08-17 06:49:32,691 DEBUG subiquity.server.server:535 cloud-init status: '<timeout>', assumed disabled

Thanks for the info, shubhakara.

To debug further what we need from you on the failed installed system once it boots is the output of
 - systemd-analyze blame
 - systemd-analyze show
 - systemd-analyze critical-chain
 - systemctl --failed
 - journalctl -b 0

This will help us determine what other systemd service ordering seems to be delaying cloud-init boot stages from running.

Most of he data above gets collected by running `cloud-init collect-logs` which emits a tar.gz file that can be attached to this bug if you don't want to run the commands individually.

Given that you mentioned clicking "cancel" and there is a /var/log/intaller subdirectory, this looks like you are using the ubuntu server live-install (subiquity) to install this system. I'm adding subiquity project too just to keep devs aware.

Looking through /var/log/installer/subiquity-server-debug*log I can see `Starting Subiquity server revision 3359` is blocking on `cloud-init status --wait` which alludes to the 10 minute wait you referenced in this bug.

# timeout logs on cloud-init status --wait
2022-08-17 06:39:32,689 DEBUG subiquitycore.utils:77 arun_command called: ['cloud-init', 'status', '--wait']
2022-08-17 06:39:32,861 DEBUG curtin.reporting.start.subiquity/Meta/status_GET:45 start: subiquity/Meta/status_GET: 
2022-08-17 06:39:32,863 DEBUG curtin.reporting.finish.subiquity/Meta/status_GET:45 finish: subiquity/Meta/status_GET: SUCCESS: 200 {"state": "CLOUD_INIT_WAIT", "confirming_tty": "", "error": null, "cloud_init...
2022-08-17 06:39:32,863 INFO aiohttp.access:233  [17/Aug/2022:06:39:32 +0000] "GET /meta/status?cur=null HTTP/1.1" 200 419 "-" "Python/3.8 aiohttp/3.6.2"
2022-08-17 06:39:33,347 DEBUG curtin.reporting.start.subiquity/Meta/status_GET:45 start: subiquity/Meta/status_GET: 
2022-08-17 06:49:32,690 DEBUG subiquity.server.server:519 waited 600.0012300014496s for cloud-init
2022-08-17 06:49:32,691 DEBUG subiquity.server.server:535 cloud-init status: '<timeout>', assumed disabled

Revision history for this message

Chad Smith (chad.smith) wrote on 2022-08-24:

#5

Please set the bug status back to "New" when further comments have been added to make sure our team gets alerted about the updated comments so we can respond more quickly.

Changed in subiquity:
status:	New → Incomplete

Shubhakar Gowda P S (shubhakara) on 2022-08-25

information type:	Public → Private
information type:	Private → Public

Shubhakar Gowda P S (shubhakara) on 2022-08-25

Changed in cloud-init:
status:	Incomplete → New

Revision history for this message

Shubhakar Gowda P S (shubhakara) wrote on 2022-08-25:

#8

Chad Smith,

File "cloud-init-install-fail.tar" in comment #6 contains the cloud init logs collected under /var/log.
File "cloud-init.tar.gz" in comment #7 contains the cloud init logs collected using the command "cloud-init collect-logs".

Revision history for this message

Chad Smith (chad.smith) wrote on 2022-08-26:

#9

Thanks Shubhakar Gowda P S (shubhakara) for getting back to us.

From journal.txt in your attached logs in comment #7 is looks like the 10 minute timeout is not strictly related to cloud-init, but cloud-init waiting on snaps being seeded.

I can see a snapd log in your journalctl -b 0 output that says snapd was timed out after an inordinate amount of time waiting for NTP:

Aug 25 08:55:42.976933 ubuntu-server snapd[1642]: devicemgr.go:1927: no NTP sync after 10m0s,

I also see a whole bunch of ntpd and network.timed

The systemd sevice which performs cloud configuration cloud-config.service has a clause which makes it wait until snapd.seeded.service completes in the booting environment so cloud-init use of any snap tools/services installed:

After=snapd.seeded.service

So this 10 minute wait is due to something in either networking setup in the ephemeral/early install environment not being configured appropriately for snapd to come up. Once snapd unblocks, cloud-init can complete.

You can confirm that the problem is snap seeding on your system with the commands we listed above in comments #2 and #4:

- systemd-analyze blame
- systemd-analyze show
- systemd-analyze critical-chain
- systemctl --failed

That should point to snap seeding being the problem in this early/ephemeral install environment.

Also in your journalctl -b 0 output (in cloud-init collect-logs /var/log/journal.txt)

A red flag I'm also seeing is repeated logs in your journal that look like thrashing of time-related services:
Aug 25 08:52:55.229686 ubuntu-server dbus-daemon[1630]: [system] Successfully activated service 'org.freedesktop.timedate1'
Aug 25 08:52:55.229931 ubuntu-server systemd[1]: Started Time & Date Service.
Aug 25 08:53:55.291021 ubuntu-server systemd[1]: systemd-timedated.service: Deactivated successfully.

Something is amiss with ntp or network configuration on this installation and it looks like it is causing issues for snapd which cause issue for cloud-init to make progress on the install of this system.

In checking /var/log/cloud-init-output.log I see that the two NICs are listed as link down:
ci-info: | eno1 | False | . | . | . | f4:ee:08:19:37:49 |
ci-info: | eno2 | False | . | . | . | f4:ee:08:19:37:4a |

But journalctl -b 0 showed them as "UP":
Aug 25 08:45:04.600250 ubuntu-server systemd-networkd[1594]: eno2: Link UP
Aug 25 08:45:04.635710 ubuntu-server systemd-networkd[1594]: eno1: Link UP

Something seems to be turning off the NICs on this device maybe?

Can you please provide:
1. Was the a customized install Ubuntu 22.04 installer ISO?
2. if public ISO can you provide the URL to the ISO used for install?
3. Run the separate commands and attach that output
- systemd-analyze blame
- systemd-analyze show
- systemd-analyze critical-chain
- systemctl --failed
- ls /etc/netplan
- cat /etc/netplan/*
- ip addr
- ip route

Thanks Shubhakar Gowda P S (shubhakara) for getting back to us.

From journal.txt in your attached logs in comment #7 is looks like the 10 minute timeout is not strictly related to cloud-init, but cloud-init waiting on snaps being seeded.

I can see a snapd log in your journalctl -b 0 output that says snapd was timed out after an inordinate amount of time waiting for NTP:

Aug 25 08:55:42.976933 ubuntu-server snapd[1642]: devicemgr.go:1927: no NTP sync after 10m0s,

I also see a whole bunch of ntpd and network.timed

The systemd sevice which performs cloud configuration cloud-config.service has a clause which makes it wait until snapd.seeded.service completes in the booting environment so cloud-init use of any snap tools/services installed:

After=snapd.seeded.service

So this 10 minute wait is due to something in either networking setup in the ephemeral/early install environment not being configured appropriately for snapd to come up. Once snapd unblocks, cloud-init can complete.

You can confirm that the problem is snap seeding on your system with the commands we listed above in comments #2 and #4:

- systemd-analyze blame
 - systemd-analyze show
 - systemd-analyze critical-chain
 - systemctl --failed

That should point to snap seeding being the problem in this early/ephemeral install environment.

Also in your journalctl -b 0 output (in cloud-init collect-logs /var/log/journal.txt)

A red flag I'm also seeing is repeated logs in your journal that look like thrashing of time-related services:
Aug 25 08:52:55.229686 ubuntu-server dbus-daemon[1630]: [system] Successfully activated service 'org.freedesktop.timedate1'
Aug 25 08:52:55.229931 ubuntu-server systemd[1]: Started Time & Date Service.
Aug 25 08:53:55.291021 ubuntu-server systemd[1]: systemd-timedated.service: Deactivated successfully.

Something is amiss with ntp or network configuration on this installation and it looks like it is causing issues for snapd which cause issue for cloud-init to make progress on the install of this system.

In checking /var/log/cloud-init-output.log I see that the two NICs are listed as link down:
ci-info: |  eno1  | False |     .     |     .     |   .   | f4:ee:08:19:37:49 |
ci-info: |  eno2  | False |     .     |     .     |   .   | f4:ee:08:19:37:4a |

But journalctl -b 0 showed them as "UP":
Aug 25 08:45:04.600250 ubuntu-server systemd-networkd[1594]: eno2: Link UP
Aug 25 08:45:04.635710 ubuntu-server systemd-networkd[1594]: eno1: Link UP

Something seems to be turning off the NICs on this device maybe?

Can you please provide:
1. Was the a customized install Ubuntu 22.04 installer ISO?
2. if public ISO can you provide the URL to the ISO used for install?
3. Run the separate commands and attach that output
 - systemd-analyze blame
 - systemd-analyze show
 - systemd-analyze critical-chain
 - systemctl --failed
 - ls /etc/netplan
 - cat /etc/netplan/*
 - ip addr
 - ip route

Changed in cloud-init:
status:	New → Incomplete

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2022-08-29:

#10

What's going on here is I think that the md5 check casper is doing is using all the (minimal) amount of IO bandwidth to the virtual ISO and starving other processes from making progress. We should probably make the md5 check run under ionice or just abort itself if it's slow or something.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2022-08-29:

#11

Would someone be able to test a custom ISO that I can make with https://code.launchpad.net/~mwhudson/casper/+git/casper/+merge/429055 incorporated?

Revision history for this message

Narendra K (knarendra) wrote on 2022-08-30:

#12

Hi Michael,

Thank you for taking a look. Please provide the custom ISO. We will test and share findings.

Revision history for this message

Narendra K (knarendra) wrote on 2022-08-30:

#13

Download full text (7.2 KiB)

Please find the details:

systemd-analyze-blame-output

12min 33.112s casper-md5check.service
      48.873s snapd.service
      48.004s snapd.seeded.service
      35.847s pollinate.service
      31.663s udisks2.service
      28.538s cloud-init-local.service
      23.592s e2scrub_reap.service
      23.281s thermald.service
      22.361s dev-sr0.device
      17.649s polkit.service
      16.971s dev-loop9.device
      16.768s networkd-dispatcher.service
      16.547s snapd.hold.service
      15.221s dev-loop6.device
      15.003s dev-loop8.device
      14.037s snapd.apparmor.service
      12.682s dev-loop4.device
      12.680s dev-loop5.device
      12.677s dev-loop0.device
      12.093s ModemManager.service
      10.873s snap.lxd.activate.service
      10.661s dev-loop7.device
       5.310s rsyslog.service
       5.231s apport.service
       5.168s systemd-logind.service
       4.633s secureboot-db.service
       4.523s cloud-init.service
       4.297s ua-timer.service
       2.867s console-setup.service
       2.712s systemd-resolved.service
       2.689s systemd-udev-trigger.service
       2.523s systemd-udevd.service
       2.365s multipathd.service
       2.263s systemd-user-sessions.service
       1.742s keyboard-setup.service
       1.590s systemd-modules-load.service
       1.483s systemd-sysusers.service
       1.433s snap-core20-1405.mount
       1.304s snap-lxd-22923.mount
       1.211s snap-snapd-15534.mount
       1.198s update-notifier-download.service
        950ms systemd-journal-flush.service
        840ms systemd-sysctl.service
        837ms setvtrgb.service
        664ms systemd-random-seed.service
  634ms snap-subiquity-3359.mount
        606ms systemd-timesyncd.service
        558ms systemd-tmpfiles-setup.service
        495ms systemd-networkd.service
        461ms cloud-config.service
        418ms systemd-tmpfiles-setup-dev.service
        357ms systemd-update-utmp.service
        340ms systemd-remount-fs.service
        327ms cloud-final.service
        317ms sys-kernel-config.mount
        315ms sys-fs-fuse-connections.mount
        306ms plymouth-read-write.service
        224ms plymouth-quit.service
        199ms lvm2-monitor.service
        181ms systemd-journald.service
        117ms finalrd.service
        111ms ufw.service
         89ms dev-hugepages.mount
         88ms dev-mqueue.mount
         87ms sys-kernel-debug.mount
         87ms sys-kernel-tracing.mount
         83ms lvm2-pvscan@8:3.service
         78ms kmod-static-nodes.service
         76ms <email address hidden>
         76ms <email address hidden>
         75ms <email address hidden>
         26ms ssh.service
         20ms systemd-tmpfiles-clean.service
         17ms plymouth-quit-wait.service
         11ms systemd-update-utmp-runlevel.service
          8ms snapd.socket
          8ms systemd-networkd-wait-online.service
          3ms tmp.mount
         51us blk-availability.service

"systemd-analyze-critical-chain-output"

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

graphical.target @13min 17.133s
`-multi-user.target @13min 17.132s
`-casper-md5che...

Hi Michael,

As per your assist downloaded 64-bit PC (AMD64) server install image from "https://cdimage.ubuntu.com/ubuntu-server/daily-live/pending/" and tested it, Issue was still observed.

Please find the details:

"systemd-analyze-blame-output"

13min 33.597s casper-md5check.service
      41.640s snapd.seeded.service
      29.360s snapd.service
      24.189s pollinate.service
      18.638s udisks2.service
      16.918s dev-sr0.device
      10.718s snapd.hold.service
      10.585s ModemManager.service
      10.085s snap.lxd.activate.service
       9.802s cloud-init-local.service
       9.375s snapd.apparmor.service
       9.059s thermald.service
       9.044s console-setup.service
       8.591s polkit.service
       7.693s e2scrub_reap.service
       7.348s systemd-logind.service
       7.220s rsyslog.service
       7.097s dev-loop8.device
       6.720s lvm2-monitor.service
       6.331s dev-loop7.device
       6.050s dev-loop6.device
       6.000s secureboot-db.service
       5.983s apport.service
       5.363s systemd-tmpfiles-setup.service
       4.766s dev-loop9.device
       4.074s dev-loop5.device
       4.073s dev-loop0.device
       4.072s dev-loop4.device
       3.766s snap-lxd-23537.mount
       3.432s multipathd.service
       3.372s snap-snapd-16292.mount
       2.970s systemd-resolved.service
       2.927s snap-core20-1611.mount
       2.376s plymouth-read-write.service
       2.234s keyboard-setup.service
       2.212s systemd-user-sessions.service
       2.040s systemd-udev-trigger.service
       1.672s finalrd.service
       1.483s systemd-sysusers.service
       1.460s cloud-init.service
       1.303s systemd-sysctl.service
       1.296s ufw.service
       1.261s systemd-journal-flush.service
       1.219s setvtrgb.service
       1.147s systemd-random-seed.service
       1.131s plymouth-quit.service
       1.103s systemd-modules-load.service
       1.032s dev-hugepages.mount
       1.030s dev-mqueue.mount
       1.029s sys-kernel-debug.mount
       1.028s sys-kernel-tracing.mount
       1.020s kmod-static-nodes.service
       1.019s systemd-journald.service
       1.014s modprobe@chromeos_pstore.service
       1.013s modprobe@configfs.service
       1.012s modprobe@drm.service
       1.010s modprobe@efi_pstore.service
       1.009s modprobe@fuse.service
       1.008s modprobe@pstore_blk.service
       1.006s modprobe@pstore_zone.service
       1.005s modprobe@ramoops.service
        801ms systemd-remount-fs.service
        712ms systemd-udevd.service
        700ms snap-subiquity-3798.mount
        558ms systemd-networkd.service
        471ms update-notifier-download.service
        378ms systemd-tmpfiles-setup-dev.service
        368ms cloud-config.service
        335ms cloud-final.service
        279ms sys-kernel-config.mount
        269ms sys-fs-fuse-connections.mount
        215ms systemd-timesyncd.service
        107ms systemd-update-utmp.service
         23ms systemd-tmpfiles-clean.service
         11ms systemd-update-utmp-runlevel.service
          9ms snapd.socket
          4ms tmp.mount
          4ms systemd-networkd-wait-online.service
          3ms plymouth-quit-wait.service
         17us blk-availability.service

"systemd-analyze-critical-chain-output"

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

graphical.target @14min 7.154s
`-multi-user.target @14min 7.153s
  `-casper-md5check.service @33.554s +13min 33.597s
    `-basic.target @33.551s
      `-sockets.target @33.550s
        `-snap.lxd.user-daemon.unix.socket @1min 32.661s
          `-sysinit.target @33.513s
            `-cloud-init.service @32.046s +1.460s
              `-systemd-networkd-wait-online.service @32.038s +4ms
                `-systemd-networkd.service @31.473s +558ms
                  `-network-pre.target @31.470s
                    `-cloud-init-local.service @21.655s +9.802s
                      `-systemd-remount-fs.service @4.284s +801ms
                        `-systemd-journald.socket @4.031s
                          `--.mount @3.958s
                            `--.slice @3.958s

"cloud-init-analyze-blame-output"

-- Boot Record 01 --
     00.54400s (init-network/config-ssh)
     00.43600s (init-local/search-NoCloud)
     00.04200s (init-network/config-users-groups)
     00.02700s (modules-config/config-set-passwords)
     00.00800s (modules-final/config-final-message)
     00.00500s (init-network/check-cache)
     00.00100s (modules-final/config-scripts-user)
     00.00100s (modules-final/config-keys-to-console)
     00.00100s (init-network/consume-user-data)
     00.00100s (init-network/activate-datasource)
     00.00000s (modules-final/config-ssh-authkey-fingerprints)
     00.00000s (modules-final/config-scripts-per-once)
     00.00000s (modules-config/config-ssh-import-id)
     00.00000s (init-network/setup-datasource)
     00.00000s (init-network/consume-vendor-data2)
     00.00000s (init-network/consume-vendor-data)
     00.00000s (init-local/check-cache)

1 boot records analyzed

"cloud-init-analyze-show-output"

-- Boot Record 01 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.

Starting stage: init-local
|`->no cache found @00.00300s +00.00000s
|`->found local data from DataSourceNoCloud @00.01500s +00.43600s
Finished stage: (init-local) 01.57200 seconds

Starting stage: init-network
|`->restored from cache with run check: DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud][dsmode=net] @02.94400s +00.00500s
|`->setting up datasource @02.98900s +00.00000s
|`->reading and applying user-data @02.99800s +00.00100s
|`->reading and applying vendor-data @02.99900s +00.00000s
|`->reading and applying vendor-data2 @02.99900s +00.00000s
|`->activating datasource @03.02400s +00.00100s
|`->config-users-groups ran successfully @03.05100s +00.04200s
|`->config-ssh ran successfully @03.09300s +00.54400s
Finished stage: (init-network) 01.16100 seconds

Starting stage: modules-config
|`->config-ssh-import-id ran successfully @76.25500s +00.00000s
|`->config-set-passwords ran successfully @76.25500s +00.02700s
Finished stage: (modules-config) 00.04800 seconds

Starting stage: modules-final
|`->config-scripts-per-once ran successfully @817.63600s +00.00000s
|`->config-scripts-user ran successfully @817.63600s +00.00100s
|`->config-ssh-authkey-fingerprints ran successfully @817.63700s +00.00000s
|`->config-keys-to-console ran successfully @817.63700s +00.00100s
|`->config-final-message ran successfully @817.63800s +00.00800s
Finished stage: (modules-final) 00.02800 seconds

Total Time: 2.80900 seconds

1 boot records analyzed

Revision history for this message

Narendra K (knarendra) wrote on 2022-09-05:

#19

Hi Chad,

2022-08-17 06:49:32,690 DEBUG subiquity.server.server:519 waited 600.0012300014496s for cloud-init
2022-08-17 06:49:32,691 DEBUG subiquity.server.server:535 cloud-init status: '<timeout>', assumed disabled

Do the logs provide any details on why cloud-init takes more than 10 minutes to complete ? Is cloud-init waiting for casper-md5check.service to complete ? casper-md5check.service seems to be taking > 10 minutes when issue occurs.

Hi Michael,

Could subiquity be made to continue without interrupting the installation and reschedule a check to verify if cloud-init completed ? Or does it need to wait for cloud-init complete before proceeding with installation ?

Also, any thoughts on query from comment #17 ?

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2022-09-05:

#20

Hmm it's a shame those casper changes didn't help but not completely surprising.

I haven't verified this but my theory on what is going on is that the casper-md5 check is using all of the small amount of IO the virtual bmc feature provides and thus starving other services like cloud-init and snap seeding from making progress. Nothing is explicitly waiting for the casper-md5 check to complete -- if my theory is correct it's not something that can be fixed by tweaking unit dependencies.

I should have mentioned this before, but one workaround for this is to pass 'tomem' on the kernel command line. This makes the initrd copy the entire ISO into RAM before pivoting to the real rootfs which makes the first part of the boot pretty slow but should make all other parts including the md5 check much faster. It obviously increases RAM usage too but most "real" servers probably have much more RAM than is needed.

Other workarounds we are looking at is making a much smaller ISO that can be provided as a virtual ISO that will then download the main ISO from the network (as the network is probably much much faster than the virtual ISO). But it would be nice to fix this properly, probably by having the casper md5 check abort if it detects that it is only making very slow progress.

Revision history for this message

Chad Smith (chad.smith) wrote on 2022-09-07:

#21

Narendra K (knarendra),

As Michael mentioned, this really looks more like a system that is under load and not responding well to all the systemd services that are being run during the install process.

The time values I'm seeing in comment #18 from systemd-analyze are 10 times greater than what I typically see on laptops, containers, VMs and bare metal.

From the systemd-analyze blame output from comment in #18, I can confirm the reason for cloud-config.service "waiting" is the sum of the systemd snap*services setup time:

      41.640s snapd.seeded.service
      29.360s snapd.service
...
      16.547s snapd.hold.service

Those values account for the minute and a half of waiting... which aligns with the attached cloud-init analyze blame output as well:

starting stage: modules-config
|`->config-ssh-import-id ran successfully @76.25500s +00.00000s

Which tells us that cloud-init's modules-config stage started at 76 seconds after boot began (around the same amount as the combined snap services).

When looking as cloud-init analyze output we see the cloud-final stage doesn't start until 8817 seconds (13 min) after boot which is greater than subiquity's 10 minute timeout.

Starting stage: modules-final
|`->config-scripts-per-once ran successfully @817.63600s +00.00000s

/lib/systemd/system/cloud-final.service config is only started once the following condition is met:

After=multi-user.target

And multi-user.target depends on that casper-md5check.service which is taking 13 minutes for one reason of another.
We can confirm what's slowing up that final stage of cloud-init with the following command on your system:
$ systemd-analyze critical-chain cloud-final.service

Revision history for this message

Chad Smith (chad.smith) wrote on 2022-09-07:

#22

Ultimately, each md5 file md5 comparison seems to take around 1/2 second on this system. If we are unable to determine what is slowing down the casper-md5check or an early exit in the event of painful md5checks, can we either provide a config option to disable the casper md5 check, or make that cloud-init timeout configurable to > 600 seconds[1]?

References for cloud-init timeout:
https://github.com/canonical/subiquity/blob/main/subiquity/server/server.py#L523

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2022-09-08:

#23

You can also put fsck.mode=skip on the kernel command line to get the md5check to skip itself, I should have mentioned that too!

Revision history for this message

Shubhakar Gowda P S (shubhakara) wrote on 2022-09-08:

#24

Hi Michael,

Tried passing kernel work around parameter "tomem" as suggested in the comment #20, Issue was still observed in this case. And also Tried passing kernel work around parameter "fsck.mode=skip" as suggested in the comment #23, Issue was not seen in this scenario.

Also, we tried reproducing the Issue in 'Ubuntu 20.04.4' in this case Issue was not seen.

Revision history for this message

Shubhakar Gowda P S (shubhakara) wrote on 2022-09-08:

#25

Chad Smith,

Please find the details as requested in comment #21.

"systemd-analyze critical-chain cloud-final.service output"

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +266ms
`-multi-user.target @12min 38.156s
  `-casper-md5check.service @39.106s +11min 59.049s
    `-basic.target @39.104s
      `-sockets.target @39.103s
        `-snap.lxd.user-daemon.unix.socket @1min 38.160s
          `-sysinit.target @39.068s
            `-cloud-init.service @35.645s +3.417s
              `-systemd-networkd-wait-online.service @35.634s +8ms
                `-systemd-networkd.service @35.265s +365ms
                  `-network-pre.target @35.262s
                    `-cloud-init-local.service @12.178s +23.078s
                      `-systemd-remount-fs.service @6.689s +483ms
                        `-systemd-journald.socket @6.401s
                          `--.mount @6.360s
                            `--.slice @6.360s

Revision history for this message

Chad Smith (chad.smith) wrote on 2022-09-08:

#26

Shubhakar Gowda P S (shubhakara)
1. thank you for the specific critical-chain of cloud-final.service. It does confirm the main thing blocking cloud-init from completing is the casper-md5check.service on this system. Without that service delaying the multi-user.target, cloud-final,service can complete in around 50
seconds which beats that 10 minute timeout in the installer.

2. > Also, we tried reproducing the Issue in 'Ubuntu 20.04.4' in this case Issue was not seen.
Do you mean that stock 20.04.4 images without any kernel command-line options did not reproduce this issue on the same system type?

I did notice in running the daily server live image for 20.04 today that in the console output it shows me the `fsck.mode=skip` So that is probably why this didn't hit you on the 20.04 install.

Thanks for the kernel cmdline tip Michael, I didn't know about `fsck.mode=skip` and seems like that would avoid this issue on platforms which are being hit with general IO/performance concerns leading to multi-user.target not being reached for an extended period of time.

Maybe no need to add a config setting for the cloud-init --wait timeout in subiquity as the other 'workaround' seems to work here.

cloud-init squad should be touching the `cloud-init status --wait` call https://github.com/canonical/subiquity/blob/main/subiquity/server/server.py#L521 to introspect cloud-init status a bit better once cloud-init releases support for `cloud-init status --format json`. In the event of a timeout, we can also make sure we touch this section to ensure a timeout results in a report of what stage cloud-init is 'stuck' and give suggested debug steps for bug filing.

Revision history for this message

Shubhakar Gowda P S (shubhakara) wrote on 2022-09-09:

#27

Chad Smith,

Once After Ubuntu 20.04.4 CD Image Installation got completed, by switching over to a shell using 'Ctrl-Z + F2' keyboard shortcut key we have collected the output of following details.

output "systemd-analyze critical-chain cloud-final.service"

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +475ms
`-cloud-config.service @22.760s +473ms
  `-snapd.seeded.service @12.689s +10.067s
    `-basic.target @9.526s
      `-sockets.target @9.526s
        `-snap.lxd.daemon.unix.socket @14.228s
          `-sysinit.target @9.512s
            `-cloud-init.service @8.350s +1.161s
              `-systemd-networkd-wait-online.service @8.341s +6ms
                `-systemd-journald.socket @1.304s
                  `-system.slice @1.300s
                    `--.slice @1.300s

output "systemctl status casper-md5check.service"

Unit casper-md5check.service could not be found

output "/var/log/casper.log"

stdin: Invalid argument
stdin: Invalid argument
stdin: Invalid argument
/init: line 49: can't open /dev/sda: No medium found
/init: line 49: can't open /dev/sda: No medium found
passwd: password expiry information changed.
Using CD-ROM mount point /cdrom/
Identifying... [bdb176fb941b5200237520bff1ab61c3-2]
Scanning disc for index files...
Found 2 package indexes, 0 source indexes, 0 translation indexes and 1 signatures
Found label 'Ubuntu-Server 20.04.4 LTS _Focal Fossa_ - Release amd64 (20220223.1)'
This disc is called:
'Ubuntu-Server 20.04.4 LTS _Focal Fossa_ - Release amd64 (20220223.1)'
Copying package lists...gpgv: Signature made Wed Feb 23 09:26:53 2022 UTC
gpgv: using RSA key 843938DF228D22F7B3742BC0D94AA3F0EFE21092
gpgv: Good signature from "Ubuntu CD Image Automatic Signing Key (2012) <email address hidden>"
Reading Package Indexes... 0%
Reading Package Indexes... Done
Writing new source list
Source list entries for this disc are:
deb cdrom:[Ubuntu-Server 20.04.4 LTS _Focal Fossa_ - Release amd64 (20220223.1)]/ focal main restricted
Repeat this process for the rest of the CDs in your set.

Chad Smith,

Once After Ubuntu 20.04.4 CD Image Installation got completed, by switching over to a shell using 'Ctrl-Z + F2' keyboard shortcut key we have collected the output of following details.

output "systemd-analyze critical-chain cloud-final.service"

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +475ms
`-cloud-config.service @22.760s +473ms
  `-snapd.seeded.service @12.689s +10.067s
    `-basic.target @9.526s
      `-sockets.target @9.526s
        `-snap.lxd.daemon.unix.socket @14.228s
          `-sysinit.target @9.512s
            `-cloud-init.service @8.350s +1.161s
              `-systemd-networkd-wait-online.service @8.341s +6ms
                `-systemd-journald.socket @1.304s
                  `-system.slice @1.300s
                    `--.slice @1.300s

output "systemctl status casper-md5check.service"

Unit casper-md5check.service could not be found

output "/var/log/casper.log"

stdin: Invalid argument
stdin: Invalid argument
stdin: Invalid argument
/init: line 49: can't open /dev/sda: No medium found
/init: line 49: can't open /dev/sda: No medium found
passwd: password expiry information changed.
Using CD-ROM mount point /cdrom/
Identifying... [bdb176fb941b5200237520bff1ab61c3-2]
Scanning disc for index files...
Found 2 package indexes, 0 source indexes, 0 translation indexes and 1 signatures
Found label 'Ubuntu-Server 20.04.4 LTS _Focal Fossa_ - Release amd64 (20220223.1)'
This disc is called:
'Ubuntu-Server 20.04.4 LTS _Focal Fossa_ - Release amd64 (20220223.1)'
Copying package lists...gpgv: Signature made Wed Feb 23 09:26:53 2022 UTC
gpgv:                using RSA key 843938DF228D22F7B3742BC0D94AA3F0EFE21092
gpgv: Good signature from "Ubuntu CD Image Automatic Signing Key (2012) <cdimage@ubuntu.com>"
Reading Package Indexes... 0%
Reading Package Indexes... Done
Writing new source list
Source list entries for this disc are:
deb cdrom:[Ubuntu-Server 20.04.4 LTS _Focal Fossa_ - Release amd64 (20220223.1)]/ focal main restricted
Repeat this process for the rest of the CDs in your set.

Revision history for this message

Narendra K (knarendra) wrote on 2022-09-16:

#28

Hi Chad/Michael,

Any updates or feedback on findings in comment #27 ?

It seems like in Ubuntu 20.04,

a) cloud-init does not depend on multi-user.target
b) casper-md5check.service is not present

Could Ubuntu 22.04 installation be modified to match with Ubuntu 20.04 ? Any thoughts ?

Hi Michael,

Could 'fsck.mode=skip' be documented in Ubuntu 22.04 release notes ?

Revision history for this message

Brett Holman (holmanb) wrote on 2022-09-16:

#29

Since the issue has been narrowed down to casper-md5check, which is out of scope of cloud-init, I'm marking the cloud-init side of this bug as "Invalid", since there is nothing actionable to do to in cloud-init to resolve this bug, and the plans to increase debugging transparency that Chad mentioned were already planned and scheduled.

Changed in cloud-init:
status:	New → Invalid

Revision history for this message

Narendra K (knarendra) wrote on 2022-09-21:

#30

Hi Michael,

Any updates on the queries in comment #28 ?

It seems like in Ubuntu 20.04,

a) cloud-init does not depend on multi-user.target
b) casper-md5check.service is not present

Could Ubuntu 22.04 installation be modified to match with Ubuntu 20.04 ? Any thoughts ?

Revision history for this message

Dan Bungert (dbungert) wrote on 2022-11-01:

#31

We propose no longer having the md5check block multi-user. I'm opening a matching MP.

Changed in casper (Ubuntu):
status:	New → In Progress
importance:	Undecided → High
assignee:	nobody → Dan Bungert (dbungert)

Revision history for this message

Narendra K (knarendra) wrote on 2022-11-16:

#32

Hi Dan,

Would the patch get included in Ubuntu 22.04.2 ?

Revision history for this message

Dan Bungert (dbungert) wrote on 2022-11-16:

#33

Hi Narendra - yes, the plan is to have this done in time for 22.04.2. Thanks for specifically pointing this out, we need to make sure it's included in the Jammy Stable Release Updates.

Changed in subiquity:
status:	New → Invalid

Revision history for this message

Launchpad Janitor (janitor) wrote on 2022-11-27:

#34

This bug was fixed in the package casper - 1.476

---------------
casper (1.476) lunar; urgency=medium

  * Change casper-md5check to Type=Simple to take advantage of the
    alternate interpretation of what started means relative to Type=oneshot.
    This should allow us to reach multi-user.target sooner, especially for
    users where the md5 check is slow. (LP: #1986781)

-- Dan Bungert <email address hidden> Wed, 23 Nov 2022 16:10:57 +1300

Changed in casper (Ubuntu):
status:	In Progress → Fix Released

Mauricio Faria de Oliveira (mfo) on 2022-11-29

Changed in casper (Ubuntu Kinetic):
status:	New → Triaged
importance:	Undecided → Low
assignee:	nobody → Mauricio Faria de Oliveira (mfo)
Changed in casper (Ubuntu Jammy):
status:	New → Triaged
importance:	Undecided → Low
assignee:	nobody → Mauricio Faria de Oliveira (mfo)
Changed in casper (Ubuntu Focal):
status:	New → Triaged
importance:	Undecided → Low
assignee:	nobody → Mauricio Faria de Oliveira (mfo)
tags:	added: se-sponsor-mfo sts

Revision history for this message

Dan Bungert (dbungert) wrote on 2022-11-29:

#35

Thanks for moving the SRU forward Mauricio.

I see you nominated this fix for Kinetic, I think it won't help much as we probably won't be creating new Kinetic install media. Or do you have something else in mind?

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2022-11-29:

#36

Hi Dan o/
Yes, it'd be just for documentation purposes for the SRU team's review stage.
I'll mark it as Won't Fix with the rationale described, in the SRU template.

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2023-01-23:

#37

Download full text (5.8 KiB)

Reproducer/Verification for SRU to Jammy,
based on strace delay injection on read().

Uploading to jammy.
Attaching debdiff for reference.

...

Launch a VM with the Jammy daily live server:

$ wget https://cdimage.ubuntu.com/ubuntu-server/jammy/daily-live/current/jammy-live-server-amd64.iso

$ ISO=jammy-live-server-amd64.iso
$ VM=casper-jammy
$ virt-install --name $VM --cdrom $ISO --vcpus 2 --memory 2048 --disk none --osinfo ubuntu-stable-latest

Press e to edit, append "break=init console=ttyS0", press ctrl-x, close window.

Open the serial console and chroot:

$ virsh console $VM
...
(initramfs) chroot /root /bin/bash

Test strace delay injection:

# time strace --trace read cat /dev/null
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3206\2\0\0\0\0\0"..., 832) = 832
read(3, "", 131072) = 0
+++ exited with 0 +++

real 0m0.041s
user 0m0.009s
sys 0m0.030s

# time strace --trace read --inject read:delay_enter=5s cat /dev/null
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3206\2\0\0\0\0\0"..., 832) = 832 (DELAYED)
read(3, "", 131072) = 0 (DELAYED)
+++ exited with 0 +++

real 0m10.041s
user 0m0.010s
sys 0m0.033s

Modify the casper-md5check service:

# sed -i '/^ExecStart=/ s,=,=/usr/bin/strace --inject read:delay_enter=60s ,' /usr/lib/systemd/system/casper-md5check.service

# cat /usr/lib/systemd/system/casper-md5check.service
[Unit]
Description=casper-md5check Verify Live ISO checksums

[Service]
Type=oneshot
ExecStart=/usr/bin/strace --inject read:delay_enter=60s /usr/lib/casper/casper-md5check /cdrom /cdrom/md5sum.txt
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Exit and wait; the issue happens:

# exit
(initramfs) exit

...

Ubuntu 22.04.1 ubuntu-server ttyS0

connecting...
waiting for cloud-init... /

<10 minutes later>

================================================================================
Serial [ Help ]
================================================================================

As the installer is running on a serial console, it has started in basic
mode, using only the ASCII character set and black and white colours.

    ┌────────────────────────────────────────────────────────────────────────┐
    │ │
    │ cloud-init failed to complete after 10 minutes of waiting. This │
    │ suggests a bug, which we would appreciate help understanding. If │
    │ you could file a bug at │
    │ https://bugs.launchpad.net/subiquity/+filebug and attach the │
    │ contents of /var/log, it would be most appreciated. │
    │ │
    │ [ Switch to a shell ] │
    │ [ Close ] │
    │ │
    └───────────────────────────────────────────────────────────────...

Reproducer/Verification for SRU to Jammy,
based on strace delay injection on read().

Uploading to jammy.
Attaching debdiff for reference.

...

Launch a VM with the Jammy daily live server:

$ wget https://cdimage.ubuntu.com/ubuntu-server/jammy/daily-live/current/jammy-live-server-amd64.iso

$ ISO=jammy-live-server-amd64.iso
	$ VM=casper-jammy
	$ virt-install  --name $VM --cdrom $ISO --vcpus 2 --memory 2048 --disk none --osinfo ubuntu-stable-latest

Press e to edit, append "break=init console=ttyS0", press ctrl-x, close window.

Open the serial console and chroot:

$ virsh console $VM
	...
	(initramfs) chroot /root /bin/bash

Test strace delay injection:

# time strace --trace read cat /dev/null
	read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3206\2\0\0\0\0\0"..., 832) = 832
	read(3, "", 131072)                     = 0
	+++ exited with 0 +++

real	0m0.041s
	user	0m0.009s
	sys	0m0.030s

# time strace --trace read --inject read:delay_enter=5s cat /dev/null
	read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3206\2\0\0\0\0\0"..., 832) = 832 (DELAYED)
	read(3, "", 131072)                     = 0 (DELAYED)
	+++ exited with 0 +++

real	0m10.041s
	user	0m0.010s
	sys	0m0.033s

Modify the casper-md5check service:

# sed -i '/^ExecStart=/ s,=,=/usr/bin/strace --inject read:delay_enter=60s ,' /usr/lib/systemd/system/casper-md5check.service

# cat /usr/lib/systemd/system/casper-md5check.service
	[Unit]
	Description=casper-md5check Verify Live ISO checksums

[Service]
	Type=oneshot
	ExecStart=/usr/bin/strace --inject read:delay_enter=60s /usr/lib/casper/casper-md5check /cdrom /cdrom/md5sum.txt
	RemainAfterExit=yes

[Install]
	WantedBy=multi-user.target

Exit and wait; the issue happens:

# exit
	(initramfs) exit

...

Ubuntu 22.04.1 ubuntu-server ttyS0

connecting...
	waiting for cloud-init... /

<10 minutes later>

================================================================================
	  Serial                                                              [ Help ]
	================================================================================

As the installer is running on a serial console, it has started in basic
	  mode, using only the ASCII character set and black and white colours.

┌────────────────────────────────────────────────────────────────────────┐
	   │                                                                        │
	   │  cloud-init failed to complete after 10 minutes of waiting. This       │
	   │  suggests a bug, which we would appreciate help understanding.  If     │
	   │  you could file a bug at                                               │
	   │  https://bugs.launchpad.net/subiquity/+filebug and attach the          │
	   │  contents of /var/log, it would be most appreciated.                   │
	   │                                                                        │
	   │                         [ Switch to a shell ]                          │
	   │                         [ Close             ]                          │
	   │                                                                        │
	   └────────────────────────────────────────────────────────────────────────┘

[ Continue in rich mode  > ]
		                  [ Continue in basic mode > ]

...

Similarly, repeat, this time with test packages from ppa:mfo/lp1986781:

...

Open the serial console and chroot:

$ virsh console $VM
	...
	(initramfs) chroot /root /bin/bash

Install the test package:

# dhclient enp1s0
	# wget https://launchpad.net/~mfo/+archive/ubuntu/lp1986781/+build/25512466/+files/casper_1.470.2_amd64.deb
	# dpkg -i casper_*.deb

Modify the casper-md5check service:

# sed -i '/^ExecStart=/ s,=,=/usr/bin/strace --inject read:delay_enter=60s ,' /usr/lib/systemd/system/casper-md5check.service

# cat /usr/lib/systemd/system/casper-md5check.service
	Description=casper-md5check Verify Live ISO checksums
	After=multi-user.target

[Service]
	Type=oneshot
	ExecStart=/usr/bin/strace --inject read:delay_enter=60s /usr/lib/casper/casper-md5check /cdrom /cdrom/md5sum.txt
	RemainAfterExit=yes

[Install]
	WantedBy=multi-user.target

Exit and wait; the issue does _not_ happen anymore:

# exit
	(initramfs) exit

...

Ubuntu 22.04.1 ubuntu-server ttyS0

connecting...
	waiting for cloud-init... /

================================================================================
	  Serial                                                              [ Help ]
	================================================================================

As the installer is running on a serial console, it has started in basic
	  mode, using only the ASCII character set and black and white colours.

If you are connecting from a terminal emulator such as gnome-terminal that
	  supports unicode and rich colours you can switch to "rich mode" which uses
	  unicode, colours and supports many languages.

You can also connect to the installer over the network via SSH, which will
	  allow use of rich mode.

[ Continue in rich mode  > ]
		                  [ Continue in basic mode > ]
		                  [ View SSH instructions    ]

Checking the casper-md5sum service is still running:

Help > Enter shell.

# systemctl status casper-md5check.service --no-pager | grep 'Active:'
	     Active: activating (start) since Mon 2023-01-23 18:56:50 UTC; 3min 26s ago

And it should not be a problem, as its start timeout is not limited.

# systemctl show casper-md5check.service | grep -i timeout
	TimeoutStartUSec=infinity
	TimeoutStopUSec=1min 30s
	TimeoutAbortUSec=1min 30s
	TimeoutStartFailureMode=terminate
	TimeoutStopFailureMode=terminate
	TimeoutCleanUSec=infinity
	JobTimeoutUSec=infinity
	JobRunningTimeoutUSec=infinity
	JobTimeoutAction=none

Quit, cleanup the VM.

Press ctrl-]

$ virsh destroy $VM
	$ virsh undefine $VM

description:	updated
description:	updated

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2023-01-23:

#39

Screenshot from 2023-01-23 13-08-33.png Edit (21.9 KiB, image/png)

cosmetic glitch (see description's regression potential section).

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2023-01-23:

#40

For documentation purposes,

The critical chain of systemd units/time
before/after the fix, with a fake 'sleep 30'
as the command in casper-md5check.service.

Before:

root@ubuntu-server:/# systemd-analyze critical-chain cloud-final.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +492ms
`-multi-user.target @37.968s
  `-casper-md5check.service @7.954s +30.010s
    `-basic.target @7.947s
      `-sockets.target @7.946s
        `-snap.lxd.user-daemon.unix.socket @15.740s
          `-sysinit.target @7.892s
            `-cloud-init.service @5.854s +2.032s
              `-systemd-networkd-wait-online.service @5.835s +6ms
                `-systemd-networkd.service @5.713s +119ms
                  `-network-pre.target @5.709s
                    `-cloud-init-local.service @1.943s +3.765s
                      `-systemd-remount-fs.service @1.172s +130ms
                        `-systemd-journald.socket @988ms
                          `--.mount @942ms
                            `--.slice @941ms

After:

root@ubuntu-server:/# systemd-analyze critical-chain cloud-final.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +901ms
`-multi-user.target @21.831s
  `-ssh.service @21.712s +38ms
    `-basic.target @8.255s
      `-sockets.target @8.254s
        `-snap.lxd.user-daemon.unix.socket @17.452s
          `-sysinit.target @8.202s
            `-cloud-init.service @6.970s +1.225s
              `-systemd-networkd-wait-online.service @6.958s +9ms
                `-systemd-networkd.service @6.842s +113ms
                  `-network-pre.target @6.838s
                    `-cloud-init-local.service @1.938s +4.898s
                      `-systemd-remount-fs.service @1.160s +174ms
                        `-systemd-journald.socket @941ms
                          `-system.slice @894ms
                            `--.slice @894ms

For documentation purposes,

The critical chain of systemd units/time
before/after the fix, with a fake 'sleep 30'
as the command in casper-md5check.service.

Before:

root@ubuntu-server:/# systemd-analyze critical-chain cloud-final.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +492ms
`-multi-user.target @37.968s
  `-casper-md5check.service @7.954s +30.010s
    `-basic.target @7.947s
      `-sockets.target @7.946s
        `-snap.lxd.user-daemon.unix.socket @15.740s
          `-sysinit.target @7.892s
            `-cloud-init.service @5.854s +2.032s
              `-systemd-networkd-wait-online.service @5.835s +6ms
                `-systemd-networkd.service @5.713s +119ms
                  `-network-pre.target @5.709s
                    `-cloud-init-local.service @1.943s +3.765s
                      `-systemd-remount-fs.service @1.172s +130ms
                        `-systemd-journald.socket @988ms
                          `--.mount @942ms
                            `--.slice @941ms
                            
After:

root@ubuntu-server:/# systemd-analyze critical-chain cloud-final.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

cloud-final.service +901ms
`-multi-user.target @21.831s
  `-ssh.service @21.712s +38ms
    `-basic.target @8.255s
      `-sockets.target @8.254s
        `-snap.lxd.user-daemon.unix.socket @17.452s
          `-sysinit.target @8.202s
            `-cloud-init.service @6.970s +1.225s
              `-systemd-networkd-wait-online.service @6.958s +9ms
                `-systemd-networkd.service @6.842s +113ms
                  `-network-pre.target @6.838s
                    `-cloud-init-local.service @1.938s +4.898s
                      `-systemd-remount-fs.service @1.160s +174ms
                        `-systemd-journald.socket @941ms
                          `-system.slice @894ms
                            `--.slice @894ms

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2023-01-23:

#41

lp1986781-casper-jammy.debdiff Edit (1.3 KiB, text/plain)

Changed in casper (Ubuntu Kinetic):
status:	Triaged → Won't Fix
Changed in casper (Ubuntu Jammy):
status:	Triaged → In Progress

Dimitri John Ledkov (xnox) on 2023-01-24

Changed in casper (Ubuntu Jammy):
milestone:	none → ubuntu-22.04.2

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2023-01-27:

#42

Re-uploading to Jammy with changelog updates suggested by Timo.

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2023-01-27:

#43

lp1986781-casper-jammy-v2.debdiff Edit (1.1 KiB, text/plain)

Revision history for this message

Timo Aaltonen (tjaalton) wrote on 2023-01-28: Please test proposed package

#44

Hello Shubhakar, or anyone else affected,

Accepted casper into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/casper/1.470.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To properly test it you will need to obtain and boot a daily build of a Live CD for jammy. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in casper (Ubuntu Jammy):
status:	In Progress → Fix Committed
tags:	added: verification-needed verification-needed-jammy

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2023-01-30:

#45

Verification done for jammy-proposed.

...

Booting jammy-live-server-amd64.iso with `break=init`
(more details in comment #37).

(initramfs) chroot /root /bin/bash
root@ubuntu-server:/# dhclient enp1s0

root@ubuntu-server:/# add-apt-repository -y -p proposed

root@ubuntu-server:/# apt install -y casper/jammy-proposed

root@ubuntu-server:/# apt policy casper
casper:
  Installed: 1.470.2
  Candidate: 1.470.2
  Version table:
*** 1.470.2 500
        500 http://security.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
        500 http://archive.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
...
root@ubuntu-server:/# add-apt-repository -y -p proposed -r

...

root@ubuntu-server:/# sed -i '/^ExecStart=/ s,=,=/usr/bin/strace --inject read:delay_enter=60s ,' /usr/lib/systemd/system/casper-md5check.service

root@ubuntu-server:/# cat /usr/lib/systemd/system/casper-md5check.service
[Unit]
Description=casper-md5check Verify Live ISO checksums
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/usr/bin/strace --inject read:delay_enter=60s /usr/lib/casper/casper-md5check /cdrom /cdrom/md5sum.txt
...

root@ubuntu-server:/# exit
exit
(initramfs) exit
...

...
[ OK ] Reached target Multi-User System.
Starting casper-md5check Verify Live ISO checksums...
Starting Execute cloud user/final scripts...
...
[ OK ] Finished Execute cloud user/final scripts.
[ OK ] Reached target Cloud-init target.

Ubuntu 22.04.1 LTS ubuntu-server ttyS0

connecting...
<a few seconds later>

================================================================================
Serial [ Help ]
================================================================================

As the installer is running on a serial console, it has started in basic
mode, using only the ASCII character set and black and white colours.

  If you are connecting from a terminal emulator such as gnome-terminal that
  supports unicode and rich colours you can switch to "rich mode" which uses
  unicode, colours and supports many languages.

You can also connect to the installer over the network via SSH, which will
allow use of rich mode.

                          [ Continue in rich mode > ]
                          [ Continue in basic mode > ]
                          [ View SSH instructions ]

Verification done for jammy-proposed.

...

Booting jammy-live-server-amd64.iso with `break=init`
(more details in comment #37).

(initramfs) chroot /root /bin/bash
root@ubuntu-server:/# dhclient enp1s0

root@ubuntu-server:/# add-apt-repository -y -p proposed

root@ubuntu-server:/# apt install -y casper/jammy-proposed

root@ubuntu-server:/# apt policy casper
casper:
  Installed: 1.470.2
  Candidate: 1.470.2
  Version table:
 *** 1.470.2 500
        500 http://security.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
        500 http://archive.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
...
root@ubuntu-server:/# add-apt-repository -y -p proposed -r

...

root@ubuntu-server:/# sed -i '/^ExecStart=/ s,=,=/usr/bin/strace --inject read:delay_enter=60s ,' /usr/lib/systemd/system/casper-md5check.service

root@ubuntu-server:/# cat /usr/lib/systemd/system/casper-md5check.service
[Unit]
Description=casper-md5check Verify Live ISO checksums
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/usr/bin/strace --inject read:delay_enter=60s /usr/lib/casper/casper-md5check /cdrom /cdrom/md5sum.txt
...

root@ubuntu-server:/# exit
exit
(initramfs) exit
...

...
[  OK  ] Reached target Multi-User System.
         Starting casper-md5check Verify Live ISO checksums...
         Starting Execute cloud user/final scripts...
...
[  OK  ] Finished Execute cloud user/final scripts.
[  OK  ] Reached target Cloud-init target.

Ubuntu 22.04.1 LTS ubuntu-server ttyS0

connecting...
<a few seconds later>

================================================================================
  Serial                                                              [ Help ]
================================================================================

As the installer is running on a serial console, it has started in basic
  mode, using only the ASCII character set and black and white colours.

If you are connecting from a terminal emulator such as gnome-terminal that
  supports unicode and rich colours you can switch to "rich mode" which uses
  unicode, colours and supports many languages.

You can also connect to the installer over the network via SSH, which will
  allow use of rich mode.

[ Continue in rich mode  > ]
                          [ Continue in basic mode > ]
                          [ View SSH instructions    ]

tags:

added: verification-done verification-done-jammy
removed: verification-needed verification-needed-jammy

Revision history for this message

Mike Rushton (leftyfb) wrote on 2023-01-31:

#46

"To properly test it you will need to obtain and boot a daily build of a Live CD for jammy"

If that is all that is needed then it is failing for me. If, however, we need to stop the installer, enable the proposed repository, update casper, disable the proposed repository and then continue the installer, then it looks like that has been validated above by Mauricio.

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2023-01-31:

#47

Hi Mike,

Right; the package version in -proposed isn't in the daily build yet,
so it needed more manual steps, but once this makes it to -updates
(or if a daily build is built w/ -proposed enabled), then it'll be in.

@ https://cdimage.ubuntu.com/ubuntu-server/jammy/daily-live/20230131/jammy-live-server-amd64.manifest
> casper 1.470.1

cheers,
Mauricio

Revision history for this message

Dan Bungert (dbungert) wrote on 2023-02-02:

#48

Verification done for jammy-proposed.

I have purposefully done a different test procedure, based on injecting the proposed casper version and using systemd-analyze.

Above, in comments such as https://bugs.launchpad.net/cloud-init/+bug/1986781/comments/25, we can see that casper-md5check needs to complete before hitting multi-user.target.

With casper 1.470.2 in place, it looks like:

The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

casper-md5check.service +13.837s
└─multi-user.target @18.459s
  └─ssh.service @18.336s +30ms
    └─basic.target @7.833s
      └─sockets.target @7.831s
        └─snap.lxd.user-daemon.unix.socket @15.746s
          └─sysinit.target @7.778s
            └─cloud-init.service @6.080s +1.695s
              └─systemd-networkd-wait-online.service @6.061s +10ms
                └─systemd-networkd.service @5.950s +109ms
                  └─network-pre.target @5.948s
                    └─cloud-init-local.service @1.615s +4.332s
                      └─systemd-remount-fs.service @884ms +135ms
                        └─systemd-journald.socket @705ms
                          └─system.slice @624ms
                            └─-.slice @624ms

This is the expected behavior, so the SRU LGTM.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2023-02-07:

#49

This bug was fixed in the package casper - 1.470.2

---------------
casper (1.470.2) jammy; urgency=medium

  [ Dan Bungert ]
  * d/casper.casper-md5check.service: order it After=multi-user.target.
    This should allow us to reach multi-user.target sooner, especially for
    users where the md5 check is slow. (LP: #1986781)

-- Mauricio Faria de Oliveira <email address hidden> Fri, 27 Jan 2023 15:31:09 -0300

Changed in casper (Ubuntu Jammy):
status:	Fix Committed → Fix Released

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2023-02-07: Update Released

#50

The verification of the Stable Release Update for casper has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

James Falcon (falcojr) wrote on 2023-05-12:

#51

Tracked in Github Issues as https://github.com/canonical/cloud-init/issues/4009

cloud-init

[Ubuntu 22.04]cloud-init failed to complete after 10 minutes of waiting was shown during Installation via iDRAC Virtual Console

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
cloud-init	Invalid	Undecided	Unassigned
subiquity	Invalid	Undecided	Unassigned
casper (Ubuntu)	Fix Released	High	Dan Bungert
Focal	Triaged	Low	Mauricio Faria de Oliveira
Jammy	Fix Released	Low	Mauricio Faria de Oliveira	Ubuntu ubuntu-22.04.2
Kinetic	Won't Fix	Low	Mauricio Faria de Oliveira