Comment 0 for bug 2057965

Revision history for this message
Catherine Redfield (catred) wrote :

New GCP dailies are failing startup-script tests, due to network not being fully set up when startup scripts are run. The failure can be reproduced as follows:

Using startup_script.sh:
#!/bin/bash
cp /etc/apt/sources.list /tmp/startup-sources.list

$ gcloud compute instances create startup-test --image daily-ubuntu-2204-jammy-v20240314 --image-project ubuntu-os-cloud-devel --metadata-from-file=startup-script=startup_script.sh
[...]
$ ssh [INSTANCE IP]
> diff /tmp/startup-sources.list /etc/apt/sources.list
0a1,8
> ## Note, this file is written by cloud-init on first boot of an instance
> ## modifications made here will not survive a re-bundle.
> ## if you wish to make changes you can:
> ## a.) add 'apt_preserve_sources_list: true' to /etc/cloud/cloud.cfg
> ## or do the same in user-data
> ## b.) add sources in /etc/apt/sources.list.d
> ## c.) make changes to template file /etc/cloud/templates/sources.list.tmpl
>
3,4c11,12
< deb http://archive.ubuntu.com/ubuntu/ jammy main restricted
< # deb-src http://archive.ubuntu.com/ubuntu/ jammy main restricted
---
> deb http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy main restricted
> # deb-src http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy main restricted
8,9c16,17
< deb http://archive.ubuntu.com/ubuntu/ jammy-updates main restricted
< # deb-src http://archive.ubuntu.com/ubuntu/ jammy-updates main restricted
---
[...]

On earlier images (such as ubuntu-2204-jammy-v20240307 in ubuntu-os-cloud) do not show this behaviour. The change is due to a change in ubuntu-pro 31 (see https://github.com/canonical/ubuntu-pro-client/blob/dfe1f1ed4678c50240d4e251f41d33bb4034135e/debian/changelog#L40 for details) that removes a systemd ordering on cloud-config.service. As side effect of this change was the removal of cloud-config.service (and ubuntu-advantage.service) from systemd's critical chain.

On v20240307 (startup scripts execute correctly):
catred@startup-test-control:~$ systemd-analyze critical-chain google-startup-scripts.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

google-startup-scripts.service +18.262s
└─multi-user.target @28.480s
  └─ubuntu-advantage.service @28.480s
    └─cloud-config.service @27.372s +1.095s
      └─snapd.seeded.service @20.048s +7.312s
        └─snapd.service @12.469s +7.555s
          └─basic.target @11.558s
            └─sockets.target @11.540s
              └─snap.lxd.daemon.unix.socket @24.376s
                └─sysinit.target @10.825s
                  └─cloud-init.service @8.432s +2.267s
                    └─systemd-networkd-wait-online.service @6.467s +1.935s
                      └─systemd-networkd.service @6.347s +112ms
                        └─network-pre.target @6.328s
                          └─cloud-init-local.service @4.309s +2.006s
                            └─systemd-remount-fs.service @1.829s +68ms
                              └─systemd-fsck-root.service @1.587s +160ms
                                └─systemd-journald.socket @1.292s
                                  └─system.slice @1.068s
                                    └─-.slice @1.068s

On v20240314 (startup scripts fail):
catred@startup-test:~$ systemd-analyze critical-chain google-startup-scripts.service
The time when unit became active or started is printed after the "@" characte>
The time the unit took to start is printed after the "+" character.

google-startup-scripts.service +260ms
└─multi-user.target @29.237s
  └─chrony.service @30.240s +56ms
    └─basic.target @13.364s
      └─sockets.target @13.225s
        └─snap.lxd.user-daemon.unix.socket @26.765s
          └─sysinit.target @12.550s
            └─cloud-init.service @7.933s +4.503s
              └─systemd-networkd-wait-online.service @6.741s +1.171s
                └─systemd-networkd.service @6.593s +124ms
                  └─network-pre.target @6.573s
                    └─cloud-init-local.service @4.478s +2.083s
                      └─systemd-remount-fs.service @1.717s +64ms
                        └─systemd-fsck-root.service @1.510s +95ms
                          └─systemd-journald.socket @1.193s
                            └─-.mount @974ms
                              └─-.slice @974ms

This can be fixed by adding an explict `After=cloud-config.service` to the google-startup-scripts.service file, which enforces the correct ordering between google-startup-scripts and cloud-init.