google-startup-scripts runs before cloud-init finished network setup

Bug #2057965 reported by Catherine Redfield
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
google-guest-agent (Ubuntu)
Confirmed
Undecided
Unassigned
Xenial
New
Undecided
Unassigned
Bionic
New
Undecided
Unassigned
Focal
New
Undecided
Unassigned
Jammy
New
Undecided
Unassigned
Mantic
New
Undecided
Unassigned
Noble
Confirmed
Undecided
Unassigned

Bug Description

[ Impact ]

In certain situations (consistently with ubuntu-pro=31.2 and cloud-init=23.4.4), cloud-config.service has not completed before google-startup-scripts.service runs. This can cause startup scripts that rely on apt to fail, as cloud-init is responsible for reconfiguring sources.list to point at the GCE archives.

Since pro and cloud-init are backported to all older releases, this bug will affect them too.

The change that results in this race condition is the removal an ordering condition between pro and cloud-init, so adding `After=cloud-final.service` to google-startup-scripts.service should ensure that the startup scripts are correctly run regardless of the ordering (or lack thereof) between other services.

[ Test Plan ]

To reproduce:

Using startup_script.sh:
#!/bin/bash
cp /etc/apt/sources.list /tmp/startup-sources.list

$ gcloud compute instances create startup-test --image daily-ubuntu-2204-jammy-v20240314 --image-project ubuntu-os-cloud-devel --metadata-from-file=startup-script=startup_script.sh
[...]
$ ssh [INSTANCE IP]
> diff /tmp/startup-sources.list /etc/apt/sources.list
0a1,8
> ## Note, this file is written by cloud-init on first boot of an instance
> ## modifications made here will not survive a re-bundle.
> ## if you wish to make changes you can:
> ## a.) add 'apt_preserve_sources_list: true' to /etc/cloud/cloud.cfg
> ## or do the same in user-data
> ## b.) add sources in /etc/apt/sources.list.d
> ## c.) make changes to template file /etc/cloud/templates/sources.list.tmpl
>
3,4c11,12
< deb http://archive.ubuntu.com/ubuntu/ jammy main restricted
< # deb-src http://archive.ubuntu.com/ubuntu/ jammy main restricted
---
> deb http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy main restricted
> # deb-src http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy main restricted
8,9c16,17
< deb http://archive.ubuntu.com/ubuntu/ jammy-updates main restricted
< # deb-src http://archive.ubuntu.com/ubuntu/ jammy-updates main restricted
---
[...]

Since this bug particularly effects first boot (once sources.list is configured with the GCE mirrors on first boot it will remain correctly configured), the best way to test that fix is correctly created will be to create an image with pro pinned at 31.2, cloud-init pinned at 23.4.4, and google-guest-agent install from proposed. The test would be:

1. Create an instance with startup script as above
$ gcloud compute instances create startup-test --image [IMAGE_NAME] --image-project [IMAGE PROJECT] --metadata-from-file=startup-script=startup_script.sh

2. SSH into the instance and verify pro/cloud-init/google-guest-agent versions/source
> pro --version
32.1~[RELEASE]
> cloud-init --version
/usr/bin/cloud-init 23.4.4-0ubuntu0~[RELEASE]
> apt-cache policy google-guest-agent
[ensure from -proposed]

3. Verify startup script ran correctly after cloud-config.service.
> diff /tmp/startup-sources.list /etc/apt/sources.list
>

[ Where problems could occur ]

Since this introduces a new ordering constraint, it will likely have performance impacts (google-startup-scripts will run later). This seems preferable to breaking a subset of startup scripts in some situations; it is not uncommon to use startup scripts to install packages so it's important for the mirrors to be correctly configured.

[ Other Info ]

Original bug report retained below.

New GCP dailies are failing startup-script tests, due to configuration via cloud-init not being fully completed, apt sources for example, when startup scripts are run. The failure can be reproduced as follows:

Using startup_script.sh:
#!/bin/bash
cp /etc/apt/sources.list /tmp/startup-sources.list

$ gcloud compute instances create startup-test --image daily-ubuntu-2204-jammy-v20240314 --image-project ubuntu-os-cloud-devel --metadata-from-file=startup-script=startup_script.sh
[...]
$ ssh [INSTANCE IP]
> diff /tmp/startup-sources.list /etc/apt/sources.list
0a1,8
> ## Note, this file is written by cloud-init on first boot of an instance
> ## modifications made here will not survive a re-bundle.
> ## if you wish to make changes you can:
> ## a.) add 'apt_preserve_sources_list: true' to /etc/cloud/cloud.cfg
> ## or do the same in user-data
> ## b.) add sources in /etc/apt/sources.list.d
> ## c.) make changes to template file /etc/cloud/templates/sources.list.tmpl
>
3,4c11,12
< deb http://archive.ubuntu.com/ubuntu/ jammy main restricted
< # deb-src http://archive.ubuntu.com/ubuntu/ jammy main restricted
---
> deb http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy main restricted
> # deb-src http://us-central1.gce.archive.ubuntu.com/ubuntu/ jammy main restricted
8,9c16,17
< deb http://archive.ubuntu.com/ubuntu/ jammy-updates main restricted
< # deb-src http://archive.ubuntu.com/ubuntu/ jammy-updates main restricted
---
[...]

On earlier images (such as ubuntu-2204-jammy-v20240307 in ubuntu-os-cloud) do not show this behaviour. The change is due to a change in ubuntu-pro 31 (see https://github.com/canonical/ubuntu-pro-client/blob/dfe1f1ed4678c50240d4e251f41d33bb4034135e/debian/changelog#L40 for details) that removes a systemd ordering on cloud-config.service. As side effect of this change was the removal of cloud-config.service (and ubuntu-advantage.service) from systemd's critical chain.

On v20240307 (startup scripts execute correctly):
catred@startup-test-control:~$ systemd-analyze critical-chain google-startup-scripts.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

google-startup-scripts.service +18.262s
└─multi-user.target @28.480s
  └─ubuntu-advantage.service @28.480s
    └─cloud-config.service @27.372s +1.095s
      └─snapd.seeded.service @20.048s +7.312s
        └─snapd.service @12.469s +7.555s
          └─basic.target @11.558s
            └─sockets.target @11.540s
              └─snap.lxd.daemon.unix.socket @24.376s
                └─sysinit.target @10.825s
                  └─cloud-init.service @8.432s +2.267s
                    └─systemd-networkd-wait-online.service @6.467s +1.935s
                      └─systemd-networkd.service @6.347s +112ms
                        └─network-pre.target @6.328s
                          └─cloud-init-local.service @4.309s +2.006s
                            └─systemd-remount-fs.service @1.829s +68ms
                              └─systemd-fsck-root.service @1.587s +160ms
                                └─systemd-journald.socket @1.292s
                                  └─system.slice @1.068s
                                    └─-.slice @1.068s

On v20240314 (startup scripts fail):
catred@startup-test:~$ systemd-analyze critical-chain google-startup-scripts.service
The time when unit became active or started is printed after the "@" characte>
The time the unit took to start is printed after the "+" character.

google-startup-scripts.service +260ms
└─multi-user.target @29.237s
  └─chrony.service @30.240s +56ms
    └─basic.target @13.364s
      └─sockets.target @13.225s
        └─snap.lxd.user-daemon.unix.socket @26.765s
          └─sysinit.target @12.550s
            └─cloud-init.service @7.933s +4.503s
              └─systemd-networkd-wait-online.service @6.741s +1.171s
                └─systemd-networkd.service @6.593s +124ms
                  └─network-pre.target @6.573s
                    └─cloud-init-local.service @4.478s +2.083s
                      └─systemd-remount-fs.service @1.717s +64ms
                        └─systemd-fsck-root.service @1.510s +95ms
                          └─systemd-journald.socket @1.193s
                            └─-.mount @974ms
                              └─-.slice @974ms

This can be fixed by adding an explict `After=cloud-config.service` to the google-startup-scripts.service file, which enforces the correct ordering between google-startup-scripts and cloud-init.

Tags: patch
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in google-guest-agent (Ubuntu):
status: New → Confirmed
Revision history for this message
Brett Holman (holmanb) wrote :

> due to network not being fully set up when startup scripts are run.

google-startup-scripts.service starts well after the network is online, so I think that the actual reason is that cc_apt_configure hasn't ran yet when the startup script starts. The diff looks like some apt mirrors configured by cloud-init, possibly via GCE's vendordata?

> This can be fixed by adding an explict `After=cloud-config.service` to the google-startup-scripts.service file, which enforces the correct ordering between google-startup-scripts and cloud-init.

+1 agreed, that should fix it. However if ssh login needs to be possible (which I do not know) when the script runs, then you may want to order after systemd-user-sessions.service instead (see this bug[1] for context).

That all said, the second critical chain looks unusual - what took ~16.6s between sysinit.target and multi-user.target?

[1] https://bugs.launchpad.net/uvtool/+bug/2039441

Revision history for this message
Philip Roche (philroche) wrote :

@holmanb

> google-startup-scripts.service starts well after the network is online

You are correct. I have updated the bug description now.

> due to configuration via cloud-init not being fully completed, apt sources for example, when startup scripts are run

description: updated
Revision history for this message
Catherine Redfield (catred) wrote :

We are seeing similar failures across older images (mantic and earlier) but not in noble, perhaps because noble is using cloud-init 24.1 (which runs faster due to not waiting on snap seeding). However, even in noble, we don't see cloud-config.service in the critical chain:

catred@startup-test-noble:~$ systemd-analyze critical-chain
The time when unit became active or started is printed after the "@" characte>
The time the unit took to start is printed after the "+" character.

graphical.target @22.713s
└─multi-user.target @22.713s
  └─snapd.seeded.service @19.697s +3.013s
    └─snapd.service @12.837s +6.842s
      └─basic.target @12.232s
        └─sockets.target @12.213s
          └─snapd.socket @12.098s +94ms
            └─sysinit.target @11.916s
              └─cloud-init.service @8.520s +2.561s
                └─systemd-networkd-wait-online.service @8.464s +20ms
                  └─systemd-networkd.service @8.332s +82ms
                    └─network-pre.target @8.300s
                      └─cloud-init-local.service @4.820s +3.460s
                        └─systemd-remount-fs.service @1.780s +92ms
                          └─systemd-fsck-root.service @1.391s +245ms
                            └─systemd-journald.socket @1.056s
                              └─-.mount @763ms
                                └─-.slice @762ms

Revision history for this message
Catherine Redfield (catred) wrote :

I believe the attached patch should fix this issue.

description: updated
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "0006-order-startup-scripts-after-cloud-final.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
description: updated
Revision history for this message
Alberto Contreras (aciba) wrote :

Why the patch in https://bugs.launchpad.net/ubuntu/+source/google-guest-agent/+bug/2057965/comments/5 and Impact description order gga after cloud-final.service but the merged (temporal) cpc_packaging.extra MRs order gga after cloud-config.service?

Revision history for this message
Catherine Redfield (catred) wrote :

> Why the patch in https://bugs.launchpad.net/ubuntu/+source/google-guest-agent/+bug/2057965/comments/5 and Impact description order gga after cloud-final.service but the merged (temporal) cpc_packaging.extra MRs order gga after cloud-config.service?

cloud-final.service runs after cloud-config.service so either will work. I proposed cloud-final.service in my patch since prior to `debian/patches/0005-make-service-directive-explicit.patch` being added, that was the ordering constraint in place. The 0005 patch removes it due to redundancy, which has since clearly been changed but I thought it was cleaner to revert to the original ordering. Either ordering (cloud-config and cloud-final) appear to solve the issue in testing.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.