snapd.seeded.service blocks for a long time preventing cloud-init from completing

Bug #1834190 reported by Jeremy Collin on 2019-06-25
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init
Undecided
Unassigned
snapd
High
Paweł Stołowski

Bug Description

We use qcow2 images to install baremetal servers, our datasource is to have a little partition as ConfigDrive with meta_data.json and vendor_data.json is necessary.
The vendor_data.json provides informations about default user password settings.

This actually works well for every linux distrib we use unless cloud-init is 19.1.

The modules steps are no longer working:

cat /run/cloud-init/status.json
{
 "v1": {
  "datasource": "DataSourceConfigDrive [net,ver=2][source=/dev/sda4]",
  "init": {
   "errors": [],
   "finished": 1561465042.7556307,
   "start": 1561465042.1868145
  },
  "init-local": {
   "errors": [],
   "finished": 1561465035.4210315,
   "start": 1561465034.724825
  },
  "modules-config": {
   "errors": [],
   "finished": null,
   "start": null
  },
  "modules-final": {
   "errors": [],
   "finished": null,
   "start": null
  },
  "modules-init": {
   "errors": [],
   "finished": null,
   "start": null
  },
  "stage": null
 }
}

and we can see in the /var/lib/cloud/instance/sem directory that the modules doesn't have the sem file.

Ryan Harper (raharper) wrote :

Thanks for filing a bug.

Could you run 'cloud-init collect-logs' and attach the tarball it creates?

Changed in cloud-init:
status: New → Incomplete
Jeremy Collin (jcollin) wrote :

It will not be on the same server, but the problem is the same.

Jeremy Collin (jcollin) wrote :
Ryan Harper (raharper) wrote :
Download full text (8.5 KiB)

Thanks for the logs. Could you be more specific about which files you mention are present in which path ?

/var/lib/cloud/instance is a symlink that points to the current instance-id
 under /var/lib/cloud/instances. And in your case the log shows this:

2019-06-27 07:20:18,617 - util.py[DEBUG]: Creating symbolic link from '/var/lib/cloud/instance' => '/var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185'

And then cloud-init logs all of the semephors it writes:

% grep instances cloud-init.log
2019-06-27 07:20:18,617 - util.py[DEBUG]: Creating symbolic link from '/var/lib/cloud/instance' => '/var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185'
2019-06-27 07:20:18,618 - util.py[DEBUG]: Reading from /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/datasource (quiet=False)
2019-06-27 07:20:18,618 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/datasource - wb: [644] 75 bytes
2019-06-27 07:20:24,975 - util.py[DEBUG]: Creating symbolic link from '/var/lib/cloud/instance' => '/var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185'
2019-06-27 07:20:24,975 - util.py[DEBUG]: Reading from /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/datasource (quiet=False)
2019-06-27 07:20:24,976 - util.py[DEBUG]: Read 75 bytes from /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/datasource
2019-06-27 07:20:24,976 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/datasource - wb: [644] 75 bytes
2019-06-27 07:20:25,004 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/user-data.txt - wb: [600] 0 bytes
2019-06-27 07:20:25,021 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/user-data.txt.i - wb: [600] 308 bytes
2019-06-27 07:20:25,022 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/vendor-data.txt - wb: [600] 113 bytes
2019-06-27 07:20:25,024 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/vendor-data.txt.i - wb: [600] 418 bytes
2019-06-27 07:20:25,025 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/sem/consume_data - wb: [644] 24 bytes
2019-06-27 07:20:25,025 - helpers.py[DEBUG]: Running consume_data using lock (<FileLock using file '/var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/sem/consume_data'>)
2019-06-27 07:20:25,027 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/cloud-config.txt - wb: [600] 0 bytes
2019-06-27 07:20:25,030 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/vendor-cloud-config.txt - wb: [600] 159 bytes
2019-06-27 07:20:25,105 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/sem/config_seed_random - wb: [644] 24 bytes
2019-06-27 07:20:25,105 - helpers.py[DEBUG]: Running config-seed_random using lock (<FileLock using file '/var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/sem/config_seed_random'>)
2019-06-27 07:20:25,106 - util.py[DEBUG]...

Read more...

Jeremy Collin (jcollin) wrote :

We are using the official images (https://cloud-images.ubuntu.com/disco/current/disco-server-cloudimg-amd64.img, and https://cloud-images.ubuntu.com/cosmic/current/cosmic-server-cloudimg-amd64.img)

For the process we use:
1/ we create a configDrive
2/ we rsync the image onto the remaining disk
3/ we reboot the server

At this point we have no other actions. This process worked well till cloud-init was upgraded to 19.1

About the files I mention, it's the semaphore files, which show that actions has been done.

In the vendor_data.json, we have the cc_set_passwords sets but it's not run. < this is the main problem.

If you need any other information, just ask.

Chad Smith (chad.smith) wrote :
Download full text (3.7 KiB)

Hi Jeremy,
thanks for the bug file and for making cloud-init better, taking your cloud-init.log and running cloud-init analyze show on it, it seems you environment isn't running either modules-config stage or modules-final stage. Just init-local and init-network stages which is why status.json has no timestamps for those boot stages. I'm wondering if something in the environment has disabled/avoided running those stages (like a broken systemd dependency chain)

Starting stage: init-local
|`->no cache found @00.00400s +00.00000s
|`->found local data from DataSourceConfigDrive @00.01400s +00.38300s
Finished stage: (init-local) 00.82400 seconds

Starting stage: init-network
|`->restored from cache with run check: DataSourceConfigDrive [net,ver=2][source=/dev/sda4] @06.75200s +00.00400s
|`->setting up datasource @06.80200s +00.00000s
|`->reading and applying user-data @06.82300s +00.00200s
|`->reading and applying vendor-data @06.82500s +00.00400s
|`->activating datasource @06.84900s +00.00100s
|`->config-migrator ran successfully @06.90200s +00.00000s
|`->config-seed_random ran successfully @06.90300s +00.00000s
|`->config-bootcmd ran successfully @06.90400s +00.00000s
|`->config-write-files ran successfully @06.90400s +00.00100s
|`->config-growpart ran successfully @06.90500s +00.01900s
|`->config-resizefs ran successfully @06.92400s +00.37100s
|`->config-disk_setup ran successfully @07.29500s +00.00100s
|`->config-mounts ran successfully @07.29600s +00.14700s
|`->config-set_hostname ran successfully @07.44300s +00.00100s
|`->config-update_hostname ran successfully @07.44500s +00.00000s
|`->config-update_etc_hosts ran successfully @07.44600s +00.00000s
|`->config-ca-certs ran successfully @07.44600s +00.00100s
|`->config-rsyslog ran successfully @07.44700s +00.00000s
|`->config-users-groups ran successfully @07.44800s +00.88100s
|`->config-ssh ran successfully @08.32900s +00.30600s
Finished stage: (init-network) 01.90700 seconds

Total Time: 2.73100 seconds

-- Boot Record 02 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.

Starting stage: init-local
|`->cache invalid in datasource: DataSourceConfigDrive [net,ver=2][source=/dev/sda4] @00.02200s +00.05000s
|`->found local data from DataSourceConfigDrive @00.07400s +00.48400s
Finished stage: (init-local) 00.70800 seconds

Starting stage: init-network
|`->restored from cache with run check: DataSourceConfigDrive [net,ver=2][source=/dev/sda4] @06.43400s +00.00300s
|`->setting up datasource @06.48600s +00.00000s
|`->reading and applying user-data @06.51700s +00.01000s
|`->reading and applying vendor-data @06.52700s +00.00400s
|`->activating datasource @06.55100s +00.00100s
|`->config-migrator ran successfully @06.59200s +00.00100s
|`->config-seed_random previously ran @06.59300s +00.00000s
|`->config-bootcmd ran successfully @06.59300s +00.00100s
|`->config-write-files previously ran @06.59400s +00.00000s
|`->config-growpart ran successfully @06.59400s +00.01900s
|`->config-resizefs ran successfully @06.61400s +00.28000s
|`->config-disk_setup previously ran @06.89500s +00.00000s
|`->config-m...

Read more...

Ryan Harper (raharper) wrote :

@Jeremy

2019-06-27 07:20:18,586 - openstack.py[DEBUG]: Failed reading optional path /run/cloud-init/tmp/tmp9gnk8kxy/openstack/latest/network_data.json due to: [Errno 2] No such file or directory: '/run/cloud-init/tmp/tmp9gnk8kxy/openstack/latest/network_data.json'

That looks like what you're seeing.

Could you dump/attach a sanitized version of your ConfigDrive so we can reproduce?

Jeremy Collin (jcollin) wrote :

Oh, is the network_data.json mandatory now?

Jeremy Collin (jcollin) wrote :

I just tried to have empty network_data.json, and empty user_data in the configDrive, the result is still the same. I'll attach an ISO file of the config drive.

Jeremy Collin (jcollin) wrote :
Ryan Harper (raharper) on 2019-07-01
Changed in cloud-init:
status: Incomplete → Triaged
Ryan Harper (raharper) wrote :

I cannot recreate the failure with your config drive and the current disco or cosmic image, however it does allow you to login before cloud-init is complete. You can watch/wait for cloud-init to complete (cloud-init status --long --wait). And systemctl list-jobs (will show you any pending units that are not yet complete. To see the extra time it takes look at this:

% systemd-analyze critical-chain cloud-final.service
cloud-final.service +481ms
└─cloud-config.service @1min 10.653s +5.742s
  └─snapd.seeded.service @23.132s +47.511s
    └─snapd.service @54.339s +3.108s
      └─snap-lxd-10972.mount @58.097s +28ms
        └─dev-loop1.device @58.116s +47ms

47 seconds for the snapd.seeded.service to install the two snaps. And the cloud-final.service will wait until the snapd.seeded.sevice is complete as cloud-init defers package and snap installs until after snapd is up.

On your failing instance log, I see a snapd message about error installing the core snap. Could you run:

snap list
snap changes

And for any errors in the snap changes output, please run:

snap tasks <ID> to get an error message.

Changed in cloud-init:
status: Triaged → Incomplete
Jeremy Collin (jcollin) wrote :

---
root@pwet:~# cloud-init status --long --wait
status: error
time: Tue, 02 Jul 2019 07:35:35 +0000
detail:
'NoneType' object has no attribute 'version'

root@pwet:~# snap list
No snaps are installed yet. Try 'snap install hello-world'.

root@pwet:~# snap changes
ID Status Spawn Ready Summary
1 Error today at 07:35 UTC today at 07:35 UTC Initialize system state
2 Done today at 07:35 UTC today at 07:35 UTC Initialize device

root@pwet:~# systemctl list-jobs
JOB UNIT TYPE STATE
107 snapd.seeded.service start running
103 snapd.autoimport.service start waiting
  2 multi-user.target start waiting
105 systemd-update-utmp-runlevel.service start waiting
111 cloud-config.service start waiting
  1 graphical.target start waiting
109 cloud-init.target start waiting
110 cloud-final.service start waiting

8 jobs listed.

root@pwet:~# systemd-analyze critical-chain cloud-final.service
Bootup is not yet finished (org.freedesktop.systemd1.Manager.FinishTimestampMonotonic=0).
Please try again later.
Hint: Use 'systemctl list-jobs' to see active jobs
---

I found something (https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1779948) and it seems that's quite like the problem I have. snapd.seeded.service does not start properly, still have to find why.

I purged snapd, squashfs-tools, reinstalled them, and cloud-init finished its starting process (modules-config & modules-final)

I now have to figure out why.

Thanks

Jeremy Collin (jcollin) wrote :

I've masked snapd before the first boot and everything works fine now.
The problem is not in cloud-init, it's just snapd.seeded.service that is stalling and preventing cloud-init to finish its work.

Ryan Harper (raharper) wrote :

I've added snapd so they can look into what's going on with snapd.seeded.service.

Changed in cloud-init:
status: Incomplete → Invalid
summary: - Modules steps not taken into account since 19.1
+ snapd.seeded.service blocks for a long time preventing cloud-init from
+ completing
Jeremy Collin (jcollin) wrote :

Thanks

Zygmunt Krynicki (zyga) wrote :

This topic was discussed at the Paris engineering sprint and we have work scheduled to understand the slowest parts and optimise them with an additional pass that can be performed on a built image.

I'm assigning this bug to Pawel as he is currently looking into it.

Changed in snapd:
status: New → Confirmed
assignee: nobody → Paweł Stołowski (stolowski)
status: Confirmed → In Progress
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers