status symlink non-atomicity traceback with status --wait
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
cloud-init |
Fix Released
|
Medium
|
Adam Collard |
Bug Description
MAAS run system-tests on a regular basis using LXD containers to set things up.
A recent run failed waiting for the LXD container to finish booting, with a traceback from cloud-init.
After launching the container, the script runs `timeout 2000 cloud-init status --wait --long` to ensure that the commands we pass through as user-data are complete.
Here's a redacted snippet from the logs
2022-02-23 17:21:00 INFO : Waiting for boot to finish...
2022-02-23 17:21:00 INFO timeout 2000 cloud-init status --wait --long
2022-02-23 17:21:04 INFO ..........
2022-02-23 17:21:04 INFO Traceback (most recent call last):
2022-02-23 17:21:04 INFO File "/usr/bin/
2022-02-23 17:21:04 INFO load_entry_
2022-02-23 17:21:04 INFO File "/usr/lib/
2022-02-23 17:21:04 INFO retval = util.log_time(
2022-02-23 17:21:04 INFO File "/usr/lib/
2022-02-23 17:21:04 INFO ret = func(*args, **kwargs)
2022-02-23 17:21:04 INFO File "/usr/lib/
2022-02-23 17:21:04 INFO status, status_detail, time = _get_status_
2022-02-23 17:21:04 INFO File "/usr/lib/
2022-02-23 17:21:04 INFO status_v1 = load_json(
2022-02-23 17:21:04 INFO File "/usr/lib/
2022-02-23 17:21:04 INFO with open(fname, 'rb') as ifh:
2022-02-23 17:21:04 INFO FileNotFoundError: [Errno 2] No such file or directory: '/run/cloud-
You can see from the ... that there are a few successful attempts to wait, but then it fails.
Changed in cloud-init: | |
status: | Fix Committed → Fix Released |
Reading the code - https:/ /github. com/canonical/ cloud-init/ blob/main/ cloudinit/ cmd/status. py#L141- L144 - , the status command tries to guard against the file not existing, but clearly this run slipped through - the file existed when we os.path.exists() it, but not when we open() it :|
Looking deeper, we can see that cmd.main. status_ wrapper atomically writes JSON (yay!) to the 'data_d' (/var/lib/ cloud/data) then symlinks that status file in the 'link_d' (/run/cloud-init). All seems reasonable, but let's look at how it does that symlinking:
https:/ /github. com/canonical/ cloud-init/ blob/main/ cloudinit/ cmd/main. py#L755- L758
Note the `force=True`, and refer to the implementation of sym_link:
https:/ /github. com/canonical/ cloud-init/ blob/2837b835f1 01d81704f018a4f 872b1d660eb6f3e /cloudinit/ util.py# L1887-L1891
.. which deletes the symlink, then re-creates it.
This is non-atomic and entirely possible for readers (such as `--wait`) to see the entry in the FS, before it gets deleted, the deletion to occur and then open it.