Running out of `/tmp` space on cloud workers - post-mortem

Bug #2061141 reported by Skia
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Auto Package Testing
Fix Released
Undecided
Unassigned

Bug Description

The week from 2024-04-08 to 2024-04-12, we experienced a lot of "No space left
on device" error in a lot of different jobs. Here is a description of various
points regarding that issue, that can act as a kind of post-mortem, in the event
we face a similar situation again.
This is described from my own point of view, with my current understanding. I
don't pretend to understand every part of the problem, but I think I can give a
pretty wide overview.

This was mostly due to multiple things:
 1. the consequence of `time_t` and `xz-utils` just before the beta release of
    Noble led to huge autopkgtest queues that we needed to consume as fast as
    possible.
 2. we thus increased the number of workers running on our cloud units, after IS
    increased our quota.
 3. increasing the number of running jobs without increasing the size of the
    main working directory obviously can lead to disasters.
 4. this was amplified by a really bad timing, having in the queue in parallel
    a lot of tests for the following three packages: libreoffice, systemd, and
    llvm-toolchain-{15,16,17,18}. All those package require at least 1.5GB for
    their working directory on the cloud-worker.
 5. this was amplified by the worker sometimes failing to clean its working
    directory, leading to dangling folders only cleaned up after 30 days on our
    units.

This combination of things ended up in a whole week of regularly "fixing" the
issue, only to discover 12 hours later that it was still there, and digging
further, taking new actions, and getting more and more depressed when users
still came telling us there was more ENOSPACE to report.

Now here is the list of actions that were taken to remediate each of those points:

 1. This was mostly the current context, and besides queue cleaning, there isn't
    much to be done.
 2. and 3. This was just a matter of reducing the number of jobs per worker:
    `n-workers` config taken from 110 to 90, for a 200GB /tmp. Value is still
    under observation.
 4. As the ceph-based `/tmp` folder isn't very fast, making `du` and `rm` very slow, it was required to have precise and very effective cleaning. Here are the two commands I ended with to remove `libreoffice` directories older than a day:
```
grep -H '^libreoffice ' /tmp/*/out/testpkg-version | cut -d'/' -f-3 > /tmp/tests
touch -d '1 day ago' /tmp/1.day.ago; df -i /tmp; df -h /tmp; for p in $(cat /tmp/tests); do if [ "$p" -ot /tmp/1.day.ago ]; then sudo rm -rf $p; fi done; df -i /tmp; df -h /tmp;
```
    This was quite effective to very quickly bring back a lot of free space
    and inodes.
 5. Was fixed by this MP:
    https://code.launchpad.net/~hyask/autopkgtest-cloud/+git/autopkgtest-cloud/+merge/463993

All in all, pretty simple solutions, but the main problem really was about
investigating the multiple causes of the issues, and the cascading effect of
them, like when the worker throws while removing its working directory, thus
failing to delete it entirely, leading to next runs having less and less space.

Related branches

Revision history for this message
Skia (hyask) wrote :

https://code.launchpad.net/~andersson123/autopkgtest-cloud/+git/autopkgtest-cloud/+merge/464343

This MP is adding documentation on the process we followed to increase the capacity of the `/tmp` partitions.

Revision history for this message
Tim Andersson (andersson123) wrote :

Given that we've mostly sorted this issue out, should we mark this as fix released?

Revision history for this message
Skia (hyask) wrote :

Yes, I've done so.

Changed in auto-package-testing:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.