worker auto restarting should be handled by systemd unit, not by cronjob
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Auto Package Testing |
Fix Released
|
Medium
|
Łukasz Zemczak |
Bug Description
autopkgtest worker units are automatically reconfigured on a daily basis by pkilling any running worker processes and then running various maintenance jobs the last of which, at the end, starts any stopped systemd units.
For any units that didn't actually stop right away because the associated test is still running, the unit start at the end of the maintenance window has no effect. But because the process was killed with SIGTERM, once it finishes the current test it exits, leaving the job stopped.
It then remains stopped for up to 6 hours (depending on how long it took to finish), until the next cronjob restarts it.
We should fix the systemd units to auto-respawn when stopped by SIGTERM, and adjust the maintenance cronjob to kill the runners at the end of the job, not at the beginning.
Related branches
- Iain Lane: Approve
-
Diff: 13 lines (+1/-1)1 file modifieddeployment/charms/xenial/autopkgtest-cloud-worker/hooks/start (+1/-1)
Changed in auto-package-testing: | |
importance: | Undecided → Medium |
Changed in auto-package-testing: | |
assignee: | nobody → Łukasz Zemczak (sil2100) |
status: | Triaged → In Progress |
I'm not sure why we pkill in this job at all; do you know?
If there were code changes in the worker it'd cause those to be picked up, but ops know to SIGHUP when upgrading anyway.
So I suggest that we stop pkilling and just have a 'build new images' daily job in addition to the regular recovery one.
...?