worker auto restarting should be handled by systemd unit, not by cronjob

Bug #1735878 reported by Steve Langasek
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Auto Package Testing
Fix Released
Medium
Łukasz Zemczak

Bug Description

autopkgtest worker units are automatically reconfigured on a daily basis by pkilling any running worker processes and then running various maintenance jobs the last of which, at the end, starts any stopped systemd units.

For any units that didn't actually stop right away because the associated test is still running, the unit start at the end of the maintenance window has no effect. But because the process was killed with SIGTERM, once it finishes the current test it exits, leaving the job stopped.

It then remains stopped for up to 6 hours (depending on how long it took to finish), until the next cronjob restarts it.

We should fix the systemd units to auto-respawn when stopped by SIGTERM, and adjust the maintenance cronjob to kill the runners at the end of the job, not at the beginning.

Related branches

Steve Langasek (vorlon)
Changed in auto-package-testing:
importance: Undecided → Medium
Revision history for this message
Iain Lane (laney) wrote :

I'm not sure why we pkill in this job at all; do you know?

If there were code changes in the worker it'd cause those to be picked up, but ops know to SIGHUP when upgrading anyway.

So I suggest that we stop pkilling and just have a 'build new images' daily job in addition to the regular recovery one.

...?

Changed in auto-package-testing:
status: New → Triaged
Changed in auto-package-testing:
assignee: nobody → Łukasz Zemczak (sil2100)
status: Triaged → In Progress
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

I think Laney is right. I can't find anything in the code that might warrant a SIGTERM. Not sure if it was actually meant for picking up code changes, as that could be done with a periodical SIGHUP instead. If it was to make sure that during the maintenance window we do not start any new worker runs with the old images, I don't think the added complexity has enough merits. We are re-building images once-per-day, so having a few tests running older images for this small window longer shouldn't be a big deal.

I have poked Martin to confirm if there are any other reasons for the pkill.

If our assumptions are correct then I will be removing the pkill as a fix for this bug.

Revision history for this message
Steve Langasek (vorlon) wrote :

I certainly don't know anything different about why the pkill is there.

I don't think stopping runners currently running against old images is the point. Indeed, if that was the point, then we're not killing nearly hard enough.

When you speak of a periodic SIGHUP, that to me seems to imply still using pkill but with a different signal. That doesn't thrill me, I think that the lifecycle should be managed via the systemd units. But it's possible that both approaches have the same problem, that signalling the runner gives it time to shut down cleanly and finish its current work, and that as a result systemd gives up on restarting the job automatically by the time the shutdown is finished?

Also one gap in the current runner lifecycle is that we never trigger a reload of the actual unit config by systemd ('systemctl daemon-reload'). While the unit contents should change infrequently, it's a gap that this is not automated; ideally we would fix this while touching this area of the code.

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Me and Laney personally don't think systemd unit reloading should be happening automatically, similarly to refreshing the workers. Whenever there's a code change to the worker or the systemd units or actually any part of the autopkgtest-cloud code-base, it's currently anyway a manual process. Someone needs to pull in the changes to the instance, so the operator is responsible for performing proper refresh of the bits that he/she expects to change. It doesn't make much sense automating this part if everything else is a manual process anyway.

From the talk I had with Martin about the pkill (if I understand him correctly), it seems that it was originally put in place so that images were not removed from under working workers. Right now this shouldn't be possible since we always keep 2 images around with inactive workers always picking up the newest (unless an autopkgtest is running over 24h, in which case it's completely busted anyway).

I will proceed with the pkill removal to resolve the bug in mention.

Without the pkill I don't think we should change the unit behavior. As per design a SIGTERM of the worker means we should not restart the unit and stay dead until re-run, with the idea that it has been killed purposely and should normally not start instantly without investigation. SIGHUP is what triggers an auto-respawn of the unit (after 5 minutes). I don't think we should change this design without any explicit reason.

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Done. If we decide we do want to change this behavior, I'd propose opening a new bug for that.

Changed in auto-package-testing:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.