SIGTERM to a busy daemon has no impact
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
Fix Released
|
High
|
Hisashi Osanai |
Bug Description
This problem also manifests with object-expirer as detailed here:
http://
Original notes:
During some testing Paul Luse and I found that swift-init stop on some nodes in a cluster was failing repeatedly. It is sending a SIGTERM to all swift processes, but the reconstructor wasn't stopping and after 15 seconds it times out. Multiple attempts also failed.
stracing the reconstructor it was chugging away doing operations. Concurrency on the reconstructor may have been set fairly high.
It seems likely it's batched some jobs and is waiting for them to complete before it acknowledges the HUP and eventually shuts down, but it felt like it was on the order of many minutes if not longer before it would get to that point. Perhaps there's some room to be more graceful here, otherwise folks are just going to start using swift-init kill more frequently.
summary: |
- SIGHUP to a busy reconstructor has no impact + SIGTERM to a busy reconstructor has no impact |
description: | updated |
Changed in swift: | |
importance: | Undecided → Medium |
Changed in swift: | |
status: | New → Confirmed |
Changed in swift: | |
importance: | Medium → High |
status: | Confirmed → In Progress |
summary: |
- SIGTERM to a busy reconstructor has no impact + SIGTERM to a busy daemon has no impact |
description: | updated |
Changed in swift: | |
assignee: | nobody → Hisashi Osanai (osanai-hisashi) |
I updated this bug s/SIGHUP/SIGTERM/ after confirming that in all cases (reload, stop, restart) swift *daemons* receive signal number 15 (not 1 like proxy and other graceful shutdown servers support).
I think Caleb was seeing a busy reconstructor not shutting down after receiving SIGTERM - which is just as bad.
In dev the reconstructor shuts down responsively to `swift-init object- reconstructor stop` (the console output confirms signal # 15). Both reload and restart also use signal 15, and the reconstructor responds appropriately and responsively. So this is not a trivial/obvious issue.
The reconstructor (and all swift daemons) do install a signal handler for SIGTERM [1] in order to ensure the TERM is forwarded to the progress group to kill any forked children if they have any. It's possible that this signal handler isn't doing the right thing - but more likely that the reconstructor had just gotten itself into some uninterruptible state.
It maybe be simple to reproduce by loading some data into a development environment and getting reconstructors going then trying to stop them. If that doesn't work we'll have to set to incomplete and get some more information (e.g. when you kill (default is SIGTERM to just the parent pid) from the command line is the behavior different? Maybe the process group SIGTERM (you can do process group kill from the command line by making the pid negative, i believe) is less good for the reconstructor/ pyeclib? If you SIGKILL (-9) do they die or become zombies?
1. swift.common. daemon. Daemon. run