Comment 8 for bug 1710278

Mike Pontillo (mpontillo) wrote :

I'm +1 on throttling reloads; I think that is the most obvious and critical work item for the MAAS team to address. I have filed that as bug #1710308.

I'm also +1 on better service monitoring using actual queries; I've filed that as bug #1710310. I think something equivalent to 'dig @127.0.0.1 <test-query>' on the region should be enough to detect a deadlock condition, but I like the idea of monitoring it from the rack's perspective as well (though that feels more like a non-fatal warning, because we don't want to restart bind in the event of random firewall hiccups).

Finally, I think your last bullet requires more discussion before we can work on it. MAAS currently uses sudoers rules specific to the init system to start and stop services like bind9; we do not currently have permission to 'kill -9' arbitrary processes. I'm concerned that if we go down that road, we would open up the possibility that MAAS could erroneously (or due to a malicious attack) believe that bind9 isn't working and repeatedly kill it without good cause, or be convinced to 'kill -9' an incorrect process.

In summary, I think the most urgent thing for MAAS to do is throttle reloads. That should greatly reduce the window of opportunity for the deadlock to occur. In parallel, this should be addressed upstream in bind9.