Comment 3 for bug 1891586

Revision history for this message
Ian Booth (wallyworld) wrote :

That "Failed to list unit files: Connection timed out;" would be coming from systemctl itself trying to run the list-unit-files command. Juju uses that command at controller startup to check that mongo etc is installed, as well as when deploying units.

You could guess that it would be related to load on the machine. The fact that the timeout causes the agent to restart would be due to the fact that design of how Juju manages it worker routines is such that it is considered better to restart if there's an error rather than to try and maintain state and recover. There's perhaps an argument to be made that I/O timeout type errors should result in that operation being retried after a back off. However, that would involve being able to cleanly identify the root cause error in the 100s of places where errors can occur and adding code to those 100s of places to do handle the retry. It's more feasible to do the agent restart but backoff at that point to avoid contributing the the load on the machine which is causing the issue.