Activity log for bug #1810714

Date Who What changed Old value New value Message
2019-01-07 06:40:08 Joel Sing bug added bug
2019-01-07 06:40:30 Joel Sing description Multiple bugs (most recently lp#1810313) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance. The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts. The juju agent should have a healthcheck and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds). Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/investigated at a later time. However, none of this should prevent a restart (for example, being out of disk space should not prevent the watchdog/sitter from restarting the jujud). Multiple bugs (most recently lp#1810313) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance. The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts. The juju agent should have a watchdog and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds). Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/investigated at a later time. However, none of this should prevent a restart (for example, being out of disk space should not prevent the watchdog/sitter from restarting the jujud).
2019-01-07 06:44:00 Joel Sing description Multiple bugs (most recently lp#1810313) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance. The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts. The juju agent should have a watchdog and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds). Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/investigated at a later time. However, none of this should prevent a restart (for example, being out of disk space should not prevent the watchdog/sitter from restarting the jujud). Multiple bugs (most recently lp#1810712) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance. The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts. The juju agent should have a watchdog and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds). Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/investigated at a later time. However, none of this should prevent a restart (for example, being out of disk space should not prevent the watchdog/sitter from restarting the jujud).
2019-01-08 08:48:02 Barry Price bug added subscriber Barry Price
2019-01-08 16:31:51 Richard Harding juju: status New Triaged
2019-01-08 16:31:57 Richard Harding juju: importance Undecided Medium
2019-01-08 16:32:07 Richard Harding juju: importance Medium Wishlist
2019-01-30 22:31:44 Haw Loeung bug added subscriber The Canonical Sysadmins
2019-01-30 22:32:07 Haw Loeung description Multiple bugs (most recently lp#1810712) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance. The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts. The juju agent should have a watchdog and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds). Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/investigated at a later time. However, none of this should prevent a restart (for example, being out of disk space should not prevent the watchdog/sitter from restarting the jujud). Multiple bugs (most recently lp: #1810712) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance. The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts. The juju agent should have a watchdog and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds). Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/investigated at a later time. However, none of this should prevent a restart (for example, being out of disk space should not prevent the watchdog/sitter from restarting the jujud).
2022-11-03 16:22:22 Canonical Juju QA Bot juju: importance Wishlist Low
2022-11-03 16:22:23 Canonical Juju QA Bot tags expirebugs-bot