Canonical Juju

Bug #1810714
Activity log

Activity log for bug #1810714

Date	Who	What changed	Old value	New value	Message
2019-01-07 06:40:08	Joel Sing	bug			added bug
2019-01-07 06:40:30	Joel Sing	description	Multiple bugs (most recently lp#1810313) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance. The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts. The juju agent should have a healthcheck and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds). Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/investigated at a later time. However, none of this should prevent a restart (for example, being out of disk space should not prevent the watchdog/sitter from restarting the jujud).	Multiple bugs (most recently lp#1810313) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance. The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts. The juju agent should have a watchdog and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds). Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/investigated at a later time. However, none of this should prevent a restart (for example, being out of disk space should not prevent the watchdog/sitter from restarting the jujud).
2019-01-07 06:44:00	Joel Sing	description	Multiple bugs (most recently lp#1810313) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance. The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts. The juju agent should have a watchdog and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds). Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/investigated at a later time. However, none of this should prevent a restart (for example, being out of disk space should not prevent the watchdog/sitter from restarting the jujud).	Multiple bugs (most recently lp#1810712) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance. The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts. The juju agent should have a watchdog and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds). Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/investigated at a later time. However, none of this should prevent a restart (for example, being out of disk space should not prevent the watchdog/sitter from restarting the jujud).
2019-01-08 08:48:02	Barry Price	bug			added subscriber Barry Price
2019-01-08 16:31:51	Richard Harding	juju: status	New	Triaged
2019-01-08 16:31:57	Richard Harding	juju: importance	Undecided	Medium
2019-01-08 16:32:07	Richard Harding	juju: importance	Medium	Wishlist
2019-01-30 22:31:44	Haw Loeung	bug			added subscriber The Canonical Sysadmins
2019-01-30 22:32:07	Haw Loeung	description	Multiple bugs (most recently lp#1810712) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance. The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts. The juju agent should have a watchdog and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds). Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/investigated at a later time. However, none of this should prevent a restart (for example, being out of disk space should not prevent the watchdog/sitter from restarting the jujud).	Multiple bugs (most recently lp: #1810712) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance. The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts. The juju agent should have a watchdog and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds). Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/investigated at a later time. However, none of this should prevent a restart (for example, being out of disk space should not prevent the watchdog/sitter from restarting the jujud).
2022-11-03 16:22:22	Canonical Juju QA Bot	juju: importance	Wishlist	Low
2022-11-03 16:22:23	Canonical Juju QA Bot	tags		expirebugs-bot