juju agent needs to be more resilient

Bug #1810714 reported by Joel Sing
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

Multiple bugs (most recently lp: #1810712) have resulted in the juju agents becoming wedged and failing to operate - in particular, either wedging while establishing a connection to the API server, or failing to reconnect to the API server. With any real deployment this is a significant problem and annoyance.

The solution currently is to manually restart 100s or 1000s of juju agents, across 100s of machines. In other situations it is not possible for the operators of the controller to force agents to restart, instead having to wait for the owners of those machines to notice problems and force restarts.

The juju agent should have a watchdog and restart the jujud process if the API server is unreachable for an extended period of time (e.g. 2-6 hours). This could be implemented either as an external sitter process, or as a watchdog inside the jujud process. The end result should be that the process is sent a SIGTERM, followed by a SIGKILL if it has not terminated within a reasonable time (e.g. 30 seconds).

Ideally the current juju_goroutines output would be opportunistically stored, the problem logged and metrics recorded so that this can be detected via monitoring and analysed/investigated at a later time. However, none of this should prevent a restart (for example, being out of disk space should not prevent the watchdog/sitter from restarting the jujud).

Joel Sing (jsing)
description: updated
description: updated
Revision history for this message
Richard Harding (rharding) wrote :

Thanks, I can't disagree with what you've got here. It's something we should get into some discussions and see about plotting out a proper path for some self-healing around this for sure.

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
importance: Medium → Wishlist
Haw Loeung (hloeung)
description: updated
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Wishlist → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.