juju doesn't close connections to machines it thinks are dead
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Low
|
Unassigned |
Bug Description
We had a large deployment (~250 machines on MAAS) that was killed all at once with "juju destroy-model". Looking at the output of that command, it slowly reaps a bunch of applications, and then finally starts marking machines as going away, and then marks the model as dead.
Looking at the database, there are no more machine records, nor a model record.
However, it seems that something went wrong wrt actually shutting those machines down (MAAS saw that ~200 machines were still running).
Watching the API, we were seeing several APIs getting hammered:
LeadershipClai
Metrics.
RetryStrategy.
Upgrader.SetTools
The last one is the first thing that the Upgrader worker does when it starts up. So the belief is that we had an active API connection, the associated Machine/Unit records were already deleted, so the Upgrader was getting an error trying to SetTools for a machine that no longer existed. That worker was then bouncing and restarting, and doing it again.
However, if we had actually dropped the TCP connection, we would have expected all of those agents to be trying to Login as machines that no longer exist, so we wouldn't expect Worker APIs to be called, but instead Login apis.
This hints that we aren't closing active connections to machines that we have otherwise treated as Dead. Likely those connections cannot do much, because we should check the auth for most requests and find that they aren't the owner of anything because the machine they think they are exists no more. But it does seem cleaner to get rid of them.
Do you have a repro scenario? Could you easily test it on the latest Juju version?
Since there have a been a few changes in the area since the original report was filed, I wonder if it is still relevant...
I'll mark this as Incomplete until we get a confirmation.