juju doesn't close connections to machines it thinks are dead

Bug #1698187 reported by John A Meinel on 2017-06-15
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
Medium
Unassigned

Bug Description

We had a large deployment (~250 machines on MAAS) that was killed all at once with "juju destroy-model". Looking at the output of that command, it slowly reaps a bunch of applications, and then finally starts marking machines as going away, and then marks the model as dead.

Looking at the database, there are no more machine records, nor a model record.

However, it seems that something went wrong wrt actually shutting those machines down (MAAS saw that ~200 machines were still running).

Watching the API, we were seeing several APIs getting hammered:
 LeadershipClaimLeadership
 Metrics.WatchMeterStatus
 RetryStrategy.RetryStrategy
 Upgrader.SetTools

The last one is the first thing that the Upgrader worker does when it starts up. So the belief is that we had an active API connection, the associated Machine/Unit records were already deleted, so the Upgrader was getting an error trying to SetTools for a machine that no longer existed. That worker was then bouncing and restarting, and doing it again.

However, if we had actually dropped the TCP connection, we would have expected all of those agents to be trying to Login as machines that no longer exist, so we wouldn't expect Worker APIs to be called, but instead Login apis.

This hints that we aren't closing active connections to machines that we have otherwise treated as Dead. Likely those connections cannot do much, because we should check the auth for most requests and find that they aren't the owner of anything because the machine they think they are exists no more. But it does seem cleaner to get rid of them.

Anastasia (anastasia-macmood) wrote :

Do you have a repro scenario? Could you easily test it on the latest Juju version?

Since there have a been a few changes in the area since the original report was filed, I wonder if it is still relevant...

I'll mark this as Incomplete until we get a confirmation.

Changed in juju:
status: Triaged → Incomplete

This seems like a case where destroy-machine --force gets you into trouble
because it doesn't wait for an ack from the machine agent before it starts
removing machine records. I would have thought we would still ensure that
machines instances are destroyed.

I don't have a particular reproduction mechanism.

On Tue, May 7, 2019, 09:00 Anastasia <email address hidden>
wrote:

> Do you have a repro scenario? Could you easily test it on the latest
> Juju version?
>
> Since there have a been a few changes in the area since the original
> report was filed, I wonder if it is still relevant...
>
> I'll mark this as Incomplete until we get a confirmation.
>
> ** Changed in: juju
> Status: Triaged => Incomplete
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1698187
>
> Title:
> juju doesn't close connections to machines it thinks are dead
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1698187/+subscriptions
>

Anastasia (anastasia-macmood) wrote :

I see. I'll try to reproduce to confirm if it is still an issue... maybe get some metric of performance for benchmarking.

Changed in juju:
status: Incomplete → Triaged
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers