Thanks for the update, Jorge.  What you're saying makes sense.  Here's a breakdown of what's going on:

1. when the machine agent starts, it creates the connection to the API server that it will use for its entire life
  https://github.com/juju/juju/blob/1.25/cmd/jujud/agent/machine.go#L444

2. the connection to the API server can't be made due to credentials, so worker.ErrTerminateAgent is returned
  https://github.com/juju/juju/blob/1.25/worker/apicaller/open.go#L170

3. the machine agent's runner treats the errors as fatal and stops

4. the machine agent waits for its runner to stop

5. the machine agent handles the worker.ErrTerminateAgent by "uninstalling" the agent
  https://github.com/juju/juju/blob/1.25/cmd/jujud/agent/machine.go#L455

6. the init services for Juju and mongod are removed and the agent data dir is deleted
  https://github.com/juju/juju/blob/1.25/cmd/jujud/agent/machine.go#L1737


Steps 1-4 are correct behavior since the agent cannot run without an API connection.  The question is whether or not 5 and 6 are desirable.  Let's condense that to what happens in 6 and break that down further:

A. the jujud init service is removed
B. the mongod service installed for jujud is removed
C. the agent data dir is deleted

A and B kind of make sense.  If the agent was terminated then it should stay off.  Uninstalling the services achieves that.  Regardless, we don't want jujud to try starting back up on its own, since we expect it will fail in the same way.  If that were not the case then we would certainly want to find a solution for juju to fix itself.  As things are now, at this point the machine is in a "lost" state and will stay that way until it gets manual intervention.

For C it's kind of the same story.  We want to avoid wasting instance resources.  The files in the agent data dir are for the sake of the agent, so we clean them up since the agent is dead.  If they are otherwise useful then that should be addressed.

I suppose it boils down to why you want A, B, or C to not happen.  Is it so that you have a chance to manually fix the problem and then revive the "dead" agent?  Given your bug report  here, I expect that is the case.

First of all, I have yet to determine the history behind why we uninstall and clean up when we get worker.ErrTerminateAgent.  The original rationale should be considered before we make any changes here.

That said, here are some options:

1. support a per-machine DO_NOT_UNINSTALL or DO_NOT_CLEAN_UP setting and respect that in the machine agent code (step 5/6)
2. don't ever clean up
3. don't delete the data dir but do uninstall the services
4. "disable" the agent
  - disable but do not uninstall the services
  - move the data dir out of the way (e.g. into /home/ubuntu/...)
5. like 4, but also add a "juju enable-agent" command that will ssh to the instance, rebuild the agent dir, and re-enable the services
    (or add  an "enable-juju" command on the machine to do that)

These are just some ideas and would need more thought before we could move forward.  #1 would probably be a less invasive fix so it might be easier to justify targeting in the short term, even if we want some other solution long-term.  All this also depends on why you care about the clean-up behavior for a "dead" agent.