Machine agent randomly uninstalled itself

Bug #1827009 reported by Barry Price
64
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Achilleas Anagnostopoulos
2.5
Fix Released
Critical
Achilleas Anagnostopoulos
2.6
Fix Released
Critical
Achilleas Anagnostopoulos

Bug Description

This is in a Juju 2.4.7 environment using AWS.

I noticed of our machines was "down" in Juju status, with failed unit hooks also visible.

SSH'd to the unit to see what was up, and found that the jujud-machine-3 systemd service files no longer exist, and that /var/lib/juju/tools has also gone, likely among others.

The only unusual lines in jujud-machine-3.log are right at the end, with no apparent cause:

2019-04-27 05:47:15 INFO juju.agent uninstall.go:36 marking agent ready for uninstall
2019-04-27 05:47:15 INFO juju.worker.stateconfigwatcher manifold.go:119 tomb dying
2019-04-27 05:47:15 INFO juju.agent uninstall.go:47 agent already marked ready for uninstall
2019-04-27 05:47:15 INFO juju.cmd.jujud machine.go:1238 uninstalling agent
2019-04-27 05:47:16 INFO cmd supercommand.go:465 command finished

Revision history for this message
Barry Price (barryprice) wrote :

I've now come across this in another two units in a separate (Juju 2.5.1 on GCE) environment also, so this isn't AWS specific, nor is it 2.4 specific.

Revision history for this message
Richard Harding (rharding) wrote :

Can you validate the disk usage of the machines missing agents. There's another bug report going around that mentions that if the disk filled up the agent was removed and I'm wondering if we can get some hint/confirmation of what's going on.

Revision history for this message
Barry Price (barryprice) wrote :

As far as our logging shows, none of the three units ever had a root partition full enough to cause alerts (threshold for that is 80%).

It's possible that disk space filled up very briefly and that the same process then recovered the space quickly enough to pass without triggering an alert, but this doesn't seem likely to have happened on all three units.

Revision history for this message
Tim Penhey (thumper) wrote :

We need to remove the codepath where the agent decides to uninstall itself. This is dangerous, and as shown here and in other similar bugs, the agent can get removed in situations where it is clearly wrong to do so.

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.6.1
Changed in juju:
milestone: 2.6.1 → 2.6.2
Changed in juju:
milestone: 2.6.2 → 2.6.3
Revision history for this message
Richard Harding (rharding) wrote :

We've all agreed to stop uninstalling so marking critical and getting it targeted for releases.

Changed in juju:
importance: High → Critical
milestone: 2.6.3 → none
assignee: nobody → Achilleas Anagnostopoulos (achilleasa)
Revision history for this message
Barry Price (barryprice) wrote :

For the record, I saw this for the first time today on a controller unit (2.5.4) rather than within an application in a non-controller model.

Unfortunately it was a non-HA setup, and as well as /var/lib/juju/tools being removed, /var/lib/juju/agents and /var/lib/juju/db are also no more.

This pretty much leaves the environment in an unrecoverable state.

Disk usage peaked at around 80% full on a 50GB volume, I don't believe we ever ran out of space entirely.

Revision history for this message
John A Meinel (jameinel) wrote :

Is it possible to get the tail of the machine log where it decides to uninstall itself (possibly out of band if you don't want to attach it to a public bug), so that we might see why it thought it should be uninstalled?

Revision history for this message
John A Meinel (jameinel) wrote :

I believe the original change for this was because of
https://bugs.launchpad.net/bugs/1611453

Changed in juju:
milestone: none → 2.7-beta1
Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
Barry Price (barryprice) wrote :

^ (machine-0.log supplied to John directly)

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (5.5 KiB)

The log has these lines at the end:
2019-05-15 02:21:41 INFO juju.worker.httpserver worker.go:168 shutting down HTTP server
2019-05-15 02:22:15 INFO juju.agent uninstall.go:47 agent already marked ready for uninstall
2019-05-15 02:22:15 INFO juju.cmd.jujud machine.go:1224 uninstalling agent

That log line is found in:
func CanUninstall(a Agent) bool {
 if _, err := os.Stat(uninstallFile(a)); err != nil {
  logger.Debugf("agent not marked ready for uninstall")
  return false
 }

and that function is only called in
func (a *MachineAgent) uninstallAgent() error {
 // We should only uninstall if the uninstall file is present.
 if !agent.CanUninstall(a) {
  logger.Infof("ignoring uninstall request")
  return nil
 }
 logger.Infof("uninstalling agent")

which is only called by:
func (a *MachineAgent) Run(*cmd.Context) (err error) {
...
 close(a.workersStarted)
 err = a.runner.Wait()
 switch errors.Cause(err) {
 case jworker.ErrTerminateAgent:
  err = a.uninstallAgent()

However, there are 30 or so references to ErrTerminateAgent. Trying to restrict it to ones that are returning that error
agent/machine.go:693
 if err := a.setupContainerSupport(runner, apiConn, agentConfig); err != nil {
  cause := errors.Cause(err)
  if params.IsCodeDead(cause) || cause == jworker.ErrTerminateAgent {
   return nil, jworker.ErrTerminateAgent
  }
  return nil, errors.Annotate(err, "setting up container support")
 }

...
func (a *MachineAgent) setControllerNetworkConfig(apiConn api.Connection) error {
 // TODO(bootstrap): do we need this for k8s???
 machine, err := a.machine(apiConn)
 if errors.IsNotFound(err) || err == nil && machine.Life() == params.Dead {
  return jworker.ErrTerminateAgent
 }

...
func (a *MachineAgent) updateSupportedContainers(
 runner *worker.Runner,
 st api.Connection,
 containers []instance.ContainerType,
 agentConfig agent.Config,
) error {
 pr := apiprovisioner.NewState(st)
 tag := agentConfig.Tag().(names.MachineTag)
 result, err := pr.Machines(tag)
 if err != nil {
  return errors.Annotatef(err, "cannot load machine %s from state", tag)
 }
 if len(result) != 1 {
  return errors.Errorf("expected 1 result, got %d", len(result))
 }
 if errors.IsNotFound(result[0].Err) || (result[0].Err == nil && result[0].Machine.Life() == params.Dead) {
  return jworker.ErrTerminateAgent
 }
...
func openStatePool(
 agentConfig agent.Config,
 dialOpts mongo.DialOpts,
 runTransactionObserver state.RunTransactionObserverFunc,
) (_ *state.StatePool, _ *state.Machine, err error) {

...
 st := pool.SystemState()
 m0, err := st.FindEntity(agentConfig.Tag())
 if err != nil {
  if errors.IsNotFound(err) {
   err = jworker.ErrTerminateAgent
  }
  return nil, nil, err
 }
 m := m0.(*state.Machine)
 if m.Life() == state.Dead {
  return nil, nil, jworker.ErrTerminateAgent
 }
...
 if !m.CheckProvisioned(agentConfig.Nonce()) {
  // The agent is running on a different machine to the one it
  // should be according to state. It must stop immediately.
  logger.Errorf("running machine %v agent on inappropriate instance", m)
  return nil, nil, jworker.ErrTerminateAgent
 }
^- not this one because we don't see that error message

machiner.go:
func (mr *Machiner) SetUp() (watcher.No...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

Other interesting lines:
2019-05-15 02:21:35 INFO juju.state.presence presence.go:194 watcher loop failed: write tcp 127.0.0.1:38766->127.0.0.1:37017: i/o timeout
...
2019-05-15 02:21:35 INFO juju.state multiwatcher.go:212 store manager loop failed: get unit "neutron-openvswitch/29": cannot get unit "neutron-openvswitch/29": write tcp 127.0.0.1:39172->127.0.0.1:37017: i/o timeout

all of that indicates the database started to stop responding to queries
2019-05-15 02:21:36 ERROR juju.worker.dependency engine.go:636 "is-responsible-flag" manifold worker returned unexpected error: lease manager stopped
...
2019-05-15 02:21:37 ERROR juju.worker.dependency engine.go:636 "is-responsible-flag" manifold worker returned unexpected error: lease manager stopped

in 1 second that line is repeated 409 times.

while that is happening we do see a line like:
2019-05-15 02:21:37 INFO juju.apiserver.connection request_notifier.go:96 agent login: unit-nrpe-physical-24 for 24ecac33-8390-4ad6-80b6-6394a88c74e6

so some sort of login is working.
2019-05-15 02:21:38 WARNING juju.environs.config config.go:1570 unknown config field "tools-metadata-url"

^- this is repeated *many* times. The config field is supposed to be "agent-metadata-url". (but it was migrated from tools-metadata-url in the past.)

...
2019-05-15 02:21:40 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-neutron-openvswitch-26 for 24ecac33-8390-4ad6-80b6-6394a88c74e6
2019-05-15 02:21:40 INFO juju.agent uninstall.go:36 marking agent ready for uninstall
2019-05-15 02:21:40 INFO juju.worker.stateconfigwatcher manifold.go:119 tomb dying

^- That is the call to SetCanUninstall

But note that there are 2 ways that we get SetCanUninstall. Namely:
 connectFilter := func(err error) error {
  cause := errors.Cause(err)
  if cause == apicaller.ErrConnectImpossible {
   err2 := coreagent.SetCanUninstall(config.Agent)
   if err2 != nil {
    return errors.Trace(err2)
   }
   return jworker.ErrTerminateAgent
  } else if cause == apicaller.ErrChangedPassword {
   return dependency.ErrBounce
  }
  return err
 }

and
 w, err := NewMachiner(Config{
  MachineAccessor: accessor,
  Tag: tag.(names.MachineTag),
  ClearMachineAddressesOnStart: ignoreMachineAddresses,
  NotifyMachineDead: func() error {
   return agent.SetCanUninstall(a)
  },
 })

The latter is if the Machiner notices that the database record is actually flagged as Dead.

The only caller to NotifyMachineDead is in Machiner.Handle which should only happen after it has called Machine.EnsureDead() which means it isn't a transitory failure, it really is something saying "this machine should be removed".

...
2019-05-15 02:21:40 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-ubuntu-0 for 7cc4a184-7867-412d-8c06-9c9780fd26a1
2019-05-15 02:21:40 INFO juju.worker.machineundertaker undertaker.go:131 tearing down machine undertaker
2019-05-15 02:21:40 INFO juju.apiserver.connection request_notifier.go:96 agent login: unit-landscape-131 for 24ecac33-8390-4ad6-80b6-6394a88c74e6
...

Revision history for this message
John A Meinel (jameinel) wrote :

If this was the code that triggers when we cannot login, we would expect to see a line like:
func ScaryConnect(a agent.Agent, apiOpen api.OpenFunc) (_ api.Connection, err error) {
 agentConfig := a.CurrentConfig()
 info, ok := agentConfig.APIInfo()
 if !ok {
  return nil, errors.New("API info not available")
 }
 oldPassword := agentConfig.OldPassword()

 defer func() {
  cause := errors.Cause(err)
  switch {
  case cause == apiagent.ErrDenied:
  case cause == errAgentEntityDead:
  case params.IsCodeUnauthorized(cause):
  case params.IsCodeNotProvisioned(cause):
  default:
   return
  }
  logger.Errorf("Failed to connect to controller: %v", err)
  err = ErrConnectImpossible
 }()

That is the only code path that yields ErrConnectImpossible
which means that we should see Failed to connect to controller.

And I *don't* see that in the log file.
And the log file *does* cover many hours, so it would be unlikely that the ErrConnectImpossible happened so much earlier that we took from 17:07 until 02:21 to actually respond to it.

Revision history for this message
John A Meinel (jameinel) wrote :

Now, we did confirm that "juju remove-machine -m controller 0" will tell you "cannot remove machine 0 as it is the only controller machine"

But we are seeing that machine 0 is getting removed and we don't see the line "Failed to connect" in the logs that we have access to.

Revision history for this message
Achilleas Anagnostopoulos (achilleasa) wrote :

Is it possible to get CodeNotProvisioned in ScaryConnect if for some
reason the state of a machine has been corrupted?

There are quite a few errors.NotProvisioned[f] calls in state/machine.go
that could potentially emit that error code.

Revision history for this message
Achilleas Anagnostopoulos (achilleasa) wrote :

Ah, sorry missed the comment that the logs do not contain a "Failed to connect" line.

Changed in juju:
status: Triaged → In Progress
Revision history for this message
Ian Booth (wallyworld) wrote :

I've landed a fix to 2.6 branch which removes the agent uninstall code entirely. Manual machine agents now simply stop when the machine is removed from juju. We need to add an additional juju command to explicitly clean up the agent and associated files from the manual machine.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Since the fixes have been merged into 2.6 and develop, I am marking this report accordingly.

Changed in juju:
status: In Progress → Fix Committed
Tim Penhey (thumper)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.