Bug #1827009 “Machine agent randomly uninstalled itself” : Bugs : Canonical Juju

Revision history for this message

Barry Price (barryprice) wrote on 2019-04-30:

#1

I've now come across this in another two units in a separate (Juju 2.5.1 on GCE) environment also, so this isn't AWS specific, nor is it 2.4 specific.

Revision history for this message

Richard Harding (rharding) wrote on 2019-04-30:

#2

Can you validate the disk usage of the machines missing agents. There's another bug report going around that mentions that if the disk filled up the agent was removed and I'm wondering if we can get some hint/confirmation of what's going on.

Revision history for this message

Barry Price (barryprice) wrote on 2019-05-03:

#3

As far as our logging shows, none of the three units ever had a root partition full enough to cause alerts (threshold for that is 80%).

It's possible that disk space filled up very briefly and that the same process then recovered the space quickly enough to pass without triggering an alert, but this doesn't seem likely to have happened on all three units.

Revision history for this message

Tim Penhey (thumper) wrote on 2019-05-05:

#4

We need to remove the codepath where the agent decides to uninstall itself. This is dangerous, and as shown here and in other similar bugs, the agent can get removed in situations where it is clearly wrong to do so.

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.6.1

Canonical Juju QA Bot (juju-qa-bot) on 2019-05-13

Changed in juju:
milestone:	2.6.1 → 2.6.2

Canonical Juju QA Bot (juju-qa-bot) on 2019-05-14

Changed in juju:
milestone:	2.6.2 → 2.6.3

Revision history for this message

Richard Harding (rharding) wrote on 2019-05-15:

#5

We've all agreed to stop uninstalling so marking critical and getting it targeted for releases.

Changed in juju:
importance:	High → Critical
milestone:	2.6.3 → none
assignee:	nobody → Achilleas Anagnostopoulos (achilleasa)

Revision history for this message

Barry Price (barryprice) wrote on 2019-05-15:

#6

For the record, I saw this for the first time today on a controller unit (2.5.4) rather than within an application in a non-controller model.

Unfortunately it was a non-HA setup, and as well as /var/lib/juju/tools being removed, /var/lib/juju/agents and /var/lib/juju/db are also no more.

This pretty much leaves the environment in an unrecoverable state.

Disk usage peaked at around 80% full on a 50GB volume, I don't believe we ever ran out of space entirely.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-15:

#7

Is it possible to get the tail of the machine log where it decides to uninstall itself (possibly out of band if you don't want to attach it to a public bug), so that we might see why it thought it should be uninstalled?

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-15:

#8

I believe the original change for this was because of
https://bugs.launchpad.net/bugs/1611453

Changed in juju:
milestone:	none → 2.7-beta1

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-15:

#9

https://github.com/juju/juju/pull/10193 is a PR against 2.5

Revision history for this message

Barry Price (barryprice) wrote on 2019-05-15:

#10

^ (machine-0.log supplied to John directly)

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-15:

#11

Download full text (5.5 KiB)

The log has these lines at the end:
2019-05-15 02:21:41 INFO juju.worker.httpserver worker.go:168 shutting down HTTP server
2019-05-15 02:22:15 INFO juju.agent uninstall.go:47 agent already marked ready for uninstall
2019-05-15 02:22:15 INFO juju.cmd.jujud machine.go:1224 uninstalling agent

That log line is found in:
func CanUninstall(a Agent) bool {
if _, err := os.Stat(uninstallFile(a)); err != nil {
logger.Debugf("agent not marked ready for uninstall")
return false
}

and that function is only called in
func (a *MachineAgent) uninstallAgent() error {
// We should only uninstall if the uninstall file is present.
if !agent.CanUninstall(a) {
logger.Infof("ignoring uninstall request")
return nil
}
logger.Infof("uninstalling agent")

which is only called by:
func (a *MachineAgent) Run(*cmd.Context) (err error) {
...
close(a.workersStarted)
err = a.runner.Wait()
switch errors.Cause(err) {
case jworker.ErrTerminateAgent:
err = a.uninstallAgent()

However, there are 30 or so references to ErrTerminateAgent. Trying to restrict it to ones that are returning that error
agent/machine.go:693
if err := a.setupContainerSupport(runner, apiConn, agentConfig); err != nil {
  cause := errors.Cause(err)
  if params.IsCodeDead(cause) || cause == jworker.ErrTerminateAgent {
   return nil, jworker.ErrTerminateAgent
  }
  return nil, errors.Annotate(err, "setting up container support")
}

...
func (a *MachineAgent) setControllerNetworkConfig(apiConn api.Connection) error {
// TODO(bootstrap): do we need this for k8s???
machine, err := a.machine(apiConn)
if errors.IsNotFound(err) || err == nil && machine.Life() == params.Dead {
return jworker.ErrTerminateAgent
}

...
func (a *MachineAgent) updateSupportedContainers(
runner *worker.Runner,
st api.Connection,
containers []instance.ContainerType,
agentConfig agent.Config,
) error {
pr := apiprovisioner.NewState(st)
tag := agentConfig.Tag().(names.MachineTag)
result, err := pr.Machines(tag)
if err != nil {
  return errors.Annotatef(err, "cannot load machine %s from state", tag)
}
if len(result) != 1 {
  return errors.Errorf("expected 1 result, got %d", len(result))
}
if errors.IsNotFound(result[0].Err) || (result[0].Err == nil && result[0].Machine.Life() == params.Dead) {
  return jworker.ErrTerminateAgent
}
...
func openStatePool(
agentConfig agent.Config,
dialOpts mongo.DialOpts,
runTransactionObserver state.RunTransactionObserverFunc,
) (_ *state.StatePool, _ *state.Machine, err error) {

...
st := pool.SystemState()
m0, err := st.FindEntity(agentConfig.Tag())
if err != nil {
  if errors.IsNotFound(err) {
   err = jworker.ErrTerminateAgent
  }
  return nil, nil, err
}
m := m0.(*state.Machine)
if m.Life() == state.Dead {
  return nil, nil, jworker.ErrTerminateAgent
}
...
if !m.CheckProvisioned(agentConfig.Nonce()) {
  // The agent is running on a different machine to the one it
  // should be according to state. It must stop immediately.
  logger.Errorf("running machine %v agent on inappropriate instance", m)
  return nil, nil, jworker.ErrTerminateAgent
}
^- not this one because we don't see that error message

machiner.go:
func (mr *Machiner) SetUp() (watcher.No...

The log has these lines at the end:
2019-05-15 02:21:41 INFO juju.worker.httpserver worker.go:168 shutting down HTTP server
2019-05-15 02:22:15 INFO juju.agent uninstall.go:47 agent already marked ready for uninstall
2019-05-15 02:22:15 INFO juju.cmd.jujud machine.go:1224 uninstalling agent

That log line is found in:
func CanUninstall(a Agent) bool {
	if _, err := os.Stat(uninstallFile(a)); err != nil {
		logger.Debugf("agent not marked ready for uninstall")
		return false
	}

and that function is only called in
func (a *MachineAgent) uninstallAgent() error {
	// We should only uninstall if the uninstall file is present.
	if !agent.CanUninstall(a) {
		logger.Infof("ignoring uninstall request")
		return nil
	}
	logger.Infof("uninstalling agent")

which is only called by:
func (a *MachineAgent) Run(*cmd.Context) (err error) {
...
	close(a.workersStarted)
	err = a.runner.Wait()
	switch errors.Cause(err) {
	case jworker.ErrTerminateAgent:
		err = a.uninstallAgent()

However, there are 30 or so references to ErrTerminateAgent. Trying to restrict it to ones that are returning that error
agent/machine.go:693
	if err := a.setupContainerSupport(runner, apiConn, agentConfig); err != nil {
		cause := errors.Cause(err)
		if params.IsCodeDead(cause) || cause == jworker.ErrTerminateAgent {
			return nil, jworker.ErrTerminateAgent
		}
		return nil, errors.Annotate(err, "setting up container support")
	}

...
func (a *MachineAgent) setControllerNetworkConfig(apiConn api.Connection) error {
	// TODO(bootstrap): do we need this for k8s???
	machine, err := a.machine(apiConn)
	if errors.IsNotFound(err) || err == nil && machine.Life() == params.Dead {
		return jworker.ErrTerminateAgent
	}

...
func (a *MachineAgent) updateSupportedContainers(
	runner *worker.Runner,
	st api.Connection,
	containers []instance.ContainerType,
	agentConfig agent.Config,
) error {
	pr := apiprovisioner.NewState(st)
	tag := agentConfig.Tag().(names.MachineTag)
	result, err := pr.Machines(tag)
	if err != nil {
		return errors.Annotatef(err, "cannot load machine %s from state", tag)
	}
	if len(result) != 1 {
		return errors.Errorf("expected 1 result, got %d", len(result))
	}
	if errors.IsNotFound(result[0].Err) || (result[0].Err == nil && result[0].Machine.Life() == params.Dead) {
		return jworker.ErrTerminateAgent
	}
...
func openStatePool(
	agentConfig agent.Config,
	dialOpts mongo.DialOpts,
	runTransactionObserver state.RunTransactionObserverFunc,
) (_ *state.StatePool, _ *state.Machine, err error) {

...
	st := pool.SystemState()
	m0, err := st.FindEntity(agentConfig.Tag())
	if err != nil {
		if errors.IsNotFound(err) {
			err = jworker.ErrTerminateAgent
		}
		return nil, nil, err
	}
	m := m0.(*state.Machine)
	if m.Life() == state.Dead {
		return nil, nil, jworker.ErrTerminateAgent
	}
...
	if !m.CheckProvisioned(agentConfig.Nonce()) {
		// The agent is running on a different machine to the one it
		// should be according to state. It must stop immediately.
		logger.Errorf("running machine %v agent on inappropriate instance", m)
		return nil, nil, jworker.ErrTerminateAgent
	}
^- not this one because we don't see that error message

machiner.go:
func (mr *Machiner) SetUp() (watcher.NotifyWatcher, error) {
	// Find which machine we're responsible for.
	m, err := mr.config.MachineAccessor.Machine(mr.config.Tag)
	if params.IsCodeNotFoundOrCodeUnauthorized(err) {
		return nil, jworker.ErrTerminateAgent
...
func (mr *Machiner) Handle(_ <-chan struct{}) error {
	if err := mr.machine.Refresh(); params.IsCodeNotFoundOrCodeUnauthorized(err) {
		// NOTE(axw) we can distinguish between NotFound and CodeUnauthorized,
		// so we could call NotifyMachineDead here in case the agent failed to
		// call NotifyMachineDead directly after setting the machine Dead in
		// the first place. We're not doing that to be cautious: the machine
		// could be missing from state due to invalid global state.
		return jworker.ErrTerminateAgent
...
	if mr.config.NotifyMachineDead != nil {
		if err := mr.config.NotifyMachineDead(); err != nil {
			return errors.Annotate(err, "reporting machine death")
		}
	}
	return jworker.ErrTerminateAgent
...
manifolds.go:
	connectFilter := func(err error) error {
		cause := errors.Cause(err)
		if cause == apicaller.ErrConnectImpossible {
			err2 := coreagent.SetCanUninstall(config.Agent)
			if err2 != nil {
				return errors.Trace(err2)
			}
			return jworker.ErrTerminateAgent

Note there are 2 places where we call SetCanUninstall which is the only way we actually delete the code.
The above code
and this one
...
	accessor := APIMachineAccessor{apimachiner.NewState(apiCaller)}
	w, err := NewMachiner(Config{
		MachineAccessor:              accessor,
		Tag:                          tag.(names.MachineTag),
		ClearMachineAddressesOnStart: ignoreMachineAddresses,
		NotifyMachineDead: func() error {
			return agent.SetCanUninstall(a)
		},
	})

...
func (u *Uniter) stopUnitError() error {
	logger.Debugf("u.modelType: %s", u.modelType)
	if u.modelType == model.CAAS {
		return ErrCAASUnitDead
	}
	return jworker.ErrTerminateAgent
}

...
func (w *RemoteStateWatcher) setUp(unitTag names.UnitTag) error {
	// TODO(axw) move this logic
	var err error
	defer func() {
		cause := errors.Cause(err)
		if params.IsCodeNotFoundOrCodeUnauthorized(cause) {
			// We only want to terminate the agent for IAAS models.
			if w.modelType == model.IAAS {
				err = jworker.ErrTerminateAgent
			}
		}
	}()

terminationworker/worker.go:
func (w *terminationWorker) loop(c <-chan os.Signal) (err error) {
	select {
	case <-c:
		return jworker.ErrTerminateAgent
	case <-w.tomb.Dying():
		return tomb.ErrDying
	}
}

That seems to be all the references.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-15:

#12

Other interesting lines:
2019-05-15 02:21:35 INFO juju.state.presence presence.go:194 watcher loop failed: write tcp 127.0.0.1:38766->127.0.0.1:37017: i/o timeout
...
2019-05-15 02:21:35 INFO juju.state multiwatcher.go:212 store manager loop failed: get unit "neutron-openvswitch/29": cannot get unit "neutron-openvswitch/29": write tcp 127.0.0.1:39172->127.0.0.1:37017: i/o timeout

all of that indicates the database started to stop responding to queries
2019-05-15 02:21:36 ERROR juju.worker.dependency engine.go:636 "is-responsible-flag" manifold worker returned unexpected error: lease manager stopped
...
2019-05-15 02:21:37 ERROR juju.worker.dependency engine.go:636 "is-responsible-flag" manifold worker returned unexpected error: lease manager stopped

in 1 second that line is repeated 409 times.

while that is happening we do see a line like:
2019-05-15 02:21:37 INFO juju.apiserver.connection request_notifier.go:96 agent login: unit-nrpe-physical-24 for 24ecac33-8390-4ad6-80b6-6394a88c74e6

so some sort of login is working.
2019-05-15 02:21:38 WARNING juju.environs.config config.go:1570 unknown config field "tools-metadata-url"

^- this is repeated *many* times. The config field is supposed to be "agent-metadata-url". (but it was migrated from tools-metadata-url in the past.)

...
2019-05-15 02:21:40 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-neutron-openvswitch-26 for 24ecac33-8390-4ad6-80b6-6394a88c74e6
2019-05-15 02:21:40 INFO juju.agent uninstall.go:36 marking agent ready for uninstall
2019-05-15 02:21:40 INFO juju.worker.stateconfigwatcher manifold.go:119 tomb dying

^- That is the call to SetCanUninstall

But note that there are 2 ways that we get SetCanUninstall. Namely:
connectFilter := func(err error) error {
  cause := errors.Cause(err)
  if cause == apicaller.ErrConnectImpossible {
   err2 := coreagent.SetCanUninstall(config.Agent)
   if err2 != nil {
    return errors.Trace(err2)
   }
   return jworker.ErrTerminateAgent
  } else if cause == apicaller.ErrChangedPassword {
   return dependency.ErrBounce
  }
  return err
}

and
w, err := NewMachiner(Config{
  MachineAccessor: accessor,
  Tag: tag.(names.MachineTag),
  ClearMachineAddressesOnStart: ignoreMachineAddresses,
  NotifyMachineDead: func() error {
   return agent.SetCanUninstall(a)
  },
})

The latter is if the Machiner notices that the database record is actually flagged as Dead.

The only caller to NotifyMachineDead is in Machiner.Handle which should only happen after it has called Machine.EnsureDead() which means it isn't a transitory failure, it really is something saying "this machine should be removed".

...
2019-05-15 02:21:40 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-ubuntu-0 for 7cc4a184-7867-412d-8c06-9c9780fd26a1
2019-05-15 02:21:40 INFO juju.worker.machineundertaker undertaker.go:131 tearing down machine undertaker
2019-05-15 02:21:40 INFO juju.apiserver.connection request_notifier.go:96 agent login: unit-landscape-131 for 24ecac33-8390-4ad6-80b6-6394a88c74e6
...

Other interesting lines:
2019-05-15 02:21:35 INFO juju.state.presence presence.go:194 watcher loop failed: write tcp 127.0.0.1:38766->127.0.0.1:37017: i/o timeout
...
2019-05-15 02:21:35 INFO juju.state multiwatcher.go:212 store manager loop failed: get unit "neutron-openvswitch/29": cannot get unit "neutron-openvswitch/29": write tcp 127.0.0.1:39172->127.0.0.1:37017: i/o timeout

all of that indicates the database started to stop responding to queries
2019-05-15 02:21:36 ERROR juju.worker.dependency engine.go:636 "is-responsible-flag" manifold worker returned unexpected error: lease manager stopped
...
2019-05-15 02:21:37 ERROR juju.worker.dependency engine.go:636 "is-responsible-flag" manifold worker returned unexpected error: lease manager stopped

in 1 second that line is repeated 409 times.

while that is happening we do see a line like:
2019-05-15 02:21:37 INFO juju.apiserver.connection request_notifier.go:96 agent login: unit-nrpe-physical-24 for 24ecac33-8390-4ad6-80b6-6394a88c74e6

so some sort of login is working.
2019-05-15 02:21:38 WARNING juju.environs.config config.go:1570 unknown config field "tools-metadata-url"

^- this is repeated *many* times. The config field is supposed to be "agent-metadata-url". (but it was migrated from tools-metadata-url in the past.)

...
2019-05-15 02:21:40 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-neutron-openvswitch-26 for 24ecac33-8390-4ad6-80b6-6394a88c74e6
2019-05-15 02:21:40 INFO juju.agent uninstall.go:36 marking agent ready for uninstall
2019-05-15 02:21:40 INFO juju.worker.stateconfigwatcher manifold.go:119 tomb dying

^- That is the call to SetCanUninstall

But note that there are 2 ways that we get SetCanUninstall. Namely:
	connectFilter := func(err error) error {
		cause := errors.Cause(err)
		if cause == apicaller.ErrConnectImpossible {
			err2 := coreagent.SetCanUninstall(config.Agent)
			if err2 != nil {
				return errors.Trace(err2)
			}
			return jworker.ErrTerminateAgent
		} else if cause == apicaller.ErrChangedPassword {
			return dependency.ErrBounce
		}
		return err
	}

and
	w, err := NewMachiner(Config{
		MachineAccessor:              accessor,
		Tag:                          tag.(names.MachineTag),
		ClearMachineAddressesOnStart: ignoreMachineAddresses,
		NotifyMachineDead: func() error {
			return agent.SetCanUninstall(a)
		},
	})

The latter is if the Machiner notices that the database record is actually flagged as Dead.

The only caller to NotifyMachineDead is in Machiner.Handle which should only happen after it has called Machine.EnsureDead() which means it isn't a transitory failure, it really is something saying "this machine should be removed".

...
2019-05-15 02:21:40 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-ubuntu-0 for 7cc4a184-7867-412d-8c06-9c9780fd26a1
2019-05-15 02:21:40 INFO juju.worker.machineundertaker undertaker.go:131 tearing down machine undertaker
2019-05-15 02:21:40 INFO juju.apiserver.connection request_notifier.go:96 agent login: unit-landscape-131 for 24ecac33-8390-4ad6-80b6-6394a88c74e6
...

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-15:

#13

If this was the code that triggers when we cannot login, we would expect to see a line like:
func ScaryConnect(a agent.Agent, apiOpen api.OpenFunc) (_ api.Connection, err error) {
agentConfig := a.CurrentConfig()
info, ok := agentConfig.APIInfo()
if !ok {
return nil, errors.New("API info not available")
}
oldPassword := agentConfig.OldPassword()

defer func() {
  cause := errors.Cause(err)
  switch {
  case cause == apiagent.ErrDenied:
  case cause == errAgentEntityDead:
  case params.IsCodeUnauthorized(cause):
  case params.IsCodeNotProvisioned(cause):
  default:
   return
  }
  logger.Errorf("Failed to connect to controller: %v", err)
  err = ErrConnectImpossible
}()

That is the only code path that yields ErrConnectImpossible
which means that we should see Failed to connect to controller.

And I *don't* see that in the log file.
And the log file *does* cover many hours, so it would be unlikely that the ErrConnectImpossible happened so much earlier that we took from 17:07 until 02:21 to actually respond to it.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-05-15:

#14

Now, we did confirm that "juju remove-machine -m controller 0" will tell you "cannot remove machine 0 as it is the only controller machine"

But we are seeing that machine 0 is getting removed and we don't see the line "Failed to connect" in the logs that we have access to.

Revision history for this message

Achilleas Anagnostopoulos (achilleasa) wrote on 2019-05-15:

#15

Is it possible to get CodeNotProvisioned in ScaryConnect if for some
reason the state of a machine has been corrupted?

There are quite a few errors.NotProvisioned[f] calls in state/machine.go
that could potentially emit that error code.

Revision history for this message

Achilleas Anagnostopoulos (achilleasa) wrote on 2019-05-15:

#16

Ah, sorry missed the comment that the logs do not contain a "Failed to connect" line.

Achilleas Anagnostopoulos (achilleasa) on 2019-05-29

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

Ian Booth (wallyworld) wrote on 2019-06-06:

#17

I've landed a fix to 2.6 branch which removes the agent uninstall code entirely. Manual machine agents now simply stop when the machine is removed from juju. We need to add an additional juju command to explicitly clean up the agent and associated files from the manual machine.

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2019-06-14:

#18

Since the fixes have been merged into 2.6 and develop, I am marking this report accordingly.

Changed in juju:
status:	In Progress → Fix Committed

Tim Penhey (thumper) on 2019-12-06

Changed in juju:
status:	Fix Committed → Fix Released

Canonical Juju

Machine agent randomly uninstalled itself

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	Critical	Achilleas Anagnostopoulos	Canonical Juju 2.7-beta1
2.5	Fix Released	Critical	Achilleas Anagnostopoulos	Canonical Juju 2.5.8
2.6	Fix Released	Critical	Achilleas Anagnostopoulos	Canonical Juju 2.6.4