Comment 12 for bug 1827009

John A Meinel (jameinel) wrote :

Other interesting lines:
2019-05-15 02:21:35 INFO juju.state.presence presence.go:194 watcher loop failed: write tcp 127.0.0.1:38766->127.0.0.1:37017: i/o timeout
...
2019-05-15 02:21:35 INFO juju.state multiwatcher.go:212 store manager loop failed: get unit "neutron-openvswitch/29": cannot get unit "neutron-openvswitch/29": write tcp 127.0.0.1:39172->127.0.0.1:37017: i/o timeout

all of that indicates the database started to stop responding to queries
2019-05-15 02:21:36 ERROR juju.worker.dependency engine.go:636 "is-responsible-flag" manifold worker returned unexpected error: lease manager stopped
...
2019-05-15 02:21:37 ERROR juju.worker.dependency engine.go:636 "is-responsible-flag" manifold worker returned unexpected error: lease manager stopped

in 1 second that line is repeated 409 times.

while that is happening we do see a line like:
2019-05-15 02:21:37 INFO juju.apiserver.connection request_notifier.go:96 agent login: unit-nrpe-physical-24 for 24ecac33-8390-4ad6-80b6-6394a88c74e6

so some sort of login is working.
2019-05-15 02:21:38 WARNING juju.environs.config config.go:1570 unknown config field "tools-metadata-url"

^- this is repeated *many* times. The config field is supposed to be "agent-metadata-url". (but it was migrated from tools-metadata-url in the past.)

...
2019-05-15 02:21:40 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-neutron-openvswitch-26 for 24ecac33-8390-4ad6-80b6-6394a88c74e6
2019-05-15 02:21:40 INFO juju.agent uninstall.go:36 marking agent ready for uninstall
2019-05-15 02:21:40 INFO juju.worker.stateconfigwatcher manifold.go:119 tomb dying

^- That is the call to SetCanUninstall

But note that there are 2 ways that we get SetCanUninstall. Namely:
 connectFilter := func(err error) error {
  cause := errors.Cause(err)
  if cause == apicaller.ErrConnectImpossible {
   err2 := coreagent.SetCanUninstall(config.Agent)
   if err2 != nil {
    return errors.Trace(err2)
   }
   return jworker.ErrTerminateAgent
  } else if cause == apicaller.ErrChangedPassword {
   return dependency.ErrBounce
  }
  return err
 }

and
 w, err := NewMachiner(Config{
  MachineAccessor: accessor,
  Tag: tag.(names.MachineTag),
  ClearMachineAddressesOnStart: ignoreMachineAddresses,
  NotifyMachineDead: func() error {
   return agent.SetCanUninstall(a)
  },
 })

The latter is if the Machiner notices that the database record is actually flagged as Dead.

The only caller to NotifyMachineDead is in Machiner.Handle which should only happen after it has called Machine.EnsureDead() which means it isn't a transitory failure, it really is something saying "this machine should be removed".

...
2019-05-15 02:21:40 INFO juju.apiserver.connection request_notifier.go:125 agent disconnected: unit-ubuntu-0 for 7cc4a184-7867-412d-8c06-9c9780fd26a1
2019-05-15 02:21:40 INFO juju.worker.machineundertaker undertaker.go:131 tearing down machine undertaker
2019-05-15 02:21:40 INFO juju.apiserver.connection request_notifier.go:96 agent login: unit-landscape-131 for 24ecac33-8390-4ad6-80b6-6394a88c74e6
...