peergrouper and HA a bit confused about who is part of the controller

Bug #1765387 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Medium
John A Meinel

Bug Description

This is in 2.4 beta after having created and killed a bunch of machines.

I also did

juju enable-ha; juju enable-ha; juju enable-ha

So that we would create machines 5, 6, 7, because 5 hadn't 'started' yet.

I've also issued a "juju remove-machine 6" but *not* with --force

At this point juju status says:
Machine State DNS Inst id Series AZ Message
1 started 10.16.17.211 juju-ebff95-1 xenial Running
3 started 10.16.17.189 juju-ebff95-3 xenial Running
6 stopped 10.16.17.22 juju-ebff95-6 xenial Running
7 started 10.16.17.120 juju-ebff95-7 xenial Running

debug-log says:
machine-6: 16:02:22 DEBUG juju.worker.machiner "machine-6" is now dying
machine-6: 16:02:22 DEBUG juju.worker.dependency "machiner" manifold worker stopped: machine-6 failed to set machine to dead: machine 6 is still a controller member
machine-6: 16:02:22 ERROR juju.worker.dependency "machiner" manifold worker returned unexpected error: machine-6 failed to set machine to dead: machine 6 is still a controller member
machine-6: 16:02:22 DEBUG juju.worker.dependency stack trace:
machine 6 is still a controller member
github.com/juju/juju/worker/machiner/machiner.go:194: machine-6 failed to set machine to dead
machine-3: 16:02:23 DEBUG juju.worker.peergrouper controller machines in state: []string{"1", "3", "6", "7"}

show-controller says:
  controller-machines:
    "1":
      instance-id: juju-ebff95-1
      ha-status: ha-enabled
    "3":
      instance-id: juju-ebff95-3
      ha-status: ha-enabled
    "7":
      instance-id: juju-ebff95-7
      ha-status: ha-enabled

I don't know why 'show-controller' doesn't think that machine 6 is part of the controller. Certainly it doesn't HaveVote, and I guess it doesn't WantVote either?:
        "life" : 1, # Dying
        "jobs" : [
                1,
                2 # JobManageModel
        ],
        "novote" : true,
        "hasvote" : false,

Maybe we filter out nodes with "novote".

Ah, I think I know why it is not being cleaned up:
 for _, removedTracker := range removed {
  if removedTracker.stm.Life() != state.Alive {
   logger.Debugf("removing dying controller machine %s", removedTracker.Id())
   if err := w.config.State.RemoveControllerMachine(removedTracker.stm); err != nil {
    logger.Errorf("failed to remove dying machine as a controller after removing its vote: %v", err)
   }
  } else {
   logger.Debugf("vote removed from %v but machine is %s", removedTracker.Id(), state.Alive)
  }
 }

but 'removed' is detected with:

 for id, hasVote := range desired.machineVoting {
  m := info.machines[id]
  switch {
  case hasVote && !m.stm.HasVote():
   added = append(added, m)
  case !hasVote && m.stm.HasVote():
   removed = append(removed, m)
  }
 }

^- if !hasVote && m.stm.HasVote()

This machine never had the vote, because it came up stripped of the vote due to multiple "enable-ha" calls before it even started.
As such it isn't removed, but it just never had the vote.

John A Meinel (jameinel)
Changed in juju:
assignee: nobody → John A Meinel (jameinel)
status: Triaged → In Progress
milestone: none → 2.4-beta2
Revision history for this message
John A Meinel (jameinel) wrote :

I still don't know about why "juju show-controller" isn't showing machine 6, even though it does have JobManageModel. I'm guessing its because it WantsVote() == false.

Regardless, you can get into this situation in a much easier way.

$ juju bootstrap
$ juju enable-ha
$ juju remove-machine 2

At this point, both machine 1 and machine 2 will lose their vote, because we keep an even number of voters. But we don't delete machine 1 because we expect it to participate later.

If you then do it again with:
$ juju remove-machine 1

Then we will *fail* to reap machine 1.

This is because the check we had was only iterating over machines that *just* lost their vote.
We need to iterate all machines, check if they are currently not Alive, and no longer have their vote.

Revision history for this message
John A Meinel (jameinel) wrote :
John A Meinel (jameinel)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.