peergrouper and HA a bit confused about who is part of the controller
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
Medium
|
John A Meinel |
Bug Description
This is in 2.4 beta after having created and killed a bunch of machines.
I also did
juju enable-ha; juju enable-ha; juju enable-ha
So that we would create machines 5, 6, 7, because 5 hadn't 'started' yet.
I've also issued a "juju remove-machine 6" but *not* with --force
At this point juju status says:
Machine State DNS Inst id Series AZ Message
1 started 10.16.17.211 juju-ebff95-1 xenial Running
3 started 10.16.17.189 juju-ebff95-3 xenial Running
6 stopped 10.16.17.22 juju-ebff95-6 xenial Running
7 started 10.16.17.120 juju-ebff95-7 xenial Running
debug-log says:
machine-6: 16:02:22 DEBUG juju.worker.
machine-6: 16:02:22 DEBUG juju.worker.
machine-6: 16:02:22 ERROR juju.worker.
machine-6: 16:02:22 DEBUG juju.worker.
machine 6 is still a controller member
github.
machine-3: 16:02:23 DEBUG juju.worker.
show-controller says:
controller-
"1":
instance-id: juju-ebff95-1
ha-status: ha-enabled
"3":
instance-id: juju-ebff95-3
ha-status: ha-enabled
"7":
instance-id: juju-ebff95-7
ha-status: ha-enabled
I don't know why 'show-controller' doesn't think that machine 6 is part of the controller. Certainly it doesn't HaveVote, and I guess it doesn't WantVote either?:
"life" : 1, # Dying
"jobs" : [
1,
2 # JobManageModel
],
"novote" : true,
"hasvote" : false,
Maybe we filter out nodes with "novote".
Ah, I think I know why it is not being cleaned up:
for _, removedTracker := range removed {
if removedTracker.
logger.
if err := w.config.
logger.
}
} else {
logger.
}
}
but 'removed' is detected with:
for id, hasVote := range desired.
m := info.machines[id]
switch {
case hasVote && !m.stm.HasVote():
added = append(added, m)
case !hasVote && m.stm.HasVote():
removed = append(removed, m)
}
}
^- if !hasVote && m.stm.HasVote()
This machine never had the vote, because it came up stripped of the vote due to multiple "enable-ha" calls before it even started.
As such it isn't removed, but it just never had the vote.
Changed in juju: | |
assignee: | nobody → John A Meinel (jameinel) |
status: | Triaged → In Progress |
milestone: | none → 2.4-beta2 |
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
status: | Fix Committed → Fix Released |
I still don't know about why "juju show-controller" isn't showing machine 6, even though it does have JobManageModel. I'm guessing its because it WantsVote() == false.
Regardless, you can get into this situation in a much easier way.
$ juju bootstrap
$ juju enable-ha
$ juju remove-machine 2
At this point, both machine 1 and machine 2 will lose their vote, because we keep an even number of voters. But we don't delete machine 1 because we expect it to participate later.
If you then do it again with:
$ juju remove-machine 1
Then we will *fail* to reap machine 1.
This is because the check we had was only iterating over machines that *just* lost their vote.
We need to iterate all machines, check if they are currently not Alive, and no longer have their vote.