Comment 8 for bug 2016868

Revision history for this message
Joseph Phillips (manadart) wrote (last edit ):

Reproduced on 3.0

Status:
https://pastebin.canonical.com/p/Mw4SmjHfP8/

Force removal of the primary (0), IP 10.246.27.32. It does a step-down, and calculates are peer-group change with a single voter to maintain an odd number. It reports success, *but* it is confused as to who the primary is. 10.246.27.45 (machine 1) is reported as self=true.
https://pastebin.canonical.com/p/98PVCrsjbx/

Despite reporting success, the replica-set does not change.
https://pastebin.canonical.com/p/FB8ntXsrn2/

Over on machine 1, which is reported as the primary when you log into Mongo, it seems to think that it is machine 0.
https://pastebin.canonical.com/p/d94BpdkpKN/

It is trying to remove member 1 (machine 0) but keeps failing, despite being primary.
https://pastebin.canonical.com/p/3rYK5JZJGc/

It certainly says PRIMARY when you connect directly to it.
https://pastebin.canonical.com/p/gwbzRd34Nd/

I think we're somehow using sessions with crossed wires.

At this point machine 0 is still not gone. If you force delete the container you get into the situation we observed 2 days ago.
https://pastebin.canonical.com/p/G2jdnf2hzs/

An no matter how many times you try to force delete machine 0, even with the container gone, it will not go away.

Restart the whole container under machine 1 to see what happens. It reports success changing the replicaset, but it thinks it's machine 2.
https://pastebin.canonical.com/p/tNvrBBVNRH/

Tried bouncing the container for machine 2 as well. No improvement.

All the while we keep reporting the same replica-set.

2023-04-20 09:29:04 DEBUG juju.replicaset replicaset.go:669 current replicaset config: {
  Name: juju,
  Version: 3,
  Term: 11,
  Protocol Version: 1,
  Members: {
    {1 "10.246.27.32:37017" juju-machine-id:0 voting},
    {2 "10.246.27.45:37017" juju-machine-id:1 voting},
    {3 "10.246.27.136:37017" juju-machine-id:2 voting},
  },
}

As an added bonus in this case. Raft gets borked, and we have no singular controller lease. So running enable HA again adds a machine that can never be provisioned.

An altogether miserable state.
https://pastebin.canonical.com/p/VPbggMVwdZ/