Mongo replica-set not reestablished after losing leader

Bug #2016868 reported by Joseph Phillips
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Won't Fix
High
Unassigned
2.9
Won't Fix
High
Ian Booth
3.1
Won't Fix
High
Unassigned
3.2
Won't Fix
High
Unassigned

Bug Description

Using Juju 3.2, I established HA, then removed machine 0, which was leader for both MongoDB and Dqlite.

I then re-ran enable-ha. Status took a very long time to quiesce, and eventually Dqlite settled its cluster:

- ID: 3297041220608546238
  Address: 10.246.27.193:17666
  Role: 2
- ID: 11462370002709127048
  Address: 10.246.27.31:17666
  Role: 0
- ID: 179389882111230348
  Address: 10.246.27.194:17666
  Role: 0
- ID: 16076074371255822966
  Address: 10.246.27.27:17666
  Role: 0

As expected, the original node is marked as a stand-by. 3 healthy nodes are all voters.

But Mongo can not update its replica-set.

Controllers show repeated attempts to contact the old leader:
https://pastebin.canonical.com/p/XHXjsmkpS4/

Having gone down to 1 voter, machine 1 can not update the replica-set, because it can't get a quorum:
https://pastebin.canonical.com/p/XKcKqmKXY4/

Machine 3 (the one added by reestablishing HA): Keeps removing the vote from 2 after the desired peer group is calculated in order to maintain an odd voter group

The peer-grouper runs successfully, but it is a different replica set:
https://pastebin.canonical.com/p/VP3VqRycnW/

rs.status() on machine 1 and 2 are the same:
https://pastebin.canonical.com/p/DpKDzvKhqY/

I can't connect to Mongo on 3:
https://pastebin.canonical.com/p/HpmTHFZWCX/

Status says we're all healthy:
https://pastebin.canonical.com/p/NjDRkSnrDX/

So it appears that machine 3 has a Mongo segmented from the rest with no way to get back.

description: updated
description: updated
Changed in juju:
status: New → Triaged
importance: Undecided → High
summary: - Mongo replica-set not reestablished after looding leader
+ Mongo replica-set not reestablished after losing leader
Revision history for this message
John A Meinel (jameinel) wrote :

Joe and I dug into this for a bit today. It looks like what is happening is that we start with HA 3, and machine-0,-1,-2 are all happily in the replicaset.
However, when we 'juju remove-machine 0', it goes into dying, and then the replicaset notices that it wants to remove machine 0, so it kicks off the process to have machine 0 lose its primary status (which it does, and in this case it moved to 1).
At which point, we notice that we can remove machine 0 safely, and that ultimately we want to go to HA 2, where we have only 1 voter, and another non-voter backup.

So we end up going from:
{
  0: true,
  1: true,
  2: true,
}

to
{
  1: true,
  2: false,
}

However, Mongo refuses that change, because you are changing 2 voters simultaneously (machine 0 is going from voting to non-voting and removed, and machine-2 is going to non-voting.)
If we logged into the mongo shell, and directly changed only machine-0 so that it was no longer in the replicaset, the rest of the peergrouper worker kicked off and we ended up in the stable desired target. (To prove this, I suppose we could try the opposite and only downgrade machine-2 to non-voting and see if the peergrouper successfully removes machine-0).

I think the logic that we have to 'only do one change at a time' is missing something about handling removals.

The other piece is that when we tried to issue the request for removal and downgrade directly on the mongodb commandline, we got an error from mongo. For whatever reason, the peergrouper is not reporting an error in the logs. So somehow we are suppressing the error.
It may be that we are using 'force' and mongo isn't erroring, but also isn't respecting our request. Or some code path where we fail to notice the response from mongo.

Revision history for this message
John A Meinel (jameinel) wrote :

I *think* this is something that is happening in 2.9 since we use the same peergrouper logic to be able to support Mongo 4.4.

We also should ultimately get a CI test that tests going into and out of HA, and checks that Mongo is happy, not just that Juju thinks it is happy.

(In this case, Juju kept running because the peergrouper still had 2 nodes in it that could both vote, and the third voter just never responded. Which allowed us to try to go into HA again, but then we couldn't get the right entities involved in the decision.)

Revision history for this message
Ian Booth (wallyworld) wrote :

Initial comment, RE:

So we end up going from:
{
  0: true,
  1: true,
  2: true,
}

to
{
  1: true,
  2: false,
}

---

If machine 0 is removed and is the mongo primary, I think we are supposed to call "StepDownPrimary()" and then re-evaluate things. I would hope that this would then result in going from:
{
  1: true,
  2: true,
}

to
{
  1: true,
  2: false,
}

which is only one change. So maybe the logic to evaluate the new desired replicaset config has a bug such that it doesn't set the "stepDownPrimary" flag when it should.

Revision history for this message
Ian Booth (wallyworld) wrote :

On juju 2.9, I tried to reproduce this and it seemed to work - and rs.confg() and rs.status() confimed it. After enable-ha, I first tested removing machine 1 (a secondary).

Maybe it's an issue introduced in 3.2.

So we had

0: primary
2: secondary, non voting

Then enable-ha created a replacement node

0: primary
2: secondary, voting
3: secondary, voting

Then I removed machine 0, the primary. In this case, the api server does drop (so my debug-log session terminated and my open mongo client connection because invalid). But after a short time (yes, it takes a few seconds), the replicaset was repaired and was configured as expected

2: primary
3: secondary, non voting

Opening a mongo client connection to machine 2:
juju:PRIMARY> rs.config()
{
        "_id" : "juju",
        "version" : 9,
        "term" : 3,
        "protocolVersion" : NumberLong(1),
        "writeConcernMajorityJournalDefault" : true,
        "members" : [
                {
                        "_id" : 3,
                        "host" : "10.240.0.9:37017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {
                                "juju-machine-id" : "2"
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 4,
                        "host" : "10.240.0.13:37017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 0,
                        "tags" : {
                                "juju-machine-id" : "3"
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 0
                }
        ],
}
juju:PRIMARY> rs.status()
{
        "set" : "juju",
        "date" : ISODate("2023-04-20T01:36:23.784Z"),
        "myState" : 1,
        "term" : NumberLong(3),
        "members" : [
                {
                        "_id" : 3,
                        "name" : "10.240.0.9:37017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                },
                {
                        "_id" : 4,
                        "name" : "10.240.0.13:37017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                }
        ],
        "ok" : 1,
}

And I can use enable-ha to set up the full replicaset again

2: primary
3: secondary, voting
4: secondary, voting

(rs config and status are as expected)

Revision history for this message
Ian Booth (wallyworld) wrote :

Testing on juju 3.1 also was successful.

Revision history for this message
Ian Booth (wallyworld) wrote :

And no problems when I tried with 3.2

If the failure can be reproduced, we'll need to do so with relevant logged ramped up

"juju.replicaset=TRACE;juju.worker.peergrouper=TRACE"

Revision history for this message
Joseph Phillips (manadart) wrote :

It is revealed by removing the primary node.

Seems to be OK on 2.9 and 3.0. I am 2/2 reproducing it on 3.1.

Revision history for this message
Joseph Phillips (manadart) wrote (last edit ):

Reproduced on 3.0

Status:
https://pastebin.canonical.com/p/Mw4SmjHfP8/

Force removal of the primary (0), IP 10.246.27.32. It does a step-down, and calculates are peer-group change with a single voter to maintain an odd number. It reports success, *but* it is confused as to who the primary is. 10.246.27.45 (machine 1) is reported as self=true.
https://pastebin.canonical.com/p/98PVCrsjbx/

Despite reporting success, the replica-set does not change.
https://pastebin.canonical.com/p/FB8ntXsrn2/

Over on machine 1, which is reported as the primary when you log into Mongo, it seems to think that it is machine 0.
https://pastebin.canonical.com/p/d94BpdkpKN/

It is trying to remove member 1 (machine 0) but keeps failing, despite being primary.
https://pastebin.canonical.com/p/3rYK5JZJGc/

It certainly says PRIMARY when you connect directly to it.
https://pastebin.canonical.com/p/gwbzRd34Nd/

I think we're somehow using sessions with crossed wires.

At this point machine 0 is still not gone. If you force delete the container you get into the situation we observed 2 days ago.
https://pastebin.canonical.com/p/G2jdnf2hzs/

An no matter how many times you try to force delete machine 0, even with the container gone, it will not go away.

Restart the whole container under machine 1 to see what happens. It reports success changing the replicaset, but it thinks it's machine 2.
https://pastebin.canonical.com/p/tNvrBBVNRH/

Tried bouncing the container for machine 2 as well. No improvement.

All the while we keep reporting the same replica-set.

2023-04-20 09:29:04 DEBUG juju.replicaset replicaset.go:669 current replicaset config: {
  Name: juju,
  Version: 3,
  Term: 11,
  Protocol Version: 1,
  Members: {
    {1 "10.246.27.32:37017" juju-machine-id:0 voting},
    {2 "10.246.27.45:37017" juju-machine-id:1 voting},
    {3 "10.246.27.136:37017" juju-machine-id:2 voting},
  },
}

As an added bonus in this case. Raft gets borked, and we have no singular controller lease. So running enable HA again adds a machine that can never be provisioned.

An altogether miserable state.
https://pastebin.canonical.com/p/VPbggMVwdZ/

Revision history for this message
Joseph Phillips (manadart) wrote :

Ah, I see. It reports success, because if we are not the primary, we just return the desired replica-set without attempting to affect the change.

Revision history for this message
Joseph Phillips (manadart) wrote (last edit ):

Seems not to happen on 2.9. Tried 3 times.

Unlike later versions, the machine does go away and the correct replica-set is achieved.

2.9 uses github.com/juju/mgo/v2 and github.com/juju/replicaset/v2, whereas 3.0 moved both to v3.

Revision history for this message
Ian Booth (wallyworld) wrote :

The replicaset bump was to align with mgo v3.

mgo v3 commit 5c00a1b (https://github.com/juju/mgo/pull/20) fixes unit tests and some sstxn issues, but also

  - Non primary write errors force a cluster sync and discard the
  primary socket for that session. If there is a current
  transaction, abort it.

This seems worth investigating more. It messes with sessions and cluster sync etc.

Revision history for this message
Joseph Phillips (manadart) wrote :

I am beginning to suspect that this is a characteristic related to forced destruction and the implications for the controller unit, which 2.9 does not have.

With normal machine removal, it was fine 2/2 on 3.0. Yesterday, I had 100% reproduction with --force.

Revision history for this message
Joseph Phillips (manadart) wrote :

I've indicated that we will not pursue a fix for this issue.

It will go away as part of replacing Mongo, and we are not seeing examples of it from the field.

Changed in juju:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.