Bug #2016868 “Mongo replica-set not reestablished after losing l...” : Bugs : Canonical Juju

Joseph Phillips (manadart) on 2023-04-18

description:	updated
description:	updated
Changed in juju:
status:	New → Triaged
importance:	Undecided → High

Joseph Phillips (manadart) on 2023-04-18

summary:

- Mongo replica-set not reestablished after looding leader
+ Mongo replica-set not reestablished after losing leader

Revision history for this message

John A Meinel (jameinel) wrote on 2023-04-18:

#1

Joe and I dug into this for a bit today. It looks like what is happening is that we start with HA 3, and machine-0,-1,-2 are all happily in the replicaset.
However, when we 'juju remove-machine 0', it goes into dying, and then the replicaset notices that it wants to remove machine 0, so it kicks off the process to have machine 0 lose its primary status (which it does, and in this case it moved to 1).
At which point, we notice that we can remove machine 0 safely, and that ultimately we want to go to HA 2, where we have only 1 voter, and another non-voter backup.

So we end up going from:
{
  0: true,
  1: true,
  2: true,
}

to
{
1: true,
2: false,
}

However, Mongo refuses that change, because you are changing 2 voters simultaneously (machine 0 is going from voting to non-voting and removed, and machine-2 is going to non-voting.)
If we logged into the mongo shell, and directly changed only machine-0 so that it was no longer in the replicaset, the rest of the peergrouper worker kicked off and we ended up in the stable desired target. (To prove this, I suppose we could try the opposite and only downgrade machine-2 to non-voting and see if the peergrouper successfully removes machine-0).

I think the logic that we have to 'only do one change at a time' is missing something about handling removals.

The other piece is that when we tried to issue the request for removal and downgrade directly on the mongodb commandline, we got an error from mongo. For whatever reason, the peergrouper is not reporting an error in the logs. So somehow we are suppressing the error.
It may be that we are using 'force' and mongo isn't erroring, but also isn't respecting our request. Or some code path where we fail to notice the response from mongo.

Revision history for this message

John A Meinel (jameinel) wrote on 2023-04-18:

#2

I *think* this is something that is happening in 2.9 since we use the same peergrouper logic to be able to support Mongo 4.4.

We also should ultimately get a CI test that tests going into and out of HA, and checks that Mongo is happy, not just that Juju thinks it is happy.

(In this case, Juju kept running because the peergrouper still had 2 nodes in it that could both vote, and the third voter just never responded. Which allowed us to try to go into HA again, but then we couldn't get the right entities involved in the decision.)

Revision history for this message

Ian Booth (wallyworld) wrote on 2023-04-19:

#3

Initial comment, RE:

So we end up going from:
{
  0: true,
  1: true,
  2: true,
}

to
{
1: true,
2: false,
}

---

If machine 0 is removed and is the mongo primary, I think we are supposed to call "StepDownPrimary()" and then re-evaluate things. I would hope that this would then result in going from:
{
1: true,
2: true,
}

to
{
1: true,
2: false,
}

which is only one change. So maybe the logic to evaluate the new desired replicaset config has a bug such that it doesn't set the "stepDownPrimary" flag when it should.

Revision history for this message

Ian Booth (wallyworld) wrote on 2023-04-20:

#4

On juju 2.9, I tried to reproduce this and it seemed to work - and rs.confg() and rs.status() confimed it. After enable-ha, I first tested removing machine 1 (a secondary).

Maybe it's an issue introduced in 3.2.

So we had

0: primary
2: secondary, non voting

Then enable-ha created a replacement node

0: primary
2: secondary, voting
3: secondary, voting

Then I removed machine 0, the primary. In this case, the api server does drop (so my debug-log session terminated and my open mongo client connection because invalid). But after a short time (yes, it takes a few seconds), the replicaset was repaired and was configured as expected

2: primary
3: secondary, non voting

Opening a mongo client connection to machine 2:
juju:PRIMARY> rs.config()
{
        "_id" : "juju",
        "version" : 9,
        "term" : 3,
        "protocolVersion" : NumberLong(1),
        "writeConcernMajorityJournalDefault" : true,
        "members" : [
                {
                        "_id" : 3,
                        "host" : "10.240.0.9:37017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {
                                "juju-machine-id" : "2"
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 4,
                        "host" : "10.240.0.13:37017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 0,
                        "tags" : {
                                "juju-machine-id" : "3"
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 0
                }
        ],
}
juju:PRIMARY> rs.status()
{
        "set" : "juju",
        "date" : ISODate("2023-04-20T01:36:23.784Z"),
        "myState" : 1,
        "term" : NumberLong(3),
        "members" : [
                {
                        "_id" : 3,
                        "name" : "10.240.0.9:37017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                },
                {
                        "_id" : 4,
                        "name" : "10.240.0.13:37017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                }
        ],
        "ok" : 1,
}

And I can use enable-ha to set up the full replicaset again

2: primary
3: secondary, voting
4: secondary, voting

(rs config and status are as expected)

On juju 2.9, I tried to reproduce this and it seemed to work - and rs.confg() and rs.status() confimed it. After enable-ha, I first tested removing machine 1 (a secondary).

Maybe it's an issue introduced in 3.2.

So we had

0: primary
2: secondary, non voting

Then enable-ha created a replacement node

0: primary
2: secondary, voting
3: secondary, voting

Then I removed machine 0, the primary. In this case, the api server does drop (so my debug-log session terminated and my open mongo client connection because invalid). But after a short time (yes, it takes a few seconds), the replicaset was repaired and was configured as expected

2: primary
3: secondary, non voting

Opening a mongo client connection to machine 2:
juju:PRIMARY> rs.config()
{
        "_id" : "juju",
        "version" : 9,
        "term" : 3,
        "protocolVersion" : NumberLong(1),
        "writeConcernMajorityJournalDefault" : true,
        "members" : [
                {
                        "_id" : 3,
                        "host" : "10.240.0.9:37017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {
                                "juju-machine-id" : "2"
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 4,
                        "host" : "10.240.0.13:37017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 0,
                        "tags" : {
                                "juju-machine-id" : "3"
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 0
                }
        ],
}
juju:PRIMARY> rs.status()
{
        "set" : "juju",
        "date" : ISODate("2023-04-20T01:36:23.784Z"),
        "myState" : 1,
        "term" : NumberLong(3),
        "members" : [
                {
                        "_id" : 3,
                        "name" : "10.240.0.9:37017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                },
                {
                        "_id" : 4,
                        "name" : "10.240.0.13:37017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                }
        ],
        "ok" : 1,
}

And I can use enable-ha to set up the full replicaset again

2: primary
3: secondary, voting
4: secondary, voting

(rs config and status are as expected)

Revision history for this message

Ian Booth (wallyworld) wrote on 2023-04-20:

#5

Testing on juju 3.1 also was successful.

Revision history for this message

Ian Booth (wallyworld) wrote on 2023-04-20:

#6

And no problems when I tried with 3.2

If the failure can be reproduced, we'll need to do so with relevant logged ramped up

"juju.replicaset=TRACE;juju.worker.peergrouper=TRACE"

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-04-20:

#7

It is revealed by removing the primary node.

Seems to be OK on 2.9 and 3.0. I am 2/2 reproducing it on 3.1.

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-04-20 (last edit on 2023-04-20):

#8

Reproduced on 3.0

Status:
https://pastebin.canonical.com/p/Mw4SmjHfP8/

Force removal of the primary (0), IP 10.246.27.32. It does a step-down, and calculates are peer-group change with a single voter to maintain an odd number. It reports success, *but* it is confused as to who the primary is. 10.246.27.45 (machine 1) is reported as self=true.
https://pastebin.canonical.com/p/98PVCrsjbx/

Despite reporting success, the replica-set does not change.
https://pastebin.canonical.com/p/FB8ntXsrn2/

Over on machine 1, which is reported as the primary when you log into Mongo, it seems to think that it is machine 0.
https://pastebin.canonical.com/p/d94BpdkpKN/

It is trying to remove member 1 (machine 0) but keeps failing, despite being primary.
https://pastebin.canonical.com/p/3rYK5JZJGc/

It certainly says PRIMARY when you connect directly to it.
https://pastebin.canonical.com/p/gwbzRd34Nd/

I think we're somehow using sessions with crossed wires.

At this point machine 0 is still not gone. If you force delete the container you get into the situation we observed 2 days ago.
https://pastebin.canonical.com/p/G2jdnf2hzs/

An no matter how many times you try to force delete machine 0, even with the container gone, it will not go away.

Restart the whole container under machine 1 to see what happens. It reports success changing the replicaset, but it thinks it's machine 2.
https://pastebin.canonical.com/p/tNvrBBVNRH/

Tried bouncing the container for machine 2 as well. No improvement.

All the while we keep reporting the same replica-set.

2023-04-20 09:29:04 DEBUG juju.replicaset replicaset.go:669 current replicaset config: {
  Name: juju,
  Version: 3,
  Term: 11,
  Protocol Version: 1,
  Members: {
    {1 "10.246.27.32:37017" juju-machine-id:0 voting},
    {2 "10.246.27.45:37017" juju-machine-id:1 voting},
    {3 "10.246.27.136:37017" juju-machine-id:2 voting},
  },
}

As an added bonus in this case. Raft gets borked, and we have no singular controller lease. So running enable HA again adds a machine that can never be provisioned.

An altogether miserable state.
https://pastebin.canonical.com/p/VPbggMVwdZ/

Reproduced on 3.0

Status:
https://pastebin.canonical.com/p/Mw4SmjHfP8/

Force removal of the primary (0), IP 10.246.27.32. It does a step-down, and calculates are peer-group change with a single voter to maintain an odd number. It reports success, *but* it is confused as to who the primary is. 10.246.27.45 (machine 1) is reported as self=true.
https://pastebin.canonical.com/p/98PVCrsjbx/

Despite reporting success, the replica-set does not change.
https://pastebin.canonical.com/p/FB8ntXsrn2/

Over on machine 1, which is reported as the primary when you log into Mongo, it seems to think that it is machine 0.
https://pastebin.canonical.com/p/d94BpdkpKN/

It is trying to remove member 1 (machine 0) but keeps failing, despite being primary.
https://pastebin.canonical.com/p/3rYK5JZJGc/

It certainly says PRIMARY when you connect directly to it.
https://pastebin.canonical.com/p/gwbzRd34Nd/

I think we're somehow using sessions with crossed wires.

At this point machine 0 is still not gone. If you force delete the container you get into the situation we observed  2 days ago.
https://pastebin.canonical.com/p/G2jdnf2hzs/

An no matter how many times you try to force delete machine 0, even with the container gone, it will not go away.

Restart the whole container under machine 1 to see what happens. It reports success changing the replicaset, but it thinks it's machine 2.
https://pastebin.canonical.com/p/tNvrBBVNRH/

Tried bouncing the container for machine 2 as well. No improvement.

All the while we keep reporting the same replica-set.

2023-04-20 09:29:04 DEBUG juju.replicaset replicaset.go:669 current replicaset config: {
  Name: juju,
  Version: 3,
  Term: 11,
  Protocol Version: 1,
  Members: {
    {1 "10.246.27.32:37017" juju-machine-id:0 voting},
    {2 "10.246.27.45:37017" juju-machine-id:1 voting},
    {3 "10.246.27.136:37017" juju-machine-id:2 voting},
  },
}

As an added bonus in this case. Raft gets borked, and we have no singular controller lease. So running enable HA again adds a machine that can never be provisioned.

An altogether miserable state.
https://pastebin.canonical.com/p/VPbggMVwdZ/

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-04-20:

#9

Ah, I see. It reports success, because if we are not the primary, we just return the desired replica-set without attempting to affect the change.

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-04-20 (last edit on 2023-04-20):

#10

Seems not to happen on 2.9. Tried 3 times.

Unlike later versions, the machine does go away and the correct replica-set is achieved.

2.9 uses github.com/juju/mgo/v2 and github.com/juju/replicaset/v2, whereas 3.0 moved both to v3.

Revision history for this message

Ian Booth (wallyworld) wrote on 2023-04-21:

#11

The replicaset bump was to align with mgo v3.

mgo v3 commit 5c00a1b (https://github.com/juju/mgo/pull/20) fixes unit tests and some sstxn issues, but also

  - Non primary write errors force a cluster sync and discard the
  primary socket for that session. If there is a current
  transaction, abort it.

This seems worth investigating more. It messes with sessions and cluster sync etc.

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-04-21:

#12

I am beginning to suspect that this is a characteristic related to forced destruction and the implications for the controller unit, which 2.9 does not have.

With normal machine removal, it was fine 2/2 on 3.0. Yesterday, I had 100% reproduction with --force.

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-12-04:

#13

I've indicated that we will not pursue a fix for this issue.

It will go away as part of replacing Mongo, and we are not seeing examples of it from the field.

Changed in juju:
status:	Triaged → Won't Fix

Canonical Juju

Mongo replica-set not reestablished after losing leader

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
Canonical Juju	Won't Fix	High	Unassigned
2.9	Won't Fix	High	Ian Booth
3.1	Won't Fix	High	Unassigned
3.2	Won't Fix	High	Unassigned