juju enable-ha fails to cluster on 2.9.18 manual machines

Bug #1951813 reported by Peter Jose De Sousa
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Ian Booth

Bug Description

Hi,

When deploying juju latest with fan networking juju will fail to cluster successfully, intially clustering then falling in and out of cluster. This issue is present when deploying manual machines on multiple cloud substrates.

To replicate:
JUJU_FAN_CONFIG="172.12.1.0/24"
JUJU_FAN_OVERLAY="252.0.0.0/16"

1. Deploy manual controllers
2. Add manual cloud
3. juju bootstrap dsv-apic-dev-juju-controller dsv-apic-dev-juju-controller --model-default container-networking-method=fan --model-default fan-config=${JUJU_FAN_CONFIG}=${JUJU_FAN_OVERLAY} # machine-0.log
4. juju add-machine -m controller ssh:machineONE # machine-1.log
5. 4. juju add-machine -m controller ssh:machineTWO # machine-2.log
6. juju enable-ha --to=1,2

Observe the logs in /var/log/juju/machine-N.log

Logs from my repro deployment + crashdump here: https://private-fileshare.canonical.com/~pjds/lp-1951813/

[Workaround]
**UPDATED WORKAROUND**

We found that the previous workaround fixed mongo, but did not resolve juju clustering issues,

@mastier1 found a workaround in deploying 2.9.14, this worksaround the issue. To workaround apply the following:

JUJU_FAN_CONFIG="172.12.1.0/24"
JUJU_FAN_OVERLAY="252.0.0.0/16"

1. Deploy manual controllers
2. Add manual cloud
3. juju bootstrap dsv-apic-dev-juju-controller dsv-apic-dev-juju-controller --agent-version=2.9.14 --model-default container-networking-method=fan --model-default fan-config=${JUJU_FAN_CONFIG}=${JUJU_FAN_OVERLAY} # machine-0.log
4. juju add-machine -m controller ssh:machineONE # machine-1.log
5. 4. juju add-machine -m controller ssh:machineTWO # machine-2.log
6. juju enable-ha --to=1,2

Thanks,

Peter

summary: - juju enable-ha does not work on >2.8.16 with fan networking
+ juju enable-ha fails to cluster on >2.8.16
summary: - juju enable-ha fails to cluster on >2.8.16
+ juju enable-ha fails to cluster on >2.9.16
description: updated
description: updated
Revision history for this message
Peter Jose De Sousa (pjds) wrote : Re: juju enable-ha fails to cluster on 2.9.18

subscribing field critical as this is blocking a customer deployment with no known workaround which works for all cases.

summary: - juju enable-ha fails to cluster on >2.9.16
+ juju enable-ha fails to cluster on 2.9.18
description: updated
summary: - juju enable-ha fails to cluster on 2.9.18
+ juju enable-ha fails to cluster on 2.9.18 manual machines
description: updated
Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Downgrading to field high, as a workaround has been found.

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Subscribed field high as a workaround was found, updating the bug.

description: updated
description: updated
Revision history for this message
Bartosz Woronicz (mastier1) wrote :

small summary

2.9.18, all machines got mongodb correct replica, cluster not assembled
2.9.16 added manual machines fail to start mongodb
2.9.14 all machines got mongodb correct replica, cluster assembled
```
ubuntu@segotl4892:~/volvo_hcl/cpe-deployments$ juju controllers --refresh
Controller Model User Access Cloud/Region Models Nodes HA Version
volvo-manual-juju-controller* default admin superuser volvo-manual-cloud/default 2 3 3 2.9.14
```

steps to reproduce
1. add manual cloud
$ cat <<EOF> cloud.yaml
clouds:
  manual-cloud:
    type: manual
    endpoint: ubuntu@x.x.x.x
    regions:
      default: {}
$ juju add-cloud -f cloud.yaml --client
2. bootstrap manual cloud (note : the tested setup got proxy settings in juju-model.yaml
$ juju --debug bootstrap --agent-version=2.9.x --config=bundles/juju-model.yaml manual-cloud manual-juju-controller

3. observe
$ watch -c juju machines -m controller --color
wait for all machine got green status

4. add other two machines
$ juju add-machine -m controller ssh:ubuntu@y.y.y.y
$ juju add-machine -m controller ssh:ubuntu@z.z.z.z

4.observe
$ watch -c juju machines -m controller --color
wait for all machine got green status

5. enable HA
$ juju enable-ha --to 1,2

6. observe
$ watch -c juju machines -m controller --color
wait for all machine got green status
check the logs on the machines at /var/log/juju/machines-*.log

Revision history for this message
Bartosz Woronicz (mastier1) wrote :

after more trials i see the following
.16 or .18 are failing either with starting mongodb, even before enable-ha

.14 works everytime

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Updated the workaround as per mastier1's suggestion

description: updated
Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Moving to field critical as latest version of juju is inoperable on manual machines.

Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.9.20
status: New → Triaged
importance: Undecided → Critical
Changed in juju:
assignee: nobody → Achilleas Anagnostopoulos (achilleasa)
Revision history for this message
Achilleas Anagnostopoulos (achilleasa) wrote :

We have tracked down the root cause for this bug and working on a fix.

Revision history for this message
Simon Richardson (simonrichardson) wrote :
Revision history for this message
Bartosz Woronicz (mastier1) wrote :

Hi, I attached logs from different versions.
up to 2.9.16 seems working most pof the time
2.9.17 won't convert machines after enable-ha
Since 2.9.18, The manual machines are going to down state even before enable-ha

Revision history for this message
Haw Loeung (hloeung) wrote :

It may be of relevance here, LP:1942804 was filed to exclude the FAN addresses/interfaces.

Revision history for this message
Ian Booth (wallyworld) wrote :

This PR

https://github.com/juju/replicaset/pull/16

fixes the problem upstream. A key issue was that we were initially setting the first node in the replicaset using the container address, and then later attempting to re-configure it using the fan address. And that update was failing due to some rework done to accommodate mongo 4.4 and a bug in that.

Now, I realise there's another request not to use the fan addresses to configure the replicaset - that's a separate Juju fix not related to this work.

Changed in juju:
assignee: Achilleas Anagnostopoulos (achilleasa) → Ian Booth (wallyworld)
Ian Booth (wallyworld)
Changed in juju:
status: Triaged → Fix Committed
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.9.20 → 2.9.21
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.