juju agent on the controller does not complete after bootstrap

Bug #2039436 reported by Yoshi Kadokawa
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Joseph Phillips
Fix Released
Joseph Phillips
Fix Released
Joseph Phillips

Bug Description

When bootstrapping on a machine(manual provider) with more than 3 networks,
the bootstrap command does complete without any errors, however, the juju status for the controller unit stays with `agent initialising` message, and doesn't complete the bootstrap process.

$ juju status -m controller
Model Controller Cloud/Region Version SLA Timestamp
controller manual-default manual/default 3.1.6 unsupported 11:05:47Z

App Version Status Scale Charm Channel Rev Exposed Message
controller waiting 0/1 juju-controller 3.1/stable 14 no agent initialising

Unit Workload Agent Machine Public address Ports Message
controller/0 waiting allocating 0 agent initialising

Machine State Address Inst id Base AZ Message
0 started manual: ubuntu@22.04

$ juju ssh -m controller 0 '
    ip -br a
lo UNKNOWN ::1/128
enp1s0 UP fe80::5054:ff:fe0a:9e30/64
enp2s0 UP fe80::5054:ff:febd:ca9/64
enp3s0 DOWN
enp4s0 DOWN
enp5s0 DOWN
enp6s0 UP fe80::5054:ff:fe21:cadf/64
enp7s0 DOWN
Connection to closed.

I have confirmed that this does not happen when bootstrapping on a machine with 2 networks.

For now the workaround is to prepare the machine with less than 2 networks, and add the necessary networks after bootstrap has completed.

Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :
Revision history for this message
Joseph Phillips (manadart) wrote (last edit ):

The peer-grouper worker, which maintains the MongoDB cluster needs to identify a unique local-cloud scoped address to which MongoDB is bound.

Where multiple are available, the one to use is specified by supplying config for the juju-ha-space.

This could be made to work on MAAS, where the subnets/spaces are provider-sourced, and therefore known at model creation.

However this is not the case with the manual provider, as there are no space definitions.

As part of the Juju 4.0 work we are moving some of this behaviour to the controller charm itself, so we will endeavour to accommodate this scenario.

Until that time, the stated work-around is the avenue to take.

Changed in juju:
status: New → Triaged
assignee: nobody → Joseph Phillips (manadart)
importance: Undecided → Low
Revision history for this message
Joseph Phillips (manadart) wrote (last edit ):

The work-around mentioned above creates a latent issue.

Progressive subnet discovery was added for the manual provider under this patch:

The problem is that it will only discover subnets from new *NICs*.

This means that since all devices but one were disabled in order to bootstrap and enter HA, they were already in Juju without addresses. Once we added addresses to them, the subnets for those addresses were not added to Juju.

If they were, we would have been able to carve the different subnets into spaces, and set one of those spaces as configuration for "juju-ha-space", which would have ensured a unique local-cloud address that the peer-grouper could use to maintain the Mongo control plane.

Once we had a restart (soft or hard), the peer-grouper now in an error loop could not broadcast the address information needed to establish the Raft transport. No Raft - no leases - no API.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-critical.

This is blocking an engagement and a proposed workaround didn't work after restarting jujud. There is no known workaround at this point and this issue invalidates some practical scenarios in customer engagements, namely:

- bootstrapping Juju with procreated VMs
- bootstrapping Juju before having any provider
  - for microcloud charm or microk8s charm where there are only 3 physical hosts without MAAS beforehand

Revision history for this message
Joseph Phillips (manadart) wrote :

If the addresses enabled at bootstrap/HA-entry are acceptable for:

1. Mongo control plane.
2. Agent to controller communication.

Then post-bootstrap you can issue:

juju model-config -m controller ignore-machine-addresses=true

This is verified working for the Tamkeen cloud.

Revision history for this message
John A Meinel (jameinel) wrote :

Unsubscribing field-critical because we have the work around, feedback as to whether this should be field-high still

Revision history for this message
Joseph Phillips (manadart) wrote :

I've created https://bugs.launchpad.net/juju/+bug/2052598 for the specific issue of subsequent subnet discovery.

Revision history for this message
Joseph Phillips (manadart) wrote :

The other part of this issue intersects with https://bugs.launchpad.net/juju/+bug/1990724.

Changed in juju:
importance: Low → High
status: Triaged → In Progress
Changed in juju:
milestone: none → 3.1.8
Revision history for this message
Joseph Phillips (manadart) wrote :
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.