juju enable-ha fails to allocate machines with multiple spaces

Bug #2055170 reported by Nobuto Murata
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Joseph Phillips

Bug Description

$ juju version
3.4.0-genericlinux-amd64

How to reproduce:
1. prepare a MAAS provider
2. prepare 3 machines with two interfaces with two network spaces
3. bootstrap Juju with the following command

$ juju bootstrap maas maas-controller \
    --model-default logging-config='<root>=DEBUG' \
    --model-default default-space=space-first \
    --bootstrap-constraints tags=juju \
    --config juju-mgmt-space=space-first \
    --config juju-ha-space=space-isolated

4. enable-ha

$ juju enable-ha
maintaining machines: 0
adding machines: 1, 2

Expected: Juju picks two additional machines and compose the HA clustered controllers
Actual: Juju fails to allocate two additional machines for HA

There is no error in the constraints since `add-machine` works with the same constraints.

$ juju status -m controller
Model Controller Cloud/Region Version SLA Timestamp
controller maas-controller maas/default 3.4.0 unsupported 14:29:30Z

App Version Status Scale Charm Channel Rev Exposed Message
controller waiting 1/3 juju-controller 3.4/stable 79 no waiting for machine

Unit Workload Agent Machine Public address Ports Message
controller/0* active idle 0 192.168.151.117
controller/1 waiting allocating 1 waiting for machine
controller/2 waiting allocating 2 waiting for machine

Machine State Address Inst id Base AZ Message
0 started 192.168.151.117 major-beetle ubuntu@22.04 default Deployed
1 down pending ubuntu@22.04
2 down pending ubuntu@22.04

$ juju ssh -m controller 0 -- ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
ens4 UP 192.168.151.117/24 fe80::5054:ff:fedc:11ce/64
ens8 UP 192.168.152.101/24 fe80::5054:ff:fe43:591c/64

$ juju spaces -m controller
Name Space ID Subnets
alpha 0
space-first 1 192.168.151.0/24
space-isolated 2 192.168.152.0/24
undefined 3 10.0.9.0/24

$ juju controller-config | grep space
juju-ha-space space-isolated
juju-mgmt-space space-first

$ juju constraints -m controller controller
arch=amd64 mem=3584M tags=juju spaces=space-first,space-isolated

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-high.

It's happening in a customer environment and it's reproducible on a test bed too. There is a manual workaround to do `juju add-machine` by hand then `juju enable-ha --to` though.

Revision history for this message
Nobuto Murata (nobuto) wrote :

When trying the same steps with 3.1, juju status shows a different message but still fails.

  "1":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 27 Feb 2024 14:47:35Z
    instance-id: pending
    machine-status:
      current: provisioning error
      message: 'matching subnets to zones: cannot use space "alpha" as deployment
        target: no subnets'
      since: 27 Feb 2024 14:47:35Z
    modification-status:
      current: idle
      since: 27 Feb 2024 14:47:30Z
    base:
      name: ubuntu
      channel: "22.04"
    constraints: mem=3584M tags=juju spaces=space-first,space-isolated
    controller-member-status: adding-vote

Nobuto Murata (nobuto)
description: updated
Revision history for this message
Nobuto Murata (nobuto) wrote :

Explicitly adding --model-default default-space=space-first didn't help either. It sets the endpoint-bindings to the desired space instead of "alpha". But sill enable-ha fails to allocate machines.

$ juju show-application -m controller controller
controller:
  charm: juju-controller
  base: ubuntu@22.04
  channel: 3.4/stable
  constraints:
    arch: amd64
    mem: 3584
    tags:
    - juju
    spaces:
    - space-first
    - space-isolated
  principal: true
  exposed: false
  remote: false
  life: alive
  endpoint-bindings:
    "": space-first
    dashboard: space-first
    metrics-endpoint: space-first
    website: space-first

It's the same thing but somehow `juju enable-ha --constraints spaces=space-first` works to allocate machines so I suppose `enable-ha` is doing something unnecessary.

Revision history for this message
Nicolas Vinuesa (nvinuesa) wrote :

We are currently taking a look a this in two different places. One is https://bugs.launchpad.net/juju/+bug/2052598 and the other one is being done by ~manadart in https://github.com/juju/juju/pull/16891. I'll keep you posted and update this bug as needed

Revision history for this message
Nobuto Murata (nobuto) wrote :

> We are currently taking a look a this in two different places. One is https://bugs.launchpad.net/juju/+bug/2052598 and the other one is being done by ~manadart in https://github.com/juju/juju/pull/16891. I'll keep you posted and update this bug as needed

I might be wrong, but those two issues (yes I'm familiar since those two hit our previous projects) do not sound related to this one. Because the two were about manual provider basically, and this is with the MAAS provider where a proper networking space support is available.

Revision history for this message
Joseph Phillips (manadart) wrote (last edit ):

Were there bootstrap constraints supplied initially?

Edit: I see there was a tag constraint.

When enabling HA, if no constraints are supplied, we use the initial controller as a reference and add its constraints.

Is it possible that the tag constraint was not able to be satisfied for the new machines? If that's the case, supplying the explicit space constraint would explain the work-around.

Revision history for this message
Nobuto Murata (nobuto) wrote :

> When enabling HA, if no constraints are supplied, we use the initial controller as a reference and add its constraints.
>
> Is it possible that the tag constraint was not able to be satisfied for the new machines? If that's the case, supplying the explicit space constraint would explain the work-around.

As you can see in the description, the "controller" application has the following constraints based on the initial input at the bootstrap time.

$ juju constraints -m controller controller
arch=amd64 mem=3584M tags=juju spaces=space-first,space-isolated

`juju enable-ha` somehow cannot find a new machine based on those constraints. However, `juju add-machine --constraints 'arch=amd64 mem=3584M tags=juju spaces=space-first,space-isolated'` can find a new machine. So the constraints are correct I think.

Revision history for this message
Joseph Phillips (manadart) wrote :

Based on the show-machine output from 3.1, we are somehow attempting to factor the alpha space into the bindings to consider for provisioning info.

Looking at the bindings for the controller application:

  endpoint-bindings:
    "": space-first
    dashboard: space-first
    metrics-endpoint: space-first
    website: space-first

There's no sensible reason for this.

Changed in juju:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Joseph Phillips (manadart)
Revision history for this message
Joseph Phillips (manadart) wrote :

I've replicated this, and will fix it as I am able.

Revision history for this message
Joseph Phillips (manadart) wrote (last edit ):

The reason for this is as follows.

The peer-grouper detects the controller machine before it has been seen by the provisioner. When it checks its addresses, it sees that there are no addresses in the configured juju-ha-space, and sets its status with an appropriate message.

The problem is that the peer-grouper assumes that the machine is "running" and uses that as the status value along with the message. This causes it to be ignored thereafter by the provisioner, which only attempts to provision if it is still "pending".

Revision history for this message
Joseph Phillips (manadart) wrote :
Changed in juju:
milestone: none → 3.4.6
status: Triaged → Fix Committed
Revision history for this message
Jeff Hillman (jhillman) wrote :

Can we get this patch released? specifically to juju 3.5? we're hitting this in multiple customer environments.

Revision history for this message
Ian Booth (wallyworld) wrote :

The planned release schedule has 3.4.6 this week and 3.5.4 next week

Revision history for this message
Ante Karamatić (ivoks) wrote :

FWIW 3.5/edge (3.5.4-84ede84) does not seem to solve this problem...

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.