Juju controller keeps restarting when deployed with juju-ha-space and juju-mgmt-space

Bug #1966983 reported by Sandor Zeestraten
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth

Bug Description

# Version
Juju 2.9.27
MAAS 2.9.2

# Problem
After bootstrapping a Juju controller on MAAS with juju-ha-space and juju-mgmt-space set, the juju controller agent seems to restarts every few minutes. There are a bunch of connection broken unexpectedly, timeouts and and unexpected errors in the controller machine logs.

When bootstrapping in same environment without these controller config options, the controller agent works OK (no restarts and issues in the logs).

# Controller log and IP info
ubuntu@hcc-admin26-vm01:~$ pastebinit /var/log/juju/machine-0.log https://paste.ubuntu.com/p/ZgQDJMjMHf/

ubuntu@hcc-admin26-vm01:~$ ip --br addr
lo UNKNOWN 127.0.0.1/8 ::1/128
eth0 UP 10.42.198.122/23 fe80::5054:ff:fe77:21b9/64
eth1 UP 10.42.200.16/23 fe80::5054:ff:fe27:c344/64
ubuntu@hcc-admin26-vm01:~$ ip --br route
default via 10.42.200.1 dev eth1 proto static
10.42.198.0/23 dev eth0 proto kernel scope link src 10.42.198.122
10.42.200.0/23 dev eth1 proto kernel scope link src 10.42.200.16

# Other info
ubuntu@hcc-admin23:~$ juju spaces
Name Space ID Subnets
alpha 0
site2-oob 1 10.42.196.0/23
site2-os-public 2 10.42.208.0/23
site2-neutron-external 3 10.42.210.0/23
site2-os-data 4 172.17.100.0/23
site2-os-internal 5 172.17.102.0/23
site2-ceph-public 6 172.17.104.0/23
site2-ceph-cluster 7 172.17.106.0/23
site2-oam 8 10.42.200.0/23
site2-provision 9 10.42.198.0/23

ubuntu@hcc-admin23:~$ cat controller-config.yaml
default-space: site2-oam
juju-ha-space: site2-oam
juju-mgmt-space: site2-oam

ubuntu@hcc-admin23:~$ juju bootstrap prodmaas-site2 prodmaas-site2-controller --bootstrap-constraints "tags=virtual,ssd" --config controller-config.yaml
Creating Juju controller "prodmaas-site2-controller" on prodmaas-site2/default
Looking for packaged Juju agent version 2.9.27 for amd64
Located Juju agent version 2.9.27-ubuntu-amd64 at https://streams.canonical.com/juju/tools/agent/2.9.27/juju-2.9.27-ubuntu-amd64.tgz
Launching controller instance(s) on prodmaas-site2/default...
 - ccy63p (arch=amd64 mem=8G cores=2)
Installing Juju agent on bootstrap instance
Fetching Juju Dashboard 0.8.1
Waiting for address
Attempting to connect to 10.42.198.122:22
Attempting to connect to 10.42.200.16:22
Connected to 10.42.198.122
Running machine configuration script...
Bootstrap agent now started
Contacting Juju controller at 10.42.198.122 to verify accessibility...

Bootstrap complete, controller "prodmaas-site2-controller" is now available
Controller machines are in the "controller" model
Initial model "default" added

ubuntu@hcc-admin23:~$ juju show-controller prodmaas-site2-controller
prodmaas-site2-controller:
  details:
    uuid: 1bc6a86b-09e9-4121-8da8-297f350e8447
    controller-uuid: 1bc6a86b-09e9-4121-8da8-297f350e8447
    api-endpoints: ['10.42.200.16:17070', '10.42.198.122:17070']
    cloud: prodmaas-site2
    region: default
    agent-version: 2.9.27
    agent-git-commit: acb32588d1752e813b36e3491f0eb44cde7c0684
    controller-model-version: 2.9.27
    mongo-version: 4.4.11
    ca-fingerprint: 95:B7:5A:76:F7:1C:6F:79:51:77:E7:BB:50:F3:D2:D3:B7:7E:04:F9:83:63:18:01:70:9B:7E:8E:B7:3D:85:07
    ca-cert: |
      -----BEGIN CERTIFICATE-----
      MIIEEjCCAnqgAwIBAgIUEuYxJXtVXTc7eHGVpeUqnOfW/hwwDQYJKoZIhvcNAQEL
      BQAwITENMAsGA1UEChMESnVqdTEQMA4GA1UEAxMHanVqdS1jYTAeFw0yMjAzMjkx
      MzE5MTNaFw0zMjAzMjkxMzI0MTNaMCExDTALBgNVBAoTBEp1anUxEDAOBgNVBAMT
      B2p1anUtY2EwggGiMA0GCSqGSIb3DQEBAQUAA4IBjwAwggGKAoIBgQCjtdziOTkm
      bw/jYojYF5A0pSCCr1Qi3R0d05jUMFNwbLqb+DncAGyVbeevOtZQEFaFQWXMr0Y4
      hAb9Bg59Keagr6rWQmsrbH4mAdhrbFXKmGAyDP2PsJlcVlgvT0m9Pjwue52lk8rn
      SlUj3bq4uyqcm1VgiRnE5JJASwIOYqymE3Au4RO3Ki7M1GYx4o9FcB6x8+tFFpWt
      8ByadTXslytNFFIMFMtR492JR/Ggz+kvjB0S9VxZNovcw3tReMEbwQo3XWV1j6hu
      VFf0WwxNJPQb9t5PmZreRhtNcTSBkcDpmOXcDjVqypc52QKFFwTP3AQA+7bOTtZQ
      FdAIy9u1vHf2bl63Qt6ABJHgxW8nIuenC66DyL+uyLNhoBi3H649QkwTvSNnvVfy
      8oPLj4Y0zTgp60vAoaF7HJ+d/WkhLIC2XFVV0vdEPrRxF5qlgrbBY8tRIIwHq3WI
      G7mIYtNMR4c1Cp8cYNsPOsMUqqx6zJuQtainqauK1Td+AUqcZ72QfjMCAwEAAaNC
      MEAwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB/wQFMAMBAf8wHQYDVR0OBBYEFAQs
      hK4JNFdQcou+2DSCykQ1fuK7MA0GCSqGSIb3DQEBCwUAA4IBgQCXMOM5OtkUEQLb
      HHcNIHm1K25JXmYMe3vK0dpKFlo+wi8BRFBdMJB+uLVadMTPPI3czIVtXIEp3nVk
      8LCIYgTBUDYo7jjAv3wqePWHg4uBfa32XzWQQ4tES+E3PbuWaYSgI42zOUxqo9/N
      G6GkxqBZwLVueLCjp34rEyqq3K9K3YpWszIZPeYHcxFCWSRT+scLUjlXJxq1VeDa
      jbC1MQpZWrXGPN88zJbr1/TB3VA16lQO/9gy8uSJoEc0VkZtLJUMnYKLDlvjF6ox
      XWL8jP7ozIrrbOFN+DFa3JbhTJjwTbeD+CZhrhg0oyCtBsa4Z7PpNsKzze5YPDlS
      1S7fA/BLL55QeXcYZMS/ydQgG0R1VkC5sxtiy2oZ1w/ffZ3YqfW94zjrPdI4Ikp/
      /Zs6QNXF/iZah6N8fWO5epQYDUgow8/5Ccs9CGlBRut2nsCzDXtAmZQiQie8COzL
      3O4ZOsZo6B8ZmlYY0Rd2BlHH7tFeDAKIBRWK51mqYrHtRo4ShoY=
      -----END CERTIFICATE-----
  controller-machines:
    "0":
      instance-id: ccy63p
  models:
    controller:
      uuid: bcc708d7-ac36-441a-8f47-4e96af3a316d
      model-uuid: bcc708d7-ac36-441a-8f47-4e96af3a316d
      machine-count: 1
      core-count: 2
    default:
      uuid: 252a0d89-ee0b-4e61-8daa-32a672741454
      model-uuid: 252a0d89-ee0b-4e61-8daa-32a672741454
  current-model: admin/default
  account:
    user: admin
    access: superuser

ubuntu@hcc-admin23:~$ juju controller-config
Attribute Value
agent-logfile-max-backups 2
agent-logfile-max-size 100M
api-port 17070
api-port-open-delay 2s
audit-log-capture-args false
audit-log-exclude-methods ReadOnlyMethods
audit-log-max-backups 10
audit-log-max-size 300M
auditing-enabled true
batch-raft-fsm false
ca-cert |
  -----BEGIN CERTIFICATE-----
  MIIEEjCCAnqgAwIBAgIUEuYxJXtVXTc7eHGVpeUqnOfW/hwwDQYJKoZIhvcNAQEL
  BQAwITENMAsGA1UEChMESnVqdTEQMA4GA1UEAxMHanVqdS1jYTAeFw0yMjAzMjkx
  MzE5MTNaFw0zMjAzMjkxMzI0MTNaMCExDTALBgNVBAoTBEp1anUxEDAOBgNVBAMT
  B2p1anUtY2EwggGiMA0GCSqGSIb3DQEBAQUAA4IBjwAwggGKAoIBgQCjtdziOTkm
  bw/jYojYF5A0pSCCr1Qi3R0d05jUMFNwbLqb+DncAGyVbeevOtZQEFaFQWXMr0Y4
  hAb9Bg59Keagr6rWQmsrbH4mAdhrbFXKmGAyDP2PsJlcVlgvT0m9Pjwue52lk8rn
  SlUj3bq4uyqcm1VgiRnE5JJASwIOYqymE3Au4RO3Ki7M1GYx4o9FcB6x8+tFFpWt
  8ByadTXslytNFFIMFMtR492JR/Ggz+kvjB0S9VxZNovcw3tReMEbwQo3XWV1j6hu
  VFf0WwxNJPQb9t5PmZreRhtNcTSBkcDpmOXcDjVqypc52QKFFwTP3AQA+7bOTtZQ
  FdAIy9u1vHf2bl63Qt6ABJHgxW8nIuenC66DyL+uyLNhoBi3H649QkwTvSNnvVfy
  8oPLj4Y0zTgp60vAoaF7HJ+d/WkhLIC2XFVV0vdEPrRxF5qlgrbBY8tRIIwHq3WI
  G7mIYtNMR4c1Cp8cYNsPOsMUqqx6zJuQtainqauK1Td+AUqcZ72QfjMCAwEAAaNC
  MEAwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB/wQFMAMBAf8wHQYDVR0OBBYEFAQs
  hK4JNFdQcou+2DSCykQ1fuK7MA0GCSqGSIb3DQEBCwUAA4IBgQCXMOM5OtkUEQLb
  HHcNIHm1K25JXmYMe3vK0dpKFlo+wi8BRFBdMJB+uLVadMTPPI3czIVtXIEp3nVk
  8LCIYgTBUDYo7jjAv3wqePWHg4uBfa32XzWQQ4tES+E3PbuWaYSgI42zOUxqo9/N
  G6GkxqBZwLVueLCjp34rEyqq3K9K3YpWszIZPeYHcxFCWSRT+scLUjlXJxq1VeDa
  jbC1MQpZWrXGPN88zJbr1/TB3VA16lQO/9gy8uSJoEc0VkZtLJUMnYKLDlvjF6ox
  XWL8jP7ozIrrbOFN+DFa3JbhTJjwTbeD+CZhrhg0oyCtBsa4Z7PpNsKzze5YPDlS
  1S7fA/BLL55QeXcYZMS/ydQgG0R1VkC5sxtiy2oZ1w/ffZ3YqfW94zjrPdI4Ikp/
  /Zs6QNXF/iZah6N8fWO5epQYDUgow8/5Ccs9CGlBRut2nsCzDXtAmZQiQie8COzL
  3O4ZOsZo6B8ZmlYY0Rd2BlHH7tFeDAKIBRWK51mqYrHtRo4ShoY=
  -----END CERTIFICATE-----
charmstore-url https://api.jujucharms.com/charmstore
controller-name prodmaas-site2-controller
controller-uuid 1bc6a86b-09e9-4121-8da8-297f350e8447
juju-db-snap-channel 4.4/stable
juju-ha-space site2-oam
juju-mgmt-space site2-oam
max-agent-state-size 524288
max-charm-state-size 2.097152e+06
max-debug-log-duration 24h0m0s
max-prune-txn-batch-size 1e+06
max-prune-txn-passes 100
max-txn-log-size 10M
metering-url https://api.jujucharms.com/omnibus/v3
migration-agent-wait-time 15m
model-logfile-max-backups 2
model-logfile-max-size 10M
model-logs-size 20M
mongo-memory-profile default
non-synced-writes-to-raft-log false
prune-txn-query-count 1000
prune-txn-sleep-time 10ms
set-numa-control-policy false
state-port 37017

ubuntu@hcc-admin23:~$ juju status -m controller --debug
15:42:19 INFO juju.cmd supercommand.go:56 running juju [2.9.27 0 acb32588d1752e813b36e3491f0eb44cde7c0684 gc go1.17.8]
15:42:19 DEBUG juju.cmd supercommand.go:57 args: []string{"/snap/juju/18573/bin/juju", "status", "-m", "controller", "--debug"}
15:42:19 INFO juju.juju api.go:78 connecting to API addresses: [10.42.198.122:17070 10.42.200.16:17070]
15:42:19 DEBUG juju.api apiclient.go:1153 successfully dialed "wss://10.42.198.122:17070/model/bcc708d7-ac36-441a-8f47-4e96af3a316d/api"
15:42:19 INFO juju.api apiclient.go:688 connection established to "wss://10.42.198.122:17070/model/bcc708d7-ac36-441a-8f47-4e96af3a316d/api"
Model Controller Cloud/Region Version SLA Timestamp
controller prodmaas-site2-controller prodmaas-site2/default 2.9.27 unsupported 15:42:19+02:00

Machine State DNS Inst id Series AZ Message
0 started 10.42.198.122 hcc-admin26-vm01 focal hcc-rack11

15:42:19 DEBUG juju.api monitor.go:35 RPC connection died
15:42:19 INFO cmd supercommand.go:544 command finished

Revision history for this message
DUFOUR Olivier (odufourc) wrote (last edit ):

Hello,

I notice your controller has 2 network interfaces/spaces.

I may be having the same issue right now.
Just to be sure, can you provide the logs of /var/snap/juju-db/common/logs/mongodb.log ?

Another good test would be to check if your controller is stable if you configure juju-ha-space to site2-provision instead when bootstrapping ?

Revision history for this message
Sandor Zeestraten (szeestraten) wrote :

Hi Olivier,

Thanks for your quick reply.

Yes, the controller has 2 networks where site2-provision is the MAAS PXE network.
Here is the mongodb.log from the controller: https://paste.ubuntu.com/p/mQ7w5hBZvS/

Bootstrapping with juju-ha-space=site2-provision and juju-mgmt-space=site2-provision is stable.

Revision history for this message
Ian Booth (wallyworld) wrote :

This PR

https://github.com/juju/juju/pull/13911

brings in a fix

https://github.com/juju/replicaset/pull/18

from the upstream library. It was QA'ed with a LXD setup with a profile to create 3 nics on each container.

Changed in juju:
milestone: none → 2.9.28
assignee: nobody → Ian Booth (wallyworld)
importance: Undecided → High
status: New → In Progress
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Sandor Zeestraten (szeestraten) wrote (last edit ):

Thank you @wallyworld! Any idea on when it'll pop up in 2.9/edge snap channel so we can give it a test?

Edit:
2.9.28 (18717) was available today in 2.9/candidate. I tested with same bootstrap configuration and it seems to work now. Thanks!

Revision history for this message
DUFOUR Olivier (odufourc) wrote :

I can confirm as well that the issue is now gone with 2.9.28 on my side as well.

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.