[VRF] juju bootstrap io timeout

Bug #2031313 reported by Peter Jose De Sousa
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Wishlist
Joseph Phillips

Bug Description

Hello,

When bootstrapping juju onto a VRF enabled machine, when bootstrapping - the juju agent script will try to configure itself on the default VRF. This is a problem because it will fail to connect to some internal services due to 127.0.0.0/8 addresses not being routable

For example take the following setup:

Node 0:

Default VRF ->: 10.10.32.3
MGMT VRF ->: 10.10.33.2

Default VRF routes:

default via 10.10.32.1 dev eth0 proto static
10.10.32.0/24 dev eth0 proto kernel scope link src 10.10.32.2

MGMT VRF Routes:

default via 10.10.33.1 dev eth1 proto static
10.10.33.0/24 dev eth1 proto kernel scope link src 10.10.33.

Cloud:

manual-vrf-test:
    type: manual
    endpoint: 10.10.33.2
    regions:
      default: {}

Cloud init script starts and works because the default VRF has a default route so is able to fetch its resources.

I actually had a behaviour here where juju over-wrote my systemd files, breaking my ssh configuration (removing my VRF configuration)

[Configuration]

sshd unit file
[Unit]
Description=OpenBSD Secure Shell server
Documentation=man:sshd(8) man:sshd_config(5)
After=network.target auditd.service
ConditionPathExists=!/etc/ssh/sshd_not_to_be_run

[Service]
EnvironmentFile=-/etc/default/ssh
ExecStartPre=/usr/sbin/sshd -t
ExecStart=/usr/sbin/ip vrf exec mgmt /usr/sbin/sshd -D $SSHD_OPTS
ExecReload=/usr/sbin/sshd -t
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
RestartPreventExitStatus=255
Type=notify
RuntimeDirectory=sshd
RuntimeDirectoryMode=0755

[Install]
WantedBy=multi-user.target
Alias=sshd.service

(Overriding didnt work here, as it was not possible to override the ExecStart argument)

VRF Config:

network:
    ethernets:
        eth0:
            addresses:
            - 10.10.32.27/24
            routes:
            - to: default
              via: 10.10.32.1
            match:
                macaddress: 00:16:3e:dd:b5:66
            mtu: 1500
            nameservers:
                addresses:
                - 10.10.32.2
                search:
                - vlan32.maas
            set-name: eth0
        eth1:
            addresses:
            - 10.10.33.2/24
            match:
                macaddress: 00:16:3e:45:26:86
            mtu: 1500
            nameservers:
                addresses:
                - 10.10.32.2
                search:
                - vlan32.maas
            routes:
            - to: default
              via: 10.10.32.1
            set-name: eth1
    vrfs:
        mgmt:
          interfaces: [eth1]
          table: 14

    version: 2

[ss output (Default VRF)]

ss -plnt
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 4096 127.0.0.53%lo:53 0.0.0.0:*
LISTEN 0 128 0.0.0.0:22 0.0.0.0:*
LISTEN 0 4096 0.0.0.0:37017 0.0.0.0:*
LISTEN 0 128 [::]:22 [::]:*
LISTEN 0 4096 [::]:37017 [::]:*

[Steps to reproduce]

1. Deploy ubuntu jammy machine
2. Setup VRF
3. Attempt to bootstrap manual controller on this machine

[Error output]
ERROR cannot store binary file: cannot clean up after failed storage operation because: read tcp 127.0.0.1:35942->127.0.0.1:37017: i/o timeout: cannot add resource "buckets/fb62d6e2-44c7-466e-852a-dd82f503dad8/tools/2.9.44-ubuntu-amd64-8cd50bb83047ffaaa816d154cba671da26f7e694f51a9fcb5bce968b84a5331d" to store at storage path "b2c36a68-7bd8-4e8f-8688-f5e3616a1509": failed to flush data: read tcp 127.0.0.1:35942->127.0.0.1:37017: i/o timeout
2023-08-14 12:53:05 DEBUG cmd supercommand.go:537 error stack:
read tcp 127.0.0.1:35942->127.0.0.1:37017: i/o timeout
github.com/juju/blobstore/v2.(*gridFSStorage).Put:69: failed to flush data
github.com/juju/blobstore/v2.(*managedStorage).putForEnvironment:258: cannot add resource "buckets/fb62d6e2-44c7-466e-852a-dd82f503dad8/tools/2.9.44-ubuntu-amd64-8cd50bb83047ffaaa816d154cba671da26f7e694f51a9fcb5bce968b84a5331d" to store at storage path "b2c36a68-7bd8-4e8f-8688-f5e3616a1509"
github.com/juju/blobstore/v2.cleanupResourceCatalog:191: cannot clean up after failed storage operation because: read tcp 127.0.0.1:35942->127.0.0.1:37017: i/o timeout
github.com/juju/juju/state/binarystorage.(*binaryStorage).Add:54: cannot store binary file
github.com/juju/juju/cmd/jujud/agent.(*BootstrapCommand).populateTools:540:
github.com/juju/juju/cmd/jujud/agent.(*BootstrapCommand).Run:364:
2023-08-14 12:53:05 DEBUG juju.cmd.jujud main.go:283 jujud complete, code 0, err <nil>

[Workaround]

Apply VRF after juju has bootstrapped

Thanks,
Peter

Tags: vrf
summary: - [VRF] juju bootstrap timeout - juju
+ [VRF] juju bootstrap timeout - juju agent starts on wrong VRF
description: updated
summary: - [VRF] juju bootstrap timeout - juju agent starts on wrong VRF
+ [VRF] juju bootstrap io timeout - juju agent starts on wrong VRF
description: updated
tags: added: vrf
description: updated
description: updated
summary: - [VRF] juju bootstrap io timeout - juju agent starts on wrong VRF
+ [VRF] juju bootstrap io timeout
Revision history for this message
Joseph Phillips (manadart) wrote :

Interestingly, this would probably also fail the way that we currently have Dqlite working on our "main" branch, earmarked for Juju 4.0.

We will bind Dqlite to a local-cloud address *if there is a unique one available*, otherwise falling back to the loopback IP and rebinding when we enter HA.

We have plans to move some behaviour into the controller charm, whereby we could use a cluster peer relation and bindings to allow this unique selection from the outset.

I doubt we will address this on the current 3.x major, but we will keep it in mind as we progress.

Changed in juju:
status: New → Triaged
importance: Undecided → Wishlist
assignee: nobody → Joseph Phillips (manadart)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.