Unable to install a subcloud due to VLAN networking failure

Bug #2013372 reported by Kyle MacLeod
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Kyle MacLeod

Bug Description

Brief Description

When trying to install a subcloud in 22.12, we observe that ostree pull fails because of network issues.

NOTE: This is an IPv4 network

Severity

<Critical: System/Feature is not usable after the defect>

Steps to Reproduce

Install the system controller, with oam and mgmt both on different vlans on the same physical pxeboot interface. The subclouds should have the same configuration.

The system consists of 2 system controllers and a worker, and one subcloud (simplex) which has a similar networking configuration as the controllers.

Now try to deploy the subcloud. It fails to install at the ostree step.

Expected Behavior

Subcloud should be deployed

Actual Behavior

Subcloud fails to deploy.

Reproducibility

100%

System Configuration

2 system controllers, 1 worker

1 simplex subcloud

OAM, MGMT all on the pxeboot interface via vlan

TOR switch, with vlans configured on them.

Load info (eg: 2022-03-10_20-00-07)

starlingx master

Last Pass

Timestamp/Logs

Provide a snippet of logs if available and the timestamp when issue was seen.

Please indicate the unique identifier in the logs to highlight the problem

Attach the logs for debugging

Alarms

Please indicate if there are any alarms observed.

If there are any alarms please list them here

Workaround

The root cause is a bug in miniboot.cfg, which is incorrectly setting the default route.

We can fix this by overriding the miniboot.cfg file used during remote installation.

Steps

The following steps are done on the active system controller.

1. Copy miniboot.cfg into /var/miniboot/kickstart-override (which needs to be created first):

1a) If /var/www/pages/feed/rel-22.12/kickstart/miniboot.cfg exists:

sudo mkdir /var/miniboot/kickstart-override
sudo cp /var/www/pages/feed/rel-22.12/kickstart/miniboot.cfg /var/miniboot/kickstart-override/

OR

1b) If /var/www/pages/feed/rel-22.12/kickstart/miniboot.cfg does not exist yet, you'll have to extract it from the current load ISO.

Assuming you have already done a load-import, you can extract miniboot.cfg from the ISO as follows.

sudo mkdir /mnt/iso
sudo mount -o loop /opt/dc-vault/loads/22.12/bootimage.iso /mnt/iso
sudo cp /mnt/iso/kickstart/miniboot.cfg /var/miniboot/kickstart-override/
sudo umount /mnt/iso
sudo rmdir /mnt/iso

2. Edit the file to change the default route setting

Run the following command, which replaces the 'mgmt_dev' to 'mgmt_iface' on lines 1589 and 1590:

sudo sed -i.orig '1589,+1s|dev ${mgmt_dev}|dev ${mgmt_iface}|' /var/miniboot/kickstart-override/miniboot.cfg

The original file can be compared with the new file to ensure that it was successful. You should see the following:

[sysadmin@controller-0 ~(keystone_admin)]$ diff /var/miniboot/kickstart-override/miniboot.cfg.orig /var/miniboot/kickstart-override/miniboot.cfg
1589,1590c1589,1590
< ilog "ip ${BOOTPARAM_IP_VER} route add default ${BOOTPARAM_ROUTE_OPTIONS} dev ${mgmt_dev} ${BOOTPARAM_METRIC}"
< ip ${BOOTPARAM_IP_VER} route add default ${BOOTPARAM_ROUTE_OPTIONS} dev ${mgmt_dev} ${BOOTPARAM_METRIC}
---
> ilog "ip ${BOOTPARAM_IP_VER} route add default ${BOOTPARAM_ROUTE_OPTIONS} dev ${mgmt_iface} ${BOOTPARAM_METRIC}"
> ip ${BOOTPARAM_IP_VER} route add default ${BOOTPARAM_ROUTE_OPTIONS} dev ${mgmt_iface} ${BOOTPARAM_METRIC}

3. Run the remote installation

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/879068

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/879068
Committed: https://opendev.org/starlingx/metal/commit/5bd181cdcfa770273112d411b412a9645035a54c
Submitter: "Zuul (22348)"
Branch: master

commit 5bd181cdcfa770273112d411b412a9645035a54c
Author: Kyle MacLeod <email address hidden>
Date: Thu Mar 30 13:21:40 2023 -0400

    miniboot: fix incorrect vlan interface applied for default route

    This commit fixes a but where the ip route add default is referencing
    the management device name instead of the interface name (containing
    vlan tag).

    The issue is only seen when the OAM network is on a VLAN and
    is a separate network (requires nexthop_gateway setting in
    install values).

    The fix is to apply the root on the vlan interface, not the top-level
    network device interface.

    Test Plan
    PASS:
    - Verify installation on system with OAM network on separate VLAN using
      nexthop_gateway
    - Verify installation on system with vlan but no nexthop_gateway

    Closes-Bug: 2013372
    Change-Id: Ic3febbd0cb77dd21435f23859e6d228e6ab95a8c
    Signed-off-by: Kyle MacLeod <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
tags: added: stx.9.0 stx.me
tags: added: stx.metal
removed: stx.me
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
assignee: nobody → Kyle MacLeod (kmacleod)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.