IPv6: All hosts remain offline after booting off the controller-0

Bug #1915050 reported by Ghada Khalil
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Yue Tao

Bug Description

Brief Description
-----------------
controller-1 and worker nodes remained offline after PXE from controller-0 install. This is seen on a multi-node system w/ IPv6

For some reason the hosts mgmt. address has incorrect CIDR

10: vlan133@ens801f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 90:e2:ba:b0:e9:f4 brd ff:ff:ff:ff:ff:ff inet6 fd01:1::4/128 scope global dynamic valid_lft 5526sec preferred_lft 5226sec inet6 fe80::92e2:baff:feb0:e9f4/64 scope link valid_lft forever preferred_lft forever

It should have been inet6 fd01:1::4/64.

Severity
--------
Critical

Steps to Reproduce
------------------
On a multi-node system w/ IPv6 config:
- Install controller-0
- Install/Configure the rest of the nodes

Expected Behavior
------------------
All nodes should install successfully and go to an online state

Actual Behavior
----------------
Controller-1 and worker nodes install, but remain in an offline state

Reproducibility
---------------
100% Reproducible on multi-node systems w/ IPv6 config

System Configuration
--------------------
multi-node systems w/ IPv6 config

Branch/Pull Time/Commit
-----------------------
stx master: Feb 1, 2021

Last Pass
---------
stx master: January 18, 2021
Note: There were build failures due to CENGN corruption from Jan 21 to Jan 28, hence the gap between the last pass and when the issue was discovered.

Timestamp/Logs
--------------
Issue is reproducible

Test Activity
-------------
Sanity

Workaround
----------
None

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Adding investigation by Don Penney:
The nodes are installing fine, and the kickstarts are generating initial network-scripts with BOOTPROTO=dhcp. On the initial boot of the node post-install, the interface configuration is then done via DHCP.

The DHCP package was recently upversioned, Jan 22, from 4.2.5-68 to 4.2.5-82:
https://review.opendev.org/c/starlingx/integ/+/771752
https://review.opendev.org/c/starlingx/tools/+/771744 (dependeny update)

Perhaps something in this update is resulting in the resulting DHCP address having /128

tags: added: stx.5.0 stx.distro.other
Changed in starlingx:
importance: Undecided → Critical
status: New → Triaged
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / critical - issue affects all IPv6 multi-node systems

The above update should be reverted and tested to see if that addresses the issue. Then the issue can be investigated further before re-introducing this version of dhcp.

Changed in starlingx:
assignee: nobody → Yue Tao (wrytao)
Revision history for this message
Zhixiong Chi (zhixiongchi) wrote :

This issue should be introduced by the patch named "dhcp-dhclient_ipv6_prefix.patch", which is included the new version of dhcp for the CentOS srpm mirror(https://review.opendev.org/c/starlingx/integ/+/771752). This patch change the default value of the ipv6 prefixlen to 128 when using dhcp.
The previous value it 64.
Next step we will revert the patch dhcp-dhclient_ipv6_prefix.patch to keep the previous value.

Revision history for this message
Ghada Khalil (gkhalil) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

For now, we agreed to completely back-out the dhcp package upversion to enable IPv6 multi-node configurations:
https://review.opendev.org/c/starlingx/integ/+/775056
https://review.opendev.org/c/starlingx/tools/+/775058

Changed in starlingx:
status: Triaged → Won't Fix
status: Won't Fix → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Revert was merged.

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tools (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/tools/+/792229

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tools (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/tools/+/792229
Reason: Updated merge coming

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tools (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/tools/+/793627

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/integ/+/793754

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tools (f/centos8)
Download full text (30.4 KiB)

Reviewed: https://review.opendev.org/c/starlingx/tools/+/793627
Committed: https://opendev.org/starlingx/tools/commit/d701c6f896dfe440566cc942e3dd71be1f19ae5d
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 7b5f3a45e663866a3c0ca3ca86eb3c92bc7f0210
Author: Scott Little <email address hidden>
Date: Wed May 5 09:56:33 2021 -0400

    fix bad flockflock url pt 2

    A stray '}' character found it's way into my prior update
    titled 'fix bad flockflock url' after testing. The result was
    the following error

    sed: -e expression #1, char 15: unexpected `}'

    This removes the unwanted '}', restoring the prior update
    to its intended form.

    Closes-bug: 1926987
    Signed-off-by: Scott Little <email address hidden>
    Change-Id: I48f4721ccaf121679916b01747243deedf5836cd

commit ac05493480f6df6f31d071d29380c1b4f35b70a9
Author: Scott Little <email address hidden>
Date: Tue May 4 12:42:36 2021 -0400

    fix git-review within docker build environment

    'tb create' fails to create a build environment since
    upstream git-review was updated of Apr 26.

    Fix is to install/update pbr ahead of git-review.

    Also, to reduce the likelyhood of this recurring, lock
    down specific versions of the pypi supplied tools we
    know to work.

    Closes-bug: 1927137
    Signed-off-by: Scott Little <email address hidden>
    Change-Id: Ib9fe6fd33de4d637f254ac421cc0427ee6131b65

commit b96ebc83d859a4a7802a462504817ecec6182a7b
Author: Scott Little <email address hidden>
Date: Mon May 3 13:16:53 2021 -0400

    fix bad flockflock url

    download_mirror.sh fails due to a bad path containing
    ‘stx-tools/centos-mirror-tools/config/centos/flockflock’

    The path is constructed, and the trigger is when an EOL is missing
    from a centos_build_layer.cfg file, causing 'cat' to merge the last
    line of the offending file with the first line of the next file.

    Switch 'cat' to 'grep', which will always ensure an EOL is present.
    Along the way, we can filter out empty lines and comments.

    Closes-bug: 1926987
    Signed-off-by: Scott Little <email address hidden>
    Change-Id: I2404b3415f0f3e2f395c2bcb7a527aa01a488f61

commit 4c3ee114bcbff710c2049626044dd1ddc756cbd9
Author: Joe Slater <email address hidden>
Date: Tue Apr 27 18:50:53 2021 -0400

    screen: fix CVE-2021-26937 segfault

    Advance to screen-4.1.0-0.27.20120314git3c2946.el7_9.x86_64.rpm.

    Closes-bug: 1926372
    Change-Id: I41834e7b1e16153b0632751f59f7ac9f503389da
    Signed-off-by: Joe Slater <email address hidden>

commit e31e0dda7a4c09143d41cd518ab97ea6112d7fb5
Author: Li Zhou <email address hidden>
Date: Tue Apr 13 04:53:50 2021 -0400

    systemd: Upgrade to version 219-78.el7_9.3

    Refer the lst entries to the new version.

    Partial-Bug: #1924691
    Signed-off-by: Li Zhou <email address hidden>
    Change-Id: I557eff6a47f341cc67de02fd59024b28bb6cac84

commit 26db2859dd3a5c060c337b886fd16c4d2d9f93af
Author: Scott Little <email address hidden>
Date: Mon Apr 12 11:21:31 2021 -0400

    Replace basearch references in y...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (f/centos8)
Download full text (37.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/793754
Committed: https://opendev.org/starlingx/integ/commit/a13966754d4e19423874ca31bf1533f057380c52
Submitter: "Zuul (22348)"
Branch: f/centos8

commit b310077093fd567944c6a46b7d0adcabe1f2b4b9
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 18:19:54 2021 +0300

    Fix resize of filesystems in puppet logical_volume

    After system reinstalls there is stale data on the disk
    and puppet fails when resizing, reporting some wrong filesystem
    types. In our case docker-lv was reported as drbd when
    it should have been xfs.

    This problem was solved in some cases e.g:
    when doing a live fs resize we wipe the last 10MB
    at the end of partition:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L146

    Our issue happened here:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L65
    Resize can happen at unlock when a bigger size is detected for the
    filesystem and the 'logical_volume' will resize it.
    To fix this we have to wipe the last 10MB of the partition after the
    'lvextend' cmd in the 'logical_volume' module.

    Tested the following scenarios:

    B&R on SX with default sizes of filesystems and cgts-vg.

    B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
    cgts-vg with additional physical volumes:

    - name: cgts-vg
        physicalVolumes:
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    B&R on DX system with backup of size 70G and cgts-vg
    with additional physical volumes:

    physicalVolumes:
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    Closes-Bug: 1926591
    Change-Id: I55ae6954d24ba32e40c2e5e276ec17015d9bba44
    Signed-off-by: Mihnea Saracin <email address hidden>

commit 3225570530458956fd642fa06b83360a7e4e2e61
Author: Mihnea Saracin <email address hidden>
Date: Thu May 20 14:33:58 2021 +0300

    Execute once the ceph services script on AIO

    The MTC client manages ceph services via ceph.sh which
    is installed on all node types in
    /etc/service.d/{controller,worker,storage}/ceph.sh

    Since the AIO controllers have both controller and worker
    personalities, the MTC client will execute the ceph script
    twice (/etc/service.d/worker/ceph.sh,
    /etc/service.d/controller/ceph.sh).
    This behavior will generate some issues.

    We fix this by exiting the ceph script if it is the one from
    /etc/services.d/worker on AIO systems.

    Closes-Bug: 1928934
    Change-Id: I3e4dc313cc3764f870b8f6c640a60338...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.