Two unlocks required when converting a single-nic system to enable SR-IOV on the underlying interface

Bug #1926366 reported by Steven Webster
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Steven Webster

Bug Description

Brief Description
-----------------
If a system is converted to a shared-nic configuration with SR-IOV enabled on the underlying physical interface, the system will undergo two reboots after the host is unlocked.

Severity
--------
Minor: The system will recover automatically, but the system will be rebooted twice from after the first unlock.

Steps to Reproduce
------------------

1. Consider a system with mgmt and oam vlan interfaces on-top of a physical ethernet platform interface:

system host-if-modify controller-0 eth0 -c platform
system host-if-add -V 10 controller-0 oam0 vlan eth0
system interface-network-assign controller-0 oam0 oam
system host-if-add -V 11 controller-0 mgmt0 vlan eth0
system interface-network-assign controller-0 mgmt0 mgmt

2. The system is then unlocked:

system host-unlock controller-0

3. When the system comes back up, the ethernet platform interface is then converted to be of class pci-sriov with 16 VFs:

system host-lock controller-0
system host-if-modify eth0 -c pci-sriov -N 16
system host-unlock controller-0

4. The system is then unlocked:

system host-unlock controller-0

5. When the controller manifest is applied, note that ceph-mon and pmond fail to bind to the management address, and the system is rebooted.

6. After the reboot, the system recovers.

Expected Behavior
------------------
The system should only require one unlock/reboot to apply the config

Actual Behavior
----------------
The system goes through another reboot, when the controller manifest fails on first reboot

Reproducibility
---------------
100%

System Configuration
--------------------
This should apply to all configs (AIO/Standard). In the case of an IPv6 system, it would be noticed that the vlan interfaces lose IPv6 addresses as well as the default route, if any. In the case of an IPv4 system, the default route related to the management interface would be lost.

Branch/Pull Time/Commit
-----------------------
master 2021-01-27 or later

Last Pass
---------
N/A the allowance of a single-nic w/ SR-IOV is a recent feature

Timestamp/Logs
--------------
Observe the puppet logs from the controller manifest application (on an AIO) or worker manifest application (on a Standard system)

Test Activity
-------------
Feature testing

Workaround
----------
The workaround would be to configure the SR-IOV interface in Step 1 in the 'Steps to Reproduce'

CVE References

Revision history for this message
Steven Webster (swebster-wr) wrote :

Triage:

This issue is ultimately caused by the apply-network-config step of the controller/worker manifest. This step launches a script that detects differences between the puppet view of what the /etc/sysconfig/network-scripts should be and what the value of the ifcfg files actually is on the system. If there are differences, the puppet view of the interface configuration is copied to the system network-scripts directory and the interface is brought down and up to apply the config. If there are no changes between the puppet view and the system view, the interface is left alone.

What happens when the underlying physical interface is configured for SR-IOV is that commands to set the number of virtual functions is added to the pre-up option in the corresponding network-script. Puppet detects this change, copies the config, and brings the interface down/up. This causes the upper vlan interfaces to lose IPv6 addresses + default route. In the case of an IPv4 system, the default route would be lost, which could be an issue in a distributed cloud environment.

Changed in starlingx:
status: New → Triaged
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: marking minor / not gating given the system recovers automatically

Changed in starlingx:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/788515

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/788515
Committed: https://opendev.org/starlingx/stx-puppet/commit/69b9809465b5e7a837917cce7d0a731ddf257f0d
Submitter: "Zuul (22348)"
Branch: master

commit 69b9809465b5e7a837917cce7d0a731ddf257f0d
Author: Steven Webster <email address hidden>
Date: Tue Apr 27 17:54:24 2021 -0400

    Fix interface (re)configuration for single-nic system

    Currently, the apply-network-config manifest step launches a script
    that detects differences between puppet's view of what the
    ifcfg-* network scripts should be and what the value
    of the ifcfg files actually are in the /etc/sysconfig/network-scripts/
    directory.

    If there are differences, the puppet representation of the interface
    configuration is copied to the system network-scripts directory and
    the interface is brought down and up to apply the config.
    If there are no changes between the puppet view and the system view,
    the interface is left alone.

    An issue can occur in a single-nic system comprising a physical
    lower ethernet interface configured for SR-IOV with upper vlan
    interfaces (oam, mgmt, etc). If the lower interface is
    re-configured, it is subsequently brought down/up to apply
    the changes. This causes the upper vlan interfaces to also
    be brought down by the kernel. In the case of an IPv6 system,
    the interfaces will lose their addresses as well as any configured
    default route. In the case of an IPv4 system, the default route
    will be wiped out, which could cause issues in a distributed cloud
    environment.

    This commit addresses the issue by detecting whether any lower
    interface associated with a vlan interface has been marked for
    re-configuration. If this is the case, the vlan interface is
    also added to the up/down list to cause it to re-apply the
    existing static configuration (if it is not already in the list).

    Closes-Bug: 1926366
    Signed-off-by: Steven Webster <email address hidden>
    Change-Id: I40177900ef58a9619fecb34ceffc412f31d1a965

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792029

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8)
Download full text (48.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/792029
Committed: https://opendev.org/starlingx/stx-puppet/commit/2b026190a3cb6d561b6ec4a46dfb3add67f1fa69
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 3e3940824dfb830ebd39fd93265b983c6a22fc51
Author: Dan Voiculeasa <email address hidden>
Date: Thu May 13 18:03:45 2021 +0300

    Enable kubelet support for pod pid limit

    Enable limiting the number of pids inside of pods.

    Add a default value to protect against a missing value.
    Default to 750 pids limit to align with service parameter default
    value for most resource consuming StarlingX optional app (openstack).
    In fact any value above service parameter minimum value is good for the
    default.

    Closes-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I10c1684fe3145e0a46b011f8e87f7a23557ddd4a

commit 0c16d288fbc483103b7ba5dad7782e97f59f4e17
Author: Jessica Castelino <email address hidden>
Date: Tue May 11 10:21:57 2021 -0400

    Safe restart of the etcd SM service in etcd upgrade runtime class

    While upgrading the central cloud of a DC system, activation failed
    because there was an unexpected SWACT to controller-1. This was due
    to the etcd upgrade script. Part of this script runs the etcd
    manifest. This triggers a reload/restart of the etcd service. As this
    is done outside of the sm, sm saw the process failure and triggered
    the SWACT.

    This commit modifies platform::etcd::upgrade::runtime puppet class
    to do a safe restart of the etcd SM service and thus, solve the
    issue.

    Change-Id: I3381b6976114c77ee96028d7d96a00302ad865ec
    Signed-off-by: Jessica Castelino <email address hidden>
    Closes-Bug: 1928135

commit eec3008f600aeeb69a42338ed44332228a862d11
Author: Mihnea Saracin <email address hidden>
Date: Mon May 10 13:09:52 2021 +0300

    Serialize updates to global_filter in the AIO manifest

    Right now, looking at the aio manifest:
    https://review.opendev.org/c/starlingx/stx-puppet/+/780600/15/puppet-manifests/src/manifests/aio.pp
    there are 3 classes that update
    in parallel the lvm global_filter:
    - include ::platform::lvm::controller
    - include ::platform::worker::storage
    - include ::platform::lvm::compute
    And this generates some errors.

    We fix this by adding dependencies between the above classes
    in order to update the global_filter in a serial mode.

    Closes-Bug: 1927762
    Signed-off-by: Mihnea Saracin <email address hidden>
    Change-Id: If6971e520454cdef41138b2f29998c036d8307ff

commit 97371409b9b2ae3f0db6a6a0acaeabd74927160e
Author: Steven Webster <email address hidden>
Date: Fri May 7 15:33:43 2021 -0400

    Add SR-IOV rate-limit dependency

    Currently, the binding of an SR-IOV virtual function (VF) to a
    driver has a dependency on platform::networking. This is needed
    to ensure that SR-IOV is enabled (VFs created) before actually
    doing the bind.

    This dependency does not exist for configuring the VF rate-limits
    however. There is a cha...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.