ACC100 device unavailable to kubernetes after lock/unlock

Bug #2045149 reported by Steven Webster
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Steven Webster

Bug Description

Brief Description
-----------------
After a restore operation, and running for a few weeks, a system was locked/unlocked. After unlock, pods using an ACC100 FEC device were not able to obtain an SR-IOV VF from the FEC device.

Note that this is not easily reproducible.

Severity
--------
Minor: System/Feature is usable with minor issue

Steps to Reproduce
------------------
Note that this is not easily reproducible, but the following steps were done on an affected system.

- Perform a backup/restore on an AIO-SX system
- pci_device inventory for the ACC100 device has entries for sriov_numvfs and sriov_vf_driver cleared.
- Perform a lock/unlock (this will probably require a second unlock attempt)
- After the system comes up, any pods that were previously making use of a SR-IOV VF of an ACC100 FEC device fail to start.

Expected Behavior
------------------
Pods that were using an SR-IOV VF from an ACC100 device can start.

Actual Behavior
----------------
Pods that were using an SR-IOV VF from an ACC100 device can't start.

Reproducibility
---------------
Seen once. From looking at logs, I have reproduced the scenario by modifying the database directly to force the issue.

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
stx-8.0

Test Activity
-------------
Field operation

Workaround
----------
If this scenario is encountered, an extra lock/unlock would likely resolve the issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/902161

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/902161
Committed: https://opendev.org/starlingx/config/commit/305dc493af2b4912bae246b69cf2eb119b790acb
Submitter: "Zuul (22348)"
Branch: master

commit 305dc493af2b4912bae246b69cf2eb119b790acb
Author: Steven Webster <email address hidden>
Date: Tue Nov 28 15:09:06 2023 -0500

    Fix FEC pcidp resource generation with expected data

    An issue was seen after restoring a system making use of the
    ACC100 FEC device.

    After restore, the values for the sriov_numvs and sriov_vf_driver
    in the database were 0/None.

    I am unsure if this is an error in the restore process, or whether
    there was some subsequent database corruption. The issue was seen
    on only one system out of many.

    In any case, this seems to have been handled in the generation of
    the actual ACC100 device config in the past. That is, we store
    the 'expected' value of the sriov_numvfs and sriov_vf_driver in
    the extra_info field of the pci_device table. These values are
    preferred over the actual values in the DB.

    The issue here is that in generating the SR-IOV device plugin
    resource data for puppet, the 'actual' values are used, rather
    than the 'expected' values. This causes under current logic
    the hieradata generation to skip the device, as it's
    sriov_vf_driver is NULL.

    This commit makes the generation of the SR-IOV device plugin
    resource data consistent with the method used for the actual
    configuration data of the device, based on a preference for the
    'expected' data in the extra_info field of the device.

    Test Plan:

    Force the issue seen in the field by setting the sriov_numvfs=0
    and sriov_vf_driver=None in the database.

      - Lock/unlock host and ensure that the hieradata is based on
        the expected_vf_driver.
      - The unit test cases should cover all cases of the modified
        function

    Closes-Bug: #2045149

    Change-Id: Ic7beb4e6a6fd69901db3a012649461fc445380ee
    Signed-off-by: Steven Webster <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.9.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.