Database shows incorrect device (ACC100) inventory after restore/reboot, takes multiple lock/unlock to correct that
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Tara Nath Subedi |
Bug Description
Brief Description
-----------------
Reported ACC100 device inventory is incorrect (num_vfs=0, driver=None) after server restored from backup (backup consists of correct configuration num_vfs=1, driver=igb_uio). This leading to no ACC100 device config to kubernetes on next reboot.
Severity
--------
Major
Steps to Reproduce
------------------
1. Take backup of node after configuring ACC100 device (driver:igb_uio, vf-driver:igb_uio, N:1)
2. Restore the node from backup.
2. Issue was seen after the node came back up (system host-device-
Expected Behavior
------------------
ACC100 device inventory should have reflected to database as configured (driver:igb_uio, vf-driver:igb_uio, N:1)
Actual Behavior
----------------
ACC100 device inventory on database is seen as driver:None, vf-driver:None, N:0
Reproducibility
---------------
Intermittent
frequency of occurrence: NA
System Configuration
-------
One node system, Multi-node system
Branch/Pull Time/Commit
-------
STX master 2024-01-03_10:44:56
Last Pass
---------
NA
It happens only on restore case and in intermittent way. May be we did not catch this until now.
Timestamp/Logs
--------------
2024-01-
sysinv 2024-01-04 18:16:43.277 37578 INFO sysinv.
sysinv 2024-01-04 18:16:45.421 71618 ERROR sysinv.
sysinv 2024-01-04 18:16:46.575 37578 WARNING wsme.api [-] Client-side error: Expecting sriov_numvfs=1 for FEC device pciaddr=
sysinv 2024-01-04 18:17:17.160 71618 INFO sysinv.
reboot: around 2024-01-
sysinv 2024-01-04 18:22:12.044 4968 INFO sysinv.
sysinv 2024-01-04 18:22:12.044 4968 INFO sysinv.
sysinv 2024-01-04 18:22:13.570 4968 WARNING sysinv.agent.pci [-] Port speed detected as -1 for: ens1f2 (link operstate: down)
sysinv 2024-01-04 18:23:15.130 4968 INFO sysinv.
2024-01-
2024-01-
sysinv 2024-01-04 18:25:30.237 4968 INFO sysinv.
sysinv 2024-01-04 18:25:30.238 4968 INFO sysinv.
sysinv 2024-01-04 18:25:34.424 95811 INFO sysinv.
sysinv 2024-01-04 18:26:28.008 95811 ERROR sysinv.
reboot: around 2024-01-
kubernetes config does not have ACC100 device config
Test Activity
-------------
Developer Testing
Workaround
----------
The server requires multiple lock/unlocks to recover.
Changed in starlingx: | |
assignee: | nobody → Tara Nath Subedi (tsubedi) |
Changed in starlingx: | |
status: | New → In Progress |
Changed in starlingx: | |
status: | Fix Released → In Progress |
Reviewed: https:/ /review. opendev. org/c/starlingx /config/ +/908520 /opendev. org/starlingx/ config/ commit/ 0b8956bf1fe6ce3 d398184a304e8ec f7e14810c1
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 0b8956bf1fe6ce3 d398184a304e8ec f7e14810c1
Author: Tara Subedi <email address hidden>
Date: Thu Feb 8 14:51:50 2024 -0500
Report port and device inventory after the worker manifest
The SR-IOV configuration of a device is not retained across reboots,
until puppet manifests bind/enable completes. The sysinv-agent should
not report device inventory at any time after it is started, it should
wait until puppet worker manifest completes.
Upon reboot, SR-IOV configuration (of ACC100) (sriov_numvfs=0) is
updated to intended configuration by puppet worker manifest. In this
case, there is a small chance that the sysinv-agent audit (every 60
seconds) will run before the driver configuration. Since the agent will
only actually report the port and device inventory once, the SR-IOV
configuration data is not accurately reflected in the db, thus
requiring additional lock/unlock(s) to force correction.
After restore and reboot, there was no platform/ .initial_ worker_ config_ complete and run/.worker_ config_ complete files until puppet worker manifest
/etc/
/var/
completes. sysinv-agent audit happened to read device inventory before
the driver configuration (i.e. before worker manifest completed), thus
not accurately reflected in the db.
This commit fixes such that port and device configuration are only
reported after the worker manifest has completed.
TEST PLAN:
driver: igb_uio, vf-driver:igb_uio, N:1), once node list/show for after-boot
PASS: Restore node from backup (ACC100 device config::
come back up, check host-device-
update time and num_vfs = 1.
Closes-Bug: 2053149 db9128859e49df2 0ed2ecd86af
Change-Id: If35ae3c9359139
Signed-off-by: Tara Nath Subedi <email address hidden>