Database shows incorrect device (ACC100) inventory after restore/reboot, takes multiple lock/unlock to correct that

Bug #2053149 reported by Tara Nath Subedi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Tara Nath Subedi

Bug Description

Brief Description
-----------------
Reported ACC100 device inventory is incorrect (num_vfs=0, driver=None) after server restored from backup (backup consists of correct configuration num_vfs=1, driver=igb_uio). This leading to no ACC100 device config to kubernetes on next reboot.

Severity
--------
Major

Steps to Reproduce
------------------
1. Take backup of node after configuring ACC100 device (driver:igb_uio, vf-driver:igb_uio, N:1)
2. Restore the node from backup.
2. Issue was seen after the node came back up (system host-device-list/show check update time and vfs)

Expected Behavior
------------------
ACC100 device inventory should have reflected to database as configured (driver:igb_uio, vf-driver:igb_uio, N:1)

Actual Behavior
----------------
ACC100 device inventory on database is seen as driver:None, vf-driver:None, N:0

Reproducibility
---------------
Intermittent
frequency of occurrence: NA

System Configuration
--------------------
One node system, Multi-node system

Branch/Pull Time/Commit
-----------------------
STX master 2024-01-03_10:44:56

Last Pass
---------
NA
It happens only on restore case and in intermittent way. May be we did not catch this until now.

Timestamp/Logs
--------------
2024-01-04-18-05-24_restore/puppet.log
sysinv 2024-01-04 18:16:43.277 37578 INFO sysinv.api.controllers.v1.host [-] Expecting sriov_numvfs=1 for FEC device pciaddr=0000:4b:00.0. Please wait a few minutes for inventory update a
sysinv 2024-01-04 18:16:45.421 71618 ERROR sysinv.puppet.kubernetes [-] Failed to get device id for pci device 0000:4b:00.0
sysinv 2024-01-04 18:16:46.575 37578 WARNING wsme.api [-] Client-side error: Expecting sriov_numvfs=1 for FEC device pciaddr=0000:4b:00.0. Please wait a few minu
sysinv 2024-01-04 18:17:17.160 71618 INFO sysinv.conductor.manager [-] update 0000:4b:00.0 attr: 'sriov_numvfs': '1\n', 'sriov_vfs_pci_address': '0000:4c:00.0', 'sriov_vf_driver': None, 'sriov_vf_pdevice_id': '0d5d', 'driver': 'igb_uio'

reboot: around 2024-01-04-18-22-07_aio
sysinv 2024-01-04 18:22:12.044 4968 INFO sysinv.agent.manager [-] _report_to_conductor initial_reports_required={'pv', 'pci_device', 'port', 'memory', 'cpu', 'disk', 'lvg', 'numa'}
sysinv 2024-01-04 18:22:12.044 4968 INFO sysinv.agent.manager [-] Sysinv Agent audit running inv_get_and_report.
sysinv 2024-01-04 18:22:13.570 4968 WARNING sysinv.agent.pci [-] Port speed detected as -1 for: ens1f2 (link operstate: down)
sysinv 2024-01-04 18:23:15.130 4968 INFO sysinv.agent.manager [-] get_ihost_by_macs rpc Timeout.
2024-01-04T18:24:05.496 controller-0 kernel: info [ 148.418787] igb_uio 0000:4b:00.0: enabling device (0140 -> 0142)
2024-01-04T18:24:08.253 controller-0 kernel: info [ 151.175570] igb_uio 0000:4c:00.0: enabling device (0000 -> 0002)
sysinv 2024-01-04 18:25:30.237 4968 INFO sysinv.agent.manager [-] get_ihost recovered from RPC timeout.
sysinv 2024-01-04 18:25:30.238 4968 INFO sysinv.agent.manager [-] _report_to_conductor initial_reports_required=={'pv', 'pci_device', 'port', 'memory', 'cpu', 'disk', 'lvg', 'numa'}
sysinv 2024-01-04 18:25:34.424 95811 INFO sysinv.conductor.manager [-] update 0000:4b:00.0 attr: 'sriov_numvfs': '0\n', 'sriov_vfs_pci_address': '', 'sriov_vf_driver': None, 'sriov_vf_pdevice_id': None, 'driver': None,
sysinv 2024-01-04 18:26:28.008 95811 ERROR sysinv.puppet.kubernetes [-] Failed to get device id for pci device 0000:4b:00.0

reboot: around 2024-01-10-17-14-42_aio
kubernetes config does not have ACC100 device config

Test Activity
-------------
Developer Testing

Workaround
----------
The server requires multiple lock/unlocks to recover.

Changed in starlingx:
assignee: nobody → Tara Nath Subedi (tsubedi)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/908520
Committed: https://opendev.org/starlingx/config/commit/0b8956bf1fe6ce3d398184a304e8ecf7e14810c1
Submitter: "Zuul (22348)"
Branch: master

commit 0b8956bf1fe6ce3d398184a304e8ecf7e14810c1
Author: Tara Subedi <email address hidden>
Date: Thu Feb 8 14:51:50 2024 -0500

    Report port and device inventory after the worker manifest

    The SR-IOV configuration of a device is not retained across reboots,
    until puppet manifests bind/enable completes. The sysinv-agent should
    not report device inventory at any time after it is started, it should
    wait until puppet worker manifest completes.

    Upon reboot, SR-IOV configuration (of ACC100) (sriov_numvfs=0) is
    updated to intended configuration by puppet worker manifest. In this
    case, there is a small chance that the sysinv-agent audit (every 60
    seconds) will run before the driver configuration. Since the agent will
    only actually report the port and device inventory once, the SR-IOV
    configuration data is not accurately reflected in the db, thus
    requiring additional lock/unlock(s) to force correction.

    After restore and reboot, there was no
    /etc/platform/.initial_worker_config_complete and
    /var/run/.worker_config_complete files until puppet worker manifest
    completes. sysinv-agent audit happened to read device inventory before
    the driver configuration (i.e. before worker manifest completed), thus
    not accurately reflected in the db.

    This commit fixes such that port and device configuration are only
    reported after the worker manifest has completed.

    TEST PLAN:
       PASS: Restore node from backup (ACC100 device config::
             driver:igb_uio, vf-driver:igb_uio, N:1), once node
             come back up, check host-device-list/show for after-boot
             update time and num_vfs = 1.

    Closes-Bug: 2053149
    Change-Id: If35ae3c9359139db9128859e49df20ed2ecd86af
    Signed-off-by: Tara Nath Subedi <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Re-opening as the above commit was reverted as it will break port inventory on initial install.
Revert: https://review.opendev.org/c/starlingx/config/+/909092

The original fix will need to be re-worked

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/909476

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)
Download full text (3.2 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config/+/909476
Committed: https://opendev.org/starlingx/config/commit/9c3bf050cd57916325a3e7218a4816ba575b63e4
Submitter: "Zuul (22348)"
Branch: master

commit 9c3bf050cd57916325a3e7218a4816ba575b63e4
Author: Tara Subedi <email address hidden>
Date: Thu Feb 8 14:51:50 2024 -0500

    Report port and device inventory after the worker manifest

    The SR-IOV configuration of a device is not retained across reboots,
    until puppet manifests bind/enable completes. The sysinv-agent should
    not report device inventory at any time after it is started, it should
    wait until puppet worker manifest completes. Though during bootstrap
    (fresh install), restore, network-boot and subsequent reboots in case
    of non-worker roles (controller, storage) sysinv-agent can report at
    any time it is started.

    Upon reboot, SR-IOV configuration (of ACC100) (sriov_numvfs=0) is
    updated to intended configuration by puppet worker manifest. In this
    case, there is a small chance that the sysinv-agent audit (every 60
    seconds) will run before the driver configuration. Since the agent will
    only actually report the port and device inventory once, the SR-IOV
    configuration data is not accurately reflected in the db, thus
    requiring additional lock/unlock(s) to force correction.

    After fresh-install/restore/network-boot and reboot, there was no
    /etc/platform/.initial_worker_config_complete and
    /var/run/.worker_config_complete files until puppet worker manifest
    completes. sysinv-agent audit happened to read device inventory before
    the driver configuration (i.e. before worker manifest completed), thus
    not accurately reflected in the db.

    This commit fixes such that port and device configuration are only
    reported after the worker manifest has completed, in case the host is
    being configured as worker subfunction.

    TEST PLAN:
       PASS: Fresh install node (that has ACC100 device) AIO, check
             host-device-list/show (before config/unlock) to see
             ACC100 device config:: driver:None, vf-driver:None, N:0.

       PASS: After above, update config (ACC100 device config::
             driver:igb_uio, vf-driver:igb_uio, N:1) and also use
             host-label-assign as sriovdp=enabled and unlock, for
             subsequent reboots validate device config as
             (driver:igb_uio, vf-driver:igb_uio, N:1) and validate
             content of /etc/pcidp/config.json.

       PASS: Restore node from backup (ACC100 device config::
             driver:igb_uio, vf-driver:igb_uio, N:1 and also
             host-label-assing as sriovdp=enabled), once node
             come back up, check host-device-list/show for after-boot
             update time and num_vfs = 1. Also validate content of
             /etc/pcidp/config.json.

        PASS: In AIO-DX setup, ports and devices can be listed and
             and second worker node can be unlocked, after the
             network-boot.

    Closes-Bug: 2053149
    Change-Id: I69d483041bd75ea0abbd68cedccfbc5f10062c75
    Signed-off-by: Tara Nath S...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Re-opening as the code changes introduced another restore issue. An incremental fix is required. Review: https://review.opendev.org/c/starlingx/config/+/914142

Changed in starlingx:
status: Fix Released → In Progress
importance: Undecided → Medium
tags: added: stx.10.0 stx.networking stx.update
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/starlingx/config/+/914142
Committed: https://opendev.org/starlingx/config/commit/933d3a3a73e923efc86d7ac8b8a059a598e6fbe1
Submitter: "Zuul (22348)"
Branch: master

commit 933d3a3a73e923efc86d7ac8b8a059a598e6fbe1
Author: Tara Subedi <email address hidden>
Date: Mon Mar 25 13:51:26 2024 -0400

    Report port and device inventory after the worker manifest

    This is incremental fix of bug:2053149.
    Upon network boot (first boot) of worker node, agent manager is
    supposed to report ports/devices, without waiting for worker manifest,
    as that would never run on first boot. Without this, after system
    restore, it will be unable to unlock compute node due to sriov config
    update.

    kickstart records first boot as "/etc/platform/.first_boot". Agent
    manager deletes this file. In case agent manager get crashed, it will
    start again. This time, agent manager don't see .first_boot file, and
    don't know this is still first boot and it won't report inventory for
    the worker node.

    This commit fixes this issue by creating volatile file
    "/var/run/.first_boot" before deleting "/etc/platform/.first_boot", and
    agent relies on both files to figure out it is first boot or not. This
    present same logic for multiple crash/restart of agent manager.

    TEST PLAN:
    PASS: AIO-DX bootstrap has no issues. lock/unlock has no issues.
    PASS: Network-boot worker node, before doing unlock, restart agent
          manager (sysinv-agent), check sysinv.log to see ports are reported.

    Closes-Bug: 2053149
    Change-Id: Iace5576575388a6ed3403590dbeec545c25fc0e0
    Signed-off-by: Tara Nath Subedi <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.