VFs not created at ACC100, following device changes and host unlock

Bug #1927089 reported by Douglas Henrique Koerich
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Douglas Henrique Koerich

Bug Description

Brief Description
-----------------
While the host is locked, following other SRIOV-related configurations (interfaces, data networks, etc.), the ACC100 device is set to provide SRIOV VFs, then the host is unlocked. After reboot, VFs are not created at the device.

Severity
--------
Critical

Steps to Reproduce
------------------
1) Lock the host;
2) Setup labels for SRIOV and CPU/memory constraints required by ACC100;
3) Configure data networks, SRIOV interfaces, and the related associations;
4) Configure ACC100 with SRIOV VFs and other device settings (e.g. driver);
5) Unlock the host;
6) After reboot, look for VFs created at the ACC100 device.

Expected Behavior
------------------
Because of step #4, VFs should be found at step #6.

Actual Behavior
----------------
No VFs found at step #6.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Unrespective

Branch/Pull Time/Commit
-----------------------
master

Last Pass
---------
First time seeing this issue.

Timestamp/Logs
--------------
From sysinv.log one can see that while ACC100 configuration is still being applied, the unlock action is allowed by checking the previous SRIOV configurations only:

sysinv 2021-04-27 14:10:49.526 8525 INFO sysinv.agent.manager [-] Agent config applied 67bf4109-5f5d-4675-b129-d3cf256c8afd
sysinv 2021-04-27 14:10:49.547 98271 INFO sysinv.conductor.manager [-] _remove_config_from_reboot_config_list host: 4b96963f-99c4-4672-b8cc-57d575a907a2,config_uuid: 67bf4109-5f5d-4675-b129-d3cf256c8afd
sysinv 2021-04-27 14:10:49.575 98271 WARNING sysinv.conductor.manager [-] controller-0: iconfig out of date: target 728019d8-f574-42d3-a699-8fb16a517723, applied 67bf4109-5f5d-4675-b129-d3cf256c8afd
sysinv 2021-04-27 14:10:49.576 98271 WARNING sysinv.conductor.manager [-] SYS_I Raise system config alarm: host controller-0 config applied: 67bf4109-5f5d-4675-b129-d3cf256c8afd vs. target: 728019d8-f574-42d3-a699-8fb16a517723.
sysinv 2021-04-27 14:10:49.596 8525 INFO sysinv.agent.manager [-] config_apply_runtime_manifest: 3d81aa38-99d7-47d0-93bd-1aea0a500c14 {u'inventory_update': u'pci_sriov_config', u'classes': [u'platform::interfaces::sriov::runtime', u'platform::devices::fpga::fec::runtime'], u'force': True, u'personalities': [u'controller', u'worker'], u'host_uuids': [u'4b96963f-99c4-4672-b8cc-57d575a907a2']} controller
sysinv 2021-04-27 14:10:49.596 8525 INFO sysinv.agent.manager [-] controller-active
sysinv 2021-04-27 14:10:49.597 8525 INFO sysinv.agent.manager [-] _apply_runtime_manifest with hieradata_path = '/opt/platform/puppet/20.06/hieradata'
sysinv 2021-04-27 14:10:49.714 122184 INFO sysinv.api.controllers.v1.host [-] check sriov_numvfs=8 sriov_vfs_pci_address=0000:18:09.0,0000:18:09.1,0000:18:09.2,0000:18:09.3,0000:18:09.4,0000:18:09.5,0000:18:09.6,0000:18:09.7
sysinv 2021-04-27 14:10:50.053 122184 INFO sysinv.api.controllers.v1.host [-] check sriov_numvfs=8 sriov_vfs_pci_address=0000:18:11.0,0000:18:11.1,0000:18:11.2,0000:18:11.3,0000:18:11.4,0000:18:11.5,0000:18:11.6,0000:18:11.7

Test Activity
-------------
Developer Testing

Workaround
----------
Perform the ACC100 device configuration alone, in a later lock/configure/unlock cycle.

Changed in starlingx:
assignee: nobody → Douglas Henrique Koerich (dkoerich-wr)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/789583

Ghada Khalil (gkhalil)
tags: added: stx.networking
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: Marking for stx.6.0. A fix in stx master is sufficient. Given there is an easy workaround, a cherrypick to r/stx.5.0 is not required.

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.6.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (master)

Change abandoned by "Douglas Henrique Koerich <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/config/+/789583
Reason: To be replaced by change to solve ACC100 issue only

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/790044

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/790044
Committed: https://opendev.org/starlingx/config/commit/858aee342dcb94654c9fcd8f0731a2838d845eb1
Submitter: "Zuul (22348)"
Branch: master

commit 858aee342dcb94654c9fcd8f0731a2838d845eb1
Author: Douglas Henrique Koerich <email address hidden>
Date: Thu May 6 07:35:56 2021 -0400

    Check for FEC device configuration pending

    SRIOV configuration at Intel ACC100 device configuration was not being
    correctly applied when it followed a bunch of other configurations, like
    data networks, SRIOV interfaces, etc., and soon before the unlock of the
    host, because runtime manifest was not finishing before system shutdown.

    "This issue came about because of the way we handle 'pci_devices' in
    contrast to 'interfaces/ports'. For the interfaces/ports, when
    configuring SR-IOV, the interface sriov_numvfs will represent the user
    requested value, while the underlying port sriov_numvfs will represent
    the actual system value. Then, before a system can be unlocked, the
    values are compared for equality. This ensures that the runtime manifest
    that is setting the value of sriov_numvfs has a chance to run. For
    'pci_devices' like the FPGA cards we support, there is no concept of an
    'upper interface'. Therefore, the value of sriov_numvfs for a pci_device
    represents the system value rather than the user requested value. This
    can cause issues when performing and unlock right after configuring a
    pci_device with SR-IOV. There's a chance that the system populates its
    hieradata and unlocks before the runtime manifest has had a chance to
    configure the SR-IOV value for the pci_device." (Steven Webster)

    This change appends a "/APPLYING" string to 'extra_info' field of FEC
    device when it gets configured via API, and it will be removed only when
    the corresponding inventory is reported back by SYSINV agent. Until
    there, attempts of unlocking the host will find that sub-string at the
    field and this is subject of additional semantic check.

    Closes-Bug: 1927089
    Signed-off-by: Douglas Henrique Koerich <email address hidden>
    Change-Id: I175bc01a2a51808c4dc7b821905c7417660bf286

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
Douglas Henrique Koerich (dkoerich-wr) wrote :

Reviewing the logs I realized the issue could still occur if the queue of SRIOV interfaces being configured prior to Mt. Bryce device is very long, so the proposed fix would still fail because it could happen the tail of such queues invalidate the flag introduced to check the pending runtime manifest apply.

For solving that, the improvement suggested for later in https://review.opendev.org/c/starlingx/config/+/790044 should be coded now.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/791531

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/791531
Committed: https://opendev.org/starlingx/config/commit/9d8cdc5bb33f9e8ae4dfd658a0c9b216e7557431
Submitter: "Zuul (22348)"
Branch: master

commit 9d8cdc5bb33f9e8ae4dfd658a0c9b216e7557431
Author: Douglas Henrique Koerich <email address hidden>
Date: Fri May 14 14:45:29 2021 -0400

    Replace applying flag by dict for FEC device check

    This change replaces the latest solution to check FEC device before an
    unlock action, that relied on an '/APPLYING' flag. In certain asynchro-
    nous scenarios, that flag could be cleared before than expected if an
    inventory report not related to the FEC device configuration came late
    (that might happen when configuring a long queue of SRIOV port changes)
    or by periodic sysinv report.
    The solution still uses the 'extra_info' field of PCI devices, this
    time "stringifying" a dictionary entry for 'expected_numvfs' that will
    keep (without clearing) at that field the programmed number of VFs at
    FEC device. It is then compared with the actual sriov_numvfs of device
    from the inventory report, in a similar way of what is currently done
    for comparing SRIOV interfaces (from database) to ports (from device).

    Closes-bug: 1927089
    Signed-off-by: Douglas Henrique Koerich <email address hidden>
    Change-Id: I380bd66a8229a72ef1981cbefa3a0543c28d7f30

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (147.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config/+/794906
Committed: https://opendev.org/starlingx/config/commit/75758b37a5a23c8811355b67e2a430a1713cd85b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9e420d9513e5fafb1df4d29567bc299a9e04d58d
Author: Bin Qian <email address hidden>
Date: Mon May 31 14:45:52 2021 -0400

    Add more logging to run docker login

    Add error log for running docker login. The new log could
    help identify docker login failure.

    Closes-Bug: 1930310
    Change-Id: I8a709fb6665de8301fbe3022563499a92b2a0211
    Signed-off-by: Bin Qian <email address hidden>

commit 31c77439d2cea590dfcca13cfa646522665f8686
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500

    Fix controller-0 downgrade failing to kill ceph

    kill_ceph_storage_monitor tried to manipulate a pmon
    file that does not exist in an AIO-DX environment.

    We no longer invoke kill_ceph_storage_monitor in an
    AIO SX or DX env.

    This allows: "system host-downgrade controller-0"
    to proceed in an AIO-DX environment where that second
    controller (controller-0) was upgraded.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I633853f75317736084feae96b5b849c601204c13

commit 0dc99eee608336fe01b58821ea404286371f1408
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500

    Fix file permissions failure during duplex upgrade abort

    When issuing a downgrade for controller-0 in a duplex upgrade
    abort and rollback scenario, the downgrade command was failing
    because the sysinv API does not have root permissions to set
    a file flag.
    The fix is to use RPC so the conductor can create the flag
    and allow the downgrade for controller-0 to get further.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I913bcad73309fe887a12cbb016a518da93327947

commit 7ef3724dad173754e40b45538b1cc726a458cc1c
Author: Chen, Haochuan Z <email address hidden>
Date: Tue May 25 16:16:29 2021 +0800

    Fix bug rook-ceph provision with multi osd on one host

    Test case:
    1, deploy simplex system
    2, apply rook-ceph with below override value
    value.yaml
    cluster:
      storage:
        nodes:
        - name: controller-0
          devices:
          - name: sdb
          - name: sdc
    3, reboot

    Without this fix, only osd pod could launch successfully after boot
    as vg start with ceph could not correctly add in sysinv-database

    Closes-bug: 1929511

    Change-Id: Ia5be599cd168d13d2aab7b5e5890376c3c8a0019
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 23505ba77d76114cf8a0bf833f9a5bcd05bc1dd1
Author: Angie Wang <email address hidden>
Date: Tue May 25 18:49:21 2021 -0400

    Fix issue in partition data migration script

    The created partition dictonary partition_map is not
    an ordered dict so we need to sort it by its key -
    device node when iterating it to adjust the device
    nodes/paths for user created extra partitions to ensure
    the number of device node...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.