Linux 5.10 kernel: Failing to create vfs for sriov interface

Bug #1945396 reported by Jiping Ma
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jiping Ma

Bug Description

Brief Description

While doing an install in lab wp13-14, a AIO-DX DIRECT system, the install failed in host-unlock controller-0 with this error:

E0921 19:38:57.222101 1 common.go:138] controller/host "msg"="user error" "error"="failed to unlock host: 0309e708-72d9-43b2-8b22-6884250457a7,

{\"action\":\"unlock\"}
: Bad request with: [PATCH http://[abcd:204::1]:6385/v1/ihosts/0309e708-72d9-43b2-8b22-6884250457a7], error message: {\"error_message\": \"

{\\\"debuginfo\\\": null, \\\"faultcode\\\": \\\"Client\\\", \\\"faultstring\\\": \\\"Expecting number of interface sriov_numvfs=32. Please wait a few minutes for inventory update and retry host-unlock.\\\"}
\"}" "request"=

{"Namespace":"deployment","Name":"controller-0"}
System status after failed install:

[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
----------------------------------------------------------------+

id hostname personality administrative operational availability
----------------------------------------------------------------+

1 controller-0 controller locked disabled online
----------------------------------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$

Severity

System install fails.
Tried without success the workaround:

system host-if-delete controller-0 sriov1;
system host-if-modify controller-0 sriov0 -c none;
system host-unlock controller-0

Steps to Reproduce

Install a build in a AIO DUPLEX DIRECT system

Expected Behavior

Build installs successfully

Actual Behavior

Build install fails.

Reproducibility

100%

System Configuration

AIO-DUPLEX-DIRECT IPv6

Branch/Pull Time/Commit

Starlingx master, build date & time 21-09-2021 8:37PM

Timestamp/Logs

Logs attached.

Alarms

Test Activity

Developer Testing

Workaround

Workaround included above did not work

Jiping Ma (jma11)
Changed in starlingx:
assignee: nobody → Jiping Ma (jma11)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/811552

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
description: updated
tags: added: stx.6.0 stx.distro.other
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on metal (master)

Change abandoned by "Jiping Ma <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/811552

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kernel (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/kernel/+/815194

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kernel (master)

Reviewed: https://review.opendev.org/c/starlingx/kernel/+/815194
Committed: https://opendev.org/starlingx/kernel/commit/4e2b94e3451bfd24fc95562e5386f7477f98d78e
Submitter: "Zuul (22348)"
Branch: master

commit 4e2b94e3451bfd24fc95562e5386f7477f98d78e
Author: Jiping Ma <email address hidden>
Date: Sat Oct 23 11:17:17 2021 -0400

    Enable CONFIG_PCI_REALLOC_ENABLE_AUTO

    PCI code could not reallocate enough mmio due to BIOS limitations or
    errors. All VF mmio space allocation will fail because there are not
    sufficient resources to map all possible VFS.

    We enable CONFIG_PCI_REALLOC_ENABLE_AUTO to realloc pci resource if
    alloc resource failed, and make sure the machine and NIC work normally
    even if it meets the BIOS issue.

    Error info in dmesg:
    DMAR: [Firmware Bug]: No firmware reserved region can cover this RMRR
    [0x0000000067ff0000-0x0000000067ff2fff], contact BIOS vendor for fixes
    2021-09-21T17:22:34.527 localhost kernel: warning [ 3.417365] DMAR:
    [Firmware Bug]: Your BIOS is broken; bad RMRR
    [0x0000000067ff0000-0x0000000067ff2fff]
    2021-09-21T17:22:34.527 localhost kernel: warning [ 3.417365] BIOS
    vendor: Intel Corporation; Ver: SE5C620.86B.02.01.0009.092820190230;
    Product Version: R2208WFQZS
    ...
    pci_bus 0000:85: Some PCI device resources are unassigned, try booting
    with pci=realloc
    ...
    i40e 0000:19:00.0: not enough MMIO resources for SR-IOV

    Closes-Bug: 1945396

    Signed-off-by: Jiping Ma <email address hidden>
    Change-Id: I87de6b323f5737549435ec95c952dbc6fcf20bb4

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.