installation (linux kernel 5.10) failed because system failed to find disk

Bug #1947313 reported by Jiping Ma
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Jiping Ma

Bug Description

Brief Description
-----------------
ml350_g10 installation (linux kernel 5.10) failed because system failed to find disk

The issue is not seen on the linux kernel 3.10.

Severity
--------
Major

Steps to Reproduce
-------------------
install controller-0
ansible bootstrap on controller-0
configure and unlock controller-0

Expected Behavior
-----------------
controller-0 is unlocked and in available status

Actual Behavior
---------------
installation failed because the system failed to find disk

Reproducibility
----------------
reproducible (happened 2/2)

System Configuration
--------------------
ml350_g10 - was seen on this specific hardware: HPE ML350

Branch/Pull Time/Commit
-----------------------
recent stx master load after the 5.10 kernel merged

Last Pass
----------
This used to work with an older version of the kernel (linux kernel 3.10)

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ system host-list

+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | locked | disabled | online |
+----+--------------+-------------+----------------+-------------+--------------+

[sysadmin@controller-0 ~(keystone_admin)]$ system host-unlock controller-0

Expecting number of interface sriov_numvfs=7. Please wait a few minutes for inventory update and retry host-unlock.

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get pods -A

NAMESPACE NAME READY STATUS RESTARTS AGE
armada armada-api-778fc65fd6-2qz78 2/2 Running 0 59m
cert-manager cm-cert-manager-785cb658cd-ldldr 1/1 Running 0 51m
cert-manager cm-cert-manager-cainjector-544d67bcb8-djggd 1/1 Running 0 51m
cert-manager cm-cert-manager-webhook-d47d89c8-dlz6j 1/1 Running 0 51m
flux-helm helm-controller-74667bfd95-wfht4 1/1 Running 1 59m
flux-helm source-controller-7d448db5b4-s8n82 1/1 Running 1 59m
kube-system calico-kube-controllers-5cd4695574-hz279 1/1 Running 1 59m
kube-system calico-node-xf2tr 1/1 Running 0 59m
kube-system coredns-666cb94996-d86hv 1/1 Running 0 59m
kube-system ic-nginx-ingress-ingress-nginx-controller-ffsbj 1/1 Running 0 54m
kube-system kube-apiserver-controller-0 1/1 Running 0 59m
kube-system kube-controller-manager-controller-0 1/1 Running 0 59m
kube-system kube-multus-ds-amd64-w5xzl 1/1 Running 0 59m
kube-system kube-proxy-qdx9w 1/1 Running 0 59m
kube-system kube-scheduler-controller-0 1/1 Running 0 59m
kube-system kube-sriov-cni-ds-amd64-2xklc 1/1 Running 0 59m
kube-system kube-sriov-device-plugin-amd64-klkk6 0/1 CrashLoopBackOff 13 45m
platform-deployment-manager platform-deployment-manager-0 2/2 Running 2 51m

Test Activity
--------------
Lab install

Workaround
----------
Unknown

Jiping Ma (jma11)
Changed in starlingx:
assignee: nobody → Jiping Ma (jma11)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kernel (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/kernel/+/814107

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
summary: - installation (linux kernel 5.10) failed because DM failed to find disk
+ installation (linux kernel 5.10) failed because system failed to find
+ disk
description: updated
Changed in starlingx:
importance: Undecided → Critical
importance: Critical → High
tags: added: stx.6.0 stx.distro.other
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kernel (master)

Reviewed: https://review.opendev.org/c/starlingx/kernel/+/814107
Committed: https://opendev.org/starlingx/kernel/commit/e7fffa117b93bec1bc31f24ab0f98e765715251d
Submitter: "Zuul (22348)"
Branch: master

commit e7fffa117b93bec1bc31f24ab0f98e765715251d
Author: jma1 <email address hidden>
Date: Fri Oct 15 06:23:00 2021 +0000

    scsi: smartpqi: Enable sas_address sys fs for SATA device type.

    We met the issue DM complains that it can't find the disk specified
    in the deployment config file after we updated the Linux kernel to 5.10.
    The error is "failed to find disk for path /dev/disk/by-path/
    pci-0000:3b:00.0-sas-0x31402ec001d92983-lun-0"

    This happens because device type SATA is excluded from being
    processed with the function pqi_is_device_with_sas_address.
    which causes all SATA type disk drives to appear the same, having
    zeroes in the lun name. /dev/disk/by-path/
    pci-0000:3b:00.0-sas-0x0000000000000000-lun-0

    We can add type SA_DEVICE_TYPE_SATA to class device_with_sas_address,
    since it will also get the sas_address from wwid. and works
    transparently with the old kernel without gaps.

    Successfully installed on Wind River lab ml350_g10 which
    contains a P816i-a SR Gen10 disk controller. This lab
    previously exhibited the problem.

    Closes-Bug: #1947313

    Signed-off-by: Jiping Ma <email address hidden>
    Change-Id: I4445bac52a4706b923ac16865bd24a82783512ba

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.