NVMe device not tested

Bug #1774828 reported by Ike Panhc
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Checkbox Provider - Resource
Invalid
High
Unassigned

Bug Description

Revision history for this message
Sylvain Pineau (sylvain-pineau) wrote :

I see only one disk (sda) in lshw, sdb being the usb stick. disk/disk_stress_ng_sda should then be the test corresponding to your nvme disk if it's the only available internal drive on this system. Please confirm.

Changed in checkbox-ng:
status: New → Incomplete
assignee: nobody → Ike Panhc (ikepanhc)
Revision history for this message
Ike Panhc (ikepanhc) wrote :

From output of lshw

KNAME="sda" TYPE="disk" MOUNTPOINT=""
KNAME="sda1" TYPE="part" MOUNTPOINT="/boot/efi"
KNAME="sda2" TYPE="part" MOUNTPOINT="/"
KNAME="sdb" TYPE="disk" MOUNTPOINT=""
KNAME="nvme0n1" TYPE="disk" MOUNTPOINT=""
KNAME="nvme0n1p1" TYPE="part" MOUNTPOINT=""

sda is the SATA disk, sdb is USB stick, nvme0p1 is the disk not tested.

Changed in checkbox-ng:
status: Incomplete → Confirmed
Revision history for this message
Ike Panhc (ikepanhc) wrote :
Revision history for this message
Rod Smith (rodsmith) wrote :

I've encountered what may be another manifestation of this bug. I configured a server (via MAAS) in an admittedly crazy way:

- /dev/md0 -- software RAID 5, consisting of
  - nvme0n1p1
  - nvme1n1p1
  - nvme2n1p1
  - mounted at /mnt/raid
- /dev/nvme3n1 -- conventionally partitioned
  - /dev/nvme3n1p1 -- /boot/efi
  - /dev/nvme3n1p2 -- /boot
  - /dev/nvme3n1p3 -- /
- vg0 -- LVM VG, split across:
  - /dev/sda1
  - /dev/sdb1
  - /dev/sdc1
  - Contains a single LV, mounted at /home

The server (aitken, a Lenovo SR650 in 18T) has Ubuntu 18.04 installed on it. When I ran test-storage, Checkbox detected and tested only /dev/dm0; it did not detect the conventionally-partitioned /dev/nvme3n1 NVMe devices, nor the disks used to create the LVM. Here's the submission:

https://certification.canonical.com/hardware/201902-26829/submission/160158/

I don't see a full test submission for this server with a more conventional disk configuration, so I don't know what it would detect with all disks partitioned and mounted in a flat configuration, without LVM or software RAID in the picture. I'll test that eventually, but for now I need the server configured this way to torture-test a new script.

Our procedures specify using flat partitioning, not LVM or software RAID. The configuration I used therefore violates our own policies and would make it impossible for our tests to produce accurate results on the LVM and, to a lesser extent, the RAID disks, so I'm not as concerned with those problems. That said, I'd expect Checkbox to correctly detect the NVMe devices as disk devices.

Jeff Lane  (bladernr)
Changed in checkbox-ng:
assignee: Ike Panhc (ikepanhc) → nobody
importance: Undecided → High
Revision history for this message
Rod Smith (rodsmith) wrote :

I've re-deployed aitken with a more conventional (for certification) layout, and certify-advanced detected all the disk devices. Clearly, more investigation is required....

Revision history for this message
Jeff Lane  (bladernr) wrote :

First place to check would be in plainbox-provider-resource-generic/bin/block_device_resource. Specifically, it does this:

for path in glob('/sys/block/*/device') + glob('/sys/block/*/dm'):
    name = re.sub('.*/(.*?)/(device|dm)', '\g<1>', path)
    state = device_state(name)
    usb2 = usb_support(name, 2.00)
    usb3 = usb_support(name, 3.00)
    rotation = device_rotation(name)
    smart = smart_support(name)
    print("""\

So it is worth re-creating this failure, and with the failing configuration verify that /sys/block/*/device and /sys/block/*/dm exist for each expected device.

You can also run the script manually to see if it is outputting expected devices and info:
$ /usr/lib/plainbox-provider-resource-generic/bin/block_device_resource
name: sdd
state: internal
usb2: unsupported
usb3: unsupported
rotation: no
smart: False

name: sdb
state: internal
usb2: unsupported
usb3: unsupported
rotation: yes
smart: False

name: sdc
state: internal
usb2: unsupported
usb3: unsupported
rotation: yes
smart: False

name: sda
state: internal
usb2: unsupported
usb3: unsupported
rotation: no
smart: False

affects: checkbox-ng → plainbox-provider-resource
Revision history for this message
Rod Smith (rodsmith) wrote :

We may be looking at two issues here:

* Ike's original bug report, which has an uncertain (to me) cause.
* The problem I encountered with similar symptoms, as reported in post #4.
  This one is actually caused by udevadm.py in checkbox-support, which
  deliberately removes non-device-mapper devices if any device-mapper
  devices are found. (Search for a comment including the word "precedence"
  for the relevant code. This was done to circumvent problems caused by
  the way disks are handled on some Power systems.

The second problem may be worked around by a new test we're developing to check that disks are partitioned correctly, so it may not really be all that important. Alternatively, the udevadm.py device-mapper tests could be made smarter.

The first problem appears to be caused by an inconsistency in disk detection. The info/disk_partitions attachment to the results linked to in post #3 show that the NVMe device is properly partitioned; however, the disk/disk_stress_ng_nvme0n1 test output shows that the stress-ng test was unable to find that partition. Perhaps this is a bug in the test script and it was unable to mount an unmounted partition (if it was unmounted); or maybe there's some other cause.

Ike, if possible, could you please try to reproduce the original results on the original hardware, or at least something similar? If you use the stable certification PPA (ppa:hardware-certification/public), you'll use the same test script you used when you first encountered this problem (give or take whatever bug fixes may have been made since then); however, just today I've submitted an entirely new stress-ng wrapper script that supplants the old one. If you use the certification development PPA (ppa:checkbox-dev/ppa) beginning tomorrow (Wednesday, 2020-02-26; but I don't know exactly when builds occur), you'll get the new script. (Look for a script called /usr/lib/plainbox-provider-checkbox/bin/disk_stress_ng, which is the old one; or /usr/lib/plainbox-provider-checkbox/bin/stress_ng_test, which is the new one.) If the old script fails but the new one works, then this bug can be considered squashed. If you can't reproduce the original behavior, then we'll be in ambiguous territory. If the new script doesn't fix the problem, then closing this bug report will require more work.

Jeff Lane  (bladernr)
tags: added: hwcert-server
Revision history for this message
Jeff Lane  (bladernr) wrote :

Ike, is this still a problem?

Changed in plainbox-provider-resource:
status: Confirmed → Incomplete
Revision history for this message
Ike Panhc (ikepanhc) wrote :

No. not hit this problem for a while and can not stable reproduce. I will close it for now and re-open if I can reproduce it stably.

Changed in plainbox-provider-resource:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.