[UC20] disk/disk_stress_ng_* fail on encrypted partitions if not run right after installing the OS

Bug #1959077 reported by Pierre Equoy
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Checkbox Provider - Base
Expired
High
Unassigned

Bug Description

Summary
=======

On first boot after install, Checkbox correctly detects encrypted partitions based on udev output. On subsequent reboots, Checkbox fails to detect the encrypted partition, leading to errors when running disk/disk_stress_ng_* tests.

This is apparently due to the way udev rules are defined for dm partitions (see comments #2 and #3).

Original description
--------------------

This is probably related to lp:1948384.

CID: 202109-29473 (but it was reported with other devices and projects)
image: uc20-x03 for this project

1. Install the OEM image on the device:
    sudo snap install checkbox20 && sudo snap install checkbox-carlsbad --edge --devmode
2. Run Stress Test plan

Expected result
===============

The disk/disk_stress_ng_* test(s) pass.

Actual result
=============

There is one test, disk/disk_stress_ng_nvme0n1, and it fails with the following output:

=====================================
STRESS_NG_DISK_TIME env var is not found, stress_ng disk running time is default value
WARNING:root:Warning: nvme0n1p3 is less than 10 GiB in size!
ERROR:root:Disk is too small to test. Aborting test!
** Unable to find a suitable partition! Aborting!
retval is 1
**************************************************************
** stress-ng test failed!
**************************************************************
=====================================

Full submission: https://certification.canonical.com/hardware/202109-29473/submission/246678/

3. Reinstall the image, reinstall Checkbox, and run another Test Plan (Automated, for instance)
4. Abandon the current session, re-launch Checkbox and select the Stress Test Plan.

→ Sometimes, the job `disk/disk_stress_ng_nvme0n1` disappear and is replaced with the job `disk/disk_stress_ng_dm-0`.

In that case, the job will pass, as seen in this submission: https://certification.canonical.com/hardware/202109-29473/submission/246843/

Tags: cbox-21
Pierre Equoy (pieq)
Changed in plainbox-provider-checkbox:
milestone: none → 0.64.0
importance: Undecided → High
description: updated
Matias Piipari (mz2)
tags: added: cbox-21
Revision history for this message
Pierre Equoy (pieq) wrote :

It happened again to me, on another project using UC20 and full disk encryption.

The first time I installed UC20, and Checkbox20, running:

checkbox-<project>.checkbox-cli run com.canonical.certification::device

returned the following item in the DISK category:

(...)
path: /devices/pci0000:00/0000:00:1c.0/mmc_host/mmc0/mmc0:0001
name: mmcblk0
bus: mmc
category: DISK
driver: mmcblk
product: DG4032
product_slug: DG4032
(...)

Because of this, `disk/disk_stress_ng_mmcblk0` failed:

=============================
STRESS_NG_DISK_TIME env var is not found, stress_ng disk running time is default value
WARNING:root:Warning: mmcblk0p3 is less than 10 GiB in size!
ERROR:root:Disk is too small to test. Aborting test!
** Unable to find a suitable partition! Aborting!
retval is 1
**************************************************************
** stress-ng test failed!
**************************************************************
=============================

After reinstalling the OS and Checkbox, the device resource job output was:

(...)
path: /devices/virtual/block/dm-0
name: dm-0
bus: block
category: DISK
product: dm-0
product_slug: dm-0

path: /devices/virtual/block/dm-1
name: dm-1
bus: block
category: DISK
product: dm-1
product_slug: dm-1
(...)

and this time, the `disk/disk_stress_ng_*` jobs can be run properly.

Revision history for this message
Pierre Equoy (pieq) wrote :

I spent more time investigating by re-installing UC20 and Checkbox several times and observing the output of lsblk and udevadm from both outside the checkbox snap and inside.

The output of `lsblk -i -n -P -o KNAME,TYPE,MOUNTPOINT` seems pretty consistent no matter if dm-0 and dm-1 are found by Checkbox, so I guess we can rule this one out.

However, when Checkbox cannot find dm0 and dm1 and only returns mmcblk0, the `udevadm info -e` output looks like this:

P: /devices/virtual/block/dm-0
N: dm-0
L: 0
E: DEVPATH=/devices/virtual/block/dm-0
E: SUBSYSTEM=block
E: DEVNAME=/dev/dm-0
E: DEVTYPE=disk
E: MAJOR=253
E: MINOR=0
E: USEC_INITIALIZED=35139639
E: DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG=1
E: DM_UDEV_DISABLE_DISK_RULES_FLAG=1
E: DM_UDEV_DISABLE_OTHER_RULES_FLAG=1
E: SYSTEMD_READY=0
E: TAGS=:systemd:

P: /devices/virtual/block/dm-1
N: dm-1
L: 0
E: DEVPATH=/devices/virtual/block/dm-1
E: SUBSYSTEM=block
E: DEVNAME=/dev/dm-1
E: DEVTYPE=disk
E: MAJOR=253
E: MINOR=1
E: USEC_INITIALIZED=35147314
E: DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG=1
E: DM_UDEV_DISABLE_DISK_RULES_FLAG=1
E: DM_UDEV_DISABLE_OTHER_RULES_FLAG=1
E: SYSTEMD_READY=0
E: TAGS=:systemd:

When things work as expected, the output looks like that:

P: /devices/virtual/block/dm-0
N: dm-0
L: 0
S: mapper/ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DEVPATH=/devices/virtual/block/dm-0
E: SUBSYSTEM=block
E: DEVNAME=/dev/dm-0
E: DEVTYPE=disk
E: MAJOR=253
E: MINOR=0
E: USEC_INITIALIZED=36869231
E: DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG=1
E: DM_UDEV_DISABLE_DISK_RULES_FLAG=1
E: DM_UDEV_DISABLE_OTHER_RULES_FLAG=1
E: DM_UDEV_RULES=1
E: DM_UDEV_RULES_VSN=2
E: DM_NAME=ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_UUID=CRYPT-LUKS2-d571d7583a6e43d8bc2c8b6b4491a2e1-ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_SUSPENDED=0
E: SYSTEMD_READY=0
E: DEVLINKS=/dev/mapper/ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: TAGS=:systemd:

P: /devices/virtual/block/dm-1
N: dm-1
L: 0
S: mapper/ubuntu-save-1b475f05-6de0-4abb-9373-1b79e7324197
E: DEVPATH=/devices/virtual/block/dm-1
E: SUBSYSTEM=block
E: DEVNAME=/dev/dm-1
E: DEVTYPE=disk
E: MAJOR=253
E: MINOR=1
E: USEC_INITIALIZED=36878289
E: DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG=1
E: DM_UDEV_DISABLE_DISK_RULES_FLAG=1
E: DM_UDEV_DISABLE_OTHER_RULES_FLAG=1
E: DM_UDEV_RULES=1
E: DM_UDEV_RULES_VSN=2
E: DM_NAME=ubuntu-save-1b475f05-6de0-4abb-9373-1b79e7324197
E: DM_UUID=CRYPT-LUKS2-e3cfac5e00184c40ae4a4880a37416cf-ubuntu-save-1b475f05-6de0-4abb-9373-1b79e7324197
E: DM_SUSPENDED=0
E: SYSTEMD_READY=0
E: DEVLINKS=/dev/mapper/ubuntu-save-1b475f05-6de0-4abb-9373-1b79e7324197
E: TAGS=:systemd:

If we only focus on dm-0, we can see the following elements are only present when things work as expected:

S: mapper/ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_UDEV_RULES=1
E: DM_UDEV_RULES_VSN=2
E: DM_NAME=ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_UUID=CRYPT-LUKS2-d571d7583a6e43d8bc2c8b6b4491a2e1-ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_SUSPENDED=0
E: DEVLINKS=/dev/mapper/ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c

Revision history for this message
Pierre Equoy (pieq) wrote :

In /usr/lib/udev/rules.d/55-dm.rules we can see the following:

------------------------------------------------------------
(...)
# Device created, major and minor number assigned - "add" event generated.
# Table loaded - no event generated.
# Device resumed (or renamed) - "change" event generated.
# Device removed - "remove" event generated.
#
# The dm-X nodes are always created, even on "add" event, we can't suppress
# that (the node is created even earlier with devtmpfs). All the symlinks
# (e.g. /dev/mapper) are created in right time after a device has its table
# loaded and is properly resumed. For this reason, direct use of dm-X nodes
# is not recommended.
ACTION!="add|change", GOTO="dm_end"

(...)

ENV{DM_UDEV_DISABLE_DM_RULES_FLAG}!="1", ENV{DM_NAME}=="?*", SYMLINK+="mapper/$env{DM_NAME}"

(...)

LABEL="dm_end"
------------------------------------------------------------

This is aligned with the findings in previous comment: on the first boot, the disk is "added" for the first time, so a bunch of information is generated by udev (SYMLINK, ENV{DM_NAME}, etc.). On subsequent boots, the disk remains the sam, so no "add" nor "change" events are generated, so udev skips all the steps and goes directly to the bottom of the rules file ("dm_end").

Now, that comment is saying it is not recommended to use dm-X nodes directly. I'm wondering if we could find a better way to do this in Checkbox, then.

Pierre Equoy (pieq)
description: updated
summary: - disk/disk_stress_ng_* fail on encrypted partitions, but not always
+ [UC20] disk/disk_stress_ng_* fail on encrypted partitions if not run
+ right after installing the OS
Revision history for this message
Rod Smith (rodsmith) wrote :

I have a couple of thoughts/comments on this.

First, I notice that the output includes the complaint that the "Disk is too small to test" (10 GiB being the apparent cutoff). This is handled in the disk_support.py script in Checkbox. That device has a size of 1536000, according to attachment_files/com.canonical.certification__sysfs_attachment, but it's not quite clear what the units are. If that's 512-byte sectors (as I think most likely, offhand), then the partition is 750 MiB in size, which might be a /boot partition or something similarly small. There's also a /dev/nvme0n1p5 with a size of 7492530824, which is about 3.5 TiB (again, assuming 512-byte sectors as the unit size). The disk tests were originally designed (before my time at Canonical) to be run on a disk with a single big partition, or at least on a disk that's dominated by one big partition with perhaps one or two smaller ones. I vaguely recall looking at the relevant code at some point, and it tried to find the largest partition that it could read on a disk, starting with the whole-disk device (e.g., /dev/sda or /dev/nvme0n1) as an input. My guess is that in this case, /dev/nvme0n1p5 is encrypted, so the code to locate the largest readable partition is falling back on that too-small partition. Linux's model for naming and accessing encrypted partitions obscures where they're located, so if my memory of this algorithm is correct, it's not surprising that it's failing to locate a testable partition.

Second, udev sets up a bunch of symlinks between /dev/dm-* devices and other names, typically in /dev/mapper and perhaps elsewhere (like /dev/{lvmname} for LVM configurations). It's conceivable that you're seeing inconsistency in the name Checkbox is reporting for this reason; there could be variability in the order in which udev creates these symlinks or the order in which Checkbox is detecting them because Checkbox is just getting a list of devices and going through them one after the other without first sorting the list. This is VERY tentative speculation, though; I haven't yet tried to locate the relevant code and figure out what it's doing.

So: I may be wrong in this, but my suspicion is that we'll need to significantly alter how Checkbox locates disk devices for testing if it's to handle encrypted devices. It was simply built on the ASSUMPTION that it would be used to test disks with conventional unencrypted partitions, because it's easiest to locate such partitions when starting from raw disk devices (which are what we want to test, really), and because that's how servers and desktop/laptop computers were typically configured several years ago. In the IoT era, that assumption may not be a good one any longer.

As a workaround in the meantime, if it's possible to configure the computer to have an unencrypted 10GiB+ partition, in addition to or instead of the current big encrypted partition, then the tests should be able to locate the big unencrypted partition and test the disk using that partition.

Changed in plainbox-provider-checkbox:
status: New → Fix Released
Pierre Equoy (pieq)
Changed in plainbox-provider-checkbox:
status: Fix Released → Confirmed
milestone: 0.64.0 → 0.65.0
Changed in plainbox-provider-checkbox:
milestone: 0.65.0 → 0.66.0
Pierre Equoy (pieq)
Changed in plainbox-provider-checkbox:
milestone: 2.0.0 → 2.1.0
Revision history for this message
Maksim Beliaev (beliaev-maksim) wrote :

Bug was migrated to GitHub: https://github.com/canonical/checkbox/issues/93.
Bug is no more monitored here.

Changed in plainbox-provider-checkbox:
status: Confirmed → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.