Bug #1959077 “[UC20] disk/disk_stress_ng_* fail on encrypted par...” : Bugs : Checkbox Provider - Base

Pierre Equoy (pieq) on 2022-01-26

Changed in plainbox-provider-checkbox:
milestone:	none → 0.64.0
importance:	Undecided → High
description:	updated

Matias Piipari (mz2) on 2022-03-14

tags:

added: cbox-21

Revision history for this message

Pierre Equoy (pieq) wrote on 2022-03-21:

#1

It happened again to me, on another project using UC20 and full disk encryption.

The first time I installed UC20, and Checkbox20, running:

checkbox-<project>.checkbox-cli run com.canonical.certification::device

returned the following item in the DISK category:

(...)
path: /devices/pci0000:00/0000:00:1c.0/mmc_host/mmc0/mmc0:0001
name: mmcblk0
bus: mmc
category: DISK
driver: mmcblk
product: DG4032
product_slug: DG4032
(...)

Because of this, `disk/disk_stress_ng_mmcblk0` failed:

=============================
STRESS_NG_DISK_TIME env var is not found, stress_ng disk running time is default value
WARNING:root:Warning: mmcblk0p3 is less than 10 GiB in size!
ERROR:root:Disk is too small to test. Aborting test!
** Unable to find a suitable partition! Aborting!
retval is 1
**************************************************************
** stress-ng test failed!
**************************************************************
=============================

After reinstalling the OS and Checkbox, the device resource job output was:

(...)
path: /devices/virtual/block/dm-0
name: dm-0
bus: block
category: DISK
product: dm-0
product_slug: dm-0

path: /devices/virtual/block/dm-1
name: dm-1
bus: block
category: DISK
product: dm-1
product_slug: dm-1
(...)

and this time, the `disk/disk_stress_ng_*` jobs can be run properly.

Revision history for this message

Pierre Equoy (pieq) wrote on 2022-03-23:

#2

I spent more time investigating by re-installing UC20 and Checkbox several times and observing the output of lsblk and udevadm from both outside the checkbox snap and inside.

The output of `lsblk -i -n -P -o KNAME,TYPE,MOUNTPOINT` seems pretty consistent no matter if dm-0 and dm-1 are found by Checkbox, so I guess we can rule this one out.

However, when Checkbox cannot find dm0 and dm1 and only returns mmcblk0, the `udevadm info -e` output looks like this:

P: /devices/virtual/block/dm-0
N: dm-0
L: 0
E: DEVPATH=/devices/virtual/block/dm-0
E: SUBSYSTEM=block
E: DEVNAME=/dev/dm-0
E: DEVTYPE=disk
E: MAJOR=253
E: MINOR=0
E: USEC_INITIALIZED=35139639
E: DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG=1
E: DM_UDEV_DISABLE_DISK_RULES_FLAG=1
E: DM_UDEV_DISABLE_OTHER_RULES_FLAG=1
E: SYSTEMD_READY=0
E: TAGS=:systemd:

P: /devices/virtual/block/dm-1
N: dm-1
L: 0
E: DEVPATH=/devices/virtual/block/dm-1
E: SUBSYSTEM=block
E: DEVNAME=/dev/dm-1
E: DEVTYPE=disk
E: MAJOR=253
E: MINOR=1
E: USEC_INITIALIZED=35147314
E: DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG=1
E: DM_UDEV_DISABLE_DISK_RULES_FLAG=1
E: DM_UDEV_DISABLE_OTHER_RULES_FLAG=1
E: SYSTEMD_READY=0
E: TAGS=:systemd:

When things work as expected, the output looks like that:

P: /devices/virtual/block/dm-0
N: dm-0
L: 0
S: mapper/ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DEVPATH=/devices/virtual/block/dm-0
E: SUBSYSTEM=block
E: DEVNAME=/dev/dm-0
E: DEVTYPE=disk
E: MAJOR=253
E: MINOR=0
E: USEC_INITIALIZED=36869231
E: DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG=1
E: DM_UDEV_DISABLE_DISK_RULES_FLAG=1
E: DM_UDEV_DISABLE_OTHER_RULES_FLAG=1
E: DM_UDEV_RULES=1
E: DM_UDEV_RULES_VSN=2
E: DM_NAME=ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_UUID=CRYPT-LUKS2-d571d7583a6e43d8bc2c8b6b4491a2e1-ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_SUSPENDED=0
E: SYSTEMD_READY=0
E: DEVLINKS=/dev/mapper/ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: TAGS=:systemd:

P: /devices/virtual/block/dm-1
N: dm-1
L: 0
S: mapper/ubuntu-save-1b475f05-6de0-4abb-9373-1b79e7324197
E: DEVPATH=/devices/virtual/block/dm-1
E: SUBSYSTEM=block
E: DEVNAME=/dev/dm-1
E: DEVTYPE=disk
E: MAJOR=253
E: MINOR=1
E: USEC_INITIALIZED=36878289
E: DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG=1
E: DM_UDEV_DISABLE_DISK_RULES_FLAG=1
E: DM_UDEV_DISABLE_OTHER_RULES_FLAG=1
E: DM_UDEV_RULES=1
E: DM_UDEV_RULES_VSN=2
E: DM_NAME=ubuntu-save-1b475f05-6de0-4abb-9373-1b79e7324197
E: DM_UUID=CRYPT-LUKS2-e3cfac5e00184c40ae4a4880a37416cf-ubuntu-save-1b475f05-6de0-4abb-9373-1b79e7324197
E: DM_SUSPENDED=0
E: SYSTEMD_READY=0
E: DEVLINKS=/dev/mapper/ubuntu-save-1b475f05-6de0-4abb-9373-1b79e7324197
E: TAGS=:systemd:

If we only focus on dm-0, we can see the following elements are only present when things work as expected:

S: mapper/ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_UDEV_RULES=1
E: DM_UDEV_RULES_VSN=2
E: DM_NAME=ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_UUID=CRYPT-LUKS2-d571d7583a6e43d8bc2c8b6b4491a2e1-ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_SUSPENDED=0
E: DEVLINKS=/dev/mapper/ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c

I spent more time investigating by re-installing UC20 and Checkbox several times and observing the output of lsblk and udevadm from both outside the checkbox snap and inside.

The output of `lsblk -i -n -P -o KNAME,TYPE,MOUNTPOINT` seems pretty consistent no matter if dm-0 and dm-1 are found by Checkbox, so I guess we can rule this one out.

However, when Checkbox cannot find dm0 and dm1 and only returns mmcblk0, the `udevadm info -e` output looks like this:

P: /devices/virtual/block/dm-0
N: dm-0
L: 0
E: DEVPATH=/devices/virtual/block/dm-0
E: SUBSYSTEM=block
E: DEVNAME=/dev/dm-0
E: DEVTYPE=disk
E: MAJOR=253
E: MINOR=0
E: USEC_INITIALIZED=35139639
E: DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG=1
E: DM_UDEV_DISABLE_DISK_RULES_FLAG=1
E: DM_UDEV_DISABLE_OTHER_RULES_FLAG=1
E: SYSTEMD_READY=0
E: TAGS=:systemd:

P: /devices/virtual/block/dm-1
N: dm-1
L: 0
E: DEVPATH=/devices/virtual/block/dm-1
E: SUBSYSTEM=block
E: DEVNAME=/dev/dm-1
E: DEVTYPE=disk
E: MAJOR=253
E: MINOR=1
E: USEC_INITIALIZED=35147314
E: DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG=1
E: DM_UDEV_DISABLE_DISK_RULES_FLAG=1
E: DM_UDEV_DISABLE_OTHER_RULES_FLAG=1
E: SYSTEMD_READY=0
E: TAGS=:systemd:

When things work as expected, the output looks like that:

P: /devices/virtual/block/dm-0
N: dm-0
L: 0
S: mapper/ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DEVPATH=/devices/virtual/block/dm-0
E: SUBSYSTEM=block
E: DEVNAME=/dev/dm-0
E: DEVTYPE=disk
E: MAJOR=253
E: MINOR=0
E: USEC_INITIALIZED=36869231
E: DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG=1
E: DM_UDEV_DISABLE_DISK_RULES_FLAG=1
E: DM_UDEV_DISABLE_OTHER_RULES_FLAG=1
E: DM_UDEV_RULES=1
E: DM_UDEV_RULES_VSN=2
E: DM_NAME=ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_UUID=CRYPT-LUKS2-d571d7583a6e43d8bc2c8b6b4491a2e1-ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_SUSPENDED=0
E: SYSTEMD_READY=0
E: DEVLINKS=/dev/mapper/ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: TAGS=:systemd:

P: /devices/virtual/block/dm-1
N: dm-1
L: 0
S: mapper/ubuntu-save-1b475f05-6de0-4abb-9373-1b79e7324197
E: DEVPATH=/devices/virtual/block/dm-1
E: SUBSYSTEM=block
E: DEVNAME=/dev/dm-1
E: DEVTYPE=disk
E: MAJOR=253
E: MINOR=1
E: USEC_INITIALIZED=36878289
E: DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG=1
E: DM_UDEV_DISABLE_DISK_RULES_FLAG=1
E: DM_UDEV_DISABLE_OTHER_RULES_FLAG=1
E: DM_UDEV_RULES=1
E: DM_UDEV_RULES_VSN=2
E: DM_NAME=ubuntu-save-1b475f05-6de0-4abb-9373-1b79e7324197
E: DM_UUID=CRYPT-LUKS2-e3cfac5e00184c40ae4a4880a37416cf-ubuntu-save-1b475f05-6de0-4abb-9373-1b79e7324197
E: DM_SUSPENDED=0
E: SYSTEMD_READY=0
E: DEVLINKS=/dev/mapper/ubuntu-save-1b475f05-6de0-4abb-9373-1b79e7324197
E: TAGS=:systemd:

If we only focus on dm-0, we can see the following elements are only present when things work as expected:

S: mapper/ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_UDEV_RULES=1
E: DM_UDEV_RULES_VSN=2
E: DM_NAME=ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_UUID=CRYPT-LUKS2-d571d7583a6e43d8bc2c8b6b4491a2e1-ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c
E: DM_SUSPENDED=0
E: DEVLINKS=/dev/mapper/ubuntu-data-f1465ded-50aa-4a10-9dd8-4d2d9f97f30c

Revision history for this message

Pierre Equoy (pieq) wrote on 2022-03-29:

#3

In /usr/lib/udev/rules.d/55-dm.rules we can see the following:

------------------------------------------------------------
(...)
# Device created, major and minor number assigned - "add" event generated.
# Table loaded - no event generated.
# Device resumed (or renamed) - "change" event generated.
# Device removed - "remove" event generated.
#
# The dm-X nodes are always created, even on "add" event, we can't suppress
# that (the node is created even earlier with devtmpfs). All the symlinks
# (e.g. /dev/mapper) are created in right time after a device has its table
# loaded and is properly resumed. For this reason, direct use of dm-X nodes
# is not recommended.
ACTION!="add|change", GOTO="dm_end"

(...)

ENV{DM_UDEV_DISABLE_DM_RULES_FLAG}!="1", ENV{DM_NAME}=="?*", SYMLINK+="mapper/$env{DM_NAME}"

(...)

LABEL="dm_end"
------------------------------------------------------------

This is aligned with the findings in previous comment: on the first boot, the disk is "added" for the first time, so a bunch of information is generated by udev (SYMLINK, ENV{DM_NAME}, etc.). On subsequent boots, the disk remains the sam, so no "add" nor "change" events are generated, so udev skips all the steps and goes directly to the bottom of the rules file ("dm_end").

Now, that comment is saying it is not recommended to use dm-X nodes directly. I'm wondering if we could find a better way to do this in Checkbox, then.

Pierre Equoy (pieq) on 2022-03-29

description:	updated
summary:	- disk/disk_stress_ng_* fail on encrypted partitions, but not always + [UC20] disk/disk_stress_ng_* fail on encrypted partitions if not run + right after installing the OS

Revision history for this message

Rod Smith (rodsmith) wrote on 2022-03-30:

#4

I have a couple of thoughts/comments on this.

First, I notice that the output includes the complaint that the "Disk is too small to test" (10 GiB being the apparent cutoff). This is handled in the disk_support.py script in Checkbox. That device has a size of 1536000, according to attachment_files/com.canonical.certification__sysfs_attachment, but it's not quite clear what the units are. If that's 512-byte sectors (as I think most likely, offhand), then the partition is 750 MiB in size, which might be a /boot partition or something similarly small. There's also a /dev/nvme0n1p5 with a size of 7492530824, which is about 3.5 TiB (again, assuming 512-byte sectors as the unit size). The disk tests were originally designed (before my time at Canonical) to be run on a disk with a single big partition, or at least on a disk that's dominated by one big partition with perhaps one or two smaller ones. I vaguely recall looking at the relevant code at some point, and it tried to find the largest partition that it could read on a disk, starting with the whole-disk device (e.g., /dev/sda or /dev/nvme0n1) as an input. My guess is that in this case, /dev/nvme0n1p5 is encrypted, so the code to locate the largest readable partition is falling back on that too-small partition. Linux's model for naming and accessing encrypted partitions obscures where they're located, so if my memory of this algorithm is correct, it's not surprising that it's failing to locate a testable partition.

Second, udev sets up a bunch of symlinks between /dev/dm-* devices and other names, typically in /dev/mapper and perhaps elsewhere (like /dev/{lvmname} for LVM configurations). It's conceivable that you're seeing inconsistency in the name Checkbox is reporting for this reason; there could be variability in the order in which udev creates these symlinks or the order in which Checkbox is detecting them because Checkbox is just getting a list of devices and going through them one after the other without first sorting the list. This is VERY tentative speculation, though; I haven't yet tried to locate the relevant code and figure out what it's doing.

So: I may be wrong in this, but my suspicion is that we'll need to significantly alter how Checkbox locates disk devices for testing if it's to handle encrypted devices. It was simply built on the ASSUMPTION that it would be used to test disks with conventional unencrypted partitions, because it's easiest to locate such partitions when starting from raw disk devices (which are what we want to test, really), and because that's how servers and desktop/laptop computers were typically configured several years ago. In the IoT era, that assumption may not be a good one any longer.

As a workaround in the meantime, if it's possible to configure the computer to have an unencrypted 10GiB+ partition, in addition to or instead of the current big encrypted partition, then the tests should be able to locate the big unencrypted partition and test the disk using that partition.

I have a couple of thoughts/comments on this.

First, I notice that the output includes the complaint that the "Disk is too small to test" (10 GiB being the apparent cutoff). This is handled in the disk_support.py script in Checkbox. That device has a size of 1536000, according to attachment_files/com.canonical.certification__sysfs_attachment, but it's not quite clear what the units are. If that's 512-byte sectors (as I think most likely, offhand), then the partition is 750 MiB in size, which might be a /boot partition or something similarly small. There's also a /dev/nvme0n1p5 with a size of 7492530824, which is about 3.5 TiB (again, assuming 512-byte sectors as the unit size). The disk tests were originally designed (before my time at Canonical) to be run on a disk with a single big partition, or at least on a disk that's dominated by one big partition with perhaps one or two smaller ones. I vaguely recall looking at the relevant code at some point, and it tried to find the largest partition that it could read on a disk, starting with the whole-disk device (e.g., /dev/sda or /dev/nvme0n1) as an input. My guess is that in this case, /dev/nvme0n1p5 is encrypted, so the code to locate the largest readable partition is falling back on that too-small partition. Linux's model for naming and accessing encrypted partitions obscures where they're located, so if my memory of this algorithm is correct, it's not surprising that it's failing to locate a testable partition.

Second, udev sets up a bunch of symlinks between /dev/dm-* devices and other names, typically in /dev/mapper and perhaps elsewhere (like /dev/{lvmname} for LVM configurations). It's conceivable that you're seeing inconsistency in the name Checkbox is reporting for this reason; there could be variability in the order in which udev creates these symlinks or the order in which Checkbox is detecting them because Checkbox is just getting a list of devices and going through them one after the other without first sorting the list. This is VERY tentative speculation, though; I haven't yet tried to locate the relevant code and figure out what it's doing.

So: I may be wrong in this, but my suspicion is that we'll need to significantly alter how Checkbox locates disk devices for testing if it's to handle encrypted devices. It was simply built on the ASSUMPTION that it would be used to test disks with conventional unencrypted partitions, because it's easiest to locate such partitions when starting from raw disk devices (which are what we want to test, really), and because that's how servers and desktop/laptop computers were typically configured several years ago. In the IoT era, that assumption may not be a good one any longer.

As a workaround in the meantime, if it's possible to configure the computer to have an unencrypted 10GiB+ partition, in addition to or instead of the current big encrypted partition, then the tests should be able to locate the big unencrypted partition and test the disk using that partition.

Devices Certification Bot (ce-certification-qa) on 2022-04-15

Changed in plainbox-provider-checkbox:
status:	New → Fix Released

Pierre Equoy (pieq) on 2022-04-26

Changed in plainbox-provider-checkbox:
status:	Fix Released → Confirmed
milestone:	0.64.0 → 0.65.0

Sylvain Pineau (sylvain-pineau) on 2022-06-16

Changed in plainbox-provider-checkbox:
milestone:	0.65.0 → 0.66.0

Pierre Equoy (pieq) on 2022-11-18

Changed in plainbox-provider-checkbox:
milestone:	2.0.0 → 2.1.0

Revision history for this message

Maksim Beliaev (beliaev-maksim) wrote on 2022-11-28:

#5

Bug was migrated to GitHub: https://github.com/canonical/checkbox/issues/93.
Bug is no more monitored here.

Changed in plainbox-provider-checkbox:
status:	Confirmed → Expired

Checkbox Provider - Base

[UC20] disk/disk_stress_ng_* fail on encrypted partitions if not run right after installing the OS

Bug Description

Other bug subscribers

Remote bug watches