5.4.0-11 crash on cryptsetup open

Bug #1860231 reported by Claudio Matsuoka on 2020-01-18
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Stefan Bader
Xenial
High
Unassigned
Bionic
High
Unassigned
Disco
High
Unassigned
Eoan
High
Unassigned

Bug Description

[Impact]

An attempt to run cryptsetup open on a newly created LUKS partition on Ubuntu Core 20 causes a kernel crash. This happens in 100% of the attempts on the snapd Core 20 installation test, but on an image created to reproduce this bug it happens only when certain parameters are passed to cryptsetup. Both images are built similarly so the reason for this discrepancy is unknown. The kernel was installed from pc-kernel_374.snap.

[Test Case]

$ dir=$(mktemp -d /tmp/lp1860231.XXXXX)
$ dmsetup create lp1860231 --notable
$ mount -t ext4 \
  "/dev/dm-$(dmsetup info -c -o minor --noheadings lp1860231)" "$dir"

Now check the logs for a backtrace.

[Regression Potential]

The currently proposed fix introduces no chance of stability regressions. There is a chance of a very small performance regression since an additional pointer comparison is performed on each block layer request but this is unlikely to be noticeable.

[Original Report]

Linux version 5.4.0-11-generic (buildd@lgw01-amd64-021) (gcc version 9.2.1 20200104 (Ubuntu 9.2.1-22ubuntu2)) #14-Ubuntu SMP Thu Jan 9 16:14:26 UTC 2020

Version signature: Ubuntu 5.4.0-11.14-generic 5.4.8

How to reproduce the crash in 3 "easy" steps:

1. Build a Core 20 image using the attached model file:
   1.1. Install the ubuntu-image from latest/edge
        $ sudo snap install --channel latest/edge ubuntu-image
   1.2. Build the image
        $ sudo ubuntu-image --image-size=4G ubuntu-core-20-amd64.model

2. Boot the image in kvm
   2.1. Install ovmf version 0~20190606.20d2e5a1-2ubuntu1 or newer (the
        stock ovmf from bionic may not work)
   2.2. Boot the image
        $ sudo kvm -snapshot -m 2048 -smp 4 \
          -netdev user,id=mynet0,hostfwd=tcp::8022-:22,hostfwd=tcp::8090-:80 \
          -device virtio-net-pci,netdev=mynet0 \
          -drive file=pc.img,if=virtio \
          -bios /usr/share/OVMF/OVMF_CODE.ms.fd
   2.3. In the grub menu, edit the default option to include parameter
        "systemd.debug-shell=1" in the kernel command line
   2.4. Boot the kernel

3. Crash the kernel
   3.1. When the system boots to the "Press enter to configure"
        message, press ALT-F9 to enter the debug shell.
   3.2. The system should have two partitions in /dev/vda. Create a
        third one with fdisk.
   3.3. Create a LUKS encrypted partition:
        # echo 123|cryptsetup luksFormat -q --type luks2 --key-file - --pbkdf argon2i --iter-time 1 /dev/vda3
        (the system will complain about a missing locking directory,
        just ignore it.)
   3.4. Open the encrypted device:
        # echo 123|cryptsetup open --key-file - /dev/vda name

        The Core 20 images contain the following udev rule which causes
        the new block device to be mounted automatically. This mount is
        what triggers the BUG:
        ACTION=="add", SUBSYSTEM=="block", KERNEL!="loop*", KERNEL!="ram*" \
        RUN+="/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/%k"
   3.5. Read the crash message

The attached screenshots show these steps being executed.

A few notes:

- The backtrace seems very similar to the one reported in bug #1835279, however that problem was possibly caused by a race between partition creation and LUKS formatting. This time it doesn't seem to be the case, delays between commands don't help us here.
- In the test case above using large values of KDF iter-time may prevent the crash. I successfully opened the device in kernel 5.4.0-9 with --iter-time larger than 100, but 5.4.0-11 seems to require values closer to 1000. Regardless of the --iter-time value used, the crash always happen when running the test in a spread-driven automated environment (same kernel with image built in the same way, some other variable seems to be disturbing the system).
- All necessary modules are loaded before the LUKS partition creation (i.e. it doesn't seem to be caused by a race between dm-crypt loading and cryptsetup luksFormat for example).

Claudio Matsuoka (cmatsuoka) wrote :
Claudio Matsuoka (cmatsuoka) wrote :

Snap versions used to build the image:

pc-kernel_374.snap
pc_83.snap
snapd_6113.snap
core20_322.snap

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1860231

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Claudio Matsuoka (cmatsuoka) wrote :

This model file is used to build both the spread test image and the manually built image. The two generated images, however, seem to behave differently regarding the conditions leading to the crash: while it always happen in the spread test, higher KDF iteration times values allow the encrypted device to be opened correctly in the image created using the steps described in the bug description.

Michael Vogt also reports that the crash doesn't happen in a classic system running the same kernel version.

Andrea Righi (arighi) on 2020-01-20
Changed in linux (Ubuntu):
assignee: nobody → Andrea Righi (arighi)
Tyler Hicks (tyhicks) on 2020-01-20
description: updated
Andrea Righi (arighi) wrote :

After a first look at the kernel bug trace, it seems that q->make_request_fn(q, bio) (block/blk-core.c:1064) became NULL.

The reason might be a race with a block device not yet properly initialized when some I/O requests were submitted (or a block device de-registered too early while some I/O was still in progress), but, considering it was triggered upon a mount, I would say the former scenario is more likely to be the case.

I haven't noticed any potentially related fix in DM or in the core block layer. I'll keep investigating.

Michael Vogt (mvo) wrote :

I was able to reproduce this with the attached snapd which is essentialls PR#7999 plus the following patch:
"""
diff --git a/cmd/snap/cmd_auto_import.go b/cmd/snap/cmd_auto_import.go
index 7408371e11..f6b8f1d0d0 100644
--- a/cmd/snap/cmd_auto_import.go
+++ b/cmd/snap/cmd_auto_import.go
@@ -38,7 +38,6 @@ import (
        "github.com/snapcore/snapd/i18n"
        "github.com/snapcore/snapd/logger"
        "github.com/snapcore/snapd/osutil"
- "github.com/snapcore/snapd/release"
 )

 const autoImportsName = "auto-import.assert"
@@ -264,11 +263,6 @@ func (x *cmdAutoImport) Execute(args []string) error {
                return ErrExtraArgs
        }

- if release.OnClassic && !x.ForceClassic {
- fmt.Fprintf(Stderr, "auto-import is disabled on classic\n")
- return nil
- }
-
        for _, path := range x.Mount {
                // udev adds new /dev/loopX devices on the fly when a
                // loop mount happens and there is no loop device left.
"""

and then running the following spread test as shell code:
"""
    echo "Setup the image as a block device"
    # without -P this test will not work, then /dev/loop1p? will be missing
    losetup -fP fake.img
    losetup -a |grep fake.img|cut -f1 -d: > loop.txt
    LOOP="$(cat loop.txt)"

    echo "Create an empty partition header"
    echo "label: gpt" | sfdisk "$LOOP"

    echo "Get the UC20 gadget"
    snap download --channel=20/edge pc

    unsquashfs -d gadget-dir pc_*.snap
    LOOP="$(cat loop.txt)"
    echo "Run the snap-bootstrap tool"
    /usr/lib/snapd/snap-bootstrap create-partitions --encrypt --mount --key-file keyfile ./gadget-dir "$LOOP"
"""

Michael Vogt (mvo) wrote :

I reproduced it successfully with the following spread commandline using PR#7999 plus the patch in the previous comment:

$ spread -debug qemu:ubuntu-20.04-64:tests/main/uc20-snap-recovery-encrypt

Stefan Bader (smb) wrote :

With additional data it is basically a bug in either the mount syscall, the generic_make_request_checks, or dm.c. Basically device-mapper is set up in two stages, the initial device creation and the table load. Somehwere around v4.1 things were changed to defer setting the make-request function of the device (queue) to when the mapping table gets loaded.

One can create such a intermediate setup using "dmsetup create -n <name>". Then a "mount /dev/dm-?" triggers the bug. Since generic_make_request_checks has a check for device->queue == NULL but not for device->queue->make_request_fn == NULL.

Interestingly neither blkid nor dd would trigger this. Likely because they first check the size which is still 0 at that time. Only mount seems to go off and try to read superblock info regardless.

Stefan Bader (smb) on 2020-01-22
Changed in linux (Ubuntu):
status: Incomplete → Triaged
importance: Undecided → High
Tyler Hicks (tyhicks) on 2020-01-22
description: updated
description: updated
Tyler Hicks (tyhicks) wrote :
Changed in linux (Ubuntu):
assignee: Andrea Righi (arighi) → Stéphane Graber (stgraber)
assignee: Stéphane Graber (stgraber) → Stefan Bader (smb)
Tyler Hicks (tyhicks) wrote :

Upstream submission:

  https://<email address hidden>/T/#t

tags: added: verification-needed-focal
Stefan Bader (smb) wrote :

Upstream fixed this in device-mapper with:

Author: Mike Snitzer <email address hidden>
  dm: fix potential for q->make_request_fn NULL pointer

This is to be included in:

Xenial: Ubuntu-4.4.0-177.207 (committed)
Bionic: Ubuntu-4.15.0-92.93 (committed, not prepared yet)
Eoan: Ubuntu-5.3.0-43.35 (committed)
Focal: Ubuntu-5.4.0-15.18 (released, revert of SAUCE committed)

Changed in linux (Ubuntu Xenial):
status: New → Fix Committed
Changed in linux (Ubuntu Bionic):
status: New → Fix Committed
Changed in linux (Ubuntu Disco):
status: New → Fix Committed
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
Changed in linux (Ubuntu Disco):
importance: Undecided → High
Changed in linux (Ubuntu Eoan):
status: New → Fix Committed
Changed in linux (Ubuntu):
status: Triaged → Fix Released
Changed in linux (Ubuntu Eoan):
importance: Undecided → High

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-eoan' to 'verification-done-eoan'. If the problem still exists, change the tag 'verification-needed-eoan' to 'verification-failed-eoan'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-eoan
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers