4.15.0-54.58-generic 4.15.18: oops/BUG on LUKS open

Bug #1835279 reported by Claudio Matsuoka
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Bionic
In Progress
Medium
Connor Kuehl

Bug Description

This is Linux version 4.15.0-54-generic (buildd@lgw01-amd64-014) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 (Ubuntu 4.15.0-54.58-generic 4.15.18), from pc-kernel_240.snap

Version signature: 4.15.0-54.58-generic 4.15.18

Issue is non-deterministic, and happens in roughly 20% of the attempts.

Running on: qemu-kvm, command line: kvm \
  -bios /usr/share/ovmf/OVMF.fd \
  -smp 2 -m 512 -netdev user,id=mynet0,hostfwd=tcp::8022-:22,hostfwd=tcp::8090-:80 \
  -device virtio-net-pci,netdev=mynet0 \
  -drive file=pc.img,format=raw

Commands that caused the problem:
cryptsetup -q --type luks2 --key-file <keyfile> luksFormat /dev/sda4
LD_PRELOAD=/lib/no-udev.so cryptsetup --type luks2 --key-file <keyfile> open /dev/sda4 crypt-data

Notes:
- See https://bugs.launchpad.net/ubuntu/+source/cryptsetup/+bug/1589083 for more information on the no-udev workaround.
- The commands are scripted. Also tried to add a 200ms and 1s interval before opening the device.

Tags: bionic cscc
Revision history for this message
Claudio Matsuoka (cmatsuoka) wrote :
description: updated
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1835279

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
Revision history for this message
Claudio Matsuoka (cmatsuoka) wrote :

This problem happens on Ubuntu Core (Core 20 development branch based on Core 18) and apport is not available.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Claudio Matsuoka (cmatsuoka) wrote :

Further testing suggests that the problem is caused by the kernel not having enough time to settle after partx -u exits (race between acknowledging the new block device and actually having it available?). Adding a 1s pause after the partition table is re-read seems to prevent this crash.

Connor Kuehl (connork)
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
assignee: nobody → Connor Kuehl (connork)
Changed in linux (Ubuntu):
status: Confirmed → Invalid
Changed in linux (Ubuntu Bionic):
status: New → In Progress
Terry Rudd (terrykrudd)
Changed in linux (Ubuntu):
importance: Undecided → Medium
assignee: nobody → Connor Kuehl (connork)
Revision history for this message
Andrea Righi (arighi) wrote :

It would be interesting to try this fix:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=37f9579f4c31a6d698dbf3016d7bf132f9288d30

I've uploaded a test kernel (based on 4.15.0-54 + the fix mentioned above):
https://kernel.ubuntu.com/~arighi/LP-1835279/

Claudio, it would be great if you could do a test with this kernel. Thanks in advance!

Revision history for this message
Claudio Matsuoka (cmatsuoka) wrote :

Thanks Andrea. The new kernel seems to address the problem in such a way that the block device made available is still not immediately usable, but the kernel crash is prevented (at least in the test runs so far). Now errors can be handled from userspace to retry the operation in case of failure.

Revision history for this message
Claudio Matsuoka (cmatsuoka) wrote :

We're now opening the LUKS device using the master key directly and the crash is happening again in both the original 4.15.0-54-generic (buildd@lgw01-amd64-014) kernel or the patched 4.15.0-54-generic (arighi@kathleen) kernel. The backtrace is very similar but unlike the previous scenario it happens even if delays as large as 5000 ms are placed between partition table re-reading and commands.

The LUKS device is created and accessed using the following commands:

# cryptsetup -q luksFormat --type luks2 --pbkdf-memory 10000 --master-key-file <keyfile> <device>

# LD_PRELOAD=/lib/no-udev.so cryptsetup open --master-key-file <keyfile> <device> <name>

Revision history for this message
Claudio Matsuoka (cmatsuoka) wrote :
Revision history for this message
Claudio Matsuoka (cmatsuoka) wrote :

The udev workaround preloading is unrelated and can be removed, crashes will also happen without using it.

Revision history for this message
Andrea Righi (arighi) wrote :

I've found other upstream fixes that might be interesting to test:
https://git.launchpad.net/~arighi/+git/bionic-linux/log/?h=lp-1835279

I've uploaded a new test kernel here (still based of 4.15.0-54 + the backported fixes): https://kernel.ubuntu.com/~arighi/LP-1835279/

It'd be great if you could repeat your test and see if the NULL pointer dereference is still happening. Thanks!

Revision history for this message
Claudio Matsuoka (cmatsuoka) wrote :

Thanks. I tested the proposed kernel and it does not prevent the crash, which still occurs under the same circumstances when cryptsetup open runs. It's possible, however, that the patch decreases the likehood of a crash. In a very un-scientific test the latest patched kernel crashed in 3 out of 10 attempts, while the stock unpatched kernel crashed in 6 out of 10 attempts.

To mitigate the issue, with both the original and patched kernels, I'm manually loading the dm-crypt module before running cryptsetup luksFormat. Note that manually loading the module between cryptsetup luksFormat and cryptsetup open does not prevent the crash.

Brad Figg (brad-figg)
tags: added: cscc
Connor Kuehl (connork)
Changed in linux (Ubuntu):
assignee: Connor Kuehl (connork) → nobody
importance: Medium → Undecided
Revision history for this message
Connor Kuehl (connork) wrote :

Hey Claudio,

Did you experience this only after upgrading from a previous kernel? If so, what was the kernel that you upgraded from?

I've been trying to reproduce this issue locally in a VM but haven't had any luck so far. So, I've been working from the kernel Oops in comment #1 and I am pretty sure I see where the NULL pointer is coming from but I am not sure yet where it's being to NULL. Since this smells a little bit like a race condition (unconfirmed) I'm trying to find ways to narrow down the search radius.

I have a couple of options in mind for moving forward:

1. We can have you try out a few mainline kernels. If you experience the bug in one of the kernels but not another, we can "bisect" between mainline kernels to zero in on the culpable patch. I would be very interested to see if you experience the same regression on the latest mainline kernel, as that may indicate that the regression also exists in mainline.

2. I can continue reading through the code and prepare a debug kernel that will hopefully give us more information when the bug is triggered. My goal behind this would be to narrow down precisely where/how that NULL pointer is being set.

Let me know if either of those options sound feasible to you.

Revision history for this message
Claudio Matsuoka (cmatsuoka) wrote :

Thanks Connor. My test configuration is a custom core system using the official core18 kernel snap on a KVM-based virtual machine. The kernel snap is installed from channel 18 (snap download --channel 18 pc-kernel) and my current revision is 240. That said, it's easy to test new kernels as long as I can convert them to a snap package. Foundations is currently preparing a kernel snap based on eoan so I can check with this newer kernel if the problems persist (it could be a first check for a bisect), but I could also boot an instrumented kernel that would give us a better insight on what could be happening.

So, to summarize: we can use either approach, but my feeling is that the debug kernel and the eoan kernel could be a good start to help us to gather more data points. Ideally I could start testing that next week after returning from Toronto. Would that work for you?

Revision history for this message
Connor Kuehl (connork) wrote :

That sounds awesome, Claudio, thank you! We'll start with Eoan and I will work on getting a debug kernel ready in the meantime.

Revision history for this message
Claudio Matsuoka (cmatsuoka) wrote :

I tested the eoan kernel 5.2.0-10-generic and the suspected problems in the block layer didn't happen anymore after removing the workarounds (artificial delays and module insertion) needed by the bionic kernel 4.15.0-54-generic used in core18. 5.3.0-050300rc4-generic also works without any immediately visible problem.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.