linux-crashdump is not capable of mounting zfs root pools

Bug #1913639 reported by Anders Aagaard
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I could not get any crash logs after using this package.

I tried using it and triggering a crash from a terminal. The alternate kernel boots successfully, but the initrd for linux-crashdump is not capable of loading my rpool.

running: linux-crashdump:amd64 5.4.0.64.67

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-crashdump:amd64 (not installed)
ProcVersionSignature: Ubuntu 5.4.0-65.73-generic 5.4.78
Uname: Linux 5.4.0-65-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.11-0ubuntu27.14
Architecture: amd64
CasperMD5CheckResult: skip
CurrentDesktop: KDE
Date: Thu Jan 28 20:15:19 2021
InstallationDate: Installed on 2019-08-15 (532 days ago)
InstallationMedia: Ubuntu 18.04.3 LTS "Bionic Beaver" - Release amd64 (20190805)
IwConfig:
 eno1 no wireless extensions.

 lo no wireless extensions.
MachineType: MSI MS-7885
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/BOOT/ubuntu_fnpezh@/vmlinuz-5.4.0-65-generic root=ZFS=rpool/ROOT/ubuntu_fnpezh ro usbcore.autosuspend=-1 quiet splash intel_iommu=on iommu=pt vfio-pci.ids=10de:1b81,10de:10f0,1b21:1242 video=efifb:off,vesafb=off,simplefb=off nouveau.modeset=0 rd.driver.blacklist=nouveau,nvidia,nvidia_uvm,nvidia_drm,nvidia_modeset crashkernel=2096M-:2096M vt.handoff=1
RelatedPackageVersions:
 linux-restricted-modules-5.4.0-65-generic N/A
 linux-backports-modules-5.4.0-65-generic N/A
 linux-firmware 1.187.8
RfKill:
 0: hci0: Bluetooth
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
UpgradeStatus: Upgraded to focal on 2020-04-22 (281 days ago)
dmi.bios.date: 06/15/2018
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1.E0
dmi.board.asset.tag: Default string
dmi.board.name: X99A SLI PLUS(MS-7885)
dmi.board.vendor: MSI
dmi.board.version: 1.0
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: MSI
dmi.chassis.version: 1.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1.E0:bd06/15/2018:svnMSI:pnMS-7885:pvr1.0:rvnMSI:rnX99ASLIPLUS(MS-7885):rvr1.0:cvnMSI:ct3:cvr1.0:
dmi.product.family: Default string
dmi.product.name: MS-7885
dmi.product.sku: Default string
dmi.product.version: 1.0
dmi.sys.vendor: MSI
modified.conffile..etc.default.apport:
 # set this to 0 to disable apport, or to 1 to enable it
 # you can temporarily override this with
 # sudo service apport start force_start=1
 enabled=0
mtime.conffile..etc.default.apport: 2020-03-23T10:46:45.197962

Revision history for this message
Anders Aagaard (aagaande) wrote :
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Hi Anders, thanks for your report! It seems the issue could lie in the initramfs-tools package not being able to recognize the need of ZFS if MODULES=dep is used - in the kdump initrd, we restrict the modules included to the needed subset, in order to make it smaller and consume less memory from the crashkernel reserved range. The way to achieve that is set the config file for the kdump initrd to make use of MODULES=dep directive.

I'd like to ask some testing from you, so we can verify this theory. Please run the following commands as root user:

ls -lh /var/lib/kdump/initrd.img*

kdump-config unload
rm -f /var/lib/kdump/initrd.img*

sed -i 's/MODULES=dep/MODULES=most/g' /etc/kernel/postinst.d/kdump-tools

kdump-config load

ls -lh /var/lib/kdump/initrd.img*

The first command will show you the current kdump initrd images you have, then next commands delete them and recreate one for the current kernel, but using "MODULES=most" to include a bigger set of modules in the kdump initrd. The last command will show the newly created initrd size, please be sure it is bigger than the one you deleted.

After that, try a kdump and let's see if it works!
Cheers,

Guilherme

Revision history for this message
Anders Aagaard (aagaande) wrote :

That made my kdump initrd grow from 46mb to 86mb. However, it still fails.

An additional complication here is that kdump does something with my usb keyboard so I can't actually write anything when I get to the initramfs prompt... So it's a bit hard to debug what's going on.

my zpool/rpool setup looks like this:
```
  pool: bpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:00:04 with 0 errors on Sun Jan 10 00:24:05 2021
config:

        NAME STATE READ WRITE CKSUM
        bpool ONLINE 0 0 0
          sdc1 ONLINE 0 0 0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:18:34 with 0 errors on Sun Jan 10 00:42:35 2021
config:

        NAME STATE READ WRITE CKSUM
        rpool ONLINE 0 0 0
          sdc2 ONLINE 0 0 0

```

sdc is just a standard sata drive, so it should just work out of the box with any fairly recent kernel, I haven't needed any non standard changes for that drive in many many kernels.

Can I easily modify the initrd that's built into this initrd and add some debug info to stdout somehow?

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Hi Anders, thanks for your testing. You can show all the initrd scripts debug output to console, for that you'd need to add the kernel parameter "debug=vc" for example. But I don't suggest to do that here, since you seem not able to capture the full console log in plain/text form. Pictures/snapshots might help to spot some issue, but they are very limited.

So, instead of dumping the output to console, I'll suggest you 2 more tests:

a) Add the following parameter to kernel command-line: "debug" . With that, boot normally, and collect/save the following file, in order to attach it here: "/run/initramfs/initramfs.debug" . This is the full output of initrd scripts executed. After that, try the kdump - you need to be sure it inherited this "debug" flag (check the kdump command-line in "kdump-config show" output to validate it's there). Kdump will fail, and you can try to SSH the file from the (initramfs) failure shell (or even dump it there using the "cat" command and grab some pictures, specially from the final portions of it). You may try to disconnect and reconnect the keyboard, or use another USB port to see if you can get that working in the failing (initramfs) shell.

b) Please execute the following 2 commands and submit the 2 resulting files here:

lsinitramfs /boot/initrd.img-$(uname -r) > regular_initrd.out
lsinitramfs /var/lib/kdump/initrd.img-$(uname -r) > kdump_initrd.out

Thanks in advance!

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

In fact Anders, I just thought of a potential alternative issue, so do you mind to run a 3rd test?

(c) After running tests (a) and (b) from the last comment, please boot the machine and edit the file "/etc/default/grub.d/kdump-tools.cfg", changing the crashkernel parameter there to be exactly "crashkernel=384M" - after that, please execute "sudo update-grub" and reboot the machine, finally trying a kdump again.

With that test, we can be sure it's not a memory issue on kdump environment, since ZFS might consume more memory than kdump boot had reserved.
Thanks,

Guilherme

Revision history for this message
Anders Aagaard (aagaande) wrote :

A : I can confirm that debug was there, and the output on the console is different. However, plugging in the keyboard again didn't work, regardless of usb port on the machine, and I can't get a response on ssh either (or pinging the machine for that matter). Any ideas? I've attached a photo of what happens... which is surprisingly different? No errors, just dumps me on the initramfs prompt. Ref getting usb working there is no output on the console whatsoever if I replug the keyboard.

B : I've attached these.

C : I realize I had from previous debugging already ran with crashkernel=2096M-:2096M, so I don't think it's a memory issue?

Revision history for this message
Anders Aagaard (aagaande) wrote :
Revision history for this message
Anders Aagaard (aagaande) wrote :
Revision history for this message
Anders Aagaard (aagaande) wrote :
Revision history for this message
Anders Aagaard (aagaande) wrote :
Revision history for this message
Anders Aagaard (aagaande) wrote :

I also just spend 5$ to order a usb to ps2 adapter, that should make it a bit easier to figure out what's going on.

Revision history for this message
Anders Aagaard (aagaande) wrote :

So I got the ps2 adapter and... it looks like this has nothing to do with ZFS.

I have 4 SATA drives in this machine.
sda - ST2000DM001-1ER1 - zpool storage
sdb - ST2000DM008-2FR1 - zpool storage
sdc - Samsung SSD 860 - my main drive. zpool rpool and bpool.
sdd - Samsung SSD 850 - windows - used with vfio + qemu. (This is also where my kernel commandline arguments are coming from, isolating one gpu and a usb controller. The iommu groups are clean and only contains the gpu+usb controller, nothing else).

When in the crashkernel, I can see:
sda and sdb. Which now translates into the Samsung ssd 850 and one of the storage pool drives. So half of my storage pool and the windows drive. The 2 other sata drives are not there. And I see a lot of errors for ata5.00: failed to IDENTIFY (I/O error, err_mask=0x100)

The motherboard is a MSI X99A SLI PLUS. I'm.. happy with closing this issue at this point. If there's anything quick I can do to help debug this absolutely. But I feel like this moved from a category of "zfs initrd script needs patching" into "weird motherboard bug" - in a motherboard I'm planning to replace anyway.

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Hi Anders, I really appreciate your debug data, thanks a bunch for buying the PS2 adapter, very useful. So, seems you are observing an issue with the SATA controller. Could you collect a dmesg from the kdump environment, in this minimal "(initramfs)" shell?

Also, thanks for the previous data - I found an interesting difference between the content of initrds, the regular one has a dkms zfs module, whereas the kdump has the packaged zfs modules. But seems this is not what causing trouble to you, although is interesting to understand why...

Let me know if you can collect the dmesg in the kdump failed boot, that'd be really helpful, it might be a kernel bug in the end.
And finally, disregards my suggestion (c), thanks for observing that the memory amount reserved is enough, good point.
Cheers,

Guilherme

Revision history for this message
Anders Aagaard (aagaande) wrote :

I did a dmesg picture attachment in the previous update. Do you need more of that output?

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Yeah..I need the full output. A suggestion: you seems to have 2 devices (sda and sdb) at the kdump failure point, right? So, you could try to mount one of them, and save the dmesg, like "dmesg > /mnt/dmesg.kdump" . That'd be perfect!

Thanks again

Revision history for this message
Anders Aagaard (aagaande) wrote :

Hi

Sorry about the lack of reply on this, I had some hardware issues with this setup and I've been waiting for the time to have a proper go at fixing it. Unfortunately I failed at that yesterday. This machine no longer boots, so it's hard for me to get the output..

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Oh, I am sorry to hear that. I wish you luck in fixing the HW problem, and when that happens, let us know and we can continue this debug if possible.
Cheers,

Guilherme

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.