Installations with multipath/LVM disks sometimes leaves the system in a non-bootable stage

Bug #1940687 reported by Frank Heimes
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
Critical
Canonical Foundations Team
subiquity
Fix Released
Undecided
Unassigned

Bug Description

While doing some autoinstall tests on an IBM Z / s390x LPAR with FCP/SCSI strorage using multipath (multipath is mandatory in case of using FCP/SCSI) I ran from time to time int a problem where the system does not come up after the post-install reboot and hangs in busybox with the message:

"Gave up waiting for root file system device. Common problems:"
"- Boot args (cat /proc/cmdline)"
"- Check rootdelay= (did the system wait long enough?)"
"- Missing modules (cat /proc/modules; ls /dev)"
"ALERT! /dev/disk/by-id/dm-uuid-LVM-qf611Cct0k4CFKieysi1dCPG3DMo27GRHB7nJ8lOvIK1"
"SvdIFoSIyDIqQG2QpOSk does not exist. Dropping to a shell!"

"BusyBox v1.30.1 (Ubuntu 1:1.30.1-4ubuntu6.3) built-in shell (ash)"
"Enter 'help' for a list of built-in commands."

"(initramfs) [6n"
"ls -l /dev/disk/by-id/dm-uuid-LVM-qf611Cct0k4CFKieysi1dCPG3DMo27GRHB"
7nJ8lO*
"ls: /dev/disk/by-id/dm-uuid-LVM-qf611Cct0k4CFKieysi1dCPG3DMo27GRHB7nJ8lO*: No su"
"ch file or directory"
"(initramfs) [6n"
"cat /proc/cmdline"
root=/dev/disk/by-id/dm-uuid-LVM-qf611Cct0k4CFKieysi1dCPG3DMo27GRHB7nJ8lOvIK1Svd
IFoSIyDIqQG2QpOSk

"(initramfs) [6n"
"ls -l /dev/disk/by-id/dm-uuid-LVM-qf611Cct0k4CFKieysi1dCPG3DMo27GRHB"
7nJ8lOvIK1Svd
"ls: /dev/disk/by-id/dm-uuid-LVM-qf611Cct0k4CFKieysi1dCPG3DMo27GRHB7nJ8lOvIK1Svd:"
"No such file or directory"
"(initramfs) [6n"
"ls -l /dev/disk/by-id/dm-uuid-LVM-qf611Cct0k4CFKieysi1dCPG3DMo27GRHB"
7nJ8lOvIK1Svd
"ls: /dev/disk/by-id/dm-uuid-LVM-qf611Cct0k4CFKieysi1dCPG3DMo27GRHB7nJ8lOvIK1Svd:"
"No such file or directory"
"(initramfs) [6n"
"ls -l /dev/disk/by-id/dm-uuid-LVM-*"
"ls: /dev/disk/by-id/dm-uuid-LVM-*: No such file or directory"
"(initramfs) [6n"
"ls -l /dev/disk/by-id/"
"lrwxrwxrwx 1 10 wwn-0x6005076306ffd6b60000000000002602 -> ../../dm-0"
"lrwxrwxrwx 1 10 dm-name-mpatha -> ../../dm-0"
"lrwxrwxrwx 1 10 dm-uuid-mpath-36005076306ffd6b60000000000002602 -> ../"
../dm-0
"lrwxrwxrwx 1 10 scsi-36005076306ffd6b60000000000002602 -> ../../dm-0"
"lrwxrwxrwx 1 10 lvm-pv-uuid-BAnixI-pV6E-ntyb-IdEQ-PwPK-R93a-zflLEP ->"
../../dm-2
"lrwxrwxrwx 1 10 wwn-0x6005076306ffd6b60000000000002602-part2 -> ../../"
dm-2
"lrwxrwxrwx 1 10 dm-uuid-part2-mpath-36005076306ffd6b60000000000002602"
"-> ../../dm-2"
"lrwxrwxrwx 1 10 dm-name-mpatha-part2 -> ../../dm-2"
"lrwxrwxrwx 1 10 scsi-36005076306ffd6b60000000000002602-part2 -> ../../"
dm-2
"lrwxrwxrwx 1 10 wwn-0x6005076306ffd6b60000000000002602-part1 -> ../../"
dm-1
"lrwxrwxrwx 1 10 dm-uuid-part1-mpath-36005076306ffd6b60000000000002602"
"-> ../../dm-1"
"lrwxrwxrwx 1 10 dm-name-mpatha-part1 -> ../../dm-1"
"lrwxrwxrwx 1 10 scsi-36005076306ffd6b60000000000002602-part1 -> ../../"
dm-1
"lrwxrwxrwx 1 9 scsi-SIBM_2107900_75DXP712602 -> ../../sdb"
"lrwxrwxrwx 1 10 scsi-SIBM_2107900_75DXP712602-part2 -> ../../sdb2"
"lrwxrwxrwx 1 10 scsi-SIBM_2107900_75DXP712602-part1 -> ../../sdc1"
"(initramfs) [6n"
dmsetup ls --tree -o nodevice
mpatha-part2
`-mpatha
mpatha-part1
`-mpatha

It happens in about one out of 4 cases:

1) failed with missing
   ALERT! /dev/disk/by-id/dm-uuid-LVM-qf611C...
2) fine
3) fine
4) fine
5) failed with missing
   ALERT! /dev/disk/by-id/dm-uuid-LVM-k4CFKi...
6) fine
7) fine
8) ALERT! /dev/disk/by-id/dm-uuid-LVM-NdJQQ6vcagcQESXKnyUI8fHg8KCfeqQw912SyHTJ00LK
ZEAEHwzpaaoQla3k4NuF does not exist
9) fine
10) /dev/disk/by-id/dm-uuid-LVM-Vnda6LhYVr8Yj4OjePUOhF2fv6lbva7Xg2e2v30TJ28M
1mwq8nW0i5DewX14B1cs

(I always did exactly the same - just kicking of the autoinstall) on the exact same system with the same disks etc. - so it 'smells' like a race condition...)

I used the daily "19.1" 20.04.3 RC ISO, as referenced in the QA Tracker:
http://cdimage.ubuntu.com/ubuntu-server/focal/daily-live/20210819.1/focal-live-server-s390x.iso

Revision history for this message
Frank Heimes (fheimes) wrote :

Forgot to add that I didn't faced this issue on normal/interactive FCP installations (neither on z/VM nor on LPAR) - so far only when doing autoinstalls.
(And autoinstall using just DASDs seems to work fine.)

Changed in ubuntu-z-systems:
assignee: nobody → Canonical Foundations Team (canonical-foundations)
Revision history for this message
Ubuntu QA Website (ubuntuqa) wrote :

This bug has been reported on the Ubuntu ISO testing tracker.

A list of all reports related to this bug can be found here:
http://iso.qa.ubuntu.com/qatracker/reports/bugs/1940687

tags: added: iso-testing
Frank Heimes (fheimes)
description: updated
description: updated
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hey Frank! Thanks for reporting this. How frequently does this issue actually occur?

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Hmm. When this happens, does it happen on every boot of the installed system? You're probably right that it's a race, but I can't tell if it's a race during the install or during boot. It's a bit hard to imagine why this might only happen with an autoinstall.

As you know, we change some stuff around lvm filters recently to get encrypted LVM on multipath working so it's possible we broke something here. Is this a regression from previous behaviour.

Is there any way you can set things up so I can poke around in a system that's failed to boot like this?

Revision history for this message
Frank Heimes (fheimes) wrote :

Well, I can't for sure say that it really only happens with autoinstall, since it happens just so sporadically. But during the few tests I did I only noticed this issue with autoinstalls, hence I then tried even more autoinstalls, to get a better idea about what's happening...

Once in busybox I can do a reboot (-f) and the system may come up properly now and then.
And when the system came up properly and I 'reboot' again, the system may end up again in busybox.
(that's here a 50/50 chance)
Hmm :-/

So very sporadically and with that maybe systemd/udev or even kernel related.

I'm trying to find more details from within busybox (which is a bit challenging, since I need to use all the low-level commands like 'lvm' ...)

Revision history for this message
Frank Heimes (fheimes) wrote :

I tried to gather some more details, see attachment, but don't know exactly how
/dev/disk/by-id/dm-uuid-LVM-Vnda6LhYVr8Yj4OjePUOhF2fv6lbva7Xg2e2v30TJ28M1mwq8nW0i5DewX14B1cs
(and /dev/disk/by-id/dm-uuid-part1-mpath-36005076306ffd6b60000000000002602 )
are assembled.
The lower levels seem to be fine.

@mwhudson, yes you should be able to access that system.
it is right properly up, hence you may just ssh into it.

If I reboot it a couple of times I'm sure I'll get it into busybox again - in this case you need to connect to its console via the HMC (task 'Operating Systems Messages) and I think you should have HMC access...

Let's chat about details via MM ...

Revision history for this message
Frank Heimes (fheimes) wrote :
Download full text (6.3 KiB)

After some more investigations I think this is a more severe problem than I initially thought - and btw. not limited to autoinstall, but affects all (FCP/)multipath installations - it just happened after a reboot of a z/VM guest that uses (FCP/)multipath disk storage, too.

I wanted to add some kernel parameters to enable systemd debugging and direct logging to the console (systemd.log_level=debug systemd.log_target=console),
but I wasn't able because I couldn't re-write the bootloader, due to the fact that it couldn't find any kernel or initrd, because /boot is empty (not properly mounted).
(Right now I even do not understand how the system could come up that way...)
And this is not only after an autoinstall with (FCP/)multipath, but also after a normal interactive installation (that I thought were fine, which is not the case due to the empty /boot).

The following output shows some system states if the system came uo and ran into BusyBox (separated by '---') compared to a situation where it pretended to come up /properly/:

ls -l /dev/dm*
brw------- 1 253, 2 /dev/dm-2
brw------- 1 253, 1 /dev/dm-1
brw------- 1 253, 0 /dev/dm-0
---
$ ls -l /dev/dm*
brw-rw---- 1 root disk 253, 0 Aug 23 10:40 /dev/dm-0

dmsetup ls --tree
mpatha-part2 (253:2)
`-mpatha (253:0)
|- (8:16)
|- (8:0)
|- (8:48)
`- (8:32)
mpatha-part1 (253:1)
`-mpatha (253:0)
|- (8:16)
|- (8:0)
|- (8:48)
`- (8:32)
---
$ sudo dmsetup ls --tree
ubuntu--vg-ubuntu--lv (253:0)
 └─ (8:2)

dmsetup ls
mpatha (253:0)
mpatha-part2 (253:2)
mpatha-part1 (253:1)
---
$ sudo dmsetup ls
ubuntu--vg-ubuntu--lv (253:0)

dmsetup status
mpatha: 0 134217728 multipath 2 0 1 0 1 1 A 0 4 2 8:32 A 0 0 1 8:48 A 0 0 1 8:0
A 0 0 1 8:16 A 0 0 1
mpatha-part2: 0 132116480 linear
mpatha-part1: 0 2097152 linear
---
$ sudo dmsetup status
ubuntu--vg-ubuntu--lv: 0 66060288 linear

dmsetup info
Name: mpatha
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 2
Event number: 0
Major, minor: 253, 0
Number of targets: 1
UUID: mpath-36005076306ffd6b60000000000002602
Name: mpatha-part2
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 0
Event number: 0
Major, minor: 253, 2
Number of targets: 1
UUID: part2-mpath-36005076306ffd6b60000000000002602
Name: mpatha-part1
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 0
Event number: 0
Major, minor: 253, 1
Number of targets: 1
UUID: part1-mpath-36005076306ffd6b60000000000002602
---
$ sudo dmsetup info
Name: ubuntu--vg-ubuntu--lv
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 1
Event number: 0
Major, minor: 253, 0
Number of targets: 1
UUID: LVM-Vnda6LhYVr8Yj4OjePUOhF2fv6lbva7Xg2e2v30TJ28M1mwq8nW0i5DewX14B1cs

ls -l /dev/disk/by-id/*
lrwxrwxrwx 1 10 /dev/disk/by-id/wwn-0x6005076306ffd6b60000000000002602
-part2 -> ../../dm-2
lrwxrwxrwx 1 10 /dev/disk/by-id/wwn-0x6005076306ffd6b60000000000002602
-part1 -> ../../dm-1
lrwxrwxrwx 1 10 /dev/disk/by-id/wwn-0x6005076306ffd6b600...

Read more...

Frank Heimes (fheimes)
summary: - autoinstall with multipath/LVM disks sometimes leaves the system in a
+ Installations with multipath/LVM disks sometimes leaves the system in a
non-bootable stage
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
importance: Undecided → Critical
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

This bug is fixed with commit 809817fd to curtin on branch master.
To view that commit see the following URL:
https://git.launchpad.net/curtin/commit/?id=809817fd

Changed in subiquity:
status: New → In Progress
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: New → In Progress
Revision history for this message
Frank Heimes (fheimes) wrote :

With the QA Tracker ISO from Aug 24th:
http://cdimage.ubuntu.com/ubuntu-server/focal/daily-live/20210824/focal-live-server-s390x.iso
this is solved.
Hence updating the status to Fix Committed.

Changed in subiquity:
status: In Progress → Fix Committed
Changed in ubuntu-z-systems:
status: In Progress → Fix Committed
Revision history for this message
Frank Heimes (fheimes) wrote :

Since 20.04.3 got released today I'm closing this ticket as Fix Released.

Changed in subiquity:
status: Fix Committed → Fix Released
Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers