subiquity

mantic daily (on s390x) does not reboot (post-install) from correct disk

Bug #2029479 reported by Frank Heimes on 2023-08-03

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Ubuntu on IBM z Systems	Fix Released	High	Skipper Bug Screeners
	subiquity	Fix Released	Undecided	Olivier Gayot

Bug Description

Trying to install the latest mantic daily in current (from July 11th)
(and ater upgrading the installer manually with:
snap refresh subiquity --channel=edge/s390x-shutdown --ignore-running)
let me complete the installation process itself until "Reboot now",
but the following post-install reboot does not boot up the system from the newly installed disk,
but starts install system again?!

However, if I force a reboot from the newly installed disk (via the z/VM hypervisor with '#cp i 200' in my case), the newly installed system comes up properly (and it is really the newly installed system, I check the date/time when it got installed).

The console and shell output does not look suspicious, but the reboot just lands here:

Ý 2.900096¨ raid6: using algorithm vx128x8 gen() 19123 MB/s
Ý 3.070038¨ raid6: .... xor() 12292 MB/s, rmw enabled
Ý 3.070055¨ raid6: using s390xc recovery algorithm
Ý 3.071513¨ xor: automatically using best checksumming function xc
done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/nfs-top ... done.
Begin: Running /scripts/nfs-premount ... done.
Begin: Running /scripts/casper-premount ... done.
done.
Unable to find a medium containing a live file system
Attempt interactive netboot from a URL?
yes no (default yes):

I'm attaching the log from the system after I manually booted it.

Tags:

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-08-03:

03082023.tgz Edit (396.0 KiB, application/x-tar)

Changed in ubuntu-z-systems:
assignee:	nobody → Skipper Bug Screeners (skipper-screen-team)
tags:	added: mantic

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2023-08-03:

So this sounds like the chreipl command isn't working. Do you have any idea why this might be?

The fact that we run chreipl just before reboot is kind of counter to how other platforms work but well. Hard to see how that would matter.

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-08-04:

I had chreipl in mind too,
but the image contains the exact same s390-tools version (2.26.0-0ubuntu1) like we have in lunar, and everything is fine with lunar.

Maybe a combination of multiple things (incl. kernel/driver?) - in case nothing has changed with the chreipl call itself.
(Btw. can that call be found in the logs? Just to check it's called against the correct disk.)

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-08-04:

I did some more tests:

My standard z/VM guests are usually installations with 3 (ECKD) DASDs, addresses 200, 300 and 400.
I separate about 1.5G on the first (200) as primary partition for /boot and partition the rest (of 200), as well as the entire space of the remaining ones (300, 400) for a big LVM (btw. I'm doing that since xenial).
(I've also some with FCP/SCSI disks, that I need to try as well with mantic.)

Now I wanted to test chreipl (even if it's the same v2.26.0_0ubuntu1 like on lunar, but anyway).
(The new v2.26.0_0ubuntu1 landed just yesterday in the mantic archive and is not yet in the archive.
And btw. the ISO image is pretty old anyway, it's from July 11th, but still the latest in 'current').

So I installed mantic now on the entire 200 (w/o LVM) - and ran in the 'post-inst reboots again into installer' problem again (so also w/o LVM).

I then installed a lunar on disk 400 (again w/o LVM) which worked flawlessly.

From lunar I did a chreipl to the mantic disk and could boot mantic w/o problems.

And from there I did again a chreipl but to disk 400 and could boot lunar again w/o any problems.

With that I believe we can be sure that chreipl is working fine.

Just notice that I upgrade the installer while doing mantic installations right now with:
snap refresh subiquity --channel=edge/s390x-shutdown --ignore-running
to overcome an issue that with hitting the 'Reboot now' button at the end of the installation the post-install reboot was not triggered at all.
Well, even if that 'not triggered reboot' is solved by edge/s390x-shutdown, this shows that changes happened in that larger area ...

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-08-04:

Forgot to mention that once I've booted into mantic (either via chreipl from a different disk, like explained above from lunar, or via forcing to boot the guest from the mantic disk from the z/VM hypervisor using '#cp i 200'), I can reboot the system without problems, means it boots the correct mantic disk.

Revision history for this message

Dan Bungert (dbungert) wrote on 2023-08-04:

> Just notice that I upgrade the installer while doing mantic installations right now with:
> snap refresh subiquity --channel=edge/s390x-shutdown --ignore-running

That fix has been merged to main, and as of today the resulting snap promoted to latest/stable/ubuntu-23.10, so the refresh to the edge channel test snap should no longer be necessary with ISOs built after today.

Frank Heimes (fheimes) on 2023-08-10

Changed in ubuntu-z-systems:
status:	New → Fix Committed
Changed in subiquity:
status:	New → Fix Committed
Changed in ubuntu-z-systems:
status:	Fix Committed → New
Changed in subiquity:
status:	Fix Committed → New

Frank Heimes (fheimes) on 2023-08-11

summary:

- mantic daily (on s390x z/VM) ends again in install system after post-
- install reboot
+ mantic daily (on s390x) does not reboot (post-install) from correct disk

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-08-11:

Just tried to install a mantic daily (latest pending from Aug 8th) on LPAR.
(The last thing I did with this LPAR was a 22.04.3 test install on DASD ECKD disks.)
Now with the new installation I explicitly installed to an FCP/SCSI disk/LUN, and the installation process completed and hitting the "Reboot now" button triggered a reboot.

When the system came up I noticed that the previous installation that I did on DASD (the 22.04.3 installation) came up, and not the newly installed mantic from from the LUN.

So there must be an issue with calling chreipl correctly (I tested the chreipl tool itself which worked fine, see comment #4).

(With that I changed the title, since this is not limited to z/VM, but also happens on LPARs - so a more general issue ...)

tags:

added: rls-mm-incoming

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-08-18:

Still happenes with image 20230817.

Julian Andres Klode (juliank) on 2023-08-31

tags:	added: fgoundations-todo removed: rls-mm-incoming
tags:	added: foundations-todo removed: fgoundations-todo

Frank Heimes (fheimes) on 2023-09-06

Changed in ubuntu-z-systems:
importance:	Undecided → High

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-09-14 (last edit on 2023-09-14):

This still happens with the mantic ISO image from Sep 14th.
I tried on a z/VM installation.
Installation completed fine and post-install was triggered, but z/VM guest ended up again in the installer
(without manually updating the install to any other release - and the installer itself also does not reported a newer version):
"
Unable to find a medium containing a live file system
Attempt interactive netboot from a URL?
yes no (default yes):
"
Well, since we are close to beta I think this is now getting pretty important.

(And the chreipl tools was tested a few times, things look okay there, all tests were fine ...)

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-09-15 (last edit on 2023-09-15):

#10

So I tried to narrow down this issue a bit.
I installed (here on a z/VM guest) that ended up at the installer again after the post-install reboot.
Repeated the installation and the same happened again - ended up at the installer.
Repeated the installation but prior to hitting the 'Reboot Now' button, I went to the installer shell and executed chreipl manually:

I've activated only a single disk for this installation, it's "0200":

root@ubuntu-server:/# lsdasd
Bus-ID Status Name Device Type BlkSz Size Blocks
================================================================================
0.0.0200 active dasda 94:0 ECKD 4096 7042MB 1802880

The IPL (boot) target device was "000c" which is the vmrdr (the "card reader", where the installation files are taken/read from):

root@ubuntu-server:/# lsreipl
Re-IPL type: ccw
Device: 0.0.000c
Loadparm: ""
Bootparms: ""
clear: 0

I manually changed the IPL/boot target to the disk that I've used for the installation:

root@ubuntu-server:/# chreipl 0.0.0200
Re-IPL type: ccw
Device: 0.0.0200
Loadparm: ""
Bootparms: ""
clear: 0

and double-checked if the traget had really changed (and it had changed, from "000c" to "0200")

root@ubuntu-server:/# lsreipl
Re-IPL type: ccw
Device: 0.0.0200
Loadparm: ""
Bootparms: ""
clear: 0
root@ubuntu-server:/#

I closed the installer shell (exit) and hit the "Reboot Now" button
and the system came up properly - from the right disk.

So this let's me assume that chreipl is not properly executed by the installation procedure (or not called at all), no?!

[I will do a similar test on LPAR, to see if the situation reported at LP#2029388 is similar - which I believe so ...]

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-09-15:

#11

Btw. would be interesting to know when chreipl is called by the installation - is it before the last screen (with "Reboot Now" is shown, or after the "Reboot Now" was pressed?

Revision history for this message

Olivier Gayot (ogayot) wrote on 2023-09-15:

#12

IIRC, it should be called after "Reboot Now" is pressed (just before the system reboots). If you have access to logs of a successful install (they should be stored under /var/log/installer), I believe subiquity-server-debug.log should mention the call to chreipl.

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-09-15 (last edit on 2023-09-15):

#13

That's what I tried to find earlier, but even 'grep-ing' through entire /var/log (grep -ri chreipl /var/log/installer/*) didn't brought up anything.
But there is also no output on a jammy installation - looks like 'chreipl' is not directly mentioned in the logs ...

Revision history for this message

Dan Bungert (dbungert) wrote on 2023-09-15:

#14

chreipl is one of the few things that should not make it into the logs. We copy log files from the installation system to the target system not long before reboot. If you want to see that I think you'd have to divert the chreipl command.

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-09-15:

#15

I completed some further tests on LPAR.
And I am now sure that LP#2029388 is a duplicta of this (LP#2029479).
The result on an LPAR installation is just a bit different.

I did a similar test than above, just here with:
root@ubuntu-server:/# lsdasd
Bus-ID Status Name Device Type BlkSz Size Blocks
================================================================================
0.0.162f active dasda 94:0 ECKD 4096 21129MB 5409180
root@ubuntu-server:/# lsreipl
Re-IPL type: fcp
WWPN: 0x50050763061b16b6
LUN: 0x4026400300000000
Device: 0.0.e000
bootprog: 0
br_lba: 0
Loadparm: ""
Bootparms: ""
root@ubuntu-server:/# chreipl 0.0.162f
Re-IPL type: ccw
Device: 0.0.162f
Loadparm: ""
clear: 0
root@ubuntu-server:/# lsreipl
Re-IPL type: ccw
Device: 0.0.162f
Loadparm: ""
clear: 0
root@ubuntu-server:/#

and the system booted from the newly installed disk - after I had called chreipl manually.

Since with an LPAR install the installation is using ftp/network rather than the reader like on z/VM,
the (re-)ipl target is on LPAR just the previously successful booted system (since there is no reader on LPAR, that's a z/VM thing).
Hence on LPAR it just bootes the previous disk (in my test a SCSI LUN (rather than the reader in z/VM).

The root cause for both is that the chreipl step is not like expected or needed.

(I'll mark LP#2029388 as duplicate of this now.)

Revision history for this message

Olivier Gayot (ogayot) wrote on 2023-09-19:

#16

Based on tests Frank and I did yesterday, the following command (which is what subiquity calls) is indeed supposed to do the job fine.

# chreipl /target/boot

However, I did another test where I delayed the call to /sbin/reboot and /sbin/poweroff.

Here's the output:

2023-09-19 10:36:43,825 DEBUG subiquitycore.utils:95 run_command ['chreipl', '/target/boot'] exited with code 1

It didn't log the output, which is a bit of a shame, but it seems that /target was already unmounted when the call ran - which would explain the failure.

Revision history for this message

Olivier Gayot (ogayot) wrote on 2023-09-19 (last edit on 2023-09-19):

#17

Above theory is confirmed by the following sequence:

2023-09-19 10:36:43,116 DEBUG subiquitycore.utils:151 astart_command called: ['systemd-run', '--wait', '--same-dir', '--property', 'SyslogIdentifier=subiquity_log.71135', '--setenv', 'PATH=/snap/subiquity/5108/bin:/snap/subiquity/5108/usr/bin:/snap/subiquity/5108/usr/sbin:/snap/subiquity/5108/usr/bin:/snap/subiquity/5108/sbin:/snap/subiquity/5108/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/subiquity/5108/bin:/snap/subiquity/5108/sbin', '--setenv', 'PYTHONPATH=:/snap/subiquity/5108/lib/python3.10/site-packages', '--setenv', 'PYTHON=/snap/subiquity/5108/usr/bin/python3.10', '--setenv', 'SNAP=/snap/subiquity/5108', '--', 'umount', '--recursive', '/target']
2023-09-19 10:36:43,822 DEBUG subiquitycore.utils:76 run_command called: ['chreipl', '/target/boot']

https://github.com/canonical/subiquity/commit/895f6a6384b55054e7c98106e071c5af20604374 is probably the commit that introduced the regression

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2023-09-19:

#18

AAAARGH

Is there any reason we delay the call to chreipl so late? Can we not run it as soon as the install is complete? At the least, running it before we copy logs to the target would seem like a good idea...

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-09-19:

#19

From a chreipl point of view it can be called (almost) any time - even immediately after the new target boot device is known (it just needs to be enabled, in theory it even does not need to be formatted or installed - well, which would not make too much sense).
So with that, it can technically be called as soon as the install is complete (e.g. just before the 'Reboot Now' button becomes active?!).

Revision history for this message

Olivier Gayot (ogayot) wrote on 2023-09-19:

#20

I opened https://github.com/canonical/subiquity/pull/1799, which moves the call to chreipl earlier, at the end of the postinstall phase.

Changed in subiquity:
assignee:	nobody → Olivier Gayot (ogayot)
status:	New → In Progress

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-09-19:

#21

That sounds very reasonable - thx! (happy to try a test build)

Changed in ubuntu-z-systems:
status:	New → In Progress

Dan Bungert (dbungert) on 2023-09-20

Changed in subiquity:
status:	In Progress → Fix Committed

Dan Bungert (dbungert) on 2023-09-21

Changed in subiquity:
status:	Fix Committed → Fix Released

Frank Heimes (fheimes) on 2023-09-21

Changed in ubuntu-z-systems:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #2029388

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

03082023.tgz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.