[2.6] Unable to reboot s390x KVM machine after initial deploy

Bug #1859656 reported by Sean Feole on 2020-01-14
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Low
Lee Trager
QEMU
Undecided
Unassigned
Ubuntu on IBM z Systems
High
MAAS

Bug Description

MAAS version: 2.6.1 (7832-g17912cdc9-0ubuntu1~18.04.1)
Arch: S390x

Appears that MAAS can not find the s390x bootloader to boot from the disk, not sure how maas determines this. However this was working in the past. I had originally thought that if the maas machine was deployed then it defaulted to boot from disk.

If I force the VM to book from disk, the VM starts up as expected.

Reproduce:

- Deploy Disco on S390x KVM instance
- Reboot it

on the KVM console...

Connected to domain s2lp6g001
Escape character is ^]
done
  Using IPv4 address: 10.246.75.160
  Using TFTP server: 10.246.72.3
  Bootfile name: 'boots390x.bin'
  Receiving data: 0 KBytes
  TFTP error: file not found: boots390x.bin
Trying pxelinux.cfg files...
  Receiving data: 0 KBytes
  Receiving data: 0 KBytes
Failed to load OS from network

==> /var/log/maas/rackd.log <==
2020-01-14 18:21:24 provisioningserver.rackdservices.tftp: [info] boots390x.bin requested by 10.246.75.160
2020-01-14 18:21:24 provisioningserver.rackdservices.tftp: [info] s390x/65a9ca43-9541-49be-b315-e2ca85936ea2 requested by 10.246.75.160
2020-01-14 18:21:24 provisioningserver.rackdservices.tftp: [info] s390x/01-52-54-00-e5-d7-bb requested by 10.246.75.160
2020-01-14 18:21:24 provisioningserver.rackdservices.tftp: [info] s390x/0AF64BA0 requested by 10.246.75.160
2020-01-14 18:21:24 provisioningserver.rackdservices.tftp: [info] s390x/0AF64BA requested by 10.246.75.160
2020-01-14 18:21:24 provisioningserver.rackdservices.tftp: [info] s390x/0AF64B requested by 10.246.75.160
2020-01-14 18:21:24 provisioningserver.rackdservices.tftp: [info] s390x/0AF64 requested by 10.246.75.160
2020-01-14 18:21:24 provisioningserver.rackdservices.tftp: [info] s390x/0AF6 requested by 10.246.75.160
2020-01-14 18:21:24 provisioningserver.rackdservices.tftp: [info] s390x/0AF requested by 10.246.75.160
2020-01-14 18:21:24 provisioningserver.rackdservices.tftp: [info] s390x/0A requested by 10.246.75.160
2020-01-14 18:21:24 provisioningserver.rackdservices.tftp: [info] s390x/0 requested by 10.246.75.160
2020-01-14 18:21:24 provisioningserver.rackdservices.tftp: [info] s390x/default requested by 10.246.75.160

Sean Feole (sfeole) wrote :
summary: - [2.6] Unable to reboot s390x machine after initial deploy
+ [2.6] Unable to reboot s390x KVM machine after initial deploy
Sean Feole (sfeole) wrote :

Powering off the machine after its initial deployment renders the machine unusable.

Workaround:
Release & Redeploy again.

Sean Feole (sfeole) on 2020-01-14
description: updated
Frank Heimes (fheimes) on 2020-01-16
Changed in ubuntu-z-systems:
status: New → Triaged
importance: Undecided → High
assignee: nobody → MAAS (maas)
Changed in maas:
assignee: nobody → Lee Trager (ltrager)
Sean Feole (sfeole) wrote :

Please note that I have updated MAAS to version 2.6.2 from the proposed PPA and this problem still exists.

Lee Trager (ltrager) wrote :

boots390x.bin is a place holder, the file shouldn't exist. Its defined as a way to allow MAAS to know the architecture of the machine being booted. The bootloader itself is shipped with kvm. You do need a newer version of kvm. Bionic should work I can't remember if it was backported to Xenial.

Changed in maas:
status: New → Incomplete
Lee Trager (ltrager) wrote :

I forgot to add each Pod/kvm host has to have Bionic or newer. This blog post explains it pretty well

http://ubuntu-on-big-iron.blogspot.com/2019/08/maas-kvm-on-s390x-part1.html

Sean Feole (sfeole) wrote :

Hey Lee,
I took a look at that document. I want to make a few points here. This has worked in the past, earlier versions of MAAS. Nothing has ever changed on my lpar that is hosting the VM's.
The host lpar is Bionic, 18.04. This was working for months and suddenly stopped, the only change in my labs have been me updating the maas servers to newer versions.

According to libvirt documentations the s390x arch only respects the first <OS> <boot> param in the XML.
  <os>
    <type arch='s390x' machine='s390-ccw-virtio-bionic'>hvm</type>
    <boot dev='network'/>
    <boot dev='hd'/>
  </os>

In normal circumstances, net boot fails and the VM default to the HD, but on s390 that's not the case. Once net boot fails then

Connected to domain s2lp6g001
Escape character is ^]
done
  Using IPv4 address: 10.246.75.177
  Using TFTP server: 10.246.72.3
  Bootfile name: 'boots390x.bin'
  Receiving data: 0 KBytes
  TFTP error: file not found: boots390x.bin
Trying pxelinux.cfg files...
  Receiving data: 0 KBytes
  Receiving data: 0 KBytes
Failed to load OS from network

your stuck there in the off state forever. Changing the XML so that the VM boots from <HD> first works. however that's not really acceptable in this use case.

I would imagine that for this to work MAAS-dhcp would have to instruct the s390x VM to boot from "local" (disk) once it's already deployed. All by means of the /var/lib/maas/dhcpd.conf

Is that not how this is designed to work?

Lee Trager (ltrager) wrote :

You are correct that the XML shouldn't have to be changed to work with MAAS. All architectures MAAS currently supports always boot from the network. MAAS gives the network boot loader a config file which tells it if it should boot into an ephemeral environment over the network or local boot.

From the log you posted MAAS is doing everything it should do. MAAS specifies boots390x.bin as a place holder so we know what architecture is booting[1]. When the kvm provided bootloader runs MAAS returns an empty config file because that should instruct the kvm bootloader to boot from disk[2].

I added the qemu-kvm project as the only thing I can think of is qemu-kvm has updated its bootloader which broke MAAS. Do you know which version of qemu-kvm previously worked?

[1] https://git.launchpad.net/maas/tree/src/provisioningserver/boot/s390x.py#n76
[2] https://git.launchpad.net/maas/tree/src/provisioningserver/boot/s390x.py#n129

affects: qemu-kvm → qemu

qemu-kvm doesn't exist for years, I have marked it for qemu instead.
Thanks Frank for making me aware.

Sean got everything right in comment #6, it can only boot one and that is the first boot entry.
There is no fallback/fallthrough on s390x.

If you stick with global boot options the host would needs to change the XML to boot fro disk in this case. (BTW that is the case since the beginnign the comment is from libvirt 3.5 somewhere around zesty I think).

P.S. if this ever worked it was a bug that is not to be relied upon (but I'd wonder)

But that doesn't mean it won't work, just not with that XML format.
I've never tested it but I think you might be able to get away with a proper bootorder config.
An example can be found here [1] that you might try (do not implement it directly, give it a test please)

[1]: https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2xmloutdata/machine-loadparm-multiple-disks-nets-s390.xml;h=c4e08fd4401bf5bf448ee45ab8890b3e44057f97;hb=HEAD

Changed in qemu:
status: New → Incomplete
Sean Feole (sfeole) wrote :

I tried the above workaround mentioned by Christian last week also, I did not mention that in comment #6.

I tried using the boot order configuration as outlined in the example(comment #9)

After the machine deploys, the same symptom occurs. So we are sort of stuck again still.

Domain s2lp6g001 started

Connected to domain s2lp6g001
Escape character is ^]
done
  Using IPv4 address: 10.246.75.152
  Using TFTP server: 10.246.72.3
  Bootfile name: 'boots390x.bin'
  Receiving data: 0 KBytes
  TFTP error: file not found: boots390x.bin
Trying pxelinux.cfg files...
  Receiving data: 0 KBytes
  Receiving data: 0 KBytes
Failed to load OS from network

Frank Heimes (fheimes) on 2020-01-22
Changed in maas:
status: Incomplete → New
Frank Heimes (fheimes) on 2020-01-23
tags: added: s390x

First check - as assumed - the old style config always failed.
It went into netboot, netboot fails and then it bails out.

root@testkvm-bionic-from:~# virsh start netboot --console
Domain netboot started
Connected to domain netboot
Escape character is ^]
done
  Using IPv4 address: 192.168.122.33
  Using TFTP server: 192.168.122.1
Trying pxelinux.cfg files...
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  Receiving data: 0 KBytes
Repeating TFTP read request...
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  Receiving data: 0 KBytes
Repeating TFTP read request...
  TFTP error: ICMP ERROR "port unreachable"
Failed to load OS from network

root@testkvm-bionic-from:~#
root@testkvm-bionic-from:~# virsh list --all
 Id Name State
----------------------------------------------------
 - netboot shut off

-- -- -- --

The suggested config with bootindex (lets see if that would work on s390x)

root@testkvm-bionic-from:~# virsh start netboot --console
Domain netboot started
Connected to domain netboot
Escape character is ^]
done
  Using IPv4 address: 192.168.122.33
  Using TFTP server: 192.168.122.1
Trying pxelinux.cfg files...
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  Receiving data: 0 KBytes
Repeating TFTP read request...
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  TFTP error: ICMP ERROR "port unreachable"
  Receiving data: 0 KBytes
Repeating TFTP read request...
  TFTP error: ICMP ERROR "port unreachable"
Failed to load OS from network

by that confirming SFeole again (comment #9 this time).

So no easy workarounds present.

TBH I'd really want to see how this worked as we didn't push anything to Bionic that would have changed this recently.

I checked and the only change in that regard is really old.
It was for bug 1790901 which was a prereq for real IPXE on s390x.
So I doubt that MAAS could have worked before that.
Never the less to be sure I was trying the old verson (which needed an odd bundle of kernel+initrd to build into what you reply on netboot).

But even that - if netboot is failing - it does not fall through (as I'd expected, but I wanted to be sure).

root@testkvm-bionic-from:~# virsh start netboot --console
Domain netboot started
Connected to domain netboot
Escape character is ^]
done
  Using IPv4 address: 192.168.122.33
  Requesting file "" via TFTP from 192.168.122.1
  Receiving data: 0 KBytesICMP ERROR "port unreachable"
Failed to load OS from network

Frank Heimes (fheimes) wrote :

I took the time and recreated a MAAS setup (latest stable 2.2) on s390x and it looks like this:
- I could start a deployment and ran through the states:
   - Power On, Commissioning (Performing PXE boot)
   - Power On, Commissioning (Gathering Information)
   - Power On, Ready
   - Power Off, Ready
  (I may have have missed some states in between.)
- Power Off, Ready is the final state at that point
  and on the console it's:
$ virsh list --all
 Id Name State
----------------------------------------------------
 - vm1 shut off
- xml VM definition is:
$ virsh dumpxml vm1
<domain type='kvm'>
  <name>vm1</name>
  <uuid>0f7d1d61-9368-4bfe-8c65-c709e90e8780</uuid>
  <memory unit='KiB'>1048576</memory>
  <currentMemory unit='KiB'>1048576</currentMemory>
  <vcpu placement='static'>1</vcpu>
  <os>
    <type arch='s390x' machine='s390-ccw-virtio-bionic'>hvm</type>
    <boot dev='network'/>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/maas-images/6addbfeb-ff2c-4350-b34d-11a56ea34f1d'/>
      <target dev='vda' bus='virtio'/>
      <serial>6addbfeb-ff2c-4350-b34d-11a56ea34f1d</serial>
      <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0002'/>
    </disk>
    <interface type='network'>
      <mac address='52:54:00:ea:11:5f'/>
      <source network='default'/>
      <model type='virtio'/>
      <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0001'/>
    </interface>
    <console type='pty'>
      <log file='/var/log/libvirt/qemu/vm1-serial0.log' append='off'/>
      <target type='sclp' port='0'/>
    </console>
    <memballoon model='virtio'>
      <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0000'/>
    </memballoon>
    <panic model='s390'/>
  </devices>
</domain>

So it largely looks like assumed (after initially reading the bug),
PXE itself seems to work, but the boot issue it due to:
    <boot dev='network'/>
    <boot dev='hd'/>

That confirms the situation (on s390x and MAAS 2.6.2)m but it still raises the question why it seem to have worked with 2.6.0?

Frank Heimes (fheimes) wrote :

The general issue with multiple boot elements on s390x was indeed already identified back in 2017, and a ticket was opened and reverse mirrored to IBM (so it should never have worked that way):
LP 1736511 (and btw. RH ticket is referenced there as well)

I asked around a bit without going into the "why" and got confirmation from IBM that fallthrough from netboot never existed.
In addition a RH engineer jumped in and said that they have this bug as well and would appreciate if IBM would implement it (that is the RH ticket I added to the other bug Frank has mentioned above).

This makes it even more puzzling how this ever worked, Frank is trying to test a 2.6.0 build Adam has set up ...

Lee Trager (ltrager) wrote :

The S390X KVM boot driver in MAAS really hasn't changed since it was committed in 2018[1]. I doubt changing the version of MAAS will show it working. I would try using older versions of qemu.

[1] https://git.launchpad.net/maas/log/src/provisioningserver/boot/s390x.py

Frank Heimes (fheimes) wrote :

It took some time (due to travel), but I was now able to do a setup based on the old 2.6.0 version [2.6.0 (7803-g6fc5f26eb-0ubuntu1~18.04.1)] for testing.

And with the combination:

$ apt-cache policy maas
maas:
  Installed: 2.6.0-7803-g6fc5f26eb-0ubuntu1~18.04.1
  Candidate: 2.6.0-7803-g6fc5f26eb-0ubuntu1~18.04.1
  Version table:
 *** 2.6.0-7803-g6fc5f26eb-0ubuntu1~18.04.1 500
        500 http://ppa.launchpad.net/maas-maintainers/testing/ubuntu bionic/main s390x Packages
        100 /var/lib/dpkg/status
     2.4.2-7034-g2f5deb8b8-0ubuntu1 500
        500 http://us.ports.ubuntu.com/ubuntu-ports bionic-updates/main s390x Packages
     2.4.0~beta2-6865-gec43e47e6-0ubuntu1 500
        500 http://us.ports.ubuntu.com/ubuntu-ports bionic/main s390x Packages
and:
$ apt-cache policy qemu
qemu:
  Installed: (none)
  Candidate: 1:2.11+dfsg-1ubuntu7.21
  Version table:
     1:2.11+dfsg-1ubuntu7.21 500
        500 http://us.ports.ubuntu.com/ubuntu-ports bionic-updates/universe s390x Packages
     1:2.11+dfsg-1ubuntu7.20 500
        500 http://ports.ubuntu.com/ubuntu-ports bionic-security/universe s390x Packages
     1:2.11+dfsg-1ubuntu7 500
        500 http://us.ports.ubuntu.com/ubuntu-ports bionic/universe s390x Packages
the system seems to successful commissions, similar to the latest maas 2.6.2 version (see above).
But then the VM ends again in state Ready / off.

virsh shows the VM as shutoff:
ubuntu@s1lp11:~$ sudo -H -u maas bash -c 'virsh -c qemu+ssh://ubuntu@192.168.122.1/system list --all'
ubuntu@192.168.122.1's password:
 Id Name State
----------------------------------------------------
 - vm1 shut off

The os element looks like this - with the two entries:
  <os>
    <type arch='s390x' machine='s390-ccw-virtio-bionic'>hvm</type>
    <boot dev='network'/>
    <boot dev='hd'/>
  </os>

A manual start (with the help of virsh, console enabled) shows that it network boots (see attachment).

Removing the network entry and booting didn't work - looks like no OS deployed on disk yet.

To sum it up - also not working on this 2.6.0 env. that I've just created.

Frank Heimes (fheimes) wrote :

I doubled check today again with sfeole and he confirmed that it worked for him as well (and not only for me) when he just started using MAAS KVM on s390x and had his initial setup...

Anyway, I'm now sure on how much effort we should spent on recreating the old situation (I may try some old qemu/libvirt version - in case they are still available and in the archive -- maybe MAAS pulls in further packages that need to be back-dated, too ?!),
but I think in parallel we should check if this can be solve on the latest MAAS versions, to get the kt testing unblocked.

Is there an option to make it work - like with MAAS changing the xml config?

I think that an upstream solution for LP 1736511 will just take way to long ...

Frank Heimes (fheimes) wrote :

@Lee, do you think that the boot log from comment #18 looks fine?
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1859656/+attachment/5323507/+files/boot_console.txt

What would be the usual next step for MAAS KVM on s390x if the above is successful (me not really knowing the exact internal flow)?
Is it then netbooting again and redirecting to the disk (same on all platforms)?

Frank Heimes (fheimes) wrote :
Download full text (6.5 KiB)

In between I found the time to setup an env. build upon older releases:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic

$ dpkg -l | grep -i qemu
ii qemu-block-extra:s390x 1:2.11+dfsg-1ubuntu7 s390x extra block backend modules for qemu-system and qemu-utils
ii qemu-kvm 1:2.11+dfsg-1ubuntu7 s390x QEMU Full virtualization on x86 hardware
ii qemu-system-common 1:2.11+dfsg-1ubuntu7 s390x QEMU full system emulation binaries (common files)
ii qemu-system-s390x 1:2.11+dfsg-1ubuntu7 s390x QEMU full system emulation binaries (s390x)
ii qemu-utils 1:2.11+dfsg-1ubuntu7 s390x QEMU utilities

$ apt-cache policy maas
maas:
  Installed: 2.6.0-7803-g6fc5f26eb-0ubuntu1~18.04.1
  Candidate: 2.6.0-7803-g6fc5f26eb-0ubuntu1~18.04.1
  Version table:
 *** 2.6.0-7803-g6fc5f26eb-0ubuntu1~18.04.1 500
        500 http://ppa.launchpad.net/maas-maintainers/testing/ubuntu bionic/main s390x Packages
        100 /var/lib/dpkg/status
     2.4.2-7034-g2f5deb8b8-0ubuntu1 500
        500 http://us.ports.ubuntu.com/ubuntu-ports bionic-updates/main s390x Packages
        500 http://ports.ubuntu.com/ubuntu-ports bionic-updates/main s390x Packages
        500 http://aus.ports.ubuntu.com/ubuntu-ports bionic-updates/main s390x Packages
     2.4.0~beta2-6865-gec43e47e6-0ubuntu1 500
        500 http://us.ports.ubuntu.com/ubuntu-ports bionic/main s390x Packages
        500 http://ports.ubuntu.com/ubuntu-ports bionic/main s390x Packages

In this environment MAAS is not able to Commission ("Failed commissioning").
Trying to start the vm manually with virsh ends up with:

$ virsh start vm1 --console
Domain vm1 started
Connected to domain vm1
Escape character is ^]
done
  Using IPv4 address: 192.168.122.102
  Requesting file "boots390x.bin" via TFTP from 192.168.122.1
  Receiving data: 0 KBytesfile not found: boots390x.bin
Failed to load OS from network

So that is expecting, since the qemu packages version 1:2.11+dfsg-1ubuntu7 were initially used - the GA version, that's not known to work - the needed patch came later.

The first qemu packages that should be good are the ones with version 1:2.11+dfsg-1ubuntu7.7.
But (after discussing with cpaelzer) the qemu packages didn't really changed since 1:2.11+dfsg-1ubuntu7.7, I thought that I now upgrade to the latest ones (1:2.11+dfsg-1ubuntu7.22):

$ sudo apt install qemu-block-extra qemu-kvm qemu-system-common qemu-system-s390x qemu-utils
Reading package lists... Done
Building dependency tree
Reading state information... Done
Suggested packages:
  debootstrap
Recommended packages:
  sharutils
The following packages will be upgraded:
  qemu-block-extra qemu-kvm qemu-system-common qemu-system-s390x qemu-utils
5 upgraded, 0 newly installed, 0 to remove and 167 not upgraded.
Need to get 3,249 kB of archives.
After this operation, 32.8 kB of additional disk space will be used.
Get:1 http://us.ports.ubuntu.com/ubuntu-port...

Read more...

The one that currently is deployed is using the same "list network and hd" which should not work.
It will boot network but should not internally fall back to disk.
 10 <os>
 11 <type arch='s390x' machine='s390-ccw-virtio-bionic'>hvm</type>
 12 <boot dev='network'/>
 13 <boot dev='hd'/>
 14 </os>

Now lets understand how/what works here...

Qemu is given both boot options (we know it will ignore the second ... or at least we think and are told so).
   ... -boot strict=on ... id=virtio-disk0,bootindex=2 ... mac=52:54:00:02:a3:f9,devno=fe.0.0001,bootindex=1

I'd expect this one to "just" netboot, but we need to understand how it got "up" from there.
Fortunately there was a full log of the serial console on disk.

Attaching files from this test ...

Download full text (5.5 KiB)

Here are the interesting bits from the log:

   1 LOADPARM=[........]^M
   2 Network boot device detected^M
   3 ^M
   4 Network boot starting...^M
   5 Using MAC address: 52:54:00:02:a3:f9^M
   6 Requesting information via DHCP: ^H^H^H010^H^H^H^Hdone^M
   7 Using IPv4 address: 192.168.122.102^M
   8 Using TFTP server: 192.168.122.1^M
   9 Bootfile name: 'boots390x.bin'^M
  10 Receiving data: 0 KBytes^M
  11 TFTP error: file not found: boots390x.bin^M
  12 Trying pxelinux.cfg files...^M^M
...
  14 TFTP: Received s390x/01-52-54-00-02-a3-f9 (581 bytes)^M
  15 Loading pxelinux.cfg entry 'execute'^M
...
  17 TFTP: Received ubuntu/s390x/ga-19.04/disco/daily/boot-kernel (4318 KBytes)^M
...
  19 TFTP: Received ubuntu/s390x/ga-19.04/disco/daily/boot-initrd (19360 KBytes)^M
  20 Network loading done, starting kernel...^M
  21 ^M
  22 [ 0.439873] Linux version 5.0.0-38-generic (buildd@bos02-s390x-020) (gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1)) #41-Ubuntu SMP Tue Dec 3 00:26:40 UTC 2019 (Ubuntu 5.0.0-38.41-generic 5.0.21)

...

38 ^M[ 0.451953] Kernel command line: nomodeset ro root=squash:http://192.168.122.1:5248/images/ubuntu/s390x/ga-19.04/disco/daily/squashfs ip=::::vm1:BOOTIF ip6=off overlayroot=tmpfs ov erlayroot_cfgdisk=disabled cc:{'datasource_list': ['MAAS']}end_cc cloud-config-url=http://192-168-122-0--24.maas-internal:5248/MAAS/metadata/latest/by-id/wpr3yp/?op=get_preseed apparmor =0 log_host=192.168.122.1 log_port=5247 --- console=tty1 console=ttyS0 BOOTIF=01-52-54-00-02-a3-f9

...

 155 Begin: Mounting root file system ... Begin: Running /scripts/local-top ... IP-Config: enc1 hardware address 52:54:00:02:a3:f9 mtu 1500 DHCP RARP^M
 156 hostname vm1 IP-Config: no response after 2 secs - giving up^M
 157 IP-Config: enc1 hardware address 52:54:00:02:a3:f9 mtu 1500 DHCP RARP^M
 158 hostname vm1 hostname vm1 IP-Config: enc1 complete (dhcp from 192.168.122.1):^M
 159 address: 192.168.122.102 broadcast: 192.168.122.255 netmask: 255.255.255.0 ^M
 160 gateway: 192.168.122.254 dns0 : 192.168.122.1 dns1 : 10.245.236.13 ^M
 161 domain : maas ^M
 162 rootserver: 192.168.122.1 rootpath: ^M
 163 filename : lpxelinux.0^M
 164 :: root=squash:http://192.16...

Read more...

The assumption from here was that this only appeared to be working due to:

a) Deploy = netboot + reboot from disk = working

but at the same time

b) Start = netboot (fail) + no fallback = fail

To get that from Maas UI we stopped the guest (it went down as expected).
Then from Maas we said "power on" again.

There on (b) it failed as maas didn't provide it with an install image.
If you track it in the console you see:

$virsh start vm1 --console
setlocale: No such file or directory
Domain vm1 started
Connected to domain vm1
Escape character is ^]
done
  Using IPv4 address: 192.168.122.102
  Using TFTP server: 192.168.122.1
  Bootfile name: 'boots390x.bin'
  Receiving data: 0 KBytes
  TFTP error: file not found: boots390x.bin
Trying pxelinux.cfg files...
  Receiving data: 0 KBytes
  Receiving data: 0 KBytes
Failed to load OS from network

Maas tries a few times as we see the guest flip between "shut off" and "paused" state.
But then fives up.

The super-TL;DR matching the current insights is:
- deploy s390x Maas-KVM @ s390x worked and still does
- poweroff/poweron s390x Maas-KVM @ s390x never worked and still does not

To fix the latter we either need
a) upstream to implement a fallback to the next boot mechanism
b) maas to modify the XML after deploy to boot from disk

I flipped
    <boot dev='hd'/>
    <boot dev='network'/>

to

    <boot dev='hd'/>
    <boot dev='network'/>

And JFH started it from the MAAS UI again.
Now things work (obviously as expected)

@sfeole - after initial deployment just do the change to your guest XMLs you see in comment #27

@maas - as I said in comment #26 (and before) this needs coding in maas to switch the XML content (or waiting a long time on IBM)

OR
Maas can boot from network (always) and if not deploying just issue a "reboot from disk" command

Andrew Cloke (andrew-cloke) wrote :

My understanding from the MAAS design was that the suggestion in comment #29 "Maas can boot from network (always) and if not deploying just issue a "reboot from disk" command" was the intended design.

...and the "reboot from disk" command was the supply of an empty (zero byte?) pxelinux.cfg.

But I'll let Lee respond and correct.

Sean Feole (sfeole) wrote :

@paelzer, Aye and thanks for your comment #27 , I was already aware of that, and yes that does work. However, it's a shoddy workaround at best and if this is going to be a solution to be presented to a customer MAAS would be scoffed at.

I'm aware of the issue at hand here, I think the problem existing now, is that a decision needs to be made moving forward how to fix this. I was about to suggest that what makes the most sense IMO and is the least invasive is the suggest by @paelzer from comment #29

Maas can boot from network (always) and if not deploying just issue a "reboot from disk" command

Sean Feole (sfeole) wrote :

To add to this discussion today, I noticed that some of the maas deployments for s390x are working. I took a look and I was able to successfully deploy 19.10/18.04/20.04

I have not changed anything on the MAAS host, I have not upgraded / altered any packages.
I have not upgraded libvirt,

The only thing that's different to my knowledge is that the images maas is booting our -dailies and updated quite often.

I don't have free time today to look into this, however now i'm wondering what has changed.

Sean Feole (sfeole) wrote :

Here is my pkg versions as things are now.

maas:
  Installed: 2.6.2-7841-ga10625be3-0ubuntu1~18.04.1
  Candidate: 2.6.2-7841-ga10625be3-0ubuntu1~18.04.1

On the s390x Lpar, Bionic, Linux s2lp6 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:05:42 UTC 2019 s390x s390x s390x GNU/Linux

ubuntu@s2lp6:~$ dpkg -l | grep libvirt
ii libvirt-clients 4.0.0-1ubuntu8.14 s390x Programs for the libvirt library
ii libvirt-daemon 4.0.0-1ubuntu8.14 s390x Virtualization daemon
ii libvirt-daemon-driver-storage-rbd 4.0.0-1ubuntu8.14 s390x Virtualization daemon RBD storage driver
ii libvirt-daemon-system 4.0.0-1ubuntu8.14 s390x Libvirt daemon configuration files
ii libvirt0:s390x 4.0.0-1ubuntu8.14 s390x library for interfacing with different virtualization systems
ii python-libvirt 4.0.0-1 s390x libvirt Python bindings
ii uvtool-libvirt 0~git140-0ubuntu1 all Library and tools for using Ubuntu Cloud Images with libvirt

Sean Feole (sfeole) wrote :

After looking at the original description it does appear that I upgraded maas since originally filing this bug, that upgrade was done to workaround a different issue which was resolved since 2.6.1

Changed in maas:
status: New → Triaged
importance: Undecided → Low
Andrew Cloke (andrew-cloke) wrote :

After discussing, I realise I had a misunderstanding in comment #30 that I'd like to correct.

I had incorrectly assumed that feeding the PXEBooting KVM guest a zero length pxelinux.cfg file *instructed* it to boot from the local disk.

I now realise that is incorrect. Feeding the PXEBooting KVM guest a zero length pxelinux.cfg file only tells the guest to *fail* it's netboot attempt.

It's at this stage that the architecture specific behaviour kicks in.

On amd64, the netboot failure will force the KVM guest to move down to it's second specified boot option, namely the local disk.

However, s390x will NEVER move to it's second specified boot option. If the first boot option (netbooting) fails, it abandons the attempt and powers the guest off.

IBM has been informed of this difference in behaviour, but it is unlikely to be able to address it soon.

Sean Feole (sfeole) wrote :

It would appear that this bug is once again causing problems with some of our automated testing.
S390x KVM deployments are failing for Focal. When attempting to investigate a big I found that it is indeed this bug.

Our MAAS Server is Version:

maas:
  Installed: 2.7.0-8232-g.6e1dba4ab-0ubuntu1~18.04.1
  Candidate: 2.7.0-8232-g.6e1dba4ab-0ubuntu1~18.04.1
  Version table:

I've attached the console log of the -KVM machine deploying.

On MAAS the rack controller reports the following:
sfeole@bsg75:~$ cat focal-s390x-maas.txt
==> rackd.log <==
2020-04-09 14:14:59 provisioningserver.rackdservices.tftp: [info] boots390x.bin requested by 10.246.75.177
2020-04-09 14:14:59 provisioningserver.rackdservices.tftp: [info] s390x/65a9ca43-9541-49be-b315-e2ca85936ea2 requested by 10.246.75.177
2020-04-09 14:14:59 provisioningserver.rackdservices.tftp: [info] s390x/01-52-54-00-e5-d7-bb requested by 10.246.75.177

==> regiond.log <==
2020-04-09 14:14:59 maasserver.rpc.leases: [info] Lease update: commit for 10.246.75.177 on 52:54:0:e5:d7:bb at 2020-04-09 14:14:59 (lease time: 600s)

==> rackd.log <==
2020-04-09 14:14:59 provisioningserver.rackdservices.tftp: [info] s390x/0AF64BB1 requested by 10.246.75.177
2020-04-09 14:14:59 provisioningserver.rackdservices.tftp: [info] s390x/0AF64BB requested by 10.246.75.177
2020-04-09 14:14:59 provisioningserver.rackdservices.tftp: [info] s390x/0AF64B requested by 10.246.75.177
2020-04-09 14:14:59 provisioningserver.rackdservices.tftp: [info] s390x/0AF64 requested by 10.246.75.177
2020-04-09 14:14:59 provisioningserver.rackdservices.tftp: [info] s390x/0AF6 requested by 10.246.75.177
2020-04-09 14:15:00 provisioningserver.rackdservices.tftp: [info] s390x/0AF requested by 10.246.75.177
2020-04-09 14:15:00 provisioningserver.rackdservices.tftp: [info] s390x/0A requested by 10.246.75.177
2020-04-09 14:15:00 provisioningserver.rackdservices.tftp: [info] s390x/0 requested by 10.246.75.177
2020-04-09 14:15:00 provisioningserver.rackdservices.tftp: [info] s390x/default requested by 10.246.75.177

Sean Feole (sfeole) wrote :

Also, please note that the libvirt version did not change on the s390x Virtual Machine Host.

On S390x VM Host

ii libvirt-clients 4.0.0-1ubuntu8.14 s390x Programs for the libvirt library
ii libvirt-daemon 4.0.0-1ubuntu8.14 s390x Virtualization daemon
ii libvirt-daemon-driver-storage-rbd 4.0.0-1ubuntu8.14 s390x Virtualization daemon RBD storage driver
ii libvirt-daemon-system 4.0.0-1ubuntu8.14 s390x Libvirt daemon configuration files
ii libvirt0:s390x 4.0.0-1ubuntu8.14 s390x library for interfacing with different virtualization systems
ii python-libvirt 4.0.0-1 s390x libvirt Python bindings

Frank Heimes (fheimes) wrote :

I still think that #29 could be a viable fix for this.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers