[1.9.1] Deployment for IBM S822LC 8335-GTA and S812L TN71-BP012 fails to boot local disk following curtin install

Bug #1558747 reported by Larry Michel
36
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Newell Jensen
curtin
Invalid
Medium
Unassigned

Bug Description

Curtin installs on S822LC 8335-GTA succeeds but subsequent boot to local disk fails. The event log shows PXE Local boot request, but nothing happens.

Trying to boot directly from the PXE nic give "ERROR: kexec load failed"

Trying to boot directly from the grub menu option for the disk in the petitboot menu works, and system goes to the deployed state.

The poweron from Maas also fails since it's setting temporary IPMI boot option to Network.

This is petitboot menu:

****************************************************************************
  [Disk: sda2 / 7fb98a45-aa02-4142-a288-b9135cf30ae1]
    Ubuntu, with Linux 3.13.0-83-generic (recovery mode)
    Ubuntu, with Linux 3.13.0-83-generic
    Ubuntu, with Linux 3.19.0-56-generic (recovery mode)
    Ubuntu, with Linux 3.19.0-56-generic
    Ubuntu
  [Network: enp1s0f0 / 98:be:94:01:0f:4c]
    netboot enp1s0f0 (pxelinux.0)

  System information
  System configuration
  Language
  Rescan devices
  Retrieve config from URL
 *Exit to shell

 ──────────────────────────────────────────────────────────────────────────────
 Enter=accept, e=edit, n=new, x=exit, l=language, h=help
 Info: Processing enp1s0f0 complete
****************************************************************************

This is when trying to boot directly from network:

****************************************************************************
 Petitboot (dev.20151105) 8335-GTA 0000000000000000
 ──────────────────────────────────────────────────────────────────────────────
  [Disk: sda2 / 7fb98a45-aa02-4142-a288-b9135cf30ae1]
    Ubuntu, with Linux 3.13.0-83-generic (recovery mode)
    Ubuntu, with Linux 3.13.0-83-generic
    Ubuntu, with Linux 3.19.0-56-generic (recovery mode)
    Ubuntu, with Linux 3.19.0-56-generic
    Ubuntu
  [Network: enp1s0f0 / 98:be:94:01:0f:4c]
 * netboot enp1s0f0 (pxelinux.0)

  System information
  System configuration
  Language
  Rescan devices
  Retrieve config from URL
  Exit to shell

 ──────────────────────────────────────────────────────────────────────────────
 Enter=accept, e=edit, n=new, x=exit, l=language, h=help
 Error: kexec load failed
****************************************************************************

This is maas version:

****************************************************************************
 dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-================================-================================-============-===============================================================================
ii maas 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server all-in-one metapackage
ii maas-cli 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS command line API tool
ii maas-cluster-controller 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server cluster controller
ii maas-common 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server common files
ii maas-dhcp 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS DHCP server
ii maas-dns 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS DNS server
ii maas-proxy 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS Caching Proxy
ii maas-region-controller 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server complete region controller
ii maas-region-controller-min 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS Server minimum region controller
ii python-django-maas 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server Django web framework
ii python-maas-client 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS python API client
ii python-maas-provisioningserver 1.9.1+bzr4543-0ubuntu1~trusty1 all MAAS server provisioning libraries
****************************************************************************

Logs attached.

Revision history for this message
Larry Michel (lmic) wrote :
Revision history for this message
Larry Michel (lmic) wrote :

Wrong attachment.. here are logs.

Larry Michel (lmic)
description: updated
description: updated
Revision history for this message
Larry Michel (lmic) wrote :

This is curtin version:

ubuntu@maas-integration-september:~$ dpkg -l |grep curtin
ii curtin 0.1.0~bzr359-0ubuntu1 all Library and tools for the curtin installer
ii curtin-common 0.1.0~bzr359-0ubuntu1 all Library and tools for curtin installer
ii python-curtin 0.1.0~bzr359-0ubuntu1 all Library and tools for curtin installer
ii python3-curtin 0.1.0~bzr359-0ubuntu1 all Library and tools for curtin installer

Revision history for this message
Larry Michel (lmic) wrote :

/etc/maas/preseeds and /etc/maas/templates attached.

Revision history for this message
Larry Michel (lmic) wrote :

Petitboot logs: https://pastebin.canonical.com/152293/

Running command:
 exe: /usr/bin/tftp
 argv: '/usr/bin/tftp' '-g' '-l' '/tmp/pb-XsMOPF' '-r' '/ppc64el/pxelinux.cfg/01-98-be-94-01-0f-4c' '10.244.192.10' '69'
boot option enp1s0f0#netboot enp1s0f0 (pxelinux.0) is resolved, sending to clients
Sending renew...
Lease of 10.244.192.175 obtained, lease time 43200
deleting routers
adding dns 10.244.192.10
Sending renew...
Lease of 10.244.192.175 obtained, lease time 43200
deleting routers
adding dns 10.244.192.10
Sending renew...
Lease of 10.244.192.175 obtained, lease time 43200
deleting routers
adding dns 10.244.192.10
Running command:
 exe: /usr/bin/tftp
 argv: '/usr/bin/tftp' '-g' '-l' '/tmp/pb-ebec6t' '-r' '/pxelinux.0' '10.244.192.10' '69'
running boot hook 01-create-default-dtb
Running command:
 exe: /etc/petitboot/boot.d/01-create-default-dtb
 argv: '/etc/petitboot/boot.d/01-create-default-dtb'
Warning (reg_format): "reg" property in /ibm,opal/flash@0 has invalid length (8 bytes) (#address-cells == 0, #size-cells == 0)
boot hook 01-create-default-dtb specified boot_dtb=/tmp/tmp.TUFj3U
running boot hook 20-set-stdout
Running command:
 exe: /etc/petitboot/boot.d/20-set-stdout
 argv: '/etc/petitboot/boot.d/20-set-stdout'
running boot hook 90-sort-dtb
Running command:
 exe: /etc/petitboot/boot.d/90-sort-dtb
 argv: '/etc/petitboot/boot.d/90-sort-dtb'
Warning (reg_format): "reg" property in /ibm,opal/flash@0 has invalid length (8 bytes) (#address-cells == 0, #size-cells == 0)
Running command:
 exe: /usr/sbin/kexec
 argv: '/usr/sbin/kexec' '-l' '--dtb=/tmp/tmp.TUFj3U' '/tmp/pb-ebec6t'
load_kernel: /tmp/pb-ebec6t is not a valid ELF file
kexec_load: failed: (256)

Revision history for this message
Larry Michel (lmic) wrote :
Download full text (10.5 KiB)

On a different maas server, I am not able to get it to commission.

 Device: (*) Specify paths/URLs manually

 Kernel: tftp://10.245.0.10/ppc64el/ubuntu/amd64/hwe-t/trusty/release/boot-kernel
 Initrd: tftp://10.245.0.10/ppc64el/ubuntu/amd64/hwe-t/trusty/release/boot-initrd
 Device tree:
 Boot arguments: nomodeset iscsi_target_name=iqn.2004-05.com.ubuntu:maas:epheme

                 [ OK ] [ Help ] [ Cancel ]

2016-03-24 07:04:28+0000 [RemoteOriginReadSession (UDP)] Final ACK received, transfer successful
2016-03-24 07:04:28+0000 [-] (UDP Port 60223 Closed)
2016-03-24 07:04:28+0000 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x7f37a5f20200>
2016-03-24 07:04:29+0000 [-] Timed during option negotiation process
2016-03-24 07:04:29+0000 [-] Timed during option negotiation process
2016-03-24 07:04:29+0000 [-] Timed during option negotiation process
2016-03-24 07:04:29+0000 [-] (UDP Port 54560 Closed)
2016-03-24 07:04:29+0000 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x7f37a56ed3f8>
2016-03-24 07:04:29+0000 [-] (UDP Port 57487 Closed)
2016-03-24 07:04:29+0000 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x7f37a7975488>
2016-03-24 07:04:29+0000 [-] (UDP Port 37997 Closed)
2016-03-24 07:04:29+0000 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x7f37a79753b0>
2016-03-24 07:04:30+0000 [TFTP (UDP)] Datagram received from ('10.245.0.237', 36123): <RRQDatagram(filename=/ppc64el/pxelinux.cfg/01-98-be-94-01-0f-4c, mode=octet, options={'tsize': '0'})>
2016-03-24 07:04:30+0000 [TFTP (UDP)] Datagram received from ('10.245.0.237', 36123): <RRQDatagram(filename=/ppc64el/pxelinux.cfg/01-98-be-94-01-0f-4c, mode=octet, options={'tsize': '0'})>
2016-03-24 07:04:30+0000 [HTTPPageGetter,client] RemoteOriginReadSession starting on 49182
2016-03-24 07:04:30+0000 [HTTPPageGetter,client] Starting protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x7f37a79757a0>
2016-03-24 07:04:30+0000 [HTTPPageGetter,client] RemoteOriginReadSession starting on 56952
2016-03-24 07:04:30+0000 [HTTPPageGetter,client] Starting protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x7f37a7964098>
2016-03-24 07:04:30+0000 [RemoteOriginReadSession (UDP)] (UDP Port 56952 Closed)
2016-03-24 07:04:30+0000 [RemoteOriginReadSession (UDP)] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x7f37a7964098>
2016-03-24 07:04:30+0000 [RemoteOriginReadSession (UDP)] Final ACK received, transfer successful
2016-03-24 07:04:30+0000 [-] (UDP Port 49182 Closed)
2016-03-24 07:04:30+0000 [-] Stopping protocol <tftp.bootstrap.RemoteOriginReadSession instance at 0x7f37a79757a0>
2016-03-24 07:04:31+0000 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/twisted/application/app.py", line 392, in startReactor
            self.config, oldstdout, oldstderr, self.profiler, reactor)
          File "/usr/lib/python2.7/dist-packages/twisted/application/app.py", line 313, in runReactorWithLogging
     ...

Revision history for this message
Larry Michel (lmic) wrote :

clusterd log file for failure to commission.

Revision history for this message
Larry Michel (lmic) wrote :

Both maas servers are at the same level, but on the other that was commissioning, it was using the right ephemeral image. I have also recreated the failure to commission on a 1.8.3 system. Attached is clusterd log from that system:

ubuntu@maas18:~$ dpkg -l |grep maas
ii maas 1.8.3+bzr4053-0ubuntu1~trusty1 all MAAS server all-in-one metapackage
ii maas-cli 1.8.3+bzr4053-0ubuntu1~trusty1 all MAAS command line API tool
ii maas-cluster-controller 1.8.3+bzr4053-0ubuntu1~trusty1 all MAAS server cluster controller
ii maas-common 1.8.3+bzr4053-0ubuntu1~trusty1 all MAAS server common files
ii maas-dhcp 1.8.3+bzr4053-0ubuntu1~trusty1 all MAAS DHCP server
ii maas-dns 1.8.3+bzr4053-0ubuntu1~trusty1 all MAAS DNS server
ii maas-enlist 0.4+bzr38-0ubuntu1 amd64 MAAS enlistment tool
ii maas-proxy 1.8.3+bzr4053-0ubuntu1~trusty1 all MAAS Caching Proxy
ii maas-region-controller 1.8.3+bzr4053-0ubuntu1~trusty1 all MAAS server complete region controller
ii maas-region-controller-min 1.8.3+bzr4053-0ubuntu1~trusty1 all MAAS Server minimum region controller
ii python-django-maas 1.8.3+bzr4053-0ubuntu1~trusty1 all MAAS server Django web framework
ii python-maas-client 1.8.3+bzr4053-0ubuntu1~trusty1 all MAAS python API client
ii python-maas-provisioningserver 1.8.3+bzr4053-0ubuntu1~trusty1 all MAAS server provisioning libraries

Revision history for this message
Larry Michel (lmic) wrote :

This is clusterd log from system where system is commissioned and curtin install works and it fails local boot.

Mike Rushton (leftyfb)
tags: added: blocks-hwcert-server
Revision history for this message
Newell Jensen (newell-jensen) wrote :

I think the issue is that you were trying to deploy a generic trusty release. For power8 to work you need to use a vivid kernel or higher. Can you please re-commission and deploy using hwe-v or higher?

Changed in maas:
assignee: nobody → Newell Jensen (newell-jensen)
Revision history for this message
Larry Michel (lmic) wrote :

clusterd.log attached.

Revision history for this message
Ryan Harper (raharper) wrote :

Hi Larry,

Can you attach the curtin configuration from maas?

maas <maasuser> node get-curtin-config <system id>

Changed in curtin:
importance: Undecided → Medium
status: New → Incomplete
Changed in maas:
status: New → Triaged
status: Triaged → In Progress
Revision history for this message
Larry Michel (lmic) wrote :

Ryan, I have attached the curtin config.

Revision history for this message
Larry Michel (lmic) wrote :

Updating the bug with petiboot and bmc debug data requested by developer.

This is ipmitool returning BMC info information.

ubuntu@conserv:~$ ipmitool -I lanplus -H 192.168.224.118 -U ADMIN -P ADMIN mc info
Device ID : 32
Device Revision : 1
Firmware Revision : 2.13
IPMI Version : 2.0
Manufacturer ID : 0
Manufacturer Name : Unknown
Product ID : 43707 (0xaabb)
Product Name : Unknown (0xAABB)
Device Available : yes
Provides Device SDRs : no
Additional Device Support :
    Sensor Device
    SDR Repository Device
    SEL Device
    FRU Inventory Device
    IPMB Event Receiver
    IPMB Event Generator
    Chassis Device
Aux Firmware Rev Info :
    0xab
    0x66
    0x01
    0x00
ubuntu@conserv:~$ ipmitool -I lanplus -H 192.168.224.118 -U ADMIN -P ADMIN raw 0x00 0x09 0x05 0x00 0x00
 01 05 80 04 00 00 00
ubuntu@conserv:~$ ipmitool -I lanplus -H 192.168.224.118 -U ADMIN -P ADMIN raw 0x00 0x09 0x03 0x00 0x00
 01 03 1f
ubuntu@conserv:~$ dpkg -l |grep ipmitool

ii ipmitool 1.8.13-1ubuntu0.6 amd64 utility for IPMI control with kernel driver or LAN interface
ubuntu@conserv:~$

Also, this was executed on petitboot command shell to enable more verbose logging and logs are attached.

nvram --update-config=petitboot,debug?=true

One thing to note is that I dropped into command shell to execute this command before PXE boot. I then PXE booted by selecting the NIC, then after curtin finished installing, system booted to local disk on its own and moved to the deployed state. After re-deploying, we then hit the issue, and I collecting the attached logs after system failed to attempt to PXE or local boot after curtin installation completed.

Revision history for this message
Mike Rushton (leftyfb) wrote :

This also affects the IBM Habenero S812L TN71-BP012. It does not, however affect the E850 (E8E) "Alpine" in PowerVM mode.

summary: - [1.9.1] Deployment for IBM S822LC 8335-GTA fails to boot local disk
- following curtin install
+ [1.9.1] Deployment for IBM S822LC 8335-GTA and S812L TN71-BP012 fails
+ to boot local disk following curtin install
Revision history for this message
Larry Michel (lmic) wrote :
Download full text (5.1 KiB)

It's looking like the issue may be that when next boot is set to pxe, it does not get cleared. So, during the next, next boot is still set to pxe and so on.

Those were the scenario that I tested:
1. Deploy system from MAAS
2. While system boots and before it gets to petitboot menu, do:
    a. Reset IPMI back to factory default from ASM web console.
    b. Power off system using ipmitool.
    c. Again with ipmitool, set next boot device to pxe and power system back on.
3. Observe system pxe booting, doing curtin install and rebooting.
4. When system hits petitboot, it does not attempt to PXE boot nor boot from disk as previously observed.

Here is the execution data:
================================================================================================
1. Deployed IBM S822LC 8335-GTA node from MAAS UI.
2. Used ipmitool to power off and power on after ipmi is restored to factory settings. This wipes out anything maas wrote through ipmi-chassis-config.
$ ipmitool -I lanplus -H 192.168.224.118 -U ADMIN -P admin raw 0x00 0x09 0x03 0x00 0x00
 01 03 00
$ ipmitool -I lanplus -H 192.168.224.118 -U ADMIN -P admin raw 0x00 0x09 0x05 0x00 0x00
 01 05 00 00 00 00 00
$ ipmitool -I lanplus -H 192.168.224.118 -U ADMIN -P admin chassis bootdev pxe
Set Boot Device to pxe
$ ipmitool -I lanplus -H 192.168.224.118 -U ADMIN -P admin raw 0x00 0x09 0x03 0x00 0x00
 01 03 00
$ ipmitool -I lanplus -H 192.168.224.118 -U ADMIN -P admin raw 0x00 0x09 0x05 0x00 0x00
 01 05 80 04 00 00 00
$ ipmitool -I lanplus -H 192.168.224.118 -U ADMIN -P admin chassis power on
Chassis Power Control: Up/On
3. I let the deployment proceed and monitored IPMI raw settings during the curtin installation and prior to system being rebooted.
$ ipmitool -I lanplus -H 192.168.224.118 -U ADMIN -P admin raw 0x00 0x09 0x05 0x00 0x00
 01 05 80 04 00 00 00
$ ipmitool -I lanplus -H 192.168.224.118 -U ADMIN -P admin raw 0x00 0x09 0x03 0x00 0x00
 01 03 00
4. After system hangs in petitboot, again queried raw boot settings using ipmitool:
$ ipmitool -I lanplus -H 192.168.224.118 -U ADMIN -P admin raw 0x00 0x09 0x05 0x00 0x00
 01 05 80 04 00 00 00
$ ipmitool -I lanplus -H 192.168.224.118 -U ADMIN -P admin raw 0x00 0x09 0x03 0x00 0x00
 01 03 00
================================================================================================

So with maas essentially ruled out, I wanted to try a variation of this test. Basically, I retried step 1 through 4 (with 1c modified to skip the step to set next boot to PXE). Without next boot device set to PXE, I would wait for petitboot menu and manually select the PXE boot device (it's execute entry), then hit ENTER:

Here is data:
================================================================================================
1. Again deployed the node from MAAS.
2. While system was still booting but before it reached petitboot. I restored ipmi to factory setting from the web console. I then used ipmitool to power off and power on (taking maas out of the equation as in earlier test).
ubuntu@maas-trusty-back-may22:/usr/lib/python2.7/dist-packages/provisioningserver$ ipmitool -I lanplus -H 192.168.224.118 -U ADMIN -P admin chassis power off
Chassis Power...

Read more...

Revision history for this message
Larry Michel (lmic) wrote :

My other observation that NICs don't have execute entries in the petitboot menu following curtin installation. But, since it's meant to boot from disk at that point, it would make sense that this is by design. Nonetheless, it's worth noting since that could potentially explain why there's no attempt to boot. It wants to PXE boot but there's nothing to PXE boot from so it seats there.

Revision history for this message
Newell Jensen (newell-jensen) wrote :

This has been shown to not be a problem with MAAS and rather a problem with the host firmware (petitboot). While we wait for this to be fix released, there is a workaround that can be followed:

1) Boot order set to "Any Disk Device"
2) Create a curtin_userdata_ubuntu_ppc64el curtin preseed in /etc/maas/preseeds on your MAAS region controller for the system which will add the lines below to the late commands. A preseed file will need to be created for each release and it can be customized per system by adding system name.
  ipmi_01_install_ipmitool: ['apt-get', '-y', 'install', 'ipmitool']
  ipmi_02_modprob_devintf: ['modprobe', 'ipmi_devintf']
  ipmi_03_ipmitool_bootdev_none: ['ipmitool', 'chassis', 'bootdev', 'none']

This will allow you to deploy with MAAS.

Changed in maas:
status: In Progress → Invalid
Revision history for this message
Newell Jensen (newell-jensen) wrote :

Set to invalid because this was not an issue with MAAS or curtin

Changed in curtin:
status: Incomplete → Invalid
Mike Rushton (leftyfb)
Changed in maas:
status: Invalid → Confirmed
Revision history for this message
Mike Rushton (leftyfb) wrote :

In further testing, I have found that this is in fact an issue with MAAS.

There are 2 issues that I have noticed in testing:

1. My understanding of how MAAS works is:
    - After phase 1 of deployment the machine reboots
    - The machine attempts to PXE to MAAS
    - MAAS offers up (in the case of x86) a grub config to boot to the local HDD
    - the machine boots to local HDD to continue on with phase 2 of deployment(curtin post-install)
   In the case of petitboot on openpower, after phase 1 of deployment, MAAS is not serving up anything via tftpboot that petitboot understands (pxelinux.0) so it will never see PXE as an option and not boot.

2. Since this issue has never been addressed with petitboot on openpower, I do not think MAAS knows how to talk to petitboot to tell it to boot to the local HDD only on the next reboot and not on subsequent boots.

To work around this temporarily (finish deployments) we can configure petitboot to boot to "any disk" as a backup when PXE is not an option.

Revision history for this message
Mike Rushton (leftyfb) wrote :
Revision history for this message
Mike Rushton (leftyfb) wrote :

Attacked at the rackd.log (above) and petitboot discovery log from the second phase of deployment. From what I can tell, MAAS might not be offering up PXE files to boot from.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Mike,

Can you please do the following:

tail -f /var/log/maas/rackd.log

And then, 'deploy' the system. I'm interested in seeing the whole log through the process. That said, I looked at the discovery log above and:

 argv: '/usr/bin/tftp' '-g' '-l' '/tmp/pb-H5aAGS' '-r' '/ppc64el/pxelinux.cfg/01-e4-1d-2d-25-90-c1' '10.1.10.2' '69'
 argv: '/usr/bin/tftp' '-g' '-l' '/tmp/pb-5zDW50' '-r' '/ppc64el/pxelinux.cfg/0A010A5' '10.1.10.2' '69'
 argv: '/usr/bin/tftp' '-g' '-l' '/tmp/pb-7yBDiA' '-r' '/ppc64el/pxelinux.cfg/0A010A' '10.1.10.2' '69'
 argv: '/usr/bin/tftp' '-g' '-l' '/tmp/pb-Cf5kv9' '-r' '/ppc64el/pxelinux.cfg/0A010' '10.1.10.2' '69'
 argv: '/usr/bin/tftp' '-g' '-l' '/tmp/pb-c1d3HI' '-r' '/ppc64el/pxelinux.cfg/0A01' '10.1.10.2' '69'
 argv: '/usr/bin/tftp' '-g' '-l' '/tmp/pb-XlRLUh' '-r' '/ppc64el/pxelinux.cfg/0A0' '10.1.10.2' '69'
 argv: '/usr/bin/tftp' '-g' '-l' '/tmp/pb-Zl0u7Q' '-r' '/ppc64el/pxelinux.cfg/0A' '10.1.10.2' '69'
 argv: '/usr/bin/tftp' '-g' '-l' '/tmp/pb-bBtekq' '-r' '/ppc64el/pxelinux.cfg/0' '10.1.10.2' '69'
 argv: '/usr/bin/tftp' '-g' '-l' '/tmp/pb-5PnYwZ' '-r' '/ppc64el/pxelinux.cfg/default' '10.1.10.2' '69'

The above should only happen if the 'MAC' address was never found.

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

AIUI, in the standard x86/IPMI case, MAAS always requests that the node PXEBoot. MAAS supplies a grub bootloader to the node, and MAAS then selects whether the node continues to boot from the network or from local disk via a MAAS-supplied grub.conf.

Is this same procedure also used in the case of Power when petitboot is involved? If so, why isn't the MAAS server responding to the Power node's PXEBoot request?

Revision history for this message
Newell Jensen (newell-jensen) wrote :

For petitboot to work the MAC address should be found as Andres mentions above so this should be investigated first.

Andy, for petitboot, MAAS will supply a grub.conf to the booting node. Since powerNV doesn't support the LOCALBOOT flag we send an empty config. In this case, petitboot should be picking the first boot device in its boot order which should be the newly installed system. If the firmware is not doing this than that is an issue outside of MAAS.

HTH

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Thanks Newell, and apologies - my posting crossed Andres'. As well as following the steps Andres has described, Mike is also going to try to use wireshark to help get an understanding of the packet flow between the MAAS server and the Power node.

Revision history for this message
Jeremy Kerr (jk-ozlabs) wrote :

> Andy, for petitboot, MAAS will supply a grub.conf to the booting node.
> Since powerNV doesn't support the LOCALBOOT flag we send an empty config.
> In this case, petitboot should be picking the first boot device in its
> boot order which should be the newly installed system.

Yes, and that behaviour should work fine with petitboot. Either the packet capture, or the petitboot logs (from /var/log/petitboot/) would help to determine what's going on here.

Jeff Lane  (bladernr)
tags: added: hwcert-server
removed: blocks-hwcert-server
Revision history for this message
Jeff Lane  (bladernr) wrote :

So what's the next step here? We're running into this still on this hardware in regression testing.

Revision history for this message
Newell Jensen (newell-jensen) wrote :

When MAAS team provided support for PowerNV (i.e. petitboot) it was under the agreement that the petitboot firmware would pick the first local disk to boot from when it gets an empty grub.cfg file from MAAS as PowerNV doesn't support the LOCALBOOT flag.

As mentioned above, have you guys gotten the packet capture as well as the petitboot logs?

Revision history for this message
Jeff Lane  (bladernr) wrote :

This will have to be done by Jason Hobbs or Chris Gregan, I no longer have the hardware. Chris has an S812LC and Jason has an S822LC, both of which are affected according to the info above.

Revision history for this message
Jeff Lane  (bladernr) wrote :

OK, I have access to the system... where would one find /var/log/petitboot/? Mike did all the hands-on, and since he's no longer with us, I'm having to learn this as I go, so apologies for not having much hands-on experience with these machines.

I deployed the node, and there is no /var/log/petitboot/ in the installed OS.

I also SSH'd into the BMC and there is no /var/log/petitboot/ present there either (limited busybox interface).

Likewise, there is no /var/log/petitboot/ on the MAAS server.

Newell asked for rackd.log during a deployment... it's attached.

This is Xenial deployed from MAAS 2.1.4+bzr5591-0ubuntu1

One thing I noticed was a lot of tracebacks in the rackd log, I don't think that's related to this bug though.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Just to summarize, here is where I am right now. I can deploy, however, on deployment, I think it is hanging at petitboot.

Initially, it PXE boots and does the deployment, but on reboot, it hangs at the petitboot menu waiting for someone to chose a boot option. Currently this is the boot order:

Petitboot System Configuration
 ──────────────────────────────────────────────────────────────────────────────

  Boot Order (0) Any Network device
                 (1) Any Device:

                 [ Add Device: ]
                 [ Clear & Boot Any ]
                 [ Clear ]

  Timeout: 10 seconds

  Temporary IPMI boot option: Network
  Clear option: [ ]

  Network: ( ) DHCP on all active interfaces
                 (*) DHCP on a specific interface
                 ( ) Static IP configuration

  Device: (*) enP4p1s0f0 [98:be:94:01:0f:4c, link up]

 ──────────────────────────────────────────────────────────────────────────────
 tab=next, shift+tab=previous, x=exit, h=help

Revision history for this message
Mike Rushton (leftyfb) wrote :

Try Clearing the config and to start with, add in "Any Network Device" and whatever the terminology for "Any Hard Disk Device" is. See if that works. What you're showing should be working but maybe "Any device" is getting tripped up for some reason.

Revision history for this message
Newell Jensen (newell-jensen) wrote :

Just to clarify, it is not grub.cfg but pxelinux.cfg, as petitboot is its own bootloader. This is still sent back empty and the firmware should be booting from the first local disk by default as this was the agreement that was in place when PowerNV was implemented for MAAS.

Revision history for this message
Jeff Lane  (bladernr) wrote :

So clearing the config and setting it to "And Disk" didn't work either. It hangs at the Petitboot menu until I manually choose something.

I've re-opened the private bug for this and we'll just have to wait for an update...

For the record, I agree with Newell at this point the issue is the petitboot not doing what it's supposed to do.

I wanted to add current info here since the private bug references this one.

Next step, I suppose, will be to validate that this node has the patched firmware (provided that patched firmware is publicly available).

Revision history for this message
Jeff Lane  (bladernr) wrote :

Verified that this was fixed with updated firmware.

Changed in maas:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.