kernel panic on IBM Power8 PPC MAAS ephemeral image

Bug #1425699 reported by Mike Rushton on 2015-02-25
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Trusty
High
Unassigned

Bug Description

The kernel running on the 14.04 ephemeral image for MAAS when commissioning an IBM Power8 PPC server causes a periodic kernel panic.

This happens intermittently and cannot be forcefully reproduced.

The kernel version on the image is 3.13.0-27.50-generic according to the boot process.

Attached is the boot process log with the kernel panics.

Mike Rushton (leftyfb) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1425699

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Mike Rushton (leftyfb) wrote :

Sorry, due to the nature of the bug(kernel panic) I am not able to login to or access the ephemeral image to grab any logs.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-key trusty
Joseph Salisbury (jsalisbury) wrote :

Was there a prior Trusty kernel that did not have the panic?

Also, does this also happen with newer release, such as Utopic or Vivid?

Joseph Salisbury (jsalisbury) wrote :

Looking at the commission.log, it appears that the h/w is reporting something is broken and that is not handled properly. Do you happen to know if this happens on more than one machine?

Joseph Salisbury (jsalisbury) wrote :

Also, you say this happens intermittently. If that is the case, we may be able to test other kernel versions if we can provision and login.

Joseph Salisbury (jsalisbury) wrote :

One additional note. The kernel version in the image is pretty out of date. The current version in -updates is 3.13.0-46.77. We may want to ask whoever maintains the images to create an image with a newer kernel for Trusty.

Joseph Salisbury (jsalisbury) wrote :

This looks very similar to bug 1354459 . I think we should really test a more recent kernel.

Mike Rushton (leftyfb) wrote :

I have tested using the daily images for MAAS 1.7.1

The kernel from this image is 3.13.0-46.77-generic

Still getting the kernel panic. See attached.

I am finding it is only during the commissioning process that causes the kernel panics. Once we get a good boot and commission(intermittent), we can deploy multiple different images without issue.

We only have the 1 Power 8 server in the certification lab that we can test with.

Joseph Salisbury (jsalisbury) wrote :

Thanks for the update, Mike. Does this panic only happen with the Trusty images? Is it possible for you to see if it is also happening with the Utopic and Vivid images?

Changed in linux (Ubuntu Trusty):
importance: Undecided → High
status: New → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Comparing the commission logs from the old kernel and current kernel it seems the new kernel is hitting a different issue now

The 3.13.0-46 kernel is getting a panic with an NIP of: power7_enter_nap_mode

The 3.13.0-27 kernel was dumping a trace on what looked like a H/W error:
 EEH: Frozen PE#5 detected on PHB#1

It would be good to know if Utopic and Vivid are getting a panic in the same way.

tags: added: kernel-da-key
removed: kernel-key
Mike Rushton (leftyfb) wrote :

Using the hwe-u kernel which I think is 3.16.0-31-generic seems to have resolved the kernel panicing.

Frédéric Bonnard (frediz) wrote :

I got this also on the latest Ubuntu 14.04 : it's looping over this indefinetly :

[ 0.193973]
[ 0.194023] Oops: Exception in kernel mode, sig: 4 [#3]
[ 0.194128] SMP NR_CPUS=2048 NUMA PowerNV
[ 0.194225] Modules linked in:
[ 0.194316] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G D 3.13.0-48-generic #80-Ubuntu
[ 0.194403] task: c0000007f26957c0 ti: c0000007f2728000 task.ti: c0000007f2728000
[ 0.194477] NIP: c000000001598930 LR: c00000000001897c CTR: c00000000002abfc
[ 0.194551] REGS: c0000007f272b800 TRAP: 0e40 Tainted: G D (3.13.0-48-generic)
[ 0.194651] MSR: 9000000000081001 <SF,HV,ME,LE> CR: 22004088 XER: 00000000
[ 0.194788] CFAR: c00000000002ace4 SOFTE: 0
GPR00: 0000000000000000 c0000007f272ba80 c000000001650610 0000000022004088
GPR04: 0000000000000001 9000000000001001 9000000000001031 c000000001598930
GPR08: 0000000000000000 9000000000001033 c00000000002abfc 00000000ffff8ae2
GPR12: 0000000022004028 c00000000fe40a80 c0000007f272bf90 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: c0000000009a18d0 0000000000000001 c00000000150e367 c00000000150f708
GPR24: c0000007f2728080 c0000000016e4a80 c0000000016e4f0c 0000000000000008
GPR28: c0000007f2728000 0000000000000003 9000000000009033 c000000001708190
[ 0.195869] NIP [c000000001598930] power7_enter_nap_mode+0x0/0x18
[ 0.196001] LR [c00000000001897c] .arch_cpu_idle+0x6c/0x160
[ 0.196092] Call Trace:
[ 0.196143] [c0000007f272bd70] [c00000000001897c] .arch_cpu_idle+0x6c/0x160
[ 0.196285] [c0000007f272bdf0] [c000000000105c94] .cpu_startup_entry+0x1d4/0x300
[ 0.196462] [c0000007f272bec0] [c000000000040084] .start_secondary+0x344/0x380
[ 0.196626] [c0000007f272bf90] [c000000000009a6c] .start_secondary_prolog+0x10/0x14
[ 0.196787] Instruction dump:
[ 0.196859] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
[ 0.197109] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX

See attached the log for more

Jeff Lane (bladernr) wrote :

Added this to maas-images. From chats with Mike, I believe that the Vivid ephemerals are stable on Power8.

I suspect that what's happening (I have not looked too deeply into this so this is just a guess for now) that the ephemerals are built at GA time, and after that point only the filesystem tarball is built for the point releases/daily images.

So even at 14.04.2, you're installing 14.04.2 filesystem but using a 14.04.0 ephemeral to do so.

Unfortunately, We can not use vivid ephemerals to certify Trusty, so we need a full stack of Trusty images for the power machines.

This bug is currently gating PowerNV certification so I consider it a critical bug.

Scott Moser (smoser) wrote :

Jeff,
  If you mark a system in maas as 'generic/hwe-u', then you will boot installation with the hwe-u kernel, the install will install the hwe-u kernel, and everything is happy.

generally speaking my experience with power is you need hwe-u. hwe-t kernels are not stable on ppc64el.

Scott Moser (smoser) wrote :

you can update this the node in maas by setting:
 architecture=ppc64el/hwe-u

with maas node update or in the gui.

Jeff Lane (bladernr) wrote :

Ok... forgive my ignorance but will setting that also apply to the commissioning and enlistment phases as well? That seems to be the most problematic where we're only able to boot and commission about 50% of the time or so.

Scott Moser (smoser) wrote :

It should, then for that machine be used for commissioning.
I don't know that there is a way to set the default arch/subarch for commissioning or enlistment.

Jeff Lane (bladernr) wrote :

Summary:
1: Need to add 14.10 images from Releases Stream to even see hwe-u as an option
2: even though hwe-u is an option, it is unusable for commissioning currently (from Releaess).
3: We really need this usable in Releases, it's gating the Power 8 work.

This morning I tried working on this a bit. Keep in mind, I did this on a non-Power system because I don't have easy access to it right now, and Mike is working on that.

First, I had to install 14.10 installation media from the Releases stream. Once that was done, hwe-u became an option.

Next, I clicked acquire node. Then Edit node.

In the Edit screen, I set the ARCH to amd64/hwe-u, saved that and then clicked Commission to start the Commissioning process.

The node powered up and failed to boot. The console says

Booting under MAAS direction...
nomodeset iscsi_target_name=iqn.2004-05.com.ubuntu:maas:ephemeral-ubuntu-amd64-hwe-u-trusty-no-such-image iscsi_target_ip=10.0.0.1 iscsi_target_port=3260 iscsi_initiator=these-grandmother ip=::::these-grandmother:BOOTIF ro root=/dev/disk/by-path/ip-10.0.0.1:3260-iscsi-iqn.2004-05.com.ubuntu:maas:ephemeral-ubuntu-amd64-hwe-u-trusty-no-such-image-lun-1 overlayroot=tmpfs cloud-config-url=http://10.0.0.1/MAAS/metadata/latest/by-id/node-d4c27b9-d7ee-11e4-8a03-eca86bfb9f66/?op=get_preseed log_host=10.0.0.1 log_port=514
Could not find kernel image: ubuntu/amd64/hwe-u/trusty/no-such-image/boot-kernel

boot:

So since setting the arch to amd64/hwe-u failed to boot commissioning, I changed it to amd64/hwe-t and retried. hwe-t was successful in booting the server and commissioning it.

Following commissioning, I attemped to just install the node using hwe-u after it was commissioned using hwe-t.

I clicked acquire node. Then Edit Node.

On the Edit Page, I set the arch to amd64/hwe-u, and left the OS and Release to Default (which is currently set to 14.04). I then tried to start the node for deployment.

The node powered on and attempted to TFTP boot, but instead got a TFTP timeout error. It then automatically moved to the next NIC and again got the TFTP timeout, and at this point it just booted what was on the hard disk.

Could not find kernel image: ubuntu/amd64/hwe-u/trusty/no-such-image/boot-kernel"

Next, for completeness, I re-acquired the node, left arch at hwe-u and set the OS to Ubuntu and Release to 14.10 to see what would happen.

With those settings, I was able to successfully deploy the system with Utopic.

I can confirm using after enabling the 14.10 images in MAAS, I now have
hwe-u as an option. I can also confirm that this does not in fact work
in the case of the Power 8 when commissioning:

Error: Couldn't load kernel image

Mike

On 05/13/2015 11:25 AM, Jeff Lane wrote:
> Summary:
> 1: Need to add 14.10 images from Releases Stream to even see hwe-u as an option
> 2: even though hwe-u is an option, it is unusable for commissioning currently (from Releaess).
> 3: We really need this usable in Releases, it's gating the Power 8 work.
>
>
> This morning I tried working on this a bit. Keep in mind, I did this on a non-Power system because I don't have easy access to it right now, and Mike is working on that.
>
> First, I had to install 14.10 installation media from the Releases
> stream. Once that was done, hwe-u became an option.
>
> Next, I clicked acquire node. Then Edit node.
>
> In the Edit screen, I set the ARCH to amd64/hwe-u, saved that and then
> clicked Commission to start the Commissioning process.
>
> The node powered up and failed to boot. The console says
>
> Booting under MAAS direction...
> nomodeset iscsi_target_name=iqn.2004-05.com.ubuntu:maas:ephemeral-ubuntu-amd64-hwe-u-trusty-no-such-image iscsi_target_ip=10.0.0.1 iscsi_target_port=3260 iscsi_initiator=these-grandmother ip=::::these-grandmother:BOOTIF ro root=/dev/disk/by-path/ip-10.0.0.1:3260-iscsi-iqn.2004-05.com.ubuntu:maas:ephemeral-ubuntu-amd64-hwe-u-trusty-no-such-image-lun-1 overlayroot=tmpfs cloud-config-url=http://10.0.0.1/MAAS/metadata/latest/by-id/node-d4c27b9-d7ee-11e4-8a03-eca86bfb9f66/?op=get_preseed log_host=10.0.0.1 log_port=514
> Could not find kernel image: ubuntu/amd64/hwe-u/trusty/no-such-image/boot-kernel
>
> boot:
>
> So since setting the arch to amd64/hwe-u failed to boot commissioning, I
> changed it to amd64/hwe-t and retried. hwe-t was successful in booting
> the server and commissioning it.
>
> Following commissioning, I attemped to just install the node using hwe-u
> after it was commissioned using hwe-t.
>
> I clicked acquire node. Then Edit Node.
>
> On the Edit Page, I set the arch to amd64/hwe-u, and left the OS and
> Release to Default (which is currently set to 14.04). I then tried to
> start the node for deployment.
>
> The node powered on and attempted to TFTP boot, but instead got a TFTP
> timeout error. It then automatically moved to the next NIC and again
> got the TFTP timeout, and at this point it just booted what was on the
> hard disk.
>
>
> Could not find kernel image: ubuntu/amd64/hwe-u/trusty/no-such-image/boot-kernel"
>
> Next, for completeness, I re-acquired the node, left arch at hwe-u and
> set the OS to Ubuntu and Release to 14.10 to see what would happen.
>
> With those settings, I was able to successfully deploy the system with
> Utopic.
>

Mike Rushton (leftyfb) wrote :

I can confirm using after enabling the 14.10 images in MAAS, I now have
hwe-u as an option. I can also confirm that this does not in fact work
in the case of the Power 8 when commissioning:

Error: Couldn't load kernel image

Jeff Lane (bladernr) wrote :

Next, I deleted all the images I had and switched over to the Daily stream. First, I downloaded ONLY 14.04 amd64.

Once that was ready, I ensured the Arch was set to amd64/hwe-u and again attempted commissioning.

So using the 14.04 images only from Daily, I was able to successfully commission using the hwe-u kernel according to the boot messages that said it was booting ephemeral-ubuntu-amd64-hwe-u-trusty-daily

To verify this, I then deployed with the node set to amd64/hwe-u, Default OS, Default Release (14.04LTS). After deployment I confirmed that it was running what appears to be 14.04.2 with a 3.16 kernel (3.16.0-37).

Jeff Lane (bladernr) wrote :

Per Mike's testing, this is now resolved by pushing more recent images to Releases that enable hwe-u in Trusty for PPC64EL.

So from the initial bug point of view, this is fixed and could be marked as such.

However, as this also affects everything that's NOT ppc64el, I'd suggest leaving it open until such time as the images have been tested and pushed for the other archs as well.

Haw Loeung (hloeung) wrote :

We ran into this while setting up an OpenStack cloud using IBM POWER8 machines[1]. Unfortunately, hwe-v for Trusty isn't available (LP:1504066) but hwe-u does indeed fix the "Oops: Exception in kernel mode, sig: 4 [#47]" boot loop we ran into.

Since it's been advised here, and in several other bugs that POWER8/ppc64el doesn't work with the main Trusty 3.13 kernel, perhaps we could make MAAS images default to hwe-u and above for ppc64el?

[1]http://paste.ubuntu.com/12713436/

Haw Loeung (hloeung) on 2015-10-10
Changed in maas-images:
status: New → Confirmed
Scott Moser (smoser) wrote :

I've dropped the maas-images task as that is being addressed under bug 1508565

no longer affects: maas-images
Stefan Bader (smb) wrote :

I will not change the status of this bug because I can not reliably rule out that the initial issue still exists. That was reported to happen not all the times and from the stack trace involved some locking issue in the pci scsi adapter's driver.
Since comment #9, however, the reported stack trace is showing a completely different issue. This happened to be a regression since 3.13.0-46 and in order to cleanly separate the problems I opened LP: #1589910 to track the fix for that. As soon as that is there should be no need to enforce a HWE kernel for commissioning and/or provisioning of ppc64el hosts.

Should the initially reported problem persist then, please report back here so we can try to fix that, too. Note that the stack trace should involve the IPR adapter in some way. Like:

[ 14.995435] ipr: IBM Power RAID SCSI Device Driver version: 2.6.0 (November 16, 2012)
[ 14.995630] ipr 0001:04:00.0: Found IOA with IRQ: 0
[ 14.995828] ipr 0001:04:00.0: Using 64-bit DMA iommu bypass
[ 14.996992] pnv_pci_dump_phb_diag_data: Unrecognized ioType 33554432
[ 14.997063] EEH: Frozen PE#5 detected on PHB#1
[ 14.997124] CPU: 11 PID: 911 Comm: systemd-udevd Not tainted 3.13.0-27-generic #50-Ubuntu
[ 14.997207] Call Trace:
[ 14.997242] [c000000fe325af70] [c000000000016af0] .show_stack+0x170/0x290 (unreliable)
[ 14.997340] [c000000fe325b060] [c000000000966fc0] .dump_stack+0x88/0xb4
[ 14.997427] [c000000fe325b0e0] [c0000000000364b0] .eeh_dev_check_failure+0x430/0x480
[ 14.997526] [c000000fe325b190] [c000000000036584] .eeh_check_failure+0x84/0xe0
[ 14.997630] [c000000fe325b220] [d00000000ef233e0] .ipr_mask_and_clear_interrupts+0x190/0x1d0 [ipr]
[ 14.997747] [c000000fe325b2d0] [d00000000ef2a394] .ipr_probe_ioa+0xc24/0x1370 [ipr]
[ 14.997857] [c000000fe325b400] [d00000000ef325c4] .ipr_probe+0x44/0x4c0 [ipr]

Stefan Bader (smb) on 2016-06-07
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers