kernel crash on m400 arm64 server cartridge

Bug #1502946 reported by Scott Moser on 2015-10-05
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Critical
Tim Gardner
Wily
Critical
Tim Gardner

Bug Description

We've made some progress on bug 1499869 , such that on Friday I actually had successful installs.
An attempt at an install today shows a kernel stack trace early in the boot, then the initramfs not finding any network devices and thus failing to mount the iscsi root.

The trace looks like this:
[ 0.000000] ------------[ cut here ]------------
[ 0.000000] WARNING: CPU: 0 PID: 0 at /build/linux-Nwr5zx/linux-4.2.0/arch/arm64/mm/numa.c:449 numa_init+0x90/0x398()
[ 0.000000] Modules linked in:
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.2.0-14-generic #16-Ubuntu
[ 0.000000] Hardware name: HP ProLiant m400 Server Cartridge (DT)
[ 0.000000] Call trace:
[ 0.000000] [<ffffffc00008a6e8>] dump_backtrace+0x0/0x150
[ 0.000000] [<ffffffc00008a858>] show_stack+0x20/0x30
[ 0.000000] [<ffffffc00086cc7c>] dump_stack+0x7c/0x98
[ 0.000000] [<ffffffc0000bd900>] warn_slowpath_common+0xa0/0xe0
[ 0.000000] [<ffffffc0000bda70>] warn_slowpath_null+0x38/0x50
[ 0.000000] [<ffffffc000c05b40>] numa_init+0x8c/0x398
[ 0.000000] [<ffffffc000c05e7c>] arm64_numa_init+0x30/0x40
[ 0.000000] [<ffffffc000c04b24>] bootmem_init+0x60/0x104
[ 0.000000] [<ffffffc000c052a0>] paging_init+0x198/0x224
[ 0.000000] [<ffffffc000c01e98>] setup_arch+0x274/0x5f8
[ 0.000000] [<ffffffc000bfe6d0>] start_kernel+0xdc/0x3f4
[ 0.000000] ---[ end trace f24b6c88ae00fa9a ]---
---
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access /dev/snd/: No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.15
Architecture: arm64
ArecordDevices: Error: [Errno 2] No such file or directory
CRDA: Error: [Errno 2] No such file or directory
CurrentDmesg: [ 33.356615] init: plymouth-upstart-bridge main process ended, respawning
DistroRelease: Ubuntu 14.04
IwConfig:
 lo no wireless extensions.

 eth1 no wireless extensions.

 eth0 no wireless extensions.
Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize libusb: -99
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: console=ttyS0,9600n8r ro
ProcVersionSignature: User Name 3.13.0-63.103-generic 3.13.11-ckt25
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-63-generic N/A
 linux-backports-modules-3.13.0-63-generic N/A
 linux-firmware 1.127.15
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty uec-images
Uname: Linux 3.13.0-63-generic aarch64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm audio cdrom dialout dip floppy netdev plugdev sudo video
_MarkForUpload: True

Scott Moser (smoser) wrote :
no longer affects: cloud-init

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1502946

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Scott Moser (smoser) on 2015-10-05
Changed in linux (Ubuntu):
status: Incomplete → Confirmed

apport information

tags: added: apport-collected trusty uec-images
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Scott Moser (smoser) wrote :

marked as confirmed.
I collected info running trusty kernel as current wily wont boot.

Craig Magina (craig.magina) wrote :

ARM64 NUMA was added and enabled in 4.2.0-13.15, which apparently breaks on the m400.

Craig Magina (craig.magina) wrote :

The dump stack comes from this according to the trace:
 if (WARN_ON(nodes_empty(node_possible_map)))

Craig Magina (craig.magina) wrote :

Here is a bit more of that code for context:

static int __init numa_register_memblks(struct numa_meminfo *mi)
{
 unsigned long uninitialized_var(pfn_align);
 int i, nid;

 /* Account for nodes with cpus and no memory */
 node_possible_map = numa_nodes_parsed;
 numa_nodemask_from_meminfo(&node_possible_map, mi);
 if (WARN_ON(nodes_empty(node_possible_map)))
  return -EINVAL;

Scott Moser (smoser) on 2015-10-06
Changed in linux (Ubuntu):
importance: Undecided → Critical
Joseph Salisbury (jsalisbury) wrote :

Can you see if the crash still occurs if you boot with the 4.2.0-12.14 kernel? It can be downloaded from:
https://launchpad.net/ubuntu/+source/linux/4.2.0-12.14

tags: added: kernel-key
Craig Magina (craig.magina) wrote :

The crash does not reproduce when running with kernel 4.2.0-12.14 .

Tim Gardner (timg-tpi) wrote :

Craig - shall I revert the arm64 NUMA patches ?

Changed in linux (Ubuntu Wily):
assignee: nobody → Tim Gardner (timg-tpi)
status: Confirmed → In Progress
dann frazier (dannf) wrote :

@Tim: Please do - the dtb interface for numa is still in discussion, so I think we can expect the OS/fw interface to change anyway.

Tim Gardner (timg-tpi) wrote :

@dann - check tip of Wily master-next to review my reverts.

dann frazier (dannf) wrote :

fyi, though I do see this backtrace when booting up an X-Gene, in my case it isn't fatal.

Rather, I'm hitting an issue later on where modules cannot load, and *that* is fatal. It persists even w/ the numa patches removed.

Loading, please wait...
[ 2.144740] systemd-udevd[128]: starting version 225
[ 2.150901] random: systemd-udevd urandom read with 1 bits of entropy availae
[ 2.192721] module gpio_xgene_sb: unsupported RELA relocation: 275
[ 2.193609] module xgene_enet: unsupported RELA relocation: 275
[ 2.249402] module libahci: unsupported RELA relocation: 275
[ 2.249628] module xgene_enet: unsupported RELA relocation: 275
[ 2.359451] module xgene_enet: unsupported RELA relocation: 275
[ 2.389444] module xgene_enet: unsupported RELA relocation: 275
[ 3.473766] module linear: unsupported RELA relocation: 275
[ 3.543252] module multipath: unsupported RELA relocation: 275
[ 3.593268] module raid0: unsupported RELA relocation: 275
[ 3.663695] module raid1: unsupported RELA relocation: 275
[ 3.713964] module raid6_pq: unsupported RELA relocation: 275
[ 3.763983] module raid6_pq: unsupported RELA relocation: 275
[ 3.803975] module raid6_pq: unsupported RELA relocation: 275
[ 3.853881] module raid10: unsupported RELA relocation: 275
[ 3.924962] module raid6_pq: unsupported RELA relocation: 275

dann frazier (dannf) wrote :

This appears to be the regression. Rebuilding the kernel w/ CONFIG_ARM64_ERRATUM_843419=n works around it.

commit bf0cdf4bfb785129798bc42d6ee8e2558f0934da
Author: Will Deacon <email address hidden>
Date: Tue Mar 17 12:15:02 2015 +0000

    arm64: errata: add module build workaround for erratum #843419

    commit df057cc7b4fa59e9b55f07ffdb6c62bf02e99a00 upstream.

Tim Gardner (timg-tpi) wrote :

So, no need to revert the NUMA patches ?

Tim Gardner (timg-tpi) wrote :

UBUNTU: [Config] CONFIG_ARM64_ERRATUM_843419=n

Changed in linux (Ubuntu Wily):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (7.8 KiB)

This bug was fixed in the package linux - 4.2.0-16.19

---------------
linux (4.2.0-16.19) wily; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1504143
  * [Config] CONFIG_X86_LEGACY_VM86=y, CONFIG_VM86=y for i386
    - LP: #1499089
  * [Config] CONFIG_MODIFY_LDT_SYSCALL=y
    - LP: #1499089
  * SAUCE: intel_pstate: Allow manually forcing the use of HWP on Skylake-S
  * [Config] CONFIG_ARM64_ERRATUM_843419=n
    - LP: #1502946
  * [Config] CONFIG_CAVIUM_ERRATUM_22375=y, CONFIG_CAVIUM_ERRATUM_23154=y

  [ Christophe Lombard ]

  * SAUCE: (noup) cxl: Fix number of allocated pages in SPA
    - LP: #1499849

  [ Matthew R. Ochs ]

  * SAUCE: (noup) cxlflash: Fix to avoid corrupting port selection mask

  [ Robert Richter ]

  * SAUCE: (noup) irqchip/gicv3-its: Add range check for number of
    allocated pages
  * SAUCE: (noup) irqchip/gicv3: Workaround for Cavium ThunderX erratum
    23154
  * SAUCE: (noup) irqchip/gicv3-its: Read typer register outside the loop
  * SAUCE: (noup) irqchip/gicv3-its: Add HW revision detection and
    configuration
  * SAUCE: (noup) irqchip/gicv3-its: Workaround for Cavium ThunderX errata
    22375, 24313

  [ Upstream Kernel Changes ]

  * x86/compat: Define ARCH_WANT_OLD_COMPAT_IPC only for 32-bit compat
    - LP: #1499089
  * x86/compat: Clean up HAVE_UID16 config
    - LP: #1499089
  * x86/compat: Separate ia32 and x32 compat ABIs
    - LP: #1499089
  * x86/entry/vm86: Clean up saved_fs/gs
    - LP: #1499089
  * x86/entry/vm86: Preserve 'orig_ax'
    - LP: #1499089
  * x86/entry/vm86: Move userspace accesses to do_sys_vm86()
    - LP: #1499089
  * x86/kconfig/32: Rename CONFIG_VM86 and default it to 'n'
    - LP: #1499089
  * x86/ldt: Make modify_ldt() optional
    - LP: #1499089
  * x86/vm86: Move vm86 fields out of 'thread_struct'
    - LP: #1499089
  * x86/vm86: Move fields from 'struct kernel_vm86_struct' to 'struct vm86'
    - LP: #1499089
  * x86/vm86: Eliminate 'struct kernel_vm86_struct'
    - LP: #1499089
  * x86/vm86: Use the normal pt_regs area for vm86
    - LP: #1499089
  * x86/vm86: Move the vm86 IRQ definitions to vm86.h
    - LP: #1499089
  * x86/vm86: Clean up vm86.h includes
    - LP: #1499089
  * x86/vm86: Rename vm86->vm86_info to user_vm86
    - LP: #1499089
  * x86/vm86: Rename vm86->v86flags and v86mask
    - LP: #1499089
  * x86/selftests, x86/vm86: Improve entry_from_vm86 selftest
    - LP: #1499089
  * selftests/x86/vm86: Fix entry_from_vm86 test on 64-bit kernels
    - LP: #1499089
  * x86/vm86: Block non-root vm86(old) if mmap_min_addr != 0
    - LP: #1499089
  * x86/vm86: Fix the misleading CONFIG_VM86 Kconfig help text
    - LP: #1499089
  * netfilter: conntrack: use nf_ct_tmpl_free in CT/synproxy error paths
    - LP: #1503902

linux (4.2.0-15.18) wily; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1503692

  [ Andy Whitcroft ]

  * Revert "SAUCE: aufs3: mmap: Fix races in madvise_remove() and sys_msync()"
    Was incorrectly backported.

  [ Ben Hutchings ]

  * SAUCE: aufs3: mmap: Fix races in madvise_remove() and sys_msync()
    - CVE-2015-7312

  [ Tim Gardner ]

  * [Debian] config-check and prepare using ${DEBIAN}/config/annotations
...

Read more...

Changed in linux (Ubuntu Wily):
status: Fix Committed → Fix Released
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-wily' to 'verification-done-wily'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-wily
tags: added: verification-done-wily
removed: verification-needed-wily
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers