arm64 soft lock crashes on nova-compute charm running

Bug #1775732 reported by Ryan Finnie
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned
Bionic
Confirmed
High
Unassigned

Bug Description

Discovered on bionic, arm64 (Moonshot, verified on multiple swirlix cartridges), 4.15.0-22-generic.

After deploying the nova-compute Juju charm, on subsequent reboots, within a few seconds after complete boot, everything will freeze and eventually display on the serial console (just these, no traces):

[ 188.010510] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [juju-log:2272]
[ 216.010292] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [juju-log:2272]

(From here on, "lock up" refers to that sequence: boot a kernel, it completes boot to login prompt, then everything freezes a few seconds later, then BUGs.)

It's usually but not always juju-log, sometimes a relation-ids or similar. I was able to briefly notice that it was in its startup config-changed hook.

I've separated out and tested nearly everything it does during its startup config-changed (sets up bridging, writes some config files, restarts libvirtd/nova-compute/etc) without being able to trigger the bug, but I suspect proximity to boot is a factor. If I disable jujud-unit-nova-compute startup, boot, log in, re-enable and start (by which time over a minute or so has elapsed from boot finish), it will not lock up. Similarly, if I wrap the jujud startup in a `strace -Ff -o /var/log/strace.log` (which slows it down massively), it will not lock up. Watched pot syndrome.

I've tried kernels from http://kernel.ubuntu.com/~kernel-ppa/mainline/ . I noticed most of the recent arm64 mainline kernels had failed builds, notified the kernel team channel and apw fixed the issue and started some rebuilds.

What I've discovered (after many dead ends and a futile bisection) is that mainline builds before the rebuilds lock up, but fixed mainline builds initiated by apw DO NOT lock up. e.g. 4.16.3-041603.201804190730 locks up, but 4.16.6-041606.201806042022 does not lock up. (4.16.4 and 4.16.5 appear to have never been rebuilt and don't have arm64 debs, and that period is what I tried to bisect after figuring a fix must be in there.)

But when I try to compile any of these recent kernels myself, they lock up when booted. Same kernel configs, tried on both bionic and in a cosmic chroot, tried both native arm64 compile and cross-compile from amd64. e.g. 4.16.6-041606.201806042022 from k.u.c does not lock up, but when I build it myself, it does.

TBC, I've verified lock ups on the following kernels (all assume kernel configs from their respective Ubuntu or k.u.c mainline builds):

- 4.15.0-22-generic from bionic (both Ubuntu-provided and my own recompile)
- v4.16 (and all point releases)
- v4.17

As I write this, my compiled v4.10 DOES NOT appear to lock up. I will attempt to bisect at a macro level from 4.10..4.15 and dig deeper.

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-22-generic 4.15.0-22.24
ProcVersionSignature: Ubuntu 4.15.0-22.24-generic 4.15.17
Uname: Linux 4.15.0-22-generic aarch64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jun 2 04:22 seq
 crw-rw---- 1 root audio 116, 33 Jun 2 04:22 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7.2
Architecture: arm64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Fri Jun 8 00:13:05 2018
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb: Error: command ['lsusb'] failed with exit code 1:
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: console=ttyS0,9600n8r ro
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-22-generic N/A
 linux-backports-modules-4.15.0-22-generic N/A
 linux-firmware 1.173.1
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Ryan Finnie (fo0bar) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
status: New → Confirmed
tags: added: kernel-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Just curious if you made progress with the bisect? If you need, I can assist with the bisect and build test kernels for you.

tags: added: kernel-da-key
removed: kernel-key
Revision history for this message
Junien F (axino) wrote :

4.19 appears to fix this problem

Revision history for this message
dann frazier (dannf) wrote :

@Ryan: Is this still a problem w/ 4.15? Do you need a full OpenStack setup to reproduce it, or is just a single node nova model enough?

Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.