Bug #1531768 “[arm64] lockups some time after booting” : Bugs : linux package : Ubuntu

Revision history for this message

Martin Pitt (pitti) wrote on 2016-01-07:

#1

CurrentDmesg.txt Edit (43.6 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (2.3 KiB, text/plain; charset="utf-8")
JournalErrors.txt Edit (25.8 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (1.1 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (1.3 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (1.9 KiB, text/plain; charset="utf-8")
UdevDb.txt Edit (51.8 KiB, text/plain; charset="utf-8")
WifiSyslog.txt Edit (45.3 KiB, text/plain; charset="utf-8")

description:

updated

Revision history for this message

Martin Pitt (pitti) wrote on 2016-01-07:

#2

"reboot" also takes too long to be practical (I killed the instance after waiting for 10 mins, as it didn't even begin to shut down).

Revision history for this message

Brad Figg (brad-figg) wrote on 2016-01-07: Missing required logs.

#3

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1531768

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Martin Pitt (pitti) on 2016-01-07

tags:	added: bot-stop-nagging
Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Martin Pitt (pitti) wrote on 2016-01-07: Re: kernel 4.2/wily on arm64 and multiple CPUs is unusably slow

#4

I tried to install the current xenial kernel (http://ports.ubuntu.com/pool/main/l/linux/linux-image-4.3.0-5-generic_4.3.0-5.16_arm64.deb). Package installation fails due to

  Processing triggers for initramfs-tools (0.120ubuntu6) ...
  update-initramfs: Generating /boot/initrd.img-4.3.0-5-generic
  Unsupported platform.
  run-parts: /etc/initramfs/post-update.d//flash-kernel exited with return code 1

but nevertheless I can boot with "reboot -f" (as a normal reboot hangs). Now lxc fails on lxc-net:

Jan 07 10:03:15 lxd-armhf1w systemd[1]: Starting LXC network bridge setup...
Jan 07 10:03:16 lxd-armhf1w lxc-net[10651]: RTNETLINK answers: Operation not supported
Jan 07 10:03:16 lxd-armhf1w lxc-net[10651]: Failed to setup lxc-net.

particularly it fails on "ip link add dev lxcbr0 type bridge".

dmesg error for this:

[ 199.168466] module x_tables: unsupported RELA relocation: 275
[ 199.232020] module llc: unsupported RELA relocation: 275

which might be a regression in xenial's arm64 kernel or lxc needing to be adjusted to it? Either way, this is a deal-breaker, so with the xenial kernel I can't do much.

However, operations like "lxc list" work without the lxcbr0 bridge. They still take several minutes, but I don't get the dmesg errors any more.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-01-07:

#5

I tried to nova boot a trusty instance and dist-upgrade the userspace packages to xenial, keeping the 3.19 linux kernel. Same effect, "lxc list" or other operations take many minutes and lxd.service itself keeps timing out too.

summary:

- kernel 4.2/wily on arm64 and multiple CPUs is unusably slow
+ arm64 kernel and multiple CPUs is unusably slow with lxd operations

Revision history for this message

Martin Pitt (pitti) wrote on 2016-01-07: Re: arm64 kernel and multiple CPUs is unusably slow with lxd operations

#6

I retried the same on m1.medium with 2 CPUs and 4 GB RAM, and lxd works fine there with the 4.2 kernel on wily. Unfortunately that's too small for my purposes. m1.large with 4 CPUs/8 GB RAM also seems to work well, I can make-do with that.

William points out that the hosts on bos01 only have 8 CPUs. So maybe this starts happening if the guest gets >= #cpus as the host?

summary:

- arm64 kernel and multiple CPUs is unusably slow with lxd operations
+ arm64 kernel and >= 8 CPUS (>= host CPU count?) is unusably slow with
+ lxd operations

Revision history for this message

Martin Pitt (pitti) wrote on 2016-01-07: Re: arm64 kernel and multiple CPUs is unusably slow

#7

I take that back. It does survive for much longer, but after some 15 minutes of running I again run into tons of

[ 2424.611668] INFO: task systemd-udevd:1320 blocked for more than 120 seconds.
[ 2424.613514] Tainted: G W 4.2.0-22-generic #27-Ubuntu
[ 2424.615183] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2424.617166] systemd-udevd D ffffffc000086ee4 0 1320 1 0x0000000c
[ 2424.617176] Call trace:
[ 2424.617842] [<ffffffc000086ee4>] __switch_to+0x94/0xa8
[ 2424.617851] [<ffffffc0008a7100>] __schedule+0x2b0/0x7b8
[ 2424.617854] [<ffffffc0008a7644>] schedule+0x3c/0x98
[ 2424.617859] [<ffffffc0008aa954>] schedule_timeout+0x1ec/0x280
[ 2424.617862] [<ffffffc0008a8364>] wait_for_common+0xcc/0x1a0
[ 2424.617866] [<ffffffc0008a8460>] wait_for_completion+0x28/0x38
[ 2424.617870] [<ffffffc000120654>] __synchronize_srcu+0x9c/0x180
[ 2424.617873] [<ffffffc000120770>] synchronize_srcu+0x38/0x48
[ 2424.617877] [<ffffffc00028b7b4>] fsnotify_destroy_group+0x2c/0x60
[ 2424.617880] [<ffffffc00028de3c>] inotify_release+0x34/0x78
[ 2424.617885] [<ffffffc00024537c>] __fput+0xa4/0x248
[ 2424.617887] [<ffffffc000245598>] ____fput+0x20/0x30
[ 2424.617892] [<ffffffc0000e0604>] task_work_run+0xbc/0xf8
[ 2424.617896] [<ffffffc0000c29d0>] do_exit+0x2f0/0xa48
[ 2424.617898] [<ffffffc0000c31bc>] do_group_exit+0x44/0xe8
[ 2424.617902] [<ffffffc0000d09b8>] get_signal+0x3d8/0x578
[ 2424.617906] [<ffffffc000089f20>] do_signal+0x90/0x530
[ 2424.617909] [<ffffffc00008a640>] do_notify_resume+0x70/0x78

for all kinds of processes.

summary:

- arm64 kernel and >= 8 CPUS (>= host CPU count?) is unusably slow with
- lxd operations
+ arm64 kernel and multiple CPUs is unusably slow

Revision history for this message

Chris J Arges (arges) wrote on 2016-01-07:

#8

Martin,
Can you collect apport information from the host system as well?
Do you get the same effects with a single vCPU?
--chris

Alberto Salvia Novella (es20490446e) on 2016-01-07

Changed in linux (Ubuntu):
importance:	Undecided → Medium

Joseph Salisbury (jsalisbury) on 2016-01-07

tags:

added: kernel-da-key

Revision history for this message

Martin Pitt (pitti) wrote on 2016-01-07:

#9

> Can you collect apport information from the host system as well?

Sorry, I can't. I can create Scalingstack instances, but I have no access to the host systems. The IS team certainly can, though.

> Do you get the same effects with a single vCPU?

So far that test system is holding up and I haven't seen processes getting locked up.

I do see that networking within containers is totally busted (transmitting 20 bytes took some10 minutes), but with the multi-CPU instance I didn't even get that far. This also happens with 3.19 (cannot test 4.3 due to its regression of creating bridges). Either way, this seems to be a separate bug.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-01-15:

#10

FTR, the "networking broken in containers" was an MTU mismatch, worked around now. Thanks to Andy for figuring this out!

Revision history for this message

Martin Pitt (pitti) wrote on 2016-01-15:

#11

I split out the xenial bridge regression into bug 1534545, so that this can keep focussing on the "processes become slow and hang after a while" main aspect.

Revision history for this message

Andy Whitcroft (apw) wrote on 2016-01-15:

#12

The lxc hangs component looks to be an lxd related issue. Specificially the go libraries in use consume a large ammount of entropy and hang waiting for it to become available. Installing haveged seems to resolve these hangs.

Revision history for this message

Andy Whitcroft (apw) wrote on 2016-01-15:

#13

The rcu messages though annoying do seem to be benign as they do not increase in time.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-01-15:

#14

I did install haveged which indeed seems to help quite a bit. But now after having used an xlarge (8 CPU) instance for a while, I again get hanging processes, like

ubuntu 2317 0.0 0.0 0 0 pts/0 D+ 16:14 0:00 [tail]

I used that tail on /var/log/lxd/lxd.log to see what's going on. lxd itself stopped responding much earlier (but not in kernel "D" state, it's sleeping).

Sorry, this is still very unspecific..

Revision history for this message

Martin Pitt (pitti) wrote on 2016-01-15:

#15

Another data point: I tried to install 3.19 (the kernel that we have on the buildds) on the xlarge instance, and lxc list now hangs there as well.

I haven't yet seem lxc list hang on a large (4 CPUs) instance, but the whole thing (running tests in containers) is still very slow. TBC on Monday..

Revision history for this message

Martin Pitt (pitti) wrote on 2016-01-26:

#16

lxd-armhf1 (8 CPUs) is again in a state where "lxc list" and even "top" hang forever. lxd-armhf2 was unfortunately shutdown in the previous days, so I just booted it again.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-01-26: Re: lxd and other commands get stuck on arm64 kernel and multiple CPUs

#17

Retitling. The "unusably slow" part was fixed with installing haveged, so what remains is that the 8x CPU instance gets into this lockup state after some time.

On the 4x instance I'm now running adt-run in a loop, so far it's through ~ 10 iterations. I'll let it run over night and see how it is keeping up.

summary:

- arm64 kernel and multiple CPUs is unusably slow
+ lxd and other commands get stuck on arm64 kernel and multiple CPUs

Revision history for this message

Martin Pitt (pitti) wrote on 2016-01-28:

#18

I managed to get the 4x CPU instance into the same locked up state now, so AFAICS the problem isn't fundamentally different between 4 and 8 cores.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-02-02:

#19

Reducing the number of threads that Go uses seems to help a bit:

$ cat /etc/systemd/system/lxd.service.d/override.conf
[Service]
Environment=GOMAXPROCS=1

(GOMAXPROCS defaults to the number of CPUs). But Stéphane is still able to lock up LXD pretty fast even with that.

Revision history for this message

Stéphane Graber (stgraber) wrote on 2016-02-02:

#20

Very much looks like it's related to threading and futexes somehow.

Forcing golang to use a single thread rather than one per container made things more stable using a very simple test (infinite loop of "lxc list"), though starting containers then still caused the hang to happen.

I've seen a similar hang on futex when running (lxc-tests package):
lxc-test-concurrent -j 8 -i 50

This creates and spawns 8 containers in parallel using threads and attempts that 50 times in a row. This is done entirely in C so doesn't touch golang.

Martin Pitt (pitti) on 2016-02-02

summary:

- lxd and other commands get stuck on arm64 kernel and multiple CPUs
+ [arm64] multithreaded processes get locked up in futexes

Revision history for this message

Martin Pitt (pitti) wrote on 2016-02-02: Re: [arm64] multithreaded processes get locked up in futexes

#21

Some good news: With bug 1534545 fixed I was now able to upgrade to the Xenial 4.4 kernel. On the 4x CPU instance two parallel adt-run loops have now run for about two hours without any dmesg spew. Stéphane has run "lxc-test-concurrent -j 16 -i 10" twice on the 8x CPU instance successfully too.

Bad news: I rebooted the 8x CPU instance (also xenial du jour with 4.4 kernel), and didn't do anything on it. After just sitting idle for an hour or two ssh stopped responding and nova console-log shows http://paste.ubuntu.com/14857144/ (only a hard reboot helped). So it wet its pants without actually doing any action.

So it appears it's not fully fixed yet, but muuch better. I'll do some more smoke testing, and if 4x CPU instances work, this is good enough to put this into production. I'll keep the old Calxeda instances alive as a fallback for a while, of course.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-02-02:

#22

Darn, I now get the "instance kills itself after some time" on the 4x CPU as well. nova console-log shows the blurb below and ssh and lxd ports are dead (so I can't learn anything further from the box than console-log).

Ubuntu Xenial Xerus (development branch) lxd-armhf2 ttyAMA0

lxd-armhf2 login: [ 954.144506] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 954.145743] 1-...: (79 GPs behind) idle=284/0/0 softirq=407182/407182 fqs=1
[ 954.147202] (detected by 3, t=15002 jiffies, g=21817, c=21816, q=1563)
[ 954.148590] Call trace:
[ 954.149123] rcu_sched kthread starved for 15002 jiffies! g21817 c21816 f0x0 s3 ->state=0x1
[ 3000.217089] INFO: task systemd:1 blocked for more than 120 seconds.
[ 3000.218529] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3000.219628] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3000.221310] Call trace:
[ 3000.222562] INFO: task kworker/0:2:12463 blocked for more than 120 seconds.
[ 3000.223985] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3000.225146] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3000.226741] Call trace:
[ 3000.227306] INFO: task (d-logind):15441 blocked for more than 120 seconds.
[ 3000.228685] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3000.229834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3000.231469] Call trace:
[ 3120.231067] INFO: task systemd:1 blocked for more than 120 seconds.
[ 3120.232501] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3120.233629] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3120.235393] Call trace:
[ 3120.236702] INFO: task kworker/0:2:12463 blocked for more than 120 seconds.
[ 3120.238188] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3120.239398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3120.241140] Call trace:
[ 3120.241716] INFO: task (d-logind):15441 blocked for more than 120 seconds.
[ 3120.243223] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3120.244366] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3120.245945] Call trace:
[ 3240.244955] INFO: task systemd:1 blocked for more than 120 seconds.
[ 3240.246398] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3240.247526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.249272] Call trace:
[ 3240.250568] INFO: task kworker/0:2:12463 blocked for more than 120 seconds.
[ 3240.252060] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3240.253280] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.254966] Call trace:
[ 3240.255549] INFO: task (d-logind):15441 blocked for more than 120 seconds.
[ 3240.257073] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3240.258259] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.259906] Call trace:
[ 3360.258901] INFO: task systemd:1 blocked for more than 120 seconds.
[ 3360.260349] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3360.261475] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3360.263224] Call trace:

Darn, I now get the "instance kills itself after some time" on the 4x CPU as well. nova console-log shows the blurb below and ssh and lxd ports are dead (so I can't learn anything further from the box than console-log).

Ubuntu Xenial Xerus (development branch) lxd-armhf2 ttyAMA0

lxd-armhf2 login: [  954.144506] INFO: rcu_sched detected stalls on CPUs/tasks:
[  954.145743]  1-...: (79 GPs behind) idle=284/0/0 softirq=407182/407182 fqs=1 
[  954.147202]  (detected by 3, t=15002 jiffies, g=21817, c=21816, q=1563)
[  954.148590] Call trace:
[  954.149123] rcu_sched kthread starved for 15002 jiffies! g21817 c21816 f0x0 s3 ->state=0x1
[ 3000.217089] INFO: task systemd:1 blocked for more than 120 seconds.
[ 3000.218529]       Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3000.219628] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3000.221310] Call trace:
[ 3000.222562] INFO: task kworker/0:2:12463 blocked for more than 120 seconds.
[ 3000.223985]       Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3000.225146] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3000.226741] Call trace:
[ 3000.227306] INFO: task (d-logind):15441 blocked for more than 120 seconds.
[ 3000.228685]       Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3000.229834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3000.231469] Call trace:
[ 3120.231067] INFO: task systemd:1 blocked for more than 120 seconds.
[ 3120.232501]       Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3120.233629] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3120.235393] Call trace:
[ 3120.236702] INFO: task kworker/0:2:12463 blocked for more than 120 seconds.
[ 3120.238188]       Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3120.239398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3120.241140] Call trace:
[ 3120.241716] INFO: task (d-logind):15441 blocked for more than 120 seconds.
[ 3120.243223]       Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3120.244366] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3120.245945] Call trace:
[ 3240.244955] INFO: task systemd:1 blocked for more than 120 seconds.
[ 3240.246398]       Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3240.247526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.249272] Call trace:
[ 3240.250568] INFO: task kworker/0:2:12463 blocked for more than 120 seconds.
[ 3240.252060]       Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3240.253280] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.254966] Call trace:
[ 3240.255549] INFO: task (d-logind):15441 blocked for more than 120 seconds.
[ 3240.257073]       Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3240.258259] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.259906] Call trace:
[ 3360.258901] INFO: task systemd:1 blocked for more than 120 seconds.
[ 3360.260349]       Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3360.261475] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3360.263224] Call trace:

Revision history for this message

Martin Pitt (pitti) wrote on 2016-02-17:

#23

For the record, this "auto-destruct" behaviour with the xenial kernel happens just by itself: reboot the instance, let it sit there for 15 or 60 minutes, then this kernel spew starts happening and it gets locked up with losing network/ssh access. There was no actual payload on these.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-03-01: Re: [arm64] locks up a few minutes after booting

#24

Download full text (44.4 KiB)

I re-tried with the current kernel 4.4.0-8, and merely booting a pristine cloud image with "nova boot --poll --image ubuntu/ubuntu-xenial-daily-arm64-server-20160227-uefi1.img --flavor m1.large" and letting it sit there for some 20 minutes is still auto-destructing:

[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[0m[35m[40m[2J[01;01H[0m[37m[40merror: no suitable video mode found.
EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services and installing virtual address map...
[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Initializing cgroup subsys cpuacct
[ 0.000000] Linux version 4.4.0-8-generic (buildd@beebe) (gcc version 5.3.1 20160222 (Ubuntu/Linaro 5.3.1-9ubuntu3) ) #23-Ubuntu SMP Wed Feb 24 20:51:39 UTC 2016 (Ubuntu 4.4.0-8.23-generic 4.4.2)
[ 0.000000] Boot CPU: AArch64 Processor [500f0001]
[ 0.000000] efi: Getting EFI parameters from FDT:
[ 0.000000] EFI v2.40 by EDK II
[ 0.000000] efi:
[ 0.000000] psci: probing for conduit method from DT.
[ 0.000000] psci: PSCIv0.2 detected in firmware.
[ 0.000000] psci: Using standard PSCI v0.2 function IDs
[ 0.000000] psci: Trusted OS migration not required
[ 0.000000] PERCPU: Embedded 17 pages/cpu @ffff8001fff7d000 s31128 r8192 d30312 u69632
[ 0.000000] Detected PIPT I-cache on CPU0
[ 0.000000] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 2064384
[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-8-generic root=LABEL=cloudimg-rootfs vt.handoff=7
[ 0.000000] log_buf_len individual max cpu contribution: 4096 bytes
[ 0.000000] log_buf_len total cpu_extra contributions: 12288 bytes
[ 0.000000] log_buf_len min size: 16384 bytes
[ 0.000000] log_buf_len: 32768 bytes
[ 0.000000] early log buf free: 14588(89%)
[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[ 0.000000] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes)
[ 0.000000] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes)
[ 0.000000] software IO TLB [mem 0xfbffb000-0xffffb000] (64MB) mapped at [ffff8000bbffb000-ffff8000bfffafff]
[ 0.000000] Memory: 8142020K/8388608K available (8552K kernel code, 1007K rwdata, 3736K rodata, 748K init, 783K bss, 246588K reserved, 0K cma-reserved)
[ 0.000000] Virtual kernel memory layout:
[ 0.000000] vmalloc : 0xffff000000000000 - 0xffff7bffbfff0000 (126974 GB)
[ 0.000000] vmemmap : 0xffff7bffc0000000 - 0xffff7fffc0000000 ( 4096 GB maximum)
[ 0.000000] 0xffff7bffc1000000 - 0xffff7bffc9000000 ( 128 MB actual)
[ 0.000000] fixed : 0xffff7ffffa7fd000 - 0xffff7ffffac00000 ( 4108 KB)
[ 0.000000] PCI I/O : 0xffff7ffffae00000 - 0xffff7ffffbe00000 ( 16 MB)
[ 0.000000] modules : 0xffff7ffffc000000 - 0xffff800000000000 ( 64 MB)
[ 0.000000] memory : 0xffff800000000000 - 0xffff800200000000 ( 8192 MB)
[ 0.000000] .init : 0x...

I re-tried with the current kernel 4.4.0-8, and merely booting a pristine cloud image with "nova boot --poll --image ubuntu/ubuntu-xenial-daily-arm64-server-20160227-uefi1.img --flavor m1.large" and letting it sit there for some 20 minutes is still auto-destructing:

[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[0m[35m[40m[2J[01;01H[0m[37m[40merror: no suitable video mode found.
EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services and installing virtual address map...
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 4.4.0-8-generic (buildd@beebe) (gcc version 5.3.1 20160222 (Ubuntu/Linaro 5.3.1-9ubuntu3) ) #23-Ubuntu SMP Wed Feb 24 20:51:39 UTC 2016 (Ubuntu 4.4.0-8.23-generic 4.4.2)
[    0.000000] Boot CPU: AArch64 Processor [500f0001]
[    0.000000] efi: Getting EFI parameters from FDT:
[    0.000000] EFI v2.40 by EDK II
[    0.000000] efi: 
[    0.000000] psci: probing for conduit method from DT.
[    0.000000] psci: PSCIv0.2 detected in firmware.
[    0.000000] psci: Using standard PSCI v0.2 function IDs
[    0.000000] psci: Trusted OS migration not required
[    0.000000] PERCPU: Embedded 17 pages/cpu @ffff8001fff7d000 s31128 r8192 d30312 u69632
[    0.000000] Detected PIPT I-cache on CPU0
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 2064384
[    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-8-generic root=LABEL=cloudimg-rootfs vt.handoff=7
[    0.000000] log_buf_len individual max cpu contribution: 4096 bytes
[    0.000000] log_buf_len total cpu_extra contributions: 12288 bytes
[    0.000000] log_buf_len min size: 16384 bytes
[    0.000000] log_buf_len: 32768 bytes
[    0.000000] early log buf free: 14588(89%)
[    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[    0.000000] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes)
[    0.000000] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes)
[    0.000000] software IO TLB [mem 0xfbffb000-0xffffb000] (64MB) mapped at [ffff8000bbffb000-ffff8000bfffafff]
[    0.000000] Memory: 8142020K/8388608K available (8552K kernel code, 1007K rwdata, 3736K rodata, 748K init, 783K bss, 246588K reserved, 0K cma-reserved)
[    0.000000] Virtual kernel memory layout:
[    0.000000]     vmalloc : 0xffff000000000000 - 0xffff7bffbfff0000   (126974 GB)
[    0.000000]     vmemmap : 0xffff7bffc0000000 - 0xffff7fffc0000000   (  4096 GB maximum)
[    0.000000]               0xffff7bffc1000000 - 0xffff7bffc9000000   (   128 MB actual)
[    0.000000]     fixed   : 0xffff7ffffa7fd000 - 0xffff7ffffac00000   (  4108 KB)
[    0.000000]     PCI I/O : 0xffff7ffffae00000 - 0xffff7ffffbe00000   (    16 MB)
[    0.000000]     modules : 0xffff7ffffc000000 - 0xffff800000000000   (    64 MB)
[    0.000000]     memory  : 0xffff800000000000 - 0xffff800200000000   (  8192 MB)
[    0.000000]       .init : 0xffff800000c82000 - 0xffff800000d3d000   (   748 KB)
[    0.000000]       .text : 0xffff800000080000 - 0xffff800000c82000   ( 12296 KB)
[    0.000000]       .data : 0xffff800000d4e000 - 0xffff800000e49e00   (  1008 KB)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.000000] Hierarchical RCU implementation.
[    0.000000]  Build-time adjustment of leaf fanout to 64.
[    0.000000]  RCU restricting CPUs from NR_CPUS=128 to nr_cpu_ids=4.
[    0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=4
[    0.000000] NR_IRQS:64 nr_irqs:64 0
[    0.000000] Architected cp15 timer(s) running at 50.00MHz (virt).
[    0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0xb8812736b, max_idle_ns: 440795202655 ns
[    0.000002] sched_clock: 56 bits at 50MHz, resolution 20ns, wraps every 4398046511100ns
[    0.000048] vt handoff: transparent VT on vt#7
[    0.000060] Console: colour dummy device 80x25
[    0.000105] console [tty0] enabled
[    0.000116] Calibrating delay loop (skipped), value calculated using timer frequency.. 100.00 BogoMIPS (lpj=200000)
[    0.000122] pid_max: default: 32768 minimum: 301
[    0.000151] Security Framework initialized
[    0.000156] Yama: becoming mindful.
[    0.000182] AppArmor: AppArmor initialized
[    0.000242] Mount-cache hash table entries: 16384 (order: 5, 131072 bytes)
[    0.000246] Mountpoint-cache hash table entries: 16384 (order: 5, 131072 bytes)
[    0.000522] Initializing cgroup subsys io
[    0.000530] Initializing cgroup subsys memory
[    0.000540] Initializing cgroup subsys devices
[    0.000545] Initializing cgroup subsys freezer
[    0.000550] Initializing cgroup subsys net_cls
[    0.000554] Initializing cgroup subsys perf_event
[    0.000558] Initializing cgroup subsys net_prio
[    0.000562] Initializing cgroup subsys hugetlb
[    0.000567] Initializing cgroup subsys pids
[    0.000583] ftrace: allocating 29997 entries in 118 pages
[    0.026742] Remapping and enabling EFI services.
[    0.026750]   EFI remap 0x0000000004000000 => 0000000040000000
[    0.026757]   EFI remap 0x0000000009010000 => 0000000044000000
[    0.026761]   EFI remap 0x000000023711d000 => 000000004401d000
[    0.026771]   EFI remap 0x000000023fb05000 => 00000000440e5000
[    0.026774]   EFI remap 0x000000023fb19000 => 00000000440f9000
[    0.026778]   EFI remap 0x000000023fb4e000 => 000000004413e000
[    0.026780]   EFI remap 0x000000023fb66000 => 0000000044146000
[    0.026788] ASID allocator initialised with 65536 entries
[    0.035737] Detected PIPT I-cache on CPU1
[    0.035845] CPU1: Booted secondary processor [500f0001]
[    0.044004] Detected PIPT I-cache on CPU2
[    0.044106] CPU2: Booted secondary processor [500f0001]
[    0.052247] Detected PIPT I-cache on CPU3
[    0.052349] CPU3: Booted secondary processor [500f0001]
[    0.052480] Brought up 4 CPUs
[    0.052493] SMP: Total of 4 processors activated.
[    0.052498] CPU: All CPU(s) started at EL1
[    0.053004] devtmpfs: initialized
[    0.053575] evm: security.selinux
[    0.053578] evm: security.SMACK64
[    0.053580] evm: security.SMACK64EXEC
[    0.053582] evm: security.SMACK64TRANSMUTE
[    0.053583] evm: security.SMACK64MMAP
[    0.053585] evm: security.ima
[    0.053587] evm: security.capability
[    0.053780] DMI not present or invalid.
[    0.053945] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.054237] pinctrl core: initialized pinctrl subsystem
[    0.054630] NET: Registered protocol family 16
[    0.067045] cpuidle: using governor ladder
[    0.079119] cpuidle: using governor menu
[    0.079153] vdso: 2 pages (1 code @ ffff800000d55000, 1 data @ ffff800000d54000)
[    0.079166] hw-breakpoint: found 4 breakpoint and 4 watchpoint registers.
[    0.079497] DMA: preallocated 256 KiB pool for atomic allocations
[    0.079566] Serial: AMBA PL011 UART driver
[    0.082155] 9000000.pl011: ttyAMA0 at MMIO 0x9000000 (irq = 37, base_baud = 0) is a PL011 rev1
[    0.203937] console [ttyAMA0] enabled
[    0.217441] ACPI: Interpreter disabled.
[    0.218501] vgaarb: loaded
[    0.219542] SCSI subsystem initialized
[    0.220632] usbcore: registered new interface driver usbfs
[    0.221777] usbcore: registered new interface driver hub
[    0.222985] usbcore: registered new device driver usb
[    0.224436] dmi: Firmware registration failed.
[    0.225836] NetLabel: Initializing
[    0.226569] NetLabel:  domain hash size = 128
[    0.227464] NetLabel:  protocols = UNLABELED CIPSOv4
[    0.228523] NetLabel:  unlabeled traffic allowed by default
[    0.229830] clocksource: Switched to clocksource arch_sys_counter
[    0.244770] AppArmor: AppArmor Filesystem Enabled
[    0.245916] pnp: PnP ACPI: disabled
[    0.252093] NET: Registered protocol family 2
[    0.253260] TCP established hash table entries: 65536 (order: 7, 524288 bytes)
[    0.254952] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[    0.256913] TCP: Hash tables configured (established 65536 bind 65536)
[    0.258339] UDP hash table entries: 4096 (order: 5, 131072 bytes)
[    0.259677] UDP-Lite hash table entries: 4096 (order: 5, 131072 bytes)
[    0.261158] NET: Registered protocol family 1
[    0.262243] Trying to unpack rootfs image as initramfs...
[    4.632259] Freeing initrd memory: 20584K (ffff800190eb5000 - ffff8001922cf000)
[    4.634112] kvm [1]: HYP mode not available
[    4.635386] futex hash table entries: 1024 (order: 5, 131072 bytes)
[    4.636734] audit: initializing netlink subsys (disabled)
[    4.637928] audit: type=2000 audit(4.588:1): initialized
[    4.639178] Initialise system trusted keyring
[    4.640554] HugeTLB registered 2 MB page size, pre-allocated 0 pages
[    4.644379] zbud: loaded
[    4.645406] VFS: Disk quotas dquot_6.6.0
[    4.646404] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    4.648861] fuse init (API version 7.23)
[    4.650029] Key type big_key registered
[    4.652354] Key type asymmetric registered
[    4.653226] Asymmetric key parser 'x509' registered
[    4.654335] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 249)
[    4.656011] io scheduler noop registered
[    4.656837] io scheduler deadline registered (default)
[    4.657995] io scheduler cfq registered
[    4.659225] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
[    4.660396] pciehp: PCI Express Hot Plug Controller Driver version: 0.4
[    4.661783] PCI host bridge /pcie@10000000 ranges:
[    4.662834]    IO 0x3eff0000..0x3effffff -> 0x00000000
[    4.663896]   MEM 0x10000000..0x3efeffff -> 0x10000000
[    4.665220] pci-host-generic 3f000000.pcie: PCI host bridge to bus 0000:00
[    4.666667] pci_bus 0000:00: root bus resource [bus 00-0f]
[    4.667793] pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
[    4.669048] pci_bus 0000:00: root bus resource [mem 0x10000000-0x3efeffff]
[    4.673751] Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
[    4.678496] msm_serial: driver initialized
[    4.679767] Unable to detect cache hierarchy from DT for CPU 0
[    4.687139] brd: module loaded
[    4.690926] loop: module loaded
[    4.694433] GPT:Primary header thinks Alt. header is not at the end of the disk.
[    4.695956] GPT:4612095 != 167772159
[    4.696689] GPT:Alternate GPT header not at the end of the disk.
[    4.697935] GPT:4612095 != 167772159
[    4.698645] GPT: Use GNU Parted to correct GPT errors.
[    4.699663]  vda: vda1 vda15
[    4.700975] libphy: Fixed MDIO Bus: probed
[    4.701880] tun: Universal TUN/TAP device driver, 1.6
[    4.702921] tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
[    4.705662] PPP generic driver version 2.4.2
[    4.706812] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    4.708178] ehci-pci: EHCI PCI platform driver
[    4.709112] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[    4.710435] ohci-pci: OHCI PCI platform driver
[    4.711351] uhci_hcd: USB Universal Host Controller Interface driver
[    4.712922] mousedev: PS/2 mouse device common for all mice
[    4.714347] i2c /dev entries driver
[    4.715419] device-mapper: uevent: version 1.0.3
[    4.716569] device-mapper: ioctl: 4.34.0-ioctl (2015-10-28) initialised: dm-devel@redhat.com
[    4.718469] ledtrig-cpu: registered to indicate activity on CPUs
[    4.719666] EFI Variables Facility v0.08 2004-May-17
[    4.721394] NET: Registered protocol family 10
[    4.722850] NET: Registered protocol family 17
[    4.723812] Key type dns_resolver registered
[    4.724755] Registered cp15_barrier emulation handler
[    4.725822] Registered setend emulation handler
[    4.727196] registered taskstats version 1
[    4.728091] Loading compiled-in X.509 certificates
[    4.730530] Loaded X.509 cert 'Build time autogenerated kernel key: 3041370e3aa1d4aaa0059da5d744f35af150019e'
[    4.732588] zswap: loaded using pool lzo/zbud
[    4.737198] Key type trusted registered
[    4.751223] Key type encrypted registered
[    4.752128] AppArmor: AppArmor sha1 policy hashing enabled
[    4.753254] ima: No TPM chip found, activating TPM-bypass!
[    4.754494] evm: HMAC attrs: 0x1
[    4.755439] hctosys: unable to open rtc device (rtc0)
[    4.757008] uart-pl011 9000000.pl011: no DMA platform data
[    4.758375] Freeing unused kernel memory: 748K (ffff800000c82000 - ffff800000d3d000)
[    4.759903] Freeing alternatives memory: 48K (ffff800000d3d000 - ffff800000d49000)
Loading, please wait...
[    4.878551] random: nonblocking pool is initialized
[    6.114586] irq 37: nobody cared (try booting with the "irqpoll" option)
[    6.116064] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-8-generic #23-Ubuntu
[    6.117653] Hardware name: linux,dummy-virt (DT)
[    6.118654] Call trace:
[    6.119203] [<ffff80000008a6d8>] dump_backtrace+0x0/0x1b0
[    6.120373] [<ffff80000008a8ac>] show_stack+0x24/0x30
[    6.121470] [<ffff80000043859c>] dump_stack+0x88/0xa4
[    6.122583] [<ffff800000119270>] __report_bad_irq+0x40/0xf0
[    6.123789] [<ffff800000119610>] note_interrupt+0x218/0x2f8
[    6.124994] [<ffff8000001162f8>] handle_irq_event_percpu+0xc8/0x1a0
[    6.126346] [<ffff800000116424>] handle_irq_event+0x54/0x88
[    6.127551] [<ffff80000011a2cc>] handle_fasteoi_irq+0xbc/0x1c8
[    6.128812] [<ffff8000001157cc>] generic_handle_irq+0x34/0x50
[    6.130054] [<ffff800000115b18>] __handle_domain_irq+0x68/0xc0
[    6.131310] [<ffff8000000825d4>] gic_handle_irq+0x5c/0xb0
[    6.132482] Exception stack(0xffff800000d53da0 to 0xffff800000d53ec0)
[    6.133868] 3da0: ffff800000d50000 ffff800000d56000 ffff800000d53ef0 ffff8000000864f4
[    6.135548] 3dc0: 0000000000000145 ffff8000008c9378 ffff800000d53ef0 0000000000000000
[    6.137232] 3de0: 0000000000000000 00000001ff248000 0100000000000000 ffff800000d56b78
[    6.138915] 3e00: 002aad4b00000000 00000000fffedfa3 ffff800000d620b0 ffff800000d53e70
[    6.140598] 3e20: 0000000000000890 0000000000000000 ffff800000e21000 0000000000000001
[    6.142295] 3e40: 000000000000003a 0000ffff9dd63c54 ffff8000000c4828 0000000000463070
[    6.143977] 3e60: 0000000000000042 ffff800000d50000 ffff800000d56000 ffff800000d56b78
[    6.145657] 3e80: ffff800000e2a000 0000000000000000 0000000000000000 ffff800000d35ad8
[    6.147355] 3ea0: ffff800000d53f20 ffff8000008e0000 ffff800000d56000 ffff800000d53ef0
[    6.149039] [<ffff8000000855a8>] el1_irq+0x68/0xc0
[    6.150080] [<ffff800000105fcc>] default_idle_call+0x24/0x40
[    6.151305] [<ffff8000001062ac>] cpu_startup_entry+0x24c/0x288
[    6.152588] [<ffff8000008c7744>] rest_init+0x7c/0x88
[    6.153667] [<ffff800000c829fc>] start_kernel+0x3ec/0x418
[    6.154835] [<00000000408d1000>] 0x408d1000
[    6.155744] handlers:
[    6.156252] [<ffff800000583478>] pl011_int
[    6.157170] Disabling IRQ #37
starting version 229
Begin: Loading essential drivers ... [    7.313982] md: linear personality registered for level -1
[    7.319467] md: multipath personality registered for level -4
[    7.325109] md: raid0 personality registered for level 0
[    7.330963] md: raid1 personality registered for level 1
[    7.405868] raid6: int64x1  gen()  1452 MB/s
[    7.473875] raid6: int64x1  xor()  1145 MB/s
[    7.541858] raid6: int64x2  gen()  1844 MB/s
[    7.609862] raid6: int64x2  xor()  1415 MB/s
[    7.677860] raid6: int64x4  gen()  2820 MB/s
[    7.745857] raid6: int64x4  xor()  1646 MB/s
[    7.813865] raid6: int64x8  gen()  2565 MB/s
[    7.881855] raid6: int64x8  xor()  1567 MB/s
[    7.949874] raid6: neonx1   gen()  3022 MB/s
[    8.017871] raid6: neonx1   xor()  1801 MB/s
[    8.085853] raid6: neonx2   gen()  3082 MB/s
[    8.153856] raid6: neonx2   xor()  1815 MB/s
[    8.221866] raid6: neonx4   gen()  3083 MB/s
[    8.289860] raid6: neonx4   xor()  1835 MB/s
[    8.357859] raid6: neonx8   gen()  2821 MB/s
[    8.425891] raid6: neonx8   xor()  1721 MB/s
[    8.426838] raid6: using algorithm neonx4 gen() 3083 MB/s
[    8.428001] raid6: .... xor() 1835 MB/s, rmw enabled
[    8.429019] raid6: using intx1 recovery algorithm
[    8.432156] xor: measuring software checksum speed
[    8.469840]    8regs     :  8641.000 MB/sec
[    8.509839]    8regs_prefetch:  7746.000 MB/sec
[    8.549836]    32regs    :  8607.000 MB/sec
[    8.589835]    32regs_prefetch:  6995.000 MB/sec
[    8.590770] xor: using function: 8regs (8641.000 MB/sec)
[    8.593859] async_tx: api initialized (async)
[    8.606634] md: raid6 personality registered for level 6
[    8.607995] md: raid5 personality registered for level 5
[    8.609150] md: raid4 personality registered for level 4
[    8.619895] md: raid10 personality registered for level 10
done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... [    8.674153] Btrfs loaded
Scanning for Btrfs filesystems
done.
Warning: fsck not present, so skipping root file system
[    8.739900] EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null)
done.
Begin: Running /scripts/local-bottom ... done.
Begin: Running /scripts/init-bottom ... Warning: overlayroot: debug is busted
/scripts/init-bottom/plymouth: line 18: /bin/plymouth: not found
done.
[    8.865534] systemd[1]: System time before build time, advancing clock.
[    8.890008] systemd[1]: systemd 229 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN)
[    8.893772] systemd[1]: Detected virtualization qemu.
[    8.894894] systemd[1]: Detected architecture arm64.

Welcome to [1mUbuntu 16.04[0m!

[    8.899870] systemd[1]: Set hostname to <ubuntu>.
[    8.903861] systemd[1]: Initializing machine ID from random generator.
[    8.905324] systemd[1]: Installed transient /etc/machine-id file.
[    8.950974] systemd-sysv-generator[315]: Overwriting existing symlink /run/systemd/generator.late/umountiscsi.service with real service.
[    9.016448] systemd[1]: Created slice User and Session Slice.
[[0;32m  OK  [0m] Created slice User and Session Slice.
[    9.020851] systemd[1]: Listening on Process Core Dump Socket.
[[0;32m  OK  [0m] Listening on Process Core Dump Socket.
[    9.023835] systemd[1]: Listening on udev Kernel Socket.
[[0;32m  OK  [0m] Listening on udev Kernel Socket.
[    9.026579] systemd[1]: Listening on LVM2 poll daemon socket.
[[0;32m  OK  [0m] Listening on LVM2 poll daemon socket.
[    9.029391] systemd[1]: Reached target User and Group Name Lookups.
[[0;32m  OK  [0m] Reached target User and Group Name Lookups.
[    9.032607] systemd[1]: Listening on Journal Socket.
[[0;32m  OK  [0m] Listening on Journal Socket.
[    9.035308] systemd[1]: Started Forward Password Requests to Wall Directory Watch.
[[0;32m  OK  [0m] Started Forward Password Requests to Wall Directory Watch.
[    9.038951] systemd[1]: Listening on udev Control Socket.
[[0;32m  OK  [0m] Listening on udev Control Socket.
[    9.041521] systemd[1]: Reached target Swap.
[[0;32m  OK  [0m] Reached target Swap.
[    9.043947] systemd[1]: Listening on Journal Audit Socket.
[[0;32m  OK  [0m] Listening on Journal Audit Socket.
[    9.046728] systemd[1]: Listening on Device-mapper event daemon FIFOs.
[[0;32m  OK  [0m] Listening on Device-mapper event daemon FIFOs.
[    9.049880] systemd[1]: Reached target Encrypted Volumes.
[[0;32m  OK  [0m] Reached target Encrypted Volumes.
[    9.052786] systemd[1]: Set up automount Arbitrary Executable File Formats File System Automount Point.
[[0;32m  OK  [0m] Set up automount Arbitrary Executab...ats File System Automount Point.
[    9.057043] systemd[1]: Listening on /dev/initctl Compatibility Named Pipe.
[[0;32m  OK  [0m] Listening on /dev/initctl Compatibility Named Pipe.
[    9.060406] systemd[1]: Listening on LVM2 metadata daemon socket.
[[0;32m  OK  [0m] Listening on LVM2 metadata daemon socket.
[    9.063388] systemd[1]: Reached target Remote File Systems (Pre).
[[0;32m  OK  [0m] Reached target Remote File Systems (Pre).
[    9.066396] systemd[1]: Reached target Remote File Systems.
[[0;32m  OK  [0m] Reached target Remote File Systems.
[    9.069208] systemd[1]: Listening on Journal Socket (/dev/log).
[[0;32m  OK  [0m] Listening on Journal Socket (/dev/log).
[    9.072179] systemd[1]: Listening on Syslog Socket.
[[0;32m  OK  [0m] Listening on Syslog Socket.
[    9.075295] systemd[1]: Created slice System Slice.
[[0;32m  OK  [0m] Created slice System Slice.
[    9.089999] systemd[1]: Mounting Huge Pages File System...
         Mounting Huge Pages File System...
[    9.095235] systemd[1]: Mounting Debug File System...
[    9.098955] systemd[1]: Starting Journal Service...
         Starting Journal Service...
[    9.103754] systemd[1]: Starting Uncomplicated firewall...
         Starting Uncomplicated firewall...
[    9.110383] systemd[1]: Starting Remount Root and Kernel File Systems...
         Starting Remount Root and Kernel File Systems...
[    9.117696] systemd[1]: Starting Load Kernel Modules...
[    9.119868] EXT4-fs (vda1): re-mounted. Opts: (null)
         Starting Load Kernel Modules...
[    9.124379] systemd[1]: Mounting POSIX Message Queue File System...
         Mounting POSIX Message Queue File System...
[    9.129258] systemd[1]: Starting Create list of required static device nodes for the current kernel...
         Starting Create list of required st... nodes for the current kernel...
[    9.134799] systemd[1]: Created slice system-serial\x2dgetty.slice.
[[0;32m  OK  [0m] Created slice system-serial\x2dgetty.slice.
[    9.138502] systemd[1]: Created slice system-getty.slice.
[[0;32m  OK  [0m] Created slice system-getty.slice.
[    9.141127] systemd[1]: Reached target Slices.
[[0;32m  OK  [0m] Reached target Slices.
[    9.144837] systemd[1]: Starting Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
         Starting Monitoring of LVM2 mirrors... dmeventd or progress polling...
[    9.152679] systemd[1]: Starting Nameserver information manager...
         Starting Nameserver information manager...
[    9.162920] systemd[1]: Mounted POSIX Message Queue File System.
[[0;32m  OK  [0m] Mounted POSIX Message Queue File System.
[    9.170746] systemd[1]: Mounted Debug File System.
[[0;32m  OK  [0m] Mounted Debug File System.
[    9.175838] systemd[1]: Mounted Huge Pages File System.
[[0;32m  OK  [0m] Mounted Huge Pages File System.
[    9.179176] systemd[1]: Started Journal Service.
[[0;32m  OK  [0m] Started Journal Service.
[[0;32m  OK  [0m] Started Uncomplicated firewall.
[[0;32m  OK  [0m] Started Remount Root and Kernel File Systems.
[[0;32m  OK  [0m] Started Load Kernel Modules.
[[0;32m  OK  [0m] Started Create list of required sta...ce nodes for the current kernel.
[[0;32m  OK  [0m] Started Nameserver information manager.
[[0;32m  OK  [0m] Started LVM2 metadata daemon.
         Starting Create Static Device Nodes in /dev...
         Mounting FUSE Control File System...
         Starting Apply Kernel Variables...
         Starting udev Coldplug all Devices...
         Starting Load/Save Random Seed...
         Starting Flush Journal to Persistent Storage...
[[0;32m  OK  [0m] Mounted FUSE Control File System.
[[0;32m  OK  [0m] Started Create Static Device Nodes in /dev.
[[0;32m  OK  [0m] Started Apply Kernel Variables.
[[0;32m  OK  [0m] Started Load/Save Random Seed.
[[0;32m  OK  [0m] Started Flush Journal to Persistent Storage.
         Starting udev Kernel Device Manager...
[[0;32m  OK  [0m] Started udev Coldplug all Devices.
[   10.843322] irq 37: nobody cared (try booting with the "irqpoll" option)
[   10.844816] Call trace:
[   10.845531] handlers:
[   10.846061] [<ffff800000583478>] pl011_int
[   10.846984] Disabling IRQ #37
[[0;32m  OK  [0m] Started udev Kernel Device Manager.
[[0;32m  OK  [0m] Found device /dev/ttyAMA0.
[[0;32m  OK  [0m] Found device /dev/disk/by-label/UEFI.
[[0;32m  OK  [0m] Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
         Starting LSB: Tune IDE hard disks...
[[0;32m  OK  [0m] Started Dispatch Password Requests to Console Directory Watch.
[[0;32m  OK  [0m] Started LSB: Tune IDE hard disks.
[[0;32m  OK  [0m] Started Monitoring of LVM2 mirrors,...ng dmeventd or progress polling.
[[0;32m  OK  [0m] Reached target Local File Systems (Pre).
         Mounting /boot/efi...
[[0;32m  OK  [0m] Mounted /boot/efi.
[[0;32m  OK  [0m] Reached target Local File Systems.
         Starting Tell Plymouth To Write Out Runtime Data...
         Starting LSB: AppArmor initialization...
         Starting Set console keymap...
         Starting Commit a transient machine-id on disk...
         Starting Create Volatile Files and Directories...
[[0;32m  OK  [0m] Started Tell Plymouth To Write Out Runtime Data.
[[0;1;31mFAILED[0m] Failed to start Set console keymap.
See 'systemctl status console-setup.service' for details.
[[0;32m  OK  [0m] Started Create Volatile Files and Directories.
         Starting Update UTMP about System Boot/Shutdown...
         Starting Network Time Synchronization...
[[0;32m  OK  [0m] Started Commit a transient machine-id on disk.
[[0;32m  OK  [0m] Started Update UTMP about System Boot/Shutdown.
[[0;32m  OK  [0m] Started Network Time Synchronization.
[[0;32m  OK  [0m] Reached target System Time Synchronized.
[[0;32m  OK  [0m] Started LSB: AppArmor initialization.
         Starting Raise network interfaces...
[[0;32m  OK  [0m] Started ifup for eth0.
[[0;32m  OK  [0m] Started Raise network interfaces.
[[0;32m  OK  [0m] Reached target Network.
[[0;32m  OK  [0m] Reached target Network is Online.
         Starting LSB: Starts and stops the ...nd logs in to default targets...
[[0;32m  OK  [0m] Started LSB: Starts and stops the i... and logs in to default targets.
[[0;32m  OK  [0m] Reached target System Initialization.
[[0;32m  OK  [0m] Started Trigger resolvconf update for networkd DNS.
[[0;32m  OK  [0m] Started Daily Cleanup of Temporary Directories.
[[0;32m  OK  [0m] Reached target Timers.
         Starting Seed the pseudo random number generator on first boot...
[[0;32m  OK  [0m] Started ACPI Events Check.
[[0;32m  OK  [0m] Reached target Paths.
[[0;32m  OK  [0m] Listening on D-Bus System Message Bus Socket.
         Starting LXD - unix socket.
[[0;32m  OK  [0m] Listening on ACPID Listen Socket.
[[0;32m  OK  [0m] Listening on UUID daemon activation socket.
[[0;32m  OK  [0m] Listening on LXD - unix socket.
[[0;32m  OK  [0m] Reached target Sockets.
[[0;32m  OK  [0m] Reached target Basic System.
         Starting Accounts Service...
[[0;32m  OK  [0m] Started D-Bus System Message Bus.
         Starting LSB: automatic crash report generation...
         Starting LSB: Record successful boot for GRUB...
         Starting LXC network bridge setup...
         Starting /etc/rc.local Compatibility...
[[0;32m  OK  [0m] Started Regular background program processing daemon.
[[0;32m  OK  [0m] Started ACPI event daemon.
         Starting LSB: daemon to balance interrupts for SMP systems...
         Starting LSB: Set the CPU Frequency Scaling governor to "ondemand"...
         Starting LSB: MD monitoring daemon...
         Starting Initial cloud-init job (pre-networking)...
         Starting System Logging Service...
[[0;32m  OK  [0m] Started Deferred execution scheduler.
         Starting Login Service...
[[0;32m  OK  [0m] Started FUSE filesystem for LXC.
         Starting LSB: Postfix Mail Transport Agent...
         Starting LXD - container startup/shutdown...
[[0;32m  OK  [0m] Started /etc/rc.local Compatibility.
[[0;32m  OK  [0m] Started LSB: automatic crash report generation.
[[0;32m  OK  [0m] Started LSB: Record successful boot for GRUB.
[[0;32m  OK  [0m] Started LSB: daemon to balance interrupts for SMP systems.
[[0;32m  OK  [0m] Started LSB: MD monitoring daemon.
[[0;32m  OK  [0m] Started LSB: Set the CPU Frequency Scaling governor to "ondemand".
         Starting Authenticate and Authorize Users to Run Privileged Tasks...
[[0;32m  OK  [0m] Started System Logging Service.
[[0;32m  OK  [0m] Started LXC network bridge setup.
[[0;32m  OK  [0m] Started Login Service.
[[0;32m  OK  [0m] Started Authenticate and Authorize Users to Run Privileged Tasks.
[[0;32m  OK  [0m] Started LXD - container startup/shutdown.
[[0;32m  OK  [0m] Started Accounts Service.
         Starting LXC Container Initialization and Autoboot Code...
[[0;32m  OK  [0m] Started LXC Container Initialization and Autoboot Code.
[[0;32m  OK  [0m] Started LSB: Postfix Mail Transport Agent.
[[0;32m  OK  [0m] Reached target Mail Transport Agent.
[[0;32m  OK  [0m] Started Seed the pseudo random number generator on first boot.
[   14.243332] [[0;32m  OK  [0m] Started Initial cloud-init job (pre-networking).
cloud-init[892]: Cloud-init v. 0.7.7 running 'init-local' at Thu, 11 Feb 2016 16:28:04 +0000. Up 13.72 seconds.
         Starting Initial cloud-init job (metadata service crawler)...
[   27.267091] cloud-init[1225]: Cloud-init v. 0.7.7 running 'init' at Thu, 11 Feb 2016 16:28:07 +0000. Up 16.16 seconds.
[   27.270354] cloud-init[1225]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
[   27.274013] cloud-init[1225]: ci-info: +--------+------+------------------------------+---------------+-------+-------------------+
[   27.277543] cloud-init[1225]: ci-info: | Device |  Up  |           Address            |      Mask     | Scope |     Hw-Address    |
[   27.281102] cloud-init[1225]: ci-info: +--------+------+------------------------------+---------------+-------+-------------------+
[   27.284657] cloud-init[1225]: ci-info: |   lo   | True |          127.0.0.1           |   255.0.0.0   |   .   |         .         |
[   27.288184] cloud-init[1225]: ci-info: |   lo   | True |           ::1/128            |       .       |  host |         .         |
[   27.291471] cloud-init[1225]: ci-info: |  eth0  | True |         10.43.43.122         | 255.255.248.0 |   .   | fa:16:3e:9f:ea:cd |
[   27.295075] cloud-init[1225]: ci-info: |  eth0  | True | fe80::f816:3eff:fe9f:eacd/64 |       .       |  link | fa:16:3e:9f:ea:cd |
[   27.298703] cloud-init[1225]: ci-info: | lxcbr0 | True |           10.0.3.1           | 255.255.255.0 |   .   | e6:32:68:c9:45:3d |
[   27.302439] cloud-init[1225]: ci-info: | lxcbr0 | True | fe80::e432:68ff:fec9:453d/64 |       .       |  link | e6:32:68:c9:45:3d |
[   27.306328] cloud-init[1225]: ci-info: +--------+------+------------------------------+---------------+-------+-------------------+
[   27.309875] cloud-init[1225]: ci-info: ++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++
[   27.313167] cloud-init[1225]: ci-info: +-------+-------------+------------+---------------+-----------+-------+
[   27.316297] cloud-init[1225]: ci-info: | Route | Destination |  Gateway   |    Genmask    | Interface | Flags |
[   27.319212] cloud-init[1225]: ci-info: +-------+-------------+------------+---------------+-----------+-------+
[   27.322385] cloud-init[1225]: ci-info: |   0   |   0.0.0.0   | 10.43.40.1 |    0.0.0.0    |    eth0   |   UG  |
[   27.325479] cloud-init[1225]: ci-info: |   1   |   10.0.3.0  |  0.0.0.0   | 255.255.255.0 |   lxcbr0  |   U   |
[   27.328780] cloud-init[1225]: ci-info: |   2   |  10.43.40.0 |  0.0.0.0   | 255.255.248.0 |    eth0   |   U   |
[   27.331628] cloud-init[1225]: ci-info: +-------+-------------+------------+---------------+-----------+-------+
[   27.334956] cloud-init[1225]: Generating public/private rsa key pair.
[   27.337360] cloud-init[1225]: Your identification has been saved in /etc/ssh/ssh_host_rsa_key.
[   27.340022] cloud-init[1225]: Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub.
[   27.342814] cloud-init[1225]: The key fingerprint is:
[   27.344840] cloud-init[1225]: SHA256:jFdam8Pa+hElashoMZv1M2xY090M5ATgOdtjUc3uJ7I root@lxd-armhf1
[   27.347473] cloud-init[1225]: The key's randomart image is:
[   27.349508] cloud-init[1225]: +---[RSA 2048]----+
[   27.351272] cloud-init[1225]: |        ...o=o   |
[   27.353328] cloud-init[1225]: |       . o = +o  |
[   27.355197] cloud-init[1225]: |    o . * * +.o  |
[   27.357176] cloud-init[1225]: |     O O @ *  .  |
[   27.359009] cloud-init[1225]: |    = = S X  .   |
[   27.360999] cloud-init[1225]: |   .   + * +. o .|
[   27.362961] cloud-init[1225]: |        . o  o o |
[   27.364934] cloud-init[1225]: |         . .E    |
[   27.366750] cloud-init[1225]: |        ...      |
[   27.368628] cloud-init[1225]: +----[SHA256]-----+
[   27.370391] cloud-init[1225]: Generating public/private dsa key pair.
[   27.372696] cloud-init[1225]: Your identification has been saved in /etc/ssh/ssh_host_dsa_key.
[   27.375320] cloud-init[1225]: Your public key has been saved in /etc/ssh/ssh_host_dsa_key.pub.
[   27.378230] cloud-init[1225]: The key fingerprint is:
[   27.380170] cloud-init[1225]: SHA256:A+dIDBwWTtTCbmnj8lIUmnNcM6wQi+e5uobWh8jVOhY root@lxd-armhf1
[   27.382956] cloud-init[1225]: The key's randomart image is:
[   27.385182] cloud-init[1225]: +---[DSA 1024]----+
[   27.387296] cloud-init[1225]: |  .=B=           |
[   27.389172] cloud-init[1225]: | ..==o*          |
[   27.390949] cloud-init[1225]: |. o*.*+o.        |
[   27.393062] cloud-init[1225]: | o+.@. =         |
[   27.394881] cloud-init[1225]: |  oB... S        |
[   27.396865] cloud-init[1225]: |  .Eo.   .       |
[   27.398705] cloud-init[1225]: |o =++            |
[   27.400625] cloud-init[1225]: |.*.*..           |
[   27.402612] cloud-init[1225]: |=...o            |
[   27.404478] cloud-init[1225]: +----[SHA256]-----+
[   27.406302] cloud-init[1225]: Generating public/private ecdsa key pair.
[   27.408561] cloud-init[1225]: Your identification has been saved in /etc/ssh/ssh_host_ecdsa_key.
[   27.411133] cloud-init[1225]: Your public key has been saved in /etc/ssh/ssh_host_ecdsa_key.pub.
[   27.414063] cloud-init[1225]: The key fingerprint is:
[   27.415941] cloud-init[1225]: SHA256:IPdhbdTyjZ+nISO8cj7peQGy3lYDObzhoodqPYaiLXU root@lxd-armhf1
[   27.418619] cloud-init[1225]: The key's randomart image is:
[   27.420660] cloud-init[1225]: +---[ECDSA 256]---+
[   27.422537] cloud-init[1225]: |          ..     |
[   27.424380] cloud-init[1225]: |         o. .    |
[   27.426192] cloud-init[1225]: |    . o o.o+ o   |
[   27.428034] cloud-init[1225]: |     o +.oB o .  |
[   27.429860] cloud-init[1225]: |        S= * . . |
[   27.431669] cloud-init[1225]: |  . E   o = * + .|
[   27.433676] cloud-init[1225]: | . . o + o = = + |
[   27.435571] cloud-init[[1225]: [0;32m  OK  [0m|... o * + B.. .  |] 
Started Initial cloud-init job (metadata service crawler).[   27.437873] 
cloud-init[1225]: |.o.o.o o *+o     |[
[0;32m  OK  [0m[   27.440791] ] cloud-initReached target Cloud-config availability.[1225]: 
+----[SHA256]-----+
[   27.443639] cloud-init[1225]: Generating public/private ed25519 key pair.
[   27.445996] cloud-init[1225]: Your identification has been saved in /etc/ssh/ssh_host_ed25519_key.
[   27.448558] cloud-init[1225]: Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub.
[   27.450927] cloud-init[1225]: The key fingerprint is:
[   27.452757] cloud-init[1225]: SHA256:/vOIUikmJTYeQDmbcy95sC12CWfyuFdJAmo3e8N7JxI root@lxd-armhf1
[   27.455626] cloud-init[1225]: The key's randomart image is:
         [   27.457634] Starting Apply the settings specified in cloud-config...cloud-init
[1225]: +--[ED25519 256]--+
[   27.460837] cloud-init[1225]: | ...             |         
Starting OpenBSD Secure Shell server...[   27.463109] cloud-init[1225]: |  + .            |

[   27.466287] cloud-init[1225]: |   * .           |
[   27.468346]          cloud-initStarting Permit User Sessions...[1225]: 
|  * % = .        |
[   27.471794] cloud-init[1225]: | . * ^ +So       |
[   27.474035] cloud-init[1225]: |    @ E.=        |
[   27.476142] [cloud-init[0;32m  OK  [0m[1225]: ] |   . X B.        |Started Permit User Sessions.

[   27.478804] cloud-init[1225]: |    . = ooo.     |
[   27.480727] cloud-init[1225]: |     . +.ooo.    |
[   27.482544] cloud-init[1225]: +----[SHA256]-----+
[[0;32m  OK  [0m] Started OpenBSD Secure Shell server.
         Starting Terminate Plymouth Boot Screen...
         Starting Hold until boot process finishes up...
[[0;32m  OK  [0m] Started Hold until boot process finishes up.
[[0;32m  OK  [0m] Started Terminate Plymouth Boot Screen.
[[0;32m  OK  [0m] Started Serial Getty on ttyAMA0.
[[0;32m  OK  [0m] Started Getty on tty1.
[[0;32m  OK  [0m] Reached target Login Prompts.
[   28.669804] cloud-init[1329]: Generating locales...
[   28.718531] cloud-init[1329]:   en_US.UTF-8... up-to-date
[   28.728072] cloud-init[1329]: Generation complete.
[   29.393149] cloud-init[1329]: Cloud-init v. 0.7.7 running 'modules:config' at Thu, 11 Feb 2016 16:28:19 +0000. Up 28.21 seconds.
[[0;32m  OK  [0m] Started Apply the settings specified in cloud-config.
         Starting Execute cloud user/final scripts...
ci-info: Authorized keys from /home/ubuntu/.ssh/authorized_keys for user ubuntu
ci-info: +---------+-------------------+---------+---------+
ci-info: | Keytype | Fingerprint (md5) | Options | Comment |
ci-info: +---------+-------------------+---------+---------+
ci-info: +---------+-------------------+---------+---------+
<14>Feb 11 16:28:21 ec2: 
<14>Feb 11 16:28:21 ec2: #############################################################
<14>Feb 11 16:28:21 ec2: -----BEGIN SSH HOST KEY FINGERPRINTS-----
<14>Feb 11 16:28:21 ec2: 1024 SHA256:A+dIDBwWTtTCbmnj8lIUmnNcM6wQi+e5uobWh8jVOhY root@lxd-armhf1 (DSA)
<14>Feb 11 16:28:21 ec2: 256 SHA256:IPdhbdTyjZ+nISO8cj7peQGy3lYDObzhoodqPYaiLXU root@lxd-armhf1 (ECDSA)
<14>Feb 11 16:28:21 ec2: 256 SHA256:/vOIUikmJTYeQDmbcy95sC12CWfyuFdJAmo3e8N7JxI root@lxd-armhf1 (ED25519)
<14>Feb 11 16:28:21 ec2: 2048 SHA256:jFdam8Pa+hElashoMZv1M2xY090M5ATgOdtjUc3uJ7I root@lxd-armhf1 (RSA)
<14>Feb 11 16:28:21 ec2: -----END SSH HOST KEY FINGERPRINTS-----
<14>Feb 11 16:28:21 ec2: #############################################################
-----BEGIN SSH HOST KEY KEYS-----
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBDy68vVyB6NVoBk80zR3qTcgKpVUBhDA48HIr3ORylUIODPFinWCBUuCjytfYoLv/TRn5ayKqm4gJirgN/StO18= root@lxd-armhf1
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICyvBa7H8Gyp0drEsjgm065yK3areAs3XDjYjcJ1XRQq root@lxd-armhf1
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC4GEOaRbzqGlhU6Nbken7sXWtpiMbFDCmmZHAh8SRodf8ka4iGJx4r12GWUr7jPPKE1ryScjsvLjwCdOWS9V+pgQmChrL/a37c2sBpcLWj1JZw36tBCChzU61VqK9HI2K0gbCXfX7xKVSyqOJf0u2akjBjBxWmC0O5zUPChVxOADffQpsTGgIkttYn2iUeKQPN3H9GcH5H6XdV9KtBzwxJsCMKERWaZRDLHR8huHoPBTYzUnlFwlMONXp/uATan+YzWlSImyin9nl05EpqzW6pCN0vZ45XT9xrvyMbZp6OFj9feQqCUqOsB+byqJN3QFC9cAuMn8+auPqpQzig8iHL root@lxd-armhf1
-----END SSH HOST KEY KEYS-----
[   30.593085] cloud-init[1419]: Cloud-init v. 0.7.7 running 'modules:final' at Thu, 11 Feb 2016 16:28:21 +0000. Up 30.28 seconds.
[   30.596898] cloud-init[1419]: Cloud-init v. 0.7.7 finished at Thu, 11 Feb 2016 16:28:21 +0000. Datasource DataSourceOpenStack [net,ver=2].  Up 30.57 seconds
[[0;32m  OK  [0m] Started Execute cloud user/final scripts.
[[0;32m  OK  [0m] Reached target Multi-User System.
[[0;32m  OK  [0m] Reached target Graphical Interface.
         Starting Update UTMP about System Runlevel Changes...
[[0;32m  OK  [0m] Started Update UTMP about System Runlevel Changes.

Ubuntu Xenial Xerus (development branch) lxd-armhf1 ttyAMA0

lxd-armhf1 login: [  564.359603] INFO: rcu_sched detected stalls on CPUs/tasks:
[  564.360854]  1-...: (102 GPs behind) idle=944/0/0 softirq=3802/3803 fqs=1 
[  564.362281]  (detected by 2, t=15002 jiffies, g=1240, c=1239, q=629)
[  564.363639] Call trace:
[  564.364180] rcu_sched kthread starved for 15001 jiffies! g1240 c1239 f0x0 s3 ->state=0x1
[  684.056248] INFO: rcu_sched detected stalls on CPUs/tasks:
[  684.057518]  1-...: (120 GPs behind) idle=998/0/0 softirq=3802/3803 fqs=1 
[  684.058933]  (detected by 0, t=15002 jiffies, g=1258, c=1257, q=162)
[  684.060283] Call trace:
[  684.060822] rcu_sched kthread starved for 15001 jiffies! g1258 c1257 f0x0 s3 ->state=0x1
[  804.056002] INFO: rcu_sched detected stalls on CPUs/tasks:
[  804.057224]  1-...: (146 GPs behind) idle=9c6/0/0 softirq=3802/3803 fqs=1 
[  804.058575]  (detected by 0, t=15002 jiffies, g=1284, c=1283, q=1062)
[  804.059876] Call trace:
[  804.060430] rcu_sched kthread starved for 15001 jiffies! g1284 c1283 f0x0 s3 ->state=0x1
[ 1344.067597] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1344.068850]  1-...: (0 ticks this GP) idle=7c0/0/0 softirq=4024/4024 fqs=0 
[ 1344.070289]  2-...: (0 ticks this GP) idle=0e4/0/0 softirq=4087/4087 fqs=0 
[ 1344.071721]  (detected by 0, t=15002 jiffies, g=1513, c=1512, q=44)
[ 1344.073071] Call trace:
[ 1344.073614] Call trace:
[ 1344.074145] rcu_sched kthread starved for 15002 jiffies! g1513 c1512 f0x0 s3 ->state=0x1
[ 1413.727530] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1413.728804]  1-...: (0 ticks this GP) idle=7d0/0/0 softirq=4024/4024 fqs=0 
[ 1413.730256]  2-...: (3 GPs behind) idle=0e8/0/0 softirq=4087/4087 fqs=0 
[ 1413.731687]  3-...: (3 GPs behind) idle=d3a/0/0 softirq=21590/21591 fqs=0 
[ 1413.733131]  (detected by 0, t=15002 jiffies, g=1516, c=1515, q=66)
[ 1413.734493] Call trace:
[ 1413.735040] Call trace:
[ 1413.735578] Call trace:
[ 1413.736110] rcu_sched kthread starved for 15002 jiffies! g1516 c1515 f0x0 s3 ->state=0x1
[ 1483.727326] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1483.728787]  1-...: (0 ticks this GP) idle=7ec/0/0 softirq=4024/4024 fqs=0 
[ 1483.730658]  2-...: (6 GPs behind) idle=112/0/0 softirq=4087/4087 fqs=0 
[ 1483.732500]  3-...: (6 GPs behind) idle=d60/0/0 softirq=21590/21591 fqs=0 
[ 1483.734318]  (detected by 0, t=15002 jiffies, g=1519, c=1518, q=898)
[ 1483.736073] Call trace:
[ 1483.736784] Call trace:
[ 1483.737462] Call trace:
[ 1483.738029] rcu_sched kthread starved for 15002 jiffies! g1519 c1518 f0x0 s3 ->state=0x1
[ 1644.070845] INFO: task sd-resolve:594 blocked for more than 120 seconds.
[ 1644.072355]       Not tainted 4.4.0-8-generic #23-Ubuntu
[ 1644.073488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1644.075233] Call trace:
[ 1764.074357] INFO: task sd-resolve:594 blocked for more than 120 seconds.
[ 1764.075868]       Not tainted 4.4.0-8-generic #23-Ubuntu
[ 1764.077002] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1764.078761] Call trace:
[ 1884.077795] INFO: task systemd:1 blocked for more than 120 seconds.
[ 1884.079207]       Not tainted 4.4.0-8-generic #23-Ubuntu
[ 1884.080345] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1884.082117] Call trace:
[ 1884.083453] INFO: task sd-resolve:594 blocked for more than 120 seconds.
[ 1884.084893]       Not tainted 4.4.0-8-generic #23-Ubuntu
[ 1884.086103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1884.087780] Call trace:
[ 1884.088415] INFO: task (md-udevd):1480 blocked for more than 120 seconds.
[ 1884.089963]       Not tainted 4.4.0-8-generic #23-Ubuntu
[ 1884.091156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1884.092805] Call trace:
[ 2004.089260] INFO: task systemd:1 blocked for more than 120 seconds.
[ 2004.090690]       Not tainted 4.4.0-8-generic #23-Ubuntu
[ 2004.091831] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2004.093608] Call trace:
[ 2004.094828] INFO: task sd-resolve:594 blocked for more than 120 seconds.
[ 2004.096614]       Not tainted 4.4.0-8-generic #23-Ubuntu
[ 2004.097859] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2004.099596] Call trace:
[ 2004.100215] INFO: task (md-udevd):1480 blocked for more than 120 seconds.
[ 2004.101746]       Not tainted 4.4.0-8-generic #23-Ubuntu
[ 2004.102945] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2004.104603] Call trace:
[ 2064.112914] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 2064.114164]  2-...: (248 GPs behind) idle=166/0/0 softirq=4087/4087 fqs=0 
[ 2064.115582]  3-...: (0 ticks this GP) idle=db8/0/0 softirq=21591/21591 fqs=0 
[ 2064.117036]  (detected by 0, t=15002 jiffies, g=1761, c=1760, q=1165)
[ 2064.118358] Call trace:
[ 2064.118875] Call trace:
[ 2064.119383] rcu_sched kthread starved for 15002 jiffies! g1761 c1760 f0x0 s3 ->state=0x1
[ 2124.104651] INFO: task systemd:1 blocked for more than 120 seconds.
[ 2124.106035]       Not tainted 4.4.0-8-generic #23-Ubuntu
[ 2124.107174] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2124.108921] Call trace:
[ 2124.110257] INFO: task sd-resolve:594 blocked for more than 120 seconds.
[ 2124.111697]       Not tainted 4.4.0-8-generic #23-Ubuntu
[ 2124.112917] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2124.114649] Call trace:
[ 2343.723540] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 2343.724812]  2-...: (326 GPs behind) idle=18c/0/0 softirq=4087/4087 fqs=1 
[ 2343.726259]  (detected by 0, t=15002 jiffies, g=1839, c=1838, q=1383)
[ 2343.727629] Call trace:
[ 2343.728168] rcu_sched kthread starved for 15001 jiffies! g1839 c1838 f0x0 s3 ->state=0x1

... and trying to ssh in just blocks.

summary:

- [arm64] multithreaded processes get locked up in futexes
+ [arm64] locks up a few minutes after booting

Revision history for this message

Martin Pitt (pitti) wrote on 2016-04-13:

#25

I tried this again yesterday evening, with an up-to-date xenial arm64 image, and lo and behold: Both setting it up and running two parallel loops with calling adt-run over night, and they went through ~ 500 iterations without a hitch.

So I suppose this was fixed by some newer kernel, the new glibc, some changed Scalingstack configuration, or the tooth fairy :-)

Whatever it was, I close this now as this has become a bit unwieldy. I'll report a new bug if this happens again.

Changed in linux (Ubuntu):
status:	Confirmed → Fix Released

Revision history for this message

Martin Pitt (pitti) wrote on 2016-04-15:

#26

Meh, of course this came back. Only not a few minutes after booting any more, but two days.

Changed in linux (Ubuntu):
status:	Fix Released → Confirmed
summary:	- [arm64] locks up a few minutes after booting + [arm64] locks up some time after booting

Revision history for this message

Martin Pitt (pitti) wrote on 2016-04-27: Re: [arm64] locks up some time after booting

#27

syslog Edit (73.4 KiB, text/plain)

This is a syslog from one boot up to the point where I hard-rebooted the instance because it was completely hanging. The kernel errors still look by and large like the ones from the original report (in JournalErrors.txt).

Colin Ian King (colin-king) on 2016-04-28

Changed in linux (Ubuntu):
assignee:	nobody → Colin Ian King (colin-king)

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-05-13:

#28

Martin, I've going on a hunch, can you try the kernels to see if these help:

http://kernel.ubuntu.com/~cking/lp1531768/

Revision history for this message

Martin Pitt (pitti) wrote on 2016-05-13: Re: [Bug 1531768] Re: [arm64] locks up some time after booting

#29

Colin Ian King [2016-05-13 11:07 -0000]:
> Martin, I've going on a hunch, can you try the kernels to see if these
> help:
>
> http://kernel.ubuntu.com/~cking/lp1531768/

I installed that on two out of three lxd hosts. I'll watch the
notification emails of my watchdog and will let you know next week!
(Usually they don't survive more than a few hours).

Many thanks, you rock!

Revision history for this message

Martin Pitt (pitti) wrote on 2016-05-16: Re: [arm64] locks up some time after booting

#30

Two boxes (lxd-armhf{1,2}) have been running for three days with cking's kernel, 3 is running the standard xenial kernel. 1 and 3 don't respond to ssh connections any more; 2 still does, but several processes are in 'D' state:

root 47 0.0 0.0 0 0 ? D May13 0:00 [fsnotify_mark]
root 12892 0.0 0.0 0 0 ? D May13 0:00 [kworker/1:0]
root 26785 0.0 0.0 0 0 ? D May13 0:00 [kworker/1:2]

(uninterruptible kernel sleep), such as lxd or systemd (pid 1). Calling "top" hangs as well (but "ps" works), and there are a ton of zombie processes (including lxd, check-new-release, sshd, socat).

All three continue to have lots of dmesg like

203722.697873] rcu_sched kthread starved for 15002 jiffies! g50919 c50918 f0x0 s3 ->state=0x1
[209670.455074] INFO: rcu_sched detected stalls on CPUs/tasks:
[209670.456446] 3-...: (126 GPs behind) idle=a7e/0/0 softirq=30811/30811 fqs=1
[209670.457910] (detected by 2, t=15002 jiffies, g=50962, c=50961, q=8377)
[209670.459430] Call trace:

So I'd say that this kernel did not really help (but also did not make things worse for sure).

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-05-16:

#31

OK, that's not so great. Can you see if the latest 4.6 kernel shows any improvement?

http://kernel.ubuntu.com/~cking/lp1531768/4.6

Revision history for this message

Martin Pitt (pitti) wrote on 2016-05-17:

#32

After installing the 4.6 kernel and rebooting, lxd-bridge.service failed the same way on two boxes:

May 17 11:27:26 lxd-armhf1 lxd-bridge.start[2112]: Failed to setup lxd-bridge.
May 17 11:27:26 lxd-armhf1 lxd-bridge.start[2129]: RTNETLINK answers: Operation not supported
May 17 11:27:26 lxd-armhf1 lxd-bridge.start[2129]: Failed to setup lxd-bridge.

A simple "brctl addbr foo" does work though, so I guess this happens later on when trying to add veths to it or setting its IP etc. Unfortunately there is no dmesg output about this at all. But after another reboot this curiously succeeded (I verified uname -a that in both cases I was running 4.6.0).

I now let one box run with 4.6.0, let's see how long it'll hold up.

Another note: With this new kernel, and I think also with the previous 4.4 one, I'm getting an awful lot of "failure to fork" errors, e. g. when trying to run apt-get install; up to the point that right after a fresh boot I can't even install a simple package like "bridge-utils" any more. "ps aux" looks fairly harmless, just 150 lines which includes all the kernel threads. It could of course be that over the course of all the automatic reboots the file system got corrupted in some ways -- but usually another reboot then becomes more lucky and apt-get install (as well as setting up the above bridge) works.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-05-17:

#33

> I now let one box run with 4.6.0, let's see how long it'll hold up.

Checking again, I got a dozen auto-reboots from the watchdog, and after disabling the watchdog and inspecting the box after 4 hours I see exactly the same symptoms.

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-05-18:

#34

Hi Martin, I've built a Xenial kernel now with a load of debug enabled, it may catch some kind of issue or provide a hint of what's going on.

http://kernel.ubuntu.com/~cking/lp1531768/4.4-debug/

Care to try this out. It won't fix anything, but it may capture some interesting bug info if it sees anything out of the ordinary on locking etc..

Revision history for this message

Martin Pitt (pitti) wrote on 2016-05-19:

#35

I installed the debug kernel on one arm64 box last night, and it has now run for 12 hours. lxd is still running, no hung processes, and dmesg shows nothing unusual, i. e. no extra debug messages. Yay heisenbug?

The other box with the standard xenial kernel has locked up as always; I'll install the debug kernel on that too now.

Thanks Colin!

Revision history for this message

Martin Pitt (pitti) wrote on 2016-05-19:

#36

dmesg with 4.4.0-23-generic_4.4.0-23.41-dbg1 Edit (98.9 KiB, text/html)

One box completely froze again (ssh does not respond any more), attaching dmesg. However, I'm not sure that this actually contains what you were looking for -- there is a lot of the usual chatter from starting/stopping containers, and then some traces about hung tasks when trying to flush the file system (I have /srv on btrfs so that containers have some acceptable performance -- but NB that this also happened in my earlier experiments on plain ext4).

Do I need to do anything to enable the extra debugging you added? Or is that perhaps not in dmesg?

Revision history for this message

Martin Pitt (pitti) wrote on 2016-06-13:

#37

This happens without any --user-data or any particular interaction, just by plainly booting a standard image.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-06-13:

#38

Colin had the hunch that this actually happens if the CPUs do *not* have anything to do, i. e. want to go into a low freq/power state. We successfully ran the full stress-ng test suite on the instance without triggering any of this, but a few minutes after it was done it started hung processes started to appear again.

We currently run a test whether booting with "nohz=off" works around that. So far it has survived for some 30 minutes (which is very promising).

This also has never triggered on a single-CPU instance (m1.small).

Changed in auto-package-testing:
status:	New → Triaged
importance:	Undecided → Medium
assignee:	nobody → Martin Pitt (pitti)

Revision history for this message

Martin Pitt (pitti) wrote on 2016-06-13:

#39

Neither did this trigger on two trusty 4-cpu instances so far.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-06-14:

#40

These have held up throughout the night \o/

I added the workaround to the worker setup script: https://git.launchpad.net/~ubuntu-release/+git/autopkgtest-cloud/commit/?id=50583f06

Changed in auto-package-testing:
assignee:	Martin Pitt (pitti) → nobody
status:	Triaged → Fix Released
summary:	- [arm64] locks up some time after booting + [arm64] locks up some time after booting when idle if tickless (nohz=on) + is used
summary:	- [arm64] locks up some time after booting when idle if tickless (nohz=on) - is used + [arm64] lockups when idle if tickless (nohz=on) is used

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-06-14: Re: [arm64] lockups when idle if tickless (nohz=on) is used

#41

That's great news! I'll try and figure out what the root cause is. Let me know if there are other issues.

Revision history for this message

Paul Gear (paulgear) wrote on 2016-06-14:

#42

We've also been running into this issue on ScalingStack instances recently; I got this traceback which seems to strongly implicate nohz as the problem area: https://pastebin.canonical.com/158640/ Presently testing @pitti's workaround on a number of different sized instances to confirm.

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-06-14:

#43

It may be worth trying nohz=off on the host as well, just as an experiment to see if this also improves things.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-06-14:

#44

Oh noes! I'm still getting "task * blocked for more than 120 seconds" hangs even with nohz=off :-( Is there another option which I could try?

Changed in auto-package-testing:
status:	Fix Released → Triaged

Revision history for this message

Martin Pitt (pitti) wrote on 2016-06-15:

#45

> It may be worth trying nohz=off on the host as well

Junien did that on the nova compute host, and no change. Processes in the instance still freeze.

This is actually also consistent with the observation that this apparently does not happen with the trusty kernel.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-06-16:

#46

FTR, running the trusty kernel on xenial userspace does not work: http://paste.ubuntu.com/17392362/

cking | pitti, syscall 384 on aarch64 is getrandom() and that does not exist on trusty

Revision history for this message

Martin Pitt (pitti) wrote on 2016-06-16:

#47

Hang still occurs with xenial kernel and one instance of

nice -n 19 dd if=/dev/zero of=/dev/null bs=1024 &

I have now rebooted and started four dd's, so that all four CPUs should remain busy constantly.

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-06-16:

#48

If 4 dd's work OK, it may be worth running a minimal sleep loop:

while true; do sleep 0.5; done

Revision history for this message

Martin Pitt (pitti) wrote on 2016-06-17:

#49

Further notekeeping:
- 4 dd's (xenial+nohz=off) has survived for half a day, then the instance crashed on something else.
- trusty and vivid kernels with nohz=off have survived for a full day without any lockups. lxd on trusty kernel causes a lot of leaked "FREEZED/FREEZING" containers, but that's unrelated and does not happen with the vivid kernel. So it's unclear whether this combination is stable or the lockups are just reduced, or it just was lucky.
- trusty kernel without the nohz option locked up a few minutes after reboot, without actually running any test.

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-06-17:

#50

I'm trying to get a reliable reproducer on a similarly sized aarch64 host. Just so that I'm not missing anything, what is the entire command line being used on the host to run the VM?

Also, what is /proc/cmdline on the VM?

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-06-17:

#51

And the /proc/cmdline info from the host would be of some use to see if anything special there is being used.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-06-17: Re: [Bug 1531768] Re: [arm64] lockups when idle if tickless (nohz=on) is used

#52

Colin Ian King [2016-06-17 10:50 -0000]:
> I'm trying to get a reliable reproducer on a similarly sized aarch64
> host. Just so that I'm not missing anything, what is the entire command
> line being used on the host to run the VM?

I can't determine this. I asked Junien on IRC to put it here.

> Also, what is /proc/cmdline on the VM?

Aside from the "nohz=off" it's rather unsurprising:

BOOT_IMAGE=/boot/vmlinuz-4.2.0-38-generic root=UUID=b98e4d93-8d8f-4349-a6ce-b5a87cdb2edd ro nohz=off

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-06-17: Re: [arm64] lockups when idle if tickless (nohz=on) is used

#53

I'd like to factor out if we are missing IRQs on the host and inside the VM, so can both be booted with kernel parameter: irqpoll

unfortunately this can eat more cpu cycles, so I'm reluctant to ask for this to be used, but I'm wondering of the host or VM are occasionally missing timer wakeups

Revision history for this message

Junien F (axino) wrote on 2016-06-17:

#54

Hi !

kvm cmdline : https://paste.ubuntu.com/17430254/
cpuinfo : https://paste.ubuntu.com/17430265/
dmesg: https://paste.ubuntu.com/17430277/

Revision history for this message

Martin Pitt (pitti) wrote on 2016-06-18:

#55

>- trusty and vivid kernels with nohz=off have survived for a full day without any lockups.

They both hung last night.

So in summary: Neither nohz=off nor older kernels help here. This really seems to be a matter of luck/what's going on on the host system.

summary:

- [arm64] lockups when idle if tickless (nohz=on) is used
+ [arm64] lockups some time after booting

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-06-21:

#56

I've been running Xenial host + Xenial VM on a mcdivitt 8 core box and not been able to reproduce this issue. I'm going to keep it running for one more day.

Do we have any idea of what the host(s) hardware is? I'm starting to wonder if it is a host/VM interaction issue.

Revision history for this message

William Grant (wgrant) wrote on 2016-06-21:

#57

The production hardware is mcdivitt as well, running trusty with lts-vivid or lts-wily.

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-06-21:

#58

Thanks William, I'm going to soak test with those older kernels and see if I can trip the hang on these.

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-06-21:

#59

Finally able to trip a rcu timeout. 3.19.0-61-generic kernel on host, xenial on server, host busy on async i/o requests (via stress-ng):

[ 825.195520] systemd[1]: Started Journal Service.
[ 900.108730] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 900.110254] 0-...: (4 GPs behind) idle=750/0/0 softirq=4668/4668 fqs=1
[ 900.111742] (detected by 2, t=15002 jiffies, g=1980, c=1979, q=90)
[ 900.113384] Call trace:
[ 900.114035] rcu_sched kthread starved for 15001 jiffies! g1980 c1979 f0x0 s3 ->state=0x1
[ 900.108730] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 900.110254] 0-...: (4 GPs behind) idle=750/0/0 softirq=4668/4668 fqs=1
[ 900.111742] (detected by 2, t=15002 jiffies, g=1980, c=1979, q=90)
[ 900.113366] Task dump for CPU 0:
[ 900.113375] swapper/0 R running task 0 0 0 0x00000000
[ 900.113384] Call trace:
[ 900.113987] [<ffff800000086ad0>] __switch_to+0x90/0xa8
[ 900.113999] [<ffff800000143c40>] __tick_nohz_idle_enter+0x50/0x3f0
[ 900.114003] [<ffff800000144238>] tick_nohz_idle_enter+0x40/0x70
[ 900.114010] [<ffff80000010b8b0>] cpu_startup_entry+0x288/0x2d8
[ 900.114018] [<ffff8000008f18f0>] rest_init+0x80/0x88
[ 900.114025] [<ffff800000cb59f8>] start_kernel+0x3e8/0x414
[ 900.114029] [<00000000408fb000>] 0x408fb000
[ 900.114035] rcu_sched kthread starved for 15001 jiffies! g1980 c1979 f0x0 s3 ->state=0x1

Let me see if I can repro this on other kernels now

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-06-21:

#60

Can trip it with stress-ng context switching with 4.2.0-38-generic

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-06-21:

#61

Testing with 4.4 on the host and the VM is showing:

[ 335.699014] sched: RT throttling activated
[ 337.600831] hrtimer: interrupt took 2939683820 ns

..which shows us that the host is suffering from some very large scheduling latency issues that is causing the VM some grief.

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-06-21:

#62

Can't repro the bug on 4.4 kernel on host. Will try 4.3 now

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-06-21:

#63

I wonder if it is possible to test with a recent 4.4 Xenial kernel on the host to see if that helps.

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-07-07:

#64

Bisecting is proving problematic as 4.3 kernels don't boot.

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-07-07:

#65

On an idle Xenial cloud image I'm seeing:

[ 1485.236760] [<ffff800000086ad0>] __switch_to+0x90/0xa8
[ 1485.236772] [<ffff800000143e80>] __tick_nohz_idle_enter+0x50/0x3f0
[ 1485.236776] [<ffff800000144478>] tick_nohz_idle_enter+0x40/0x70
[ 1485.236785] [<ffff80000010baf0>] cpu_startup_entry+0x288/0x2d8
[ 1485.236791] [<ffff80000008fca8>] secondary_start_kernel+0x120/0x130
[ 1485.236795] [<000000004008290c>] 0x4008290c

after a while I get:

[ 2462.806971] rcu_sched kthread starved for 15002 jiffies! g2579 c2578 f0x0 s3 ->state=0x1
[ 2667.835351] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 2667.836918] 0-...: (66 GPs behind) idle=cf0/0/0 softirq=5177/5177 fqs=0
[ 2667.838801] 2-...: (0 ticks this GP) idle=73a/0/0 softirq=4570/4570 fqs=0
[ 2667.840696] 3-...: (64 GPs behind) idle=eba/0/0 softirq=4654/4654 fqs=0
[ 2667.842533] (detected by 1, t=15002 jiffies, g=2638, c=2637, q=4389)

and at this point sleeping blocks, for example strace on sleep(1) on the VM shows nanosleep({1, 0}) sleep forever, one has to SIGINT this as it never times out.

Also the secondary_start_kernel() is indicative that the VM puts CPUs to sleep and wakes them on a timer.

I can trigger this more often with more CPUs on the VM and also by loading the host, for example, producing a lot of cache or memory activity can trigger the initial hangs more frequently than having an idle host.

So, I suspect there is a cpuhotplug and nohz combo causing issues here.

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-07-07:

#66

Bit more digging, I see that the CPU goes into idle either by a single WFI (wait for an interrupt) shallow sleep or a deeper arm_cpuidle_suspend() - the latter is akin to turning off the CPU. I wonder if we're seeing some issues with the wakeup latency taking a long time inside QEMU when the host is loaded, causing rcu sched issues.

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-07-07:

#67

http://lists.infradead.org/pipermail/linux-arm-kernel/2014-July/274251.html

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-07-08:

#68

This article throws some light onto things:

https://lwn.net/Articles/518953/

"Second, the greater the number of idle CPUs, the more work RCU must do when forcing quiescent states. Yes, the busier the system, the less work RCU needs to do! The reason for the extra work is that RCU is not permitted to disturb idle CPUs for energy-efficiency reasons. RCU must therefore probe a per-CPU data structure to read out idleness state during each grace period, likely incurring a cache miss on each such probe."

Just to add, I running the VM with say 4 CPUs, all are which are idle.

In my experiments on 3.19 and 4.2, kernels, kvm is not being used on the host, so we have QEMU emulating N CPUs with just 1 host CPU. Plus a loaded host means that this single CPU is busy and we have potentially large latencies serving the N virtual CPUs in the VM. I think that's part of the issue; large latencies from the host with a N-to-1 virt to host mapping meaning that we are tripping the RCU grace periods.

To try and help RCU kthreads from suffering from delays, I added the following kernel parameters to the VM:

rcu_nocb_poll rcutree.kthread_prio=90 rcuperf.verbose=1

I was able to run an 8 CPU VM without any RCU issues with the host CPU being hammered to death with stress-ng. I also then cranked down the RCU stall grace period to just 5 seconds to see how easy I can trip the issue with this more extreme setting using:

echo 5 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout

and again, no RCU issues.

@Martin,

can you try using the following kernel parameters on the VM and see if this helps:

rcu_nocb_poll rcutree.kthread_prio=90 rcuperf.verbose=1

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-07-08:

#69

Also, can we clarify something. Do the ARM hosts provide kvm? If not, one should really run the VMs with just one CPU.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-07-08:

#70

Thanks Colin, great work! I'll deploy this ASAP.

FYI, at least some of the VM hosts in scalingstack got updated to a 4.4 kernel. Not sure how much that changes your investigations.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-07-12:

#71

lxd-armhf1 (on swirlix01) has run without any lockup since the host kernel update to 4.4. I created a new lxd-armhf2 yesterday (on swirlix08) which also survived without any workaround. At the same time I created a new lxd-armhf3 (on swirlix16) which has locked up pretty well every < 15 minutes (I got a ton of watchdog reboot messages over the night). I deployed the "rcu_nocb_poll rcutree.kthread_prio=90 rcuperf.verbose=1" thing on that now.

I also asked #is about the host kernels. My suspicion is that swirlix{01,08} got updated to 4.4 while swirlix16 is still running an older one.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-07-12:

#72

hloeung | pitti: yeah, I believe work was done to get swirlix01-09 to 4.4

Revision history for this message

Martin Pitt (pitti) wrote on 2016-07-12:

#73

[hloeung@ragnar tmp]$ for i in {01..09} 16; do ssh swirlix${i}.bos01.scalingstack "uname -a"; done
Linux swirlix01 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix02 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix03 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix04 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix05 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix06 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix07 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix08 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix09 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix16 4.2.0-36-generic #42~14.04.1-Ubuntu SMP Fri May 13 17:26:22 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux

Revision history for this message

Martin Pitt (pitti) wrote on 2016-07-12:

#74

console log with rcu_nocb_poll rcutree.kthread_prio=90 Edit (34.2 KiB, text/plain)

> can you try using the following kernel parameters on the VM and see if this helps:
> rcu_nocb_poll rcutree.kthread_prio=90 rcuperf.verbose=1

the instance on swirlix16 (on 4.2 kernel) hung again (twice), with the attached console log. This now has the above kernel parameters, but I'm afraid it doesn't look any different than before.

Revision history for this message

Junien F (axino) wrote on 2016-07-12:

#75

Hi,

I'm sorry but 4.4 is too unstable on the hosts. We have to reboot and/or power cycle them multiple times a day. We're back on 4.2 everywhere.

Haw gathered some perf data on a failing 4.4 host, perhaps we can start digging the issue from here ? Perhaps it should be a separate bug as well.

Thanks

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-07-12:

#76

yep, file a separate bug, the perf data will be useful. Thanks.

Revision history for this message

Junien F (axino) wrote on 2016-07-13:

#77

Filed LP#1602577 for the host instability issue on 4.4

Revision history for this message

Colin Watson (cjwatson) wrote on 2016-07-13:

#78

We're seeing rather similar symptoms on Launchpad builders after upgrading the guests from wily to xenial (console-log not very informative, e.g. https://pastebin.canonical.com/160898/plain/; build output appears hung; I can't tell for sure that it's the same thing, this is just a guess). These are on the same scalingstack hosts as Martin is using, so if anything the surprise is that we weren't seeing it before (we had different problems, mainly occasional hangs on boot). KVM is enabled, host is 4.2 per Junien's comments above, only interesting kernel parameter we're using at present is "compat_uts_machine=armv7l".

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-07-14:

#79

I'm reproducing rcu_sched timeouts all the time with a 4.4 kernel on a far slower ARM64 host with the same cloud images.

[ 157.555837] INFO: rcu_sched self-detected stall on CPU
[ 157.561551] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 157.562669] 2-...: (14960 ticks this GP) idle=5b5/140000000000001/0 softirq=37/37 fqs=1354
[ 157.563478] (detected by 3, t=15002 jiffies, g=-251, c=-252, q=302)
[ 157.564033] Task dump for CPU 2:
[ 157.564468] swapper/0 R running task 0 1 0 0x00000002
[ 157.565071] Call trace:
[ 157.565469] [<ffff800000086ad0>] __switch_to+0x90/0xa8
[ 157.566052] [<ffff80000011b714>] generic_handle_irq+0x34/0x50
[ 157.566256] [<ffff80000011ba70>] __handle_domain_irq+0x68/0xc0

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-07-14:

#80

..and for one more datapoint, QEMU seems to be hung spinning on a futex:

futex(0xb05520, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xb054f4, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0xb05520, 1104376) = 1

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-07-14:

#81

OK, ignore the last two messages, it eventually booted, it just seems that the host was rather slow.

Revision history for this message

dann frazier (dannf) wrote on 2016-07-20:

#82

I wonder if this might be a dupe of LP: #1549494? We fixed that in xenial, but haven't backported the fix to wily. I haven't been able to reproduce this issue myself, but I uploaded a wily kernel w/ a backported fix to ppa:dannf/test, in case someone else can test it. It corresponds to the git branch here:

https://code.launchpad.net/~dannf/ubuntu/+source/linux/+git/wily/+ref/lp1531768

Revision history for this message

Colin Watson (cjwatson) wrote on 2016-07-21:

#83

@dannf, this bug seems to be *worse* in xenial than in wily, so I don't think backporting a change from xenial to wily is going to help matters?

Revision history for this message

dann frazier (dannf) wrote on 2016-07-21:

#84

@cjwatson: Ah, ok. I may have misread the history here. I had gleaned that the xenial kernel (as a host) was more unstable - but for different reasons.

Regardless, I have pulled the wily backport build I prepared, because it was frequently triggering a WARN() condition. Looks like my backport attempt was too naive and would need some work. But perhaps I should pause that effort until we confirm if 4.4 is impacted by this same issue.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-09-30:

#85

For the record, I now use two arm64 xenial (4.4) instances on a host with kernel 4.8, and things are looking really good. See latest posts to bug 1602577.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-11-07:

#86

I propose to close this. This is clearly fixed with 4.4 on the host, and rolling that out is covered by bug 1602577.

It can be closed for auto-package-testing either way as our arm64 nova compute nodes now run 4.4.23.

Changed in linux (Ubuntu):
status:	Confirmed → Fix Released
Changed in auto-package-testing:
status:	Triaged → Fix Released

Ubuntu
linux package

[arm64] lockups some time after booting

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	Auto Package Testing	Fix Released	Medium	Unassigned
	linux (Ubuntu)	Fix Released	Medium	Colin Ian King

Ubuntulinux package

[arm64] lockups some time after booting

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package