[arm64] lockups some time after booting

Bug #1531768 reported by Martin Pitt
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Auto Package Testing
Fix Released
Medium
Unassigned
linux (Ubuntu)
Fix Released
Medium
Colin Ian King

Bug Description

I created an 8 CPU arm64 instance on Canonical's Scalingstack (which I want to use for armhf autopkgtesting in LXD). I started with wily as that has lxd available (it's not yet available in trusty nor the PPA for arm64).

However, pretty much any LXD task that I do (I haven't tried much else) on this machine takes unbearably long. A simple "lxc profile set default raw.lxc lxc.seccomp=" or "lxc list" takes several minutes.

I see tons of

[ 1020.971955] rcu_sched kthread starved for 6000 jiffies! g1095 c1094 f0x0
[ 1121.166926] INFO: task fsnotify_mark:69 blocked for more than 120 seconds.

in dmesg (the attached apport info has the complete dmesg).

ProblemType: Bug
DistroRelease: Ubuntu 15.10
Package: linux-image-4.2.0-22-generic 4.2.0-22.27
ProcVersionSignature: User Name 4.2.0-22.27-generic 4.2.6
Uname: Linux 4.2.0-22-generic aarch64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jan 7 09:18 seq
 crw-rw---- 1 root audio 116, 33 Jan 7 09:18 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.19.1-0ubuntu5
Architecture: arm64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: N/A
Date: Thu Jan 7 09:24:01 2016
IwConfig:
 eth0 no wireless extensions.

 lo no wireless extensions.

 lxcbr0 no wireless extensions.
Lspci:
 00:00.0 Host bridge [0600]: Red Hat, Inc. Device [1b36:0008]
  Subsystem: Red Hat, Inc Device [1af4:1100]
  Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
  Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize libusb: -99
PciMultimedia:

ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.2.0-22-generic root=LABEL=cloudimg-rootfs earlyprintk
RelatedPackageVersions:
 linux-restricted-modules-4.2.0-22-generic N/A
 linux-backports-modules-4.2.0-22-generic N/A
 linux-firmware 1.149.3
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UdevLog: Error: [Errno 2] No such file or directory: '/var/log/udev'
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Martin Pitt (pitti) wrote :
description: updated
Revision history for this message
Martin Pitt (pitti) wrote :

"reboot" also takes too long to be practical (I killed the instance after waiting for 10 mins, as it didn't even begin to shut down).

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1531768

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Martin Pitt (pitti)
tags: added: bot-stop-nagging
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Martin Pitt (pitti) wrote : Re: kernel 4.2/wily on arm64 and multiple CPUs is unusably slow

I tried to install the current xenial kernel (http://ports.ubuntu.com/pool/main/l/linux/linux-image-4.3.0-5-generic_4.3.0-5.16_arm64.deb). Package installation fails due to

  Processing triggers for initramfs-tools (0.120ubuntu6) ...
  update-initramfs: Generating /boot/initrd.img-4.3.0-5-generic
  Unsupported platform.
  run-parts: /etc/initramfs/post-update.d//flash-kernel exited with return code 1

but nevertheless I can boot with "reboot -f" (as a normal reboot hangs). Now lxc fails on lxc-net:

Jan 07 10:03:15 lxd-armhf1w systemd[1]: Starting LXC network bridge setup...
Jan 07 10:03:16 lxd-armhf1w lxc-net[10651]: RTNETLINK answers: Operation not supported
Jan 07 10:03:16 lxd-armhf1w lxc-net[10651]: Failed to setup lxc-net.

particularly it fails on "ip link add dev lxcbr0 type bridge".

dmesg error for this:

[ 199.168466] module x_tables: unsupported RELA relocation: 275
[ 199.232020] module llc: unsupported RELA relocation: 275

which might be a regression in xenial's arm64 kernel or lxc needing to be adjusted to it? Either way, this is a deal-breaker, so with the xenial kernel I can't do much.

However, operations like "lxc list" work without the lxcbr0 bridge. They still take several minutes, but I don't get the dmesg errors any more.

Revision history for this message
Martin Pitt (pitti) wrote :

I tried to nova boot a trusty instance and dist-upgrade the userspace packages to xenial, keeping the 3.19 linux kernel. Same effect, "lxc list" or other operations take many minutes and lxd.service itself keeps timing out too.

summary: - kernel 4.2/wily on arm64 and multiple CPUs is unusably slow
+ arm64 kernel and multiple CPUs is unusably slow with lxd operations
Revision history for this message
Martin Pitt (pitti) wrote : Re: arm64 kernel and multiple CPUs is unusably slow with lxd operations

I retried the same on m1.medium with 2 CPUs and 4 GB RAM, and lxd works fine there with the 4.2 kernel on wily. Unfortunately that's too small for my purposes. m1.large with 4 CPUs/8 GB RAM also seems to work well, I can make-do with that.

William points out that the hosts on bos01 only have 8 CPUs. So maybe this starts happening if the guest gets >= #cpus as the host?

summary: - arm64 kernel and multiple CPUs is unusably slow with lxd operations
+ arm64 kernel and >= 8 CPUS (>= host CPU count?) is unusably slow with
+ lxd operations
Revision history for this message
Martin Pitt (pitti) wrote : Re: arm64 kernel and multiple CPUs is unusably slow

I take that back. It does survive for much longer, but after some 15 minutes of running I again run into tons of

[ 2424.611668] INFO: task systemd-udevd:1320 blocked for more than 120 seconds.
[ 2424.613514] Tainted: G W 4.2.0-22-generic #27-Ubuntu
[ 2424.615183] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2424.617166] systemd-udevd D ffffffc000086ee4 0 1320 1 0x0000000c
[ 2424.617176] Call trace:
[ 2424.617842] [<ffffffc000086ee4>] __switch_to+0x94/0xa8
[ 2424.617851] [<ffffffc0008a7100>] __schedule+0x2b0/0x7b8
[ 2424.617854] [<ffffffc0008a7644>] schedule+0x3c/0x98
[ 2424.617859] [<ffffffc0008aa954>] schedule_timeout+0x1ec/0x280
[ 2424.617862] [<ffffffc0008a8364>] wait_for_common+0xcc/0x1a0
[ 2424.617866] [<ffffffc0008a8460>] wait_for_completion+0x28/0x38
[ 2424.617870] [<ffffffc000120654>] __synchronize_srcu+0x9c/0x180
[ 2424.617873] [<ffffffc000120770>] synchronize_srcu+0x38/0x48
[ 2424.617877] [<ffffffc00028b7b4>] fsnotify_destroy_group+0x2c/0x60
[ 2424.617880] [<ffffffc00028de3c>] inotify_release+0x34/0x78
[ 2424.617885] [<ffffffc00024537c>] __fput+0xa4/0x248
[ 2424.617887] [<ffffffc000245598>] ____fput+0x20/0x30
[ 2424.617892] [<ffffffc0000e0604>] task_work_run+0xbc/0xf8
[ 2424.617896] [<ffffffc0000c29d0>] do_exit+0x2f0/0xa48
[ 2424.617898] [<ffffffc0000c31bc>] do_group_exit+0x44/0xe8
[ 2424.617902] [<ffffffc0000d09b8>] get_signal+0x3d8/0x578
[ 2424.617906] [<ffffffc000089f20>] do_signal+0x90/0x530
[ 2424.617909] [<ffffffc00008a640>] do_notify_resume+0x70/0x78

for all kinds of processes.

summary: - arm64 kernel and >= 8 CPUS (>= host CPU count?) is unusably slow with
- lxd operations
+ arm64 kernel and multiple CPUs is unusably slow
Revision history for this message
Chris J Arges (arges) wrote :

Martin,
Can you collect apport information from the host system as well?
Do you get the same effects with a single vCPU?
--chris

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key
Revision history for this message
Martin Pitt (pitti) wrote :

> Can you collect apport information from the host system as well?

Sorry, I can't. I can create Scalingstack instances, but I have no access to the host systems. The IS team certainly can, though.

> Do you get the same effects with a single vCPU?

So far that test system is holding up and I haven't seen processes getting locked up.

I do see that networking within containers is totally busted (transmitting 20 bytes took some10 minutes), but with the multi-CPU instance I didn't even get that far. This also happens with 3.19 (cannot test 4.3 due to its regression of creating bridges). Either way, this seems to be a separate bug.

Revision history for this message
Martin Pitt (pitti) wrote :

FTR, the "networking broken in containers" was an MTU mismatch, worked around now. Thanks to Andy for figuring this out!

Revision history for this message
Martin Pitt (pitti) wrote :

I split out the xenial bridge regression into bug 1534545, so that this can keep focussing on the "processes become slow and hang after a while" main aspect.

Revision history for this message
Andy Whitcroft (apw) wrote :

The lxc hangs component looks to be an lxd related issue. Specificially the go libraries in use consume a large ammount of entropy and hang waiting for it to become available. Installing haveged seems to resolve these hangs.

Revision history for this message
Andy Whitcroft (apw) wrote :

The rcu messages though annoying do seem to be benign as they do not increase in time.

Revision history for this message
Martin Pitt (pitti) wrote :

I did install haveged which indeed seems to help quite a bit. But now after having used an xlarge (8 CPU) instance for a while, I again get hanging processes, like

ubuntu 2317 0.0 0.0 0 0 pts/0 D+ 16:14 0:00 [tail]

I used that tail on /var/log/lxd/lxd.log to see what's going on. lxd itself stopped responding much earlier (but not in kernel "D" state, it's sleeping).

Sorry, this is still very unspecific..

Revision history for this message
Martin Pitt (pitti) wrote :

Another data point: I tried to install 3.19 (the kernel that we have on the buildds) on the xlarge instance, and lxc list now hangs there as well.

I haven't yet seem lxc list hang on a large (4 CPUs) instance, but the whole thing (running tests in containers) is still very slow. TBC on Monday..

Revision history for this message
Martin Pitt (pitti) wrote :

lxd-armhf1 (8 CPUs) is again in a state where "lxc list" and even "top" hang forever. lxd-armhf2 was unfortunately shutdown in the previous days, so I just booted it again.

Revision history for this message
Martin Pitt (pitti) wrote : Re: lxd and other commands get stuck on arm64 kernel and multiple CPUs

Retitling. The "unusably slow" part was fixed with installing haveged, so what remains is that the 8x CPU instance gets into this lockup state after some time.

On the 4x instance I'm now running adt-run in a loop, so far it's through ~ 10 iterations. I'll let it run over night and see how it is keeping up.

summary: - arm64 kernel and multiple CPUs is unusably slow
+ lxd and other commands get stuck on arm64 kernel and multiple CPUs
Revision history for this message
Martin Pitt (pitti) wrote :

I managed to get the 4x CPU instance into the same locked up state now, so AFAICS the problem isn't fundamentally different between 4 and 8 cores.

Revision history for this message
Martin Pitt (pitti) wrote :

Reducing the number of threads that Go uses seems to help a bit:

$ cat /etc/systemd/system/lxd.service.d/override.conf
[Service]
Environment=GOMAXPROCS=1

(GOMAXPROCS defaults to the number of CPUs). But Stéphane is still able to lock up LXD pretty fast even with that.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Very much looks like it's related to threading and futexes somehow.

Forcing golang to use a single thread rather than one per container made things more stable using a very simple test (infinite loop of "lxc list"), though starting containers then still caused the hang to happen.

I've seen a similar hang on futex when running (lxc-tests package):
lxc-test-concurrent -j 8 -i 50

This creates and spawns 8 containers in parallel using threads and attempts that 50 times in a row. This is done entirely in C so doesn't touch golang.

Martin Pitt (pitti)
summary: - lxd and other commands get stuck on arm64 kernel and multiple CPUs
+ [arm64] multithreaded processes get locked up in futexes
Revision history for this message
Martin Pitt (pitti) wrote : Re: [arm64] multithreaded processes get locked up in futexes

Some good news: With bug 1534545 fixed I was now able to upgrade to the Xenial 4.4 kernel. On the 4x CPU instance two parallel adt-run loops have now run for about two hours without any dmesg spew. Stéphane has run "lxc-test-concurrent -j 16 -i 10" twice on the 8x CPU instance successfully too.

Bad news: I rebooted the 8x CPU instance (also xenial du jour with 4.4 kernel), and didn't do anything on it. After just sitting idle for an hour or two ssh stopped responding and nova console-log shows http://paste.ubuntu.com/14857144/ (only a hard reboot helped). So it wet its pants without actually doing any action.

So it appears it's not fully fixed yet, but muuch better. I'll do some more smoke testing, and if 4x CPU instances work, this is good enough to put this into production. I'll keep the old Calxeda instances alive as a fallback for a while, of course.

Revision history for this message
Martin Pitt (pitti) wrote :

Darn, I now get the "instance kills itself after some time" on the 4x CPU as well. nova console-log shows the blurb below and ssh and lxd ports are dead (so I can't learn anything further from the box than console-log).

Ubuntu Xenial Xerus (development branch) lxd-armhf2 ttyAMA0

lxd-armhf2 login: [ 954.144506] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 954.145743] 1-...: (79 GPs behind) idle=284/0/0 softirq=407182/407182 fqs=1
[ 954.147202] (detected by 3, t=15002 jiffies, g=21817, c=21816, q=1563)
[ 954.148590] Call trace:
[ 954.149123] rcu_sched kthread starved for 15002 jiffies! g21817 c21816 f0x0 s3 ->state=0x1
[ 3000.217089] INFO: task systemd:1 blocked for more than 120 seconds.
[ 3000.218529] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3000.219628] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3000.221310] Call trace:
[ 3000.222562] INFO: task kworker/0:2:12463 blocked for more than 120 seconds.
[ 3000.223985] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3000.225146] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3000.226741] Call trace:
[ 3000.227306] INFO: task (d-logind):15441 blocked for more than 120 seconds.
[ 3000.228685] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3000.229834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3000.231469] Call trace:
[ 3120.231067] INFO: task systemd:1 blocked for more than 120 seconds.
[ 3120.232501] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3120.233629] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3120.235393] Call trace:
[ 3120.236702] INFO: task kworker/0:2:12463 blocked for more than 120 seconds.
[ 3120.238188] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3120.239398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3120.241140] Call trace:
[ 3120.241716] INFO: task (d-logind):15441 blocked for more than 120 seconds.
[ 3120.243223] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3120.244366] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3120.245945] Call trace:
[ 3240.244955] INFO: task systemd:1 blocked for more than 120 seconds.
[ 3240.246398] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3240.247526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.249272] Call trace:
[ 3240.250568] INFO: task kworker/0:2:12463 blocked for more than 120 seconds.
[ 3240.252060] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3240.253280] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.254966] Call trace:
[ 3240.255549] INFO: task (d-logind):15441 blocked for more than 120 seconds.
[ 3240.257073] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3240.258259] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3240.259906] Call trace:
[ 3360.258901] INFO: task systemd:1 blocked for more than 120 seconds.
[ 3360.260349] Not tainted 4.4.0-2-generic #16-Ubuntu
[ 3360.261475] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3360.263224] Call trace:

Revision history for this message
Martin Pitt (pitti) wrote :

For the record, this "auto-destruct" behaviour with the xenial kernel happens just by itself: reboot the instance, let it sit there for 15 or 60 minutes, then this kernel spew starts happening and it gets locked up with losing network/ssh access. There was no actual payload on these.

Revision history for this message
Martin Pitt (pitti) wrote : Re: [arm64] locks up a few minutes after booting
Download full text (44.4 KiB)

I re-tried with the current kernel 4.4.0-8, and merely booting a pristine cloud image with "nova boot --poll --image ubuntu/ubuntu-xenial-daily-arm64-server-20160227-uefi1.img --flavor m1.large" and letting it sit there for some 20 minutes is still auto-destructing:

[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[2J[01;01H[=3h[2J[01;01H[0m[35m[40m[2J[01;01H[0m[37m[40merror: no suitable video mode found.
EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services and installing virtual address map...
[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Initializing cgroup subsys cpuacct
[ 0.000000] Linux version 4.4.0-8-generic (buildd@beebe) (gcc version 5.3.1 20160222 (Ubuntu/Linaro 5.3.1-9ubuntu3) ) #23-Ubuntu SMP Wed Feb 24 20:51:39 UTC 2016 (Ubuntu 4.4.0-8.23-generic 4.4.2)
[ 0.000000] Boot CPU: AArch64 Processor [500f0001]
[ 0.000000] efi: Getting EFI parameters from FDT:
[ 0.000000] EFI v2.40 by EDK II
[ 0.000000] efi:
[ 0.000000] psci: probing for conduit method from DT.
[ 0.000000] psci: PSCIv0.2 detected in firmware.
[ 0.000000] psci: Using standard PSCI v0.2 function IDs
[ 0.000000] psci: Trusted OS migration not required
[ 0.000000] PERCPU: Embedded 17 pages/cpu @ffff8001fff7d000 s31128 r8192 d30312 u69632
[ 0.000000] Detected PIPT I-cache on CPU0
[ 0.000000] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 2064384
[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-8-generic root=LABEL=cloudimg-rootfs vt.handoff=7
[ 0.000000] log_buf_len individual max cpu contribution: 4096 bytes
[ 0.000000] log_buf_len total cpu_extra contributions: 12288 bytes
[ 0.000000] log_buf_len min size: 16384 bytes
[ 0.000000] log_buf_len: 32768 bytes
[ 0.000000] early log buf free: 14588(89%)
[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[ 0.000000] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes)
[ 0.000000] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes)
[ 0.000000] software IO TLB [mem 0xfbffb000-0xffffb000] (64MB) mapped at [ffff8000bbffb000-ffff8000bfffafff]
[ 0.000000] Memory: 8142020K/8388608K available (8552K kernel code, 1007K rwdata, 3736K rodata, 748K init, 783K bss, 246588K reserved, 0K cma-reserved)
[ 0.000000] Virtual kernel memory layout:
[ 0.000000] vmalloc : 0xffff000000000000 - 0xffff7bffbfff0000 (126974 GB)
[ 0.000000] vmemmap : 0xffff7bffc0000000 - 0xffff7fffc0000000 ( 4096 GB maximum)
[ 0.000000] 0xffff7bffc1000000 - 0xffff7bffc9000000 ( 128 MB actual)
[ 0.000000] fixed : 0xffff7ffffa7fd000 - 0xffff7ffffac00000 ( 4108 KB)
[ 0.000000] PCI I/O : 0xffff7ffffae00000 - 0xffff7ffffbe00000 ( 16 MB)
[ 0.000000] modules : 0xffff7ffffc000000 - 0xffff800000000000 ( 64 MB)
[ 0.000000] memory : 0xffff800000000000 - 0xffff800200000000 ( 8192 MB)
[ 0.000000] .init : 0x...

summary: - [arm64] multithreaded processes get locked up in futexes
+ [arm64] locks up a few minutes after booting
Revision history for this message
Martin Pitt (pitti) wrote :

I tried this again yesterday evening, with an up-to-date xenial arm64 image, and lo and behold: Both setting it up and running two parallel loops with calling adt-run over night, and they went through ~ 500 iterations without a hitch.

So I suppose this was fixed by some newer kernel, the new glibc, some changed Scalingstack configuration, or the tooth fairy :-)

Whatever it was, I close this now as this has become a bit unwieldy. I'll report a new bug if this happens again.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Martin Pitt (pitti) wrote :

Meh, of course this came back. Only not a few minutes after booting any more, but two days.

Changed in linux (Ubuntu):
status: Fix Released → Confirmed
summary: - [arm64] locks up a few minutes after booting
+ [arm64] locks up some time after booting
Revision history for this message
Martin Pitt (pitti) wrote : Re: [arm64] locks up some time after booting

This is a syslog from one boot up to the point where I hard-rebooted the instance because it was completely hanging. The kernel errors still look by and large like the ones from the original report (in JournalErrors.txt).

Changed in linux (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
Revision history for this message
Colin Ian King (colin-king) wrote :

Martin, I've going on a hunch, can you try the kernels to see if these help:

http://kernel.ubuntu.com/~cking/lp1531768/

Revision history for this message
Martin Pitt (pitti) wrote : Re: [Bug 1531768] Re: [arm64] locks up some time after booting

Colin Ian King [2016-05-13 11:07 -0000]:
> Martin, I've going on a hunch, can you try the kernels to see if these
> help:
>
> http://kernel.ubuntu.com/~cking/lp1531768/

I installed that on two out of three lxd hosts. I'll watch the
notification emails of my watchdog and will let you know next week!
(Usually they don't survive more than a few hours).

Many thanks, you rock!

Revision history for this message
Martin Pitt (pitti) wrote : Re: [arm64] locks up some time after booting

Two boxes (lxd-armhf{1,2}) have been running for three days with cking's kernel, 3 is running the standard xenial kernel. 1 and 3 don't respond to ssh connections any more; 2 still does, but several processes are in 'D' state:

root 47 0.0 0.0 0 0 ? D May13 0:00 [fsnotify_mark]
root 12892 0.0 0.0 0 0 ? D May13 0:00 [kworker/1:0]
root 26785 0.0 0.0 0 0 ? D May13 0:00 [kworker/1:2]

 (uninterruptible kernel sleep), such as lxd or systemd (pid 1). Calling "top" hangs as well (but "ps" works), and there are a ton of zombie processes (including lxd, check-new-release, sshd, socat).

All three continue to have lots of dmesg like

203722.697873] rcu_sched kthread starved for 15002 jiffies! g50919 c50918 f0x0 s3 ->state=0x1
[209670.455074] INFO: rcu_sched detected stalls on CPUs/tasks:
[209670.456446] 3-...: (126 GPs behind) idle=a7e/0/0 softirq=30811/30811 fqs=1
[209670.457910] (detected by 2, t=15002 jiffies, g=50962, c=50961, q=8377)
[209670.459430] Call trace:

So I'd say that this kernel did not really help (but also did not make things worse for sure).

Revision history for this message
Colin Ian King (colin-king) wrote :

OK, that's not so great. Can you see if the latest 4.6 kernel shows any improvement?

http://kernel.ubuntu.com/~cking/lp1531768/4.6

Revision history for this message
Martin Pitt (pitti) wrote :

After installing the 4.6 kernel and rebooting, lxd-bridge.service failed the same way on two boxes:

May 17 11:27:26 lxd-armhf1 lxd-bridge.start[2112]: Failed to setup lxd-bridge.
May 17 11:27:26 lxd-armhf1 lxd-bridge.start[2129]: RTNETLINK answers: Operation not supported
May 17 11:27:26 lxd-armhf1 lxd-bridge.start[2129]: Failed to setup lxd-bridge.

A simple "brctl addbr foo" does work though, so I guess this happens later on when trying to add veths to it or setting its IP etc. Unfortunately there is no dmesg output about this at all. But after another reboot this curiously succeeded (I verified uname -a that in both cases I was running 4.6.0).

I now let one box run with 4.6.0, let's see how long it'll hold up.

Another note: With this new kernel, and I think also with the previous 4.4 one, I'm getting an awful lot of "failure to fork" errors, e. g. when trying to run apt-get install; up to the point that right after a fresh boot I can't even install a simple package like "bridge-utils" any more. "ps aux" looks fairly harmless, just 150 lines which includes all the kernel threads. It could of course be that over the course of all the automatic reboots the file system got corrupted in some ways -- but usually another reboot then becomes more lucky and apt-get install (as well as setting up the above bridge) works.

Revision history for this message
Martin Pitt (pitti) wrote :

> I now let one box run with 4.6.0, let's see how long it'll hold up.

Checking again, I got a dozen auto-reboots from the watchdog, and after disabling the watchdog and inspecting the box after 4 hours I see exactly the same symptoms.

Revision history for this message
Colin Ian King (colin-king) wrote :

Hi Martin, I've built a Xenial kernel now with a load of debug enabled, it may catch some kind of issue or provide a hint of what's going on.

http://kernel.ubuntu.com/~cking/lp1531768/4.4-debug/

Care to try this out. It won't fix anything, but it may capture some interesting bug info if it sees anything out of the ordinary on locking etc..

Revision history for this message
Martin Pitt (pitti) wrote :

I installed the debug kernel on one arm64 box last night, and it has now run for 12 hours. lxd is still running, no hung processes, and dmesg shows nothing unusual, i. e. no extra debug messages. Yay heisenbug?

The other box with the standard xenial kernel has locked up as always; I'll install the debug kernel on that too now.

Thanks Colin!

Revision history for this message
Martin Pitt (pitti) wrote :

One box completely froze again (ssh does not respond any more), attaching dmesg. However, I'm not sure that this actually contains what you were looking for -- there is a lot of the usual chatter from starting/stopping containers, and then some traces about hung tasks when trying to flush the file system (I have /srv on btrfs so that containers have some acceptable performance -- but NB that this also happened in my earlier experiments on plain ext4).

Do I need to do anything to enable the extra debugging you added? Or is that perhaps not in dmesg?

Revision history for this message
Martin Pitt (pitti) wrote :

This happens without any --user-data or any particular interaction, just by plainly booting a standard image.

Revision history for this message
Martin Pitt (pitti) wrote :

Colin had the hunch that this actually happens if the CPUs do *not* have anything to do, i. e. want to go into a low freq/power state. We successfully ran the full stress-ng test suite on the instance without triggering any of this, but a few minutes after it was done it started hung processes started to appear again.

We currently run a test whether booting with "nohz=off" works around that. So far it has survived for some 30 minutes (which is very promising).

This also has never triggered on a single-CPU instance (m1.small).

Changed in auto-package-testing:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Martin Pitt (pitti)
Revision history for this message
Martin Pitt (pitti) wrote :

Neither did this trigger on two trusty 4-cpu instances so far.

Revision history for this message
Martin Pitt (pitti) wrote :

These have held up throughout the night \o/

I added the workaround to the worker setup script: https://git.launchpad.net/~ubuntu-release/+git/autopkgtest-cloud/commit/?id=50583f06

Changed in auto-package-testing:
assignee: Martin Pitt (pitti) → nobody
status: Triaged → Fix Released
summary: - [arm64] locks up some time after booting
+ [arm64] locks up some time after booting when idle if tickless (nohz=on)
+ is used
summary: - [arm64] locks up some time after booting when idle if tickless (nohz=on)
- is used
+ [arm64] lockups when idle if tickless (nohz=on) is used
Revision history for this message
Colin Ian King (colin-king) wrote : Re: [arm64] lockups when idle if tickless (nohz=on) is used

That's great news! I'll try and figure out what the root cause is. Let me know if there are other issues.

Revision history for this message
Paul Gear (paulgear) wrote :

We've also been running into this issue on ScalingStack instances recently; I got this traceback which seems to strongly implicate nohz as the problem area: https://pastebin.canonical.com/158640/ Presently testing @pitti's workaround on a number of different sized instances to confirm.

Revision history for this message
Colin Ian King (colin-king) wrote :

It may be worth trying nohz=off on the host as well, just as an experiment to see if this also improves things.

Revision history for this message
Martin Pitt (pitti) wrote :

Oh noes! I'm still getting "task * blocked for more than 120 seconds" hangs even with nohz=off :-( Is there another option which I could try?

Changed in auto-package-testing:
status: Fix Released → Triaged
Revision history for this message
Martin Pitt (pitti) wrote :

> It may be worth trying nohz=off on the host as well

Junien did that on the nova compute host, and no change. Processes in the instance still freeze.

This is actually also consistent with the observation that this apparently does not happen with the trusty kernel.

Revision history for this message
Martin Pitt (pitti) wrote :

FTR, running the trusty kernel on xenial userspace does not work: http://paste.ubuntu.com/17392362/

cking | pitti, syscall 384 on aarch64 is getrandom() and that does not exist on trusty

Revision history for this message
Martin Pitt (pitti) wrote :

Hang still occurs with xenial kernel and one instance of

   nice -n 19 dd if=/dev/zero of=/dev/null bs=1024 &

I have now rebooted and started four dd's, so that all four CPUs should remain busy constantly.

Revision history for this message
Colin Ian King (colin-king) wrote :

If 4 dd's work OK, it may be worth running a minimal sleep loop:

while true; do sleep 0.5; done

Revision history for this message
Martin Pitt (pitti) wrote :

Further notekeeping:
 - 4 dd's (xenial+nohz=off) has survived for half a day, then the instance crashed on something else.
 - trusty and vivid kernels with nohz=off have survived for a full day without any lockups. lxd on trusty kernel causes a lot of leaked "FREEZED/FREEZING" containers, but that's unrelated and does not happen with the vivid kernel. So it's unclear whether this combination is stable or the lockups are just reduced, or it just was lucky.
 - trusty kernel without the nohz option locked up a few minutes after reboot, without actually running any test.

Revision history for this message
Colin Ian King (colin-king) wrote :

I'm trying to get a reliable reproducer on a similarly sized aarch64 host. Just so that I'm not missing anything, what is the entire command line being used on the host to run the VM?

Also, what is /proc/cmdline on the VM?

Revision history for this message
Colin Ian King (colin-king) wrote :

And the /proc/cmdline info from the host would be of some use to see if anything special there is being used.

Revision history for this message
Martin Pitt (pitti) wrote : Re: [Bug 1531768] Re: [arm64] lockups when idle if tickless (nohz=on) is used

Colin Ian King [2016-06-17 10:50 -0000]:
> I'm trying to get a reliable reproducer on a similarly sized aarch64
> host. Just so that I'm not missing anything, what is the entire command
> line being used on the host to run the VM?

I can't determine this. I asked Junien on IRC to put it here.

> Also, what is /proc/cmdline on the VM?

Aside from the "nohz=off" it's rather unsurprising:

  BOOT_IMAGE=/boot/vmlinuz-4.2.0-38-generic root=UUID=b98e4d93-8d8f-4349-a6ce-b5a87cdb2edd ro nohz=off

Revision history for this message
Colin Ian King (colin-king) wrote : Re: [arm64] lockups when idle if tickless (nohz=on) is used

I'd like to factor out if we are missing IRQs on the host and inside the VM, so can both be booted with kernel parameter: irqpoll

unfortunately this can eat more cpu cycles, so I'm reluctant to ask for this to be used, but I'm wondering of the host or VM are occasionally missing timer wakeups

Revision history for this message
Junien F (axino) wrote :
Revision history for this message
Martin Pitt (pitti) wrote :

>- trusty and vivid kernels with nohz=off have survived for a full day without any lockups.

They both hung last night.

So in summary: Neither nohz=off nor older kernels help here. This really seems to be a matter of luck/what's going on on the host system.

summary: - [arm64] lockups when idle if tickless (nohz=on) is used
+ [arm64] lockups some time after booting
Revision history for this message
Colin Ian King (colin-king) wrote :

I've been running Xenial host + Xenial VM on a mcdivitt 8 core box and not been able to reproduce this issue. I'm going to keep it running for one more day.

Do we have any idea of what the host(s) hardware is? I'm starting to wonder if it is a host/VM interaction issue.

Revision history for this message
William Grant (wgrant) wrote :

The production hardware is mcdivitt as well, running trusty with lts-vivid or lts-wily.

Revision history for this message
Colin Ian King (colin-king) wrote :

Thanks William, I'm going to soak test with those older kernels and see if I can trip the hang on these.

Revision history for this message
Colin Ian King (colin-king) wrote :

Finally able to trip a rcu timeout. 3.19.0-61-generic kernel on host, xenial on server, host busy on async i/o requests (via stress-ng):

[ 825.195520] systemd[1]: Started Journal Service.
[ 900.108730] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 900.110254] 0-...: (4 GPs behind) idle=750/0/0 softirq=4668/4668 fqs=1
[ 900.111742] (detected by 2, t=15002 jiffies, g=1980, c=1979, q=90)
[ 900.113384] Call trace:
[ 900.114035] rcu_sched kthread starved for 15001 jiffies! g1980 c1979 f0x0 s3 ->state=0x1
[ 900.108730] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 900.110254] 0-...: (4 GPs behind) idle=750/0/0 softirq=4668/4668 fqs=1
[ 900.111742] (detected by 2, t=15002 jiffies, g=1980, c=1979, q=90)
[ 900.113366] Task dump for CPU 0:
[ 900.113375] swapper/0 R running task 0 0 0 0x00000000
[ 900.113384] Call trace:
[ 900.113987] [<ffff800000086ad0>] __switch_to+0x90/0xa8
[ 900.113999] [<ffff800000143c40>] __tick_nohz_idle_enter+0x50/0x3f0
[ 900.114003] [<ffff800000144238>] tick_nohz_idle_enter+0x40/0x70
[ 900.114010] [<ffff80000010b8b0>] cpu_startup_entry+0x288/0x2d8
[ 900.114018] [<ffff8000008f18f0>] rest_init+0x80/0x88
[ 900.114025] [<ffff800000cb59f8>] start_kernel+0x3e8/0x414
[ 900.114029] [<00000000408fb000>] 0x408fb000
[ 900.114035] rcu_sched kthread starved for 15001 jiffies! g1980 c1979 f0x0 s3 ->state=0x1

Let me see if I can repro this on other kernels now

Revision history for this message
Colin Ian King (colin-king) wrote :

Can trip it with stress-ng context switching with 4.2.0-38-generic

Revision history for this message
Colin Ian King (colin-king) wrote :

Testing with 4.4 on the host and the VM is showing:

[ 335.699014] sched: RT throttling activated
[ 337.600831] hrtimer: interrupt took 2939683820 ns

..which shows us that the host is suffering from some very large scheduling latency issues that is causing the VM some grief.

Revision history for this message
Colin Ian King (colin-king) wrote :

Can't repro the bug on 4.4 kernel on host. Will try 4.3 now

Revision history for this message
Colin Ian King (colin-king) wrote :

I wonder if it is possible to test with a recent 4.4 Xenial kernel on the host to see if that helps.

Revision history for this message
Colin Ian King (colin-king) wrote :

Bisecting is proving problematic as 4.3 kernels don't boot.

Revision history for this message
Colin Ian King (colin-king) wrote :

On an idle Xenial cloud image I'm seeing:

[ 1485.236760] [<ffff800000086ad0>] __switch_to+0x90/0xa8
[ 1485.236772] [<ffff800000143e80>] __tick_nohz_idle_enter+0x50/0x3f0
[ 1485.236776] [<ffff800000144478>] tick_nohz_idle_enter+0x40/0x70
[ 1485.236785] [<ffff80000010baf0>] cpu_startup_entry+0x288/0x2d8
[ 1485.236791] [<ffff80000008fca8>] secondary_start_kernel+0x120/0x130
[ 1485.236795] [<000000004008290c>] 0x4008290c

after a while I get:

[ 2462.806971] rcu_sched kthread starved for 15002 jiffies! g2579 c2578 f0x0 s3 ->state=0x1
[ 2667.835351] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 2667.836918] 0-...: (66 GPs behind) idle=cf0/0/0 softirq=5177/5177 fqs=0
[ 2667.838801] 2-...: (0 ticks this GP) idle=73a/0/0 softirq=4570/4570 fqs=0
[ 2667.840696] 3-...: (64 GPs behind) idle=eba/0/0 softirq=4654/4654 fqs=0
[ 2667.842533] (detected by 1, t=15002 jiffies, g=2638, c=2637, q=4389)

and at this point sleeping blocks, for example strace on sleep(1) on the VM shows nanosleep({1, 0}) sleep forever, one has to SIGINT this as it never times out.

Also the secondary_start_kernel() is indicative that the VM puts CPUs to sleep and wakes them on a timer.

I can trigger this more often with more CPUs on the VM and also by loading the host, for example, producing a lot of cache or memory activity can trigger the initial hangs more frequently than having an idle host.

So, I suspect there is a cpuhotplug and nohz combo causing issues here.

Revision history for this message
Colin Ian King (colin-king) wrote :

Bit more digging, I see that the CPU goes into idle either by a single WFI (wait for an interrupt) shallow sleep or a deeper arm_cpuidle_suspend() - the latter is akin to turning off the CPU. I wonder if we're seeing some issues with the wakeup latency taking a long time inside QEMU when the host is loaded, causing rcu sched issues.

Revision history for this message
Colin Ian King (colin-king) wrote :
Revision history for this message
Colin Ian King (colin-king) wrote :

This article throws some light onto things:

https://lwn.net/Articles/518953/

"Second, the greater the number of idle CPUs, the more work RCU must do when forcing quiescent states. Yes, the busier the system, the less work RCU needs to do! The reason for the extra work is that RCU is not permitted to disturb idle CPUs for energy-efficiency reasons. RCU must therefore probe a per-CPU data structure to read out idleness state during each grace period, likely incurring a cache miss on each such probe."

Just to add, I running the VM with say 4 CPUs, all are which are idle.

In my experiments on 3.19 and 4.2, kernels, kvm is not being used on the host, so we have QEMU emulating N CPUs with just 1 host CPU. Plus a loaded host means that this single CPU is busy and we have potentially large latencies serving the N virtual CPUs in the VM. I think that's part of the issue; large latencies from the host with a N-to-1 virt to host mapping meaning that we are tripping the RCU grace periods.

To try and help RCU kthreads from suffering from delays, I added the following kernel parameters to the VM:

rcu_nocb_poll rcutree.kthread_prio=90 rcuperf.verbose=1

I was able to run an 8 CPU VM without any RCU issues with the host CPU being hammered to death with stress-ng. I also then cranked down the RCU stall grace period to just 5 seconds to see how easy I can trip the issue with this more extreme setting using:

echo 5 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout

and again, no RCU issues.

@Martin,

can you try using the following kernel parameters on the VM and see if this helps:

rcu_nocb_poll rcutree.kthread_prio=90 rcuperf.verbose=1

Revision history for this message
Colin Ian King (colin-king) wrote :

Also, can we clarify something. Do the ARM hosts provide kvm? If not, one should really run the VMs with just one CPU.

Revision history for this message
Martin Pitt (pitti) wrote :

Thanks Colin, great work! I'll deploy this ASAP.

FYI, at least some of the VM hosts in scalingstack got updated to a 4.4 kernel. Not sure how much that changes your investigations.

Revision history for this message
Martin Pitt (pitti) wrote :

lxd-armhf1 (on swirlix01) has run without any lockup since the host kernel update to 4.4. I created a new lxd-armhf2 yesterday (on swirlix08) which also survived without any workaround. At the same time I created a new lxd-armhf3 (on swirlix16) which has locked up pretty well every < 15 minutes (I got a ton of watchdog reboot messages over the night). I deployed the "rcu_nocb_poll rcutree.kthread_prio=90 rcuperf.verbose=1" thing on that now.

I also asked #is about the host kernels. My suspicion is that swirlix{01,08} got updated to 4.4 while swirlix16 is still running an older one.

Revision history for this message
Martin Pitt (pitti) wrote :

hloeung | pitti: yeah, I believe work was done to get swirlix01-09 to 4.4

Revision history for this message
Martin Pitt (pitti) wrote :

[hloeung@ragnar tmp]$ for i in {01..09} 16; do ssh swirlix${i}.bos01.scalingstack "uname -a"; done
Linux swirlix01 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix02 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix03 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix04 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix05 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix06 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix07 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix08 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix09 4.4.0-30-generic #49~14.04.1-Ubuntu SMP Thu Jun 30 22:20:09 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
Linux swirlix16 4.2.0-36-generic #42~14.04.1-Ubuntu SMP Fri May 13 17:26:22 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux

Revision history for this message
Martin Pitt (pitti) wrote :

> can you try using the following kernel parameters on the VM and see if this helps:
> rcu_nocb_poll rcutree.kthread_prio=90 rcuperf.verbose=1

the instance on swirlix16 (on 4.2 kernel) hung again (twice), with the attached console log. This now has the above kernel parameters, but I'm afraid it doesn't look any different than before.

Revision history for this message
Junien F (axino) wrote :

Hi,

I'm sorry but 4.4 is too unstable on the hosts. We have to reboot and/or power cycle them multiple times a day. We're back on 4.2 everywhere.

Haw gathered some perf data on a failing 4.4 host, perhaps we can start digging the issue from here ? Perhaps it should be a separate bug as well.

Thanks

Revision history for this message
Colin Ian King (colin-king) wrote :

yep, file a separate bug, the perf data will be useful. Thanks.

Revision history for this message
Junien F (axino) wrote :

Filed LP#1602577 for the host instability issue on 4.4

Revision history for this message
Colin Watson (cjwatson) wrote :

We're seeing rather similar symptoms on Launchpad builders after upgrading the guests from wily to xenial (console-log not very informative, e.g. https://pastebin.canonical.com/160898/plain/; build output appears hung; I can't tell for sure that it's the same thing, this is just a guess). These are on the same scalingstack hosts as Martin is using, so if anything the surprise is that we weren't seeing it before (we had different problems, mainly occasional hangs on boot). KVM is enabled, host is 4.2 per Junien's comments above, only interesting kernel parameter we're using at present is "compat_uts_machine=armv7l".

Revision history for this message
Colin Ian King (colin-king) wrote :

I'm reproducing rcu_sched timeouts all the time with a 4.4 kernel on a far slower ARM64 host with the same cloud images.

[ 157.555837] INFO: rcu_sched self-detected stall on CPU
[ 157.561551] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 157.562669] 2-...: (14960 ticks this GP) idle=5b5/140000000000001/0 softirq=37/37 fqs=1354
[ 157.563478] (detected by 3, t=15002 jiffies, g=-251, c=-252, q=302)
[ 157.564033] Task dump for CPU 2:
[ 157.564468] swapper/0 R running task 0 1 0 0x00000002
[ 157.565071] Call trace:
[ 157.565469] [<ffff800000086ad0>] __switch_to+0x90/0xa8
[ 157.566052] [<ffff80000011b714>] generic_handle_irq+0x34/0x50
[ 157.566256] [<ffff80000011ba70>] __handle_domain_irq+0x68/0xc0

Revision history for this message
Colin Ian King (colin-king) wrote :

..and for one more datapoint, QEMU seems to be hung spinning on a futex:

futex(0xb05520, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xb054f4, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0xb05520, 1104376) = 1

Revision history for this message
Colin Ian King (colin-king) wrote :

OK, ignore the last two messages, it eventually booted, it just seems that the host was rather slow.

Revision history for this message
dann frazier (dannf) wrote :

I wonder if this might be a dupe of LP: #1549494? We fixed that in xenial, but haven't backported the fix to wily. I haven't been able to reproduce this issue myself, but I uploaded a wily kernel w/ a backported fix to ppa:dannf/test, in case someone else can test it. It corresponds to the git branch here:

   https://code.launchpad.net/~dannf/ubuntu/+source/linux/+git/wily/+ref/lp1531768

Revision history for this message
Colin Watson (cjwatson) wrote :

@dannf, this bug seems to be *worse* in xenial than in wily, so I don't think backporting a change from xenial to wily is going to help matters?

Revision history for this message
dann frazier (dannf) wrote :

@cjwatson: Ah, ok. I may have misread the history here. I had gleaned that the xenial kernel (as a host) was more unstable - but for different reasons.

Regardless, I have pulled the wily backport build I prepared, because it was frequently triggering a WARN() condition. Looks like my backport attempt was too naive and would need some work. But perhaps I should pause that effort until we confirm if 4.4 is impacted by this same issue.

Revision history for this message
Martin Pitt (pitti) wrote :

For the record, I now use two arm64 xenial (4.4) instances on a host with kernel 4.8, and things are looking really good. See latest posts to bug 1602577.

Revision history for this message
Martin Pitt (pitti) wrote :

I propose to close this. This is clearly fixed with 4.4 on the host, and rolling that out is covered by bug 1602577.

It can be closed for auto-package-testing either way as our arm64 nova compute nodes now run 4.4.23.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Changed in auto-package-testing:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.