stress smoke test hang with dev test on AWS Xenial kernel

Bug #1741409 reported by Po-Hsu Lin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
Invalid
Medium
Unassigned
linux-aws (Ubuntu)
Invalid
High
Unassigned

Bug Description

The test will hang on dev STARTING, and get killed by the timeout setting.

DEBUG| [stdout] dccp PASSED
DEBUG| [stdout] dentry STARTING
DEBUG| [stdout] dentry RETURNED 0
DEBUG| [stdout] dentry PASSED
DEBUG| [stdout] dev STARTING

No interesting output in dmesg:
[ 8.281861] random: nonblocking pool is initialized
[ 8.338335] ppdev: user-space parallel port driver
[ 11.662848] cgroup: new mount options do not match the existing superblock, will be ignored
[ 12.272168] systemd[1]: Started ACPI event daemon.

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-1045-aws 4.4.0-1045.54
ProcVersionSignature: User Name 4.4.0-1045.54-aws 4.4.98
Uname: Linux 4.4.0-1045-aws x86_64
ApportVersion: 2.20.1-0ubuntu2.14
Architecture: amd64
Date: Fri Jan 5 06:47:45 2018
Ec2AMI: ami-a2e544da
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-west-2b
Ec2InstanceType: t2.nano
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-aws
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
description: updated
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Manually tested with older kernel (4.4.0-1043-aws), this issue still can be reproduced.

The node will get rebooted when bumping into this test.

Changed in linux-aws (Ubuntu):
importance: Undecided → High
status: New → In Progress
Revision history for this message
Colin Ian King (colin-king) wrote :

This is locking up on opening a specific device. It is not a race condition as I originally suspected, but a lockup on a simple read open of a device on just AWS.

Revision history for this message
Colin Ian King (colin-king) wrote :

4.4.0-73 has the same issue, so it's not an aws specific kernel issue per se.

Revision history for this message
Colin Ian King (colin-king) wrote :

issue occurs with v4.15-rc7 upstream kernel too

Revision history for this message
Colin Ian King (colin-king) wrote :

..and way back to v4.0

Revision history for this message
Colin Ian King (colin-king) wrote :

Do you mind re-running the test to see if we get passed this stress test now?

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Tested with 4.4.0-109 lowlatency kernel, this dev test can pass now.

I will leave this bug open as discussed on the IRC.

Revision history for this message
Colin Ian King (colin-king) wrote :

I can reproduce this with 4.16-rc2, I've debugged this down to:

drivers/char/hpet.c, hpet_timer_set_irq():

        if (irq < HPET_MAX_IRQ) {
                spin_lock_irq(&hpet_lock);
                v = readl(&timer->hpet_config);
                v |= irq << Tn_INT_ROUTE_CNF_SHIFT;
                writel(v, &timer->hpet_config);

.. the writel to hpet_config causes the reboot.

How to reproduce this issue:

git clone git://kernel.ubuntu.com/cking/stress-ng
cd stress-ng
git revert 0124b250ec205ea3cd6d9d68fb96c03ac294d12f
make
sudo ./stress-ng --dev 1

.. wait a while and it will eventually get around to the /dev/hpet and opening this causes the hang.

The minimal reproducer is:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>

int main(void)
{
 int fd;

 fd = open("/dev/hpet", O_RDONLY | O_NONBLOCK);
 if (fd > 0)
  close(fd);

 exit(0);
}

run this as root and it will cause the reboot.

Revision history for this message
Colin Ian King (colin-king) wrote :
Download full text (31.6 KiB)

demsg of guest:

[ 0.000000] Linux version 4.16.0-rc2+ (cking@gloin) (gcc version 7.3.0 (Ubuntu 7.3.0-3ubuntu1)) #7 SMP Tue Feb 20 14:27:20 UTC 2018
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.16.0-rc2+ root=UUID=b6adc449-5e3d-4331-ba6b-6e99a75fa48e ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295
[ 0.000000] KERNEL supported cpus:
[ 0.000000] Intel GenuineIntel
[ 0.000000] AMD AuthenticAMD
[ 0.000000] Centaur CentaurHauls
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
[ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003fffffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] random: fast init done
[ 0.000000] SMBIOS 2.7 present.
[ 0.000000] DMI: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[ 0.000000] Hypervisor detected: Xen HVM
[ 0.000000] Xen version 4.2.
[ 0.000000] Xen Platform PCI: I/O protocol version 1
[ 0.000000] Netfront and the Xen platform PCI driver have been compiled for this kernel: unplug emulated NICs.
[ 0.000000] Blkfront and the Xen platform PCI driver have been compiled for this kernel: unplug emulated disks.
               You might have to change the root device
               from /dev/hd[a-d] to /dev/xvd[a-d]
               in your root= kernel command line option
[ 0.000000] HVMOP_pagetable_dying not supported
[ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[ 0.000000] e820: last_pfn = 0x40000 max_arch_pfn = 0x400000000
[ 0.000000] MTRR default type: write-back
[ 0.000000] MTRR fixed ranges enabled:
[ 0.000000] 00000-9FFFF write-back
[ 0.000000] A0000-BFFFF write-combining
[ 0.000000] C0000-FFFFF write-back
[ 0.000000] MTRR variable ranges enabled:
[ 0.000000] 0 base 0000F0000000 mask 3FFFF8000000 uncachable
[ 0.000000] 1 base 0000F8000000 mask 3FFFFC000000 uncachable
[ 0.000000] 2 disabled
[ 0.000000] 3 disabled
[ 0.000000] 4 disabled
[ 0.000000] 5 disabled
[ 0.000000] 6 disabled
[ 0.000000] 7 disabled
[ 0.000000] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
[ 0.000000] found SMP MP-table at [mem 0x000fbc20-0x000fbc2f] mapped at [ (ptrval)]
[ 0.000000] Scanning 1 areas for low memory corruption
[ 0.000000] Base memory trampoline at [ (ptrval)] 98000 size 24576
[ 0.000000] BRK [0x0ff42000, 0x0ff42fff] PGTAB...

Revision history for this message
Stefan Bader (smb) wrote :

I was able to observe the crash on a Ubuntu Xenial Xen host which produced the following text on the host console:

(XEN) domain_crash called from hpet.c:387
(XEN) Domain 2 (vcpu#1) crashed on cpu#4:
(XEN) ----[ Xen-4.6.5 x86_64 debug=n Not tainted ]----
(XEN) CPU: 4
(XEN) RIP: 0010:[<ffffffff81532dc1>]
(XEN) RFLAGS: 0000000000010002 CONTEXT: hvm guest (d2v1)
(XEN) rax: 0000000000000032 rbx: ffff880034e41a00 rcx: 0000000000000000
(XEN) rdx: 0000000000000001 rsi: 0000000000000032 rdi: ffffffff821fdfb0
(XEN) rbp: ffff8800e90afc10 rsp: ffff8800e90afbd8 r8: 0000000000000003
(XEN) r9: 0000000000000000 r10: 000000000000000a r11: 0000000000000000
(XEN) r12: ffff880107a4a8f8 r13: ffffffff821fdfb0 r14: ffffc90000002140
(XEN) r15: ffffffff81a7c600 cr0: 0000000080050033 cr4: 0000000000360670
(XEN) cr3: 000000003492c000 cr2: 00007fcc1156a030
(XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0018 cs: 0010

Will investigate further (whether this persists in newer xen versions)

Revision history for this message
Stefan Bader (smb) wrote :

I was booting the same Xenial based HVM guest on the same host (but this time running Bionic / Xen 4.9). This combination does not crash the domain when opening HPET. Though the check and code that would do it is still there. I also found a bug report against xenserver which I believe is based on the same Xen version as we have in Xenial (4.6.5): https://bugs.xenserver.org/browse/XSO-809?focusedCommentId=16484&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel

This and the code say that the crash is done because HPET is set to use an unsupported interrupt method (edge/level). Since the Linux guest is the same in both cases, and also the test + crash code, either the hypervisor or maybe the seabios seem to use a different default.

Revision history for this message
Stefan Bader (smb) wrote :

Darn, ok I take everything back. Somehow the compiled reproducer was mangled in such a way it did maybe no longer do what it was intended to do. Anyhow, with freshly generated reproducers, even Xen 4.9 has the crash. :(

Changed in linux-aws (Ubuntu):
assignee: Colin Ian King (colin-king) → Stefan Bader (smb)
Revision history for this message
Stefan Bader (smb) wrote :

Right now I do not think there is much choice to fix this (other than not touch /dev/hpet on AWS). The linux kernel deliberately wants to set a level triggered interrupt. The xen hypervisor has no support for that (there might be some addition done but certainly not in any released version of Xen). And as "error handling" forcefully crashes the domain.

Revision history for this message
Sean Feole (sfeole) wrote :

Been sorting through many of the ubuntu-kernel-tests bugs.

This is one of the few that actually is being worked.

Stefan, any update on this? Should this be/ Has it been fixed? I can revisit once i finish cleaning up the list

Changed in ubuntu-kernel-tests:
status: New → In Progress
assignee: nobody → Sean Feole (sfeole)
importance: Undecided → Medium
Revision history for this message
Stefan Bader (smb) wrote :

It might be fixed if AWS runs a Xen hypervisor which has the following patch included (this is from the development tree of upstream Xen, so will be part of Xen-4.12).

commit be07023be115c94b7fbb51d2ef6f421ddd680de8
Author: Roger Pau Monné <email address hidden>
Date: Tue Jul 24 15:54:18 2018 +0200

    x86/vhpet: add support for level triggered interrupts

One can never say for sure what AWS runs, so whether its fixed or not can only be found out by trial and error.

Revision history for this message
Sean Feole (sfeole) wrote :

We updated our test instances to run on the latest hardware made available in AWS, I have not seen this reoccur in the xenial testing.

closing bug.

Changed in ubuntu-kernel-tests:
status: In Progress → Invalid
Changed in linux-aws (Ubuntu):
status: In Progress → Invalid
Changed in ubuntu-kernel-tests:
assignee: Sean Feole (sfeole) → nobody
Changed in linux-aws (Ubuntu):
assignee: Stefan Bader (smb) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.