linux autopkgtest: eternal hang/ BUG: soft lockup

Bug #1634519 reported by Martin Pitt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Colin Ian King

Bug Description

Following up to bug 1632252, tests still hang (but not due to the suspend issue any more). In a local QEMU run:

$ autopkgtest --testname ubuntu-regression-suite linux -- qemu --ram-size=4096 -o /var/cache/martin /srv/vm/autopkgtest-yakkety-amd64.img
[...]
14:46:41 DEBUG| [stdout] Test icebp [Ok]
14:46:41 DEBUG| [stdout] Test int 3 trap [Ok]
14:46:41 DEBUG| [stdout] selftests: breakpoint_test [PASS]

and since then (1 hour) it's hung. But now I still can log into ttyS0. dmesg is almost empty:

[ 5501.499217] ata2.01: NODEV after polling detection
[ 5501.500138] ata2.00: configured for MWDMA2

(something in the test clears the ring buffer), and journalctl confirms that suspend/resume worked fine:

Oct 18 14:47:11 autopkgtest kernel: PM: suspend of devices complete after 65.273 msecs
Oct 18 14:47:11 autopkgtest kernel: PM: late suspend of devices complete after 0.172 msecs
Oct 18 14:47:11 autopkgtest kernel: PM: noirq suspend of devices complete after 1.963 msecs
Oct 18 14:47:11 autopkgtest kernel: ACPI: Preparing to enter system sleep state S3
Oct 18 14:47:11 autopkgtest kernel: PM: Saving platform NVS memory
Oct 18 14:47:11 autopkgtest kernel: Disabling non-boot CPUs ...
Oct 18 14:47:11 autopkgtest kernel: kvm-clock: cpu 0, msr 1:3fff4001, primary cpu clock, resume
Oct 18 14:47:11 autopkgtest kernel: ACPI: Low-level resume complete
Oct 18 14:47:11 autopkgtest kernel: PM: Restoring platform NVS memory
Oct 18 14:47:11 autopkgtest kernel: ACPI: Waking up from system sleep state S3
Oct 18 14:47:11 autopkgtest kernel: PM: noirq resume of devices complete after 6.973 msecs
Oct 18 14:47:11 autopkgtest kernel: PM: early resume of devices complete after 0.105 msecs
Oct 18 14:47:11 autopkgtest kernel: pci 0000:00:01.0: PIIX3: Enabling Passive Release
Oct 18 14:47:11 autopkgtest kernel: rtc_cmos 00:00: System wakeup disabled by ACPI
Oct 18 14:47:11 autopkgtest kernel: PM: resume of devices complete after 8.898 msecs
Oct 18 14:47:11 autopkgtest kernel: PM: Finishing wakeup.
Oct 18 14:47:11 autopkgtest systemd[1]: Time has been changed
Oct 18 14:47:11 autopkgtest systemd[1]: apt-daily.timer: Adding 10h 56min 28.694439s random time.
Oct 18 14:47:11 autopkgtest systemd[3986]: Time has been changed
Oct 18 14:47:11 autopkgtest kernel: Restarting tasks ... done.
Oct 18 14:47:11 autopkgtest sudo[30731]: pam_unix(sudo:session): session closed for user root
Oct 18 14:47:11 autopkgtest kernel: ata2.01: NODEV after polling detection
Oct 18 14:47:11 autopkgtest kernel: ata2.00: configured for MWDMA2

But also, no messages beyond that (last message one hour ago).

On the currently running amd64 yakkety test on the ifrastructure, the int trap bug fails:

13:15:46 DEBUG| [stdout] Test icebp [Ok]
13:15:46 DEBUG| [stdout] Test int 3 trap [Failed]
13:15:46 DEBUG| [stdout] selftests: breakpoint_test [PASS]

and then it's apparently hung at exactly the same problem. But unlike with local QEMU I cannot ssh in any more (Connection refused), and console-log shows trouble:

autopkgtest login: [ 1927.192886] AppArmor: change_hat: Invalid input '^open'
[ 1927.194060] AppArmor: change_hat: Invalid input '^'[ 1927.198501] AppArmor: change_hat: Invalid input, NULL hat and NULL magic
[ 1927.203181] AppArmor: change_hat: Invalid input, NULL hat and NULL magic
[ 1927.207249] AppArmor: change_hat: Invalid input '^open'AppArmor: change_hat: Invalid input '^'
[ 1927.208826] AppArmor: change_hat: Invalid input '^'[[0;32m OK [0m] Stopped target Graphical Interface.
[[0;32m OK [0m] Stopped target Multi-User System.
         Stopping OpenBSD Secure Shell server...
         Stopping autopkgtest root shell on ttyS1...
[ 2909.360975] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]
[ 2937.360965] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]
[ 2941.368025] INFO: rcu_sched self-detected stall on CPU
[ 2941.368095] 3-...: (1 GPs behind) idle=ea7/2/0 softirq=544443/544443 fqs=6600
[ 2941.368095] (t=15000 jiffies g=205236 c=205235 q=134637)
[ 2969.360984] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]
[ 2997.360987] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]
[ 3021.796060] INFO: task systemd:1 blocked for more than 120 seconds.
[ 3021.797626] Tainted: G W OEL 4.8.0-25-generic #27-Ubuntu
[ 3021.797627] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.797727] INFO: task systemd-journal:285 blocked for more than 120 seconds.
[ 3021.797728] Tainted: G W OEL 4.8.0-25-generic #27-Ubuntu
[ 3021.797728] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.797772] INFO: task kworker/1:2:303 blocked for more than 120 seconds.
[ 3021.797772] Tainted: G W OEL 4.8.0-25-generic #27-Ubuntu
[ 3021.797773] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.797814] INFO: task kworker/2:2:350 blocked for more than 120 seconds.
[ 3021.797815] Tainted: G W OEL 4.8.0-25-generic #27-Ubuntu
[ 3021.797815] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.797853] INFO: task acpid:659 blocked for more than 120 seconds.
[ 3021.797853] Tainted: G W OEL 4.8.0-25-generic #27-Ubuntu
[ 3021.797854] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.797940] INFO: task (exim4):31259 blocked for more than 120 seconds.
[ 3021.797941] Tainted: G W OEL 4.8.0-25-generic #27-Ubuntu
[ 3021.797941] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3025.361006] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]
[ 3053.360962] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]
[ 3081.360962] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]
[ 3109.360971] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]

... and a couple of similar repetitions.

Martin Pitt (pitti)
Changed in linux (Ubuntu):
status: New → Confirmed
assignee: nobody → Colin Ian King (colin-king)
tags: added: bot-stop-nagging
Revision history for this message
Martin Pitt (pitti) wrote :

The local QEMU run also doesn't time out cleanly:

qqemu-system-x86_64: /build/qemu-3KpUoe/qemu-2.6.1+dfsg/hw/char/serial.c:231: serial_xmit: Assertion `!(s->lsr & UART_LSR_TEMT)' failed.
autopkgtest [16:03:27]: ERROR: timed out on command "[...]" (kind: test)
autopkgtest [16:03:28]: test ubuntu-regression-suite: -----------------------]
Traceback (most recent call last):
  File "/tmp/autopkgtest-virt-qemu._36vlf46/runcmd", line 63, in <module>
    s.connect('/tmp/autopkgtest-virt-qemu._36vlf46/ttyS1')
ConnectionRefusedError: [Errno 111] Connection refused

I. e. it also breaks the status of ttyS1 somehow.

Revision history for this message
Colin Ian King (colin-king) wrote :

Hi Martin,

I've pushed another minor change to our tests, do you mind re-running this. I've run it on QEMU on my system and it runs through to completion, so I am hopeful that this may resolve the issues.

Revision history for this message
Martin Pitt (pitti) wrote :

The tests (both yakkety and zesty) are still looping, I'm afraid. I didn't see the "BUG soft lockup" in the one instance I was watching. There were no obvious kernel call traces in dmesg, and the test output tail is just:

06:27:14 DEBUG| [stdout]
Killed
DEBUG - Leave sh
ERROR - The command (AUTOTEST_PATH=/home/ubuntu/autotest sudo -E autotest/client/autotest-local --verbose autotest/client/tests/ubuntu_qrt_apparmor/control) return
ed a non-zero exit status (137).
ERROR - 06:03:39 INFO | Writing results to /tmp/autopkgtest.XDxmzj/build.2sj/linux-4.8.0/autotest/client/results/default
ERROR - 06:03:39 DEBUG| Initializing the state engine
ERROR - 06:03:39 DEBUG| Persistent state client.steps now set to []
ERROR - 06:03:39 DEBUG| Persistent option harness now set to None
ERROR - 06:03:39 DEBUG| Persistent option harness_args now set to None
ERROR - 06:03:39 DEBUG| Selected harness: standalone
ERROR - 06:03:39 INFO | START ---- ---- timestamp=1477029819 localtime=Oct 21 06:03:39
ERROR - 06:03:39 DEBUG| Persistent state client._record_indent now set to 1
ERROR - 06:03:39 DEBUG| Test has timeout: 7200 sec.
ERROR - 06:03:39 INFO | START ubuntu_qrt_apparmor.setup ubuntu_qrt_apparmor.setup timestamp=1477029819 timeout=7200 localtime=Oct 21 06
:03:39
ERROR - 06:03:39 DEBUG| Persistent state client._record_indent now set to 2
ERROR - 06:03:39 DEBUG| Persistent state client.unexpected_reboot now set to ('ubuntu_qrt_apparmor.setup', 'ubuntu_qrt_apparmor.setup')
ERROR - 06:03:39 DEBUG| Waiting for pid 2611 for 7200 seconds
ERROR - 06:03:39 DEBUG| Running 'apt-get update'
[...]
ERROR - 06:05:02 DEBUG| [stdout] Unpacking libgraphite2-3:i386 (1.3.8-1ubuntu1) ...
ERROR - 06:05:02 DEBUG| [stdout] Selecting previously unselected package libharfbuzz0b:i386.
ERROR - 06:05:02 DEBUG| [stdout] Preparing to unpack .../035-libharfbuzz0b_1.2.7-1_i386.deb ...
ERROR - 06:05:02 DEBUG| [stdout] Unpacking libharfbuzz0b:i386 (1.2.7-1) ...
ERROR - 06:05:02 DEBUG| [stdout] Selecting previously unselected package libpangoft2-1.0-0:i386.
ERROR - 06:05:02 DEBUG| [stdout] Preparing to unpack .../036-libpangoft2-1.0-0_1.40.3-2_i386.deb ...
ERROR - 06:05:02 DEBUG| [stdout] Unpacking libpangoft2-1.0-0:i386 (1.40.3-2) ...
Killed
Connection to 10.220.46.18 closed by remote host.
Exit request sent.
autopkgtest [06:31:15]: ERROR: testbed failure: testbed auxverb failed with exit code 255

So this time it seems apt gets killed. The VM went away after that, so I don't have it any more for investigation.

I'm watching another current run and will get its journalctl -f and test output tail, and will then kill the current runs again as there are a lot of them, they loop, and they block large amounts of capacity.

Revision history for this message
Martin Pitt (pitti) wrote :
Download full text (7.3 KiB)

Second run: Nothing useful at all in "nova console-log" (just the normal boot up to "login:"), test output tail is

07:32:01 DEBUG| [stdout] nice PASSED
07:32:11 DEBUG| [stdout] null PASSED
07:32:21 DEBUG| [stdout] opcode PASSED
07:32:31 DEBUG| [stdout] open PASSED
07:32:41 DEBUG| [stdout] personality PASSED
07:32:51 DEBUG| [stdout] pipe PASSED
07:33:01 DEBUG| [stdout] poll PASSED

dmesg tail is more interesting, as it has some trace:

[ 1.568250] AppArmor: AppArmor sha1 policy hashing enabled [21/1103]
[ 1.569433] ima: No TPM chip found, activating TPM-bypass!
[ 1.570642] evm: HMAC attrs: 0x1
[ 1.571783] Magic number: 4:640:161
[ 1.572860] rtc_cmos 00:00: setting system clock to 2016-10-21 04:10:09 UTC (1477023009)
[ 1.574813] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 1.576127] EDD information not available.
[ 1.577057] PM: Hibernation image not present or could not be loaded.
[ 1.653017] isapnp: No Plug & Play device found
[ 1.654764] Freeing unused kernel memory: 984K (d3c21000 - d3d17000)
[ 1.656407] Write protecting the kernel text: 8252k
[ 1.657711] Write protecting the kernel read-only data: 3324k
[ 1.658992] NX-protecting the kernel data: 6084k
[ 1.660357] ------------[ cut here ]------------
[ 1.661352] WARNING: CPU: 0 PID: 1 at /build/linux-yFroJZ/linux-4.8.0/arch/x86/mm/dump_pagetables.c:225 note_page+0x6a1/0x880
[ 1.663740] x86/mm: Found insecure W+X mapping at address c00a0000/0xc00a0000
[ 1.665280] Modules linked in:
[ 1.666020] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-26-generic #28-Ubuntu
[ 1.667603] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[ 1.669270] d3b54967 625585ec 00200286 d7cabe70 d33da625 d7cabeb4 d3a38b68 d7cabea0
[ 1.669270] d3071b9a d3a38b30 d7cabed4 00000001 d3a38b68 000000e1 d30682a1 000000e1
[ 1.669270] d7cabf34 80000000 00000000 d7cabec0 d3071c06 00000009 00000000 d7cabeb4
[ 1.669270] Call Trace:
[ 1.669270] [<d33da625>] dump_stack+0x58/0x73
[ 1.669270] [<d3071b9a>] __warn+0xea/0x110
[ 1.669270] [<d30682a1>] ? note_page+0x6a1/0x880
[ 1.669270] [<d3071c06>] warn_slowpath_fmt+0x46/0x60
[ 1.669270] [<d30682a1>] note_page+0x6a1/0x880
[ 1.669270] [<d3068675>] ptdump_walk_pgd_level_core+0x1f5/0x300
[ 1.669270] [<d30687b6>] ptdump_walk_pgd_level_checkwx+0x16/0x20
[ 1.669270] [<d305e04d>] mark_rodata_ro+0xfd/0x130
[ 1.669270] [<d3801aac>] kernel_init+0x2c/0x100
[ 1.669270] [<d309b4b1>] ? schedule_tail+0x11/0x50
[ 1.669270] [<d380c3e2>] ret_from_kernel_thread+0xe/0x24
[ 1.669270] [<d3801a80>] ? rest_init+0x70/0x70
[ 1.688344] ---[ end trace c05a9abbb5340752 ]---
[ 1.689410] x86/mm: Checked W+X mappings: FAILED, 96 W+X pages found.
[ 1.708425] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[ 1.710325] random: udevadm: uninitialized urandom read (16 bytes read)
[ 1.710853] random: udevadm: uninitialized urandom read (16...

Read more...

Revision history for this message
Colin Ian King (colin-king) wrote :

This is reproduceable on my QEMU instance:

13:16:59 DEBUG| [stdout] opcode PASSED
13:17:10 DEBUG| [stdout] open PASSED
13:17:20 DEBUG| [stdout] personality PASSED
13:17:30 DEBUG| [stdout] pipe PASSED
13:17:40 DEBUG| [stdout] poll PASSED

And we get:

[ 5361.167961] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff84f6b18e
[ 5361.167961]
[ 5361.171800] CPU: 1 PID: 19248 Comm: Tainted: G W OE 4.8.0-26-generic #28-Ubuntu
[ 5361.173179] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[ 5361.175119] 0000000000000086 0000000094818d37 ffff8b7b72acbc78 ffffffff8502f5b2
[ 5361.176740] 0000000094f6b400 ffffffff85883520 ffff8b7b72acbd00 ffffffff84d9e71c
[ 5361.178356] ffff8b7b00000010 ffff8b7b72acbd10 ffff8b7b72acbca8 0000000094818d37
[ 5361.179938] Call Trace:
[ 5361.180420] [<ffffffff8502f5b2>] dump_stack+0x63/0x81
[ 5361.181401] [<ffffffff84d9e71c>] panic+0xe4/0x226
[ 5361.182310] [<ffffffff84f6b18e>] ? proc_keys_show+0x3ce/0x3d0
[ 5361.183487] [<ffffffff84c82b89>] __stack_chk_fail+0x19/0x30
[ 5361.184478] [<ffffffff84f6b18e>] proc_keys_show+0x3ce/0x3d0
[ 5361.185476] [<ffffffff84f68690>] ? key_validate+0x50/0x50
[ 5361.187078] [<ffffffff84f65d50>] ? key_default_cmp+0x20/0x20
[ 5361.188277] [<ffffffff84e599f9>] seq_read+0x2e9/0x3c0
[ 5361.189303] [<ffffffff84ea62e2>] proc_reg_read+0x42/0x70
[ 5361.190234] [<ffffffff84e32428>] __vfs_read+0x18/0x40
[ 5361.190813] [<ffffffff84e32b86>] vfs_read+0x96/0x130
[ 5361.191732] [<ffffffff84e34065>] SyS_read+0x55/0xc0
[ 5361.192739] [<ffffffff8549f076>] entry_SYSCALL_64_fastpath+0x1e/0xa8
[ 5361.194353] Kernel Offset: 0x3c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 5361.197761] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff84f6b18e

This is because of bug 1634496.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.