[SRU][Zesty] fix soft lockup on overcommited hugepages

Bug #1696165 reported by Manoj Iyer on 2017-06-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Manoj Iyer
Zesty
High
Unassigned

Bug Description

[Impact]
On failing to migrate a page, soft_offline_huge_page() performs the necessary update to the hugepage ref-count.

But when !hugepage_migration_supported() , unmap_and_move_hugepage() also decrements the page ref-count for the hugepage. The combined behaviour leaves the ref-count in an inconsistent state.

This leads to soft lockups when running the overcommitted hugepage test from mce-tests suite

[Testing]
Run the mce-test/cases/function/hwpoison/run_hugepage_overcommit.sh and you should see softlock up if hugepage migration support is not enabled.

[Fix]
upstream commit:
30809f559a0d mm/migrate: fix refcount handling when !hugepage_migration_supported()

[Regression Potential]

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1696165

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Manoj Iyer (manjo) wrote :

== Without the patch ==
ubuntu@ubuntu:~/testing/mce-test/cases/function/hwpoison$ sudo ./run_hugepage_overcommit.sh
[sudo] password for ubuntu:

***************************************************************************
Pay attention:

This test checks that hugepage soft-offlining works under overcommitting.
***************************************************************************

-------------------------------------
TestCase ./thugetlb_overcommit 1
[ 1628.254754] NMI watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [thugetlb_overco:3154]
[ 1660.668149] INFO: rcu_sched self-detected stall on CPU
[ 1660.672210] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1660.672216] 8-...: (14998 ticks this GP) idle=72f/140000000000001/0 softirq=1348/1348 fqs=7389
[ 1660.672217] (detected by 18, t=15002 jiffies, g=3147, c=3146, q=503)
[ 1660.692986] 8-...: (14998 ticks this GP) idle=72f/140000000000001/0 softirq=1348/1348 fqs=7392
[ 1660.701752] (t=15009 jiffies g=3147 c=3146 q=503)

[ 1840.695633] INFO: rcu_sched self-detected stall on CPU
[ 1840.699810] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1840.699818] 8-...: (59995 ticks this GP) idle=72f/140000000000001/0 softirq=1348/1348 fqs=27921
[ 1840.699818] (t=60007 jiffies g=3147 c=3146 q=1101)
[ 1840.719086] 8-...: (60000 ticks this GP) idle=72f/140000000000000/0 softirq=1348/1348 fqs=27921
[ 1840.727935] (detected by 1, t=60007 jiffies, g=3147, c=3146, q=1101)

Manoj Iyer (manjo) wrote :

== With patch applied ==
ubuntu@ubuntu:~/testing/mce-test/cases/function/hwpoison$ uname -a
Linux ubuntu 4.10.0-22-generic #24~lp1696165+softlockup.1 SMP Wed Jun 14 19:05:07 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux

ubuntu@ubuntu:~/testing/mce-test/cases/function/hwpoison$ sudo ./run_hugepage_overcommit.sh
[sudo] password for ubuntu:
hwpoison-inject module is loaded.

***************************************************************************
Pay attention:

This test checks that hugepage soft-offlining works under overcommitting.
***************************************************************************

-------------------------------------
TestCase ./thugetlb_overcommit 1
FAIL: migration failed.
Unpoisoning.

 Num of Executed Test Case: 1 Num of Failed Case: 1

Testcase failure is expected because hugepage migration is not enabled in the Ubuntu configs. Please not that we no longer see softlockups. The patch fixed that bug.

Manoj Iyer (manjo) wrote :

Test Kernel is available in PPA: https://launchpad.net/~centriq-team/+archive/ubuntu/lp1696165/

Boot tested on Power8:
ubuntu@manjo-srutest:~$ uname -a
Linux manjo-srutest 4.10.0-22-generic #24~lp1696165+softlockup.1-Ubuntu SMP Wed Jun 14 19:58:24 UTC 20 ppc64le ppc64le ppc64le GNU/Linux

Boot tested on AMD64:
ubuntu@adib:~$ uname -a
Linux adib 4.10.0-22-generic #24~lp1696165+softlockup.1-Ubuntu SMP Wed Jun 14 20:01:20 UTC 20 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@adib:~$

Stefan Bader (smb) on 2017-06-21
Changed in linux (Ubuntu Zesty):
importance: Undecided → High
status: New → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-zesty' to 'verification-done-zesty'. If the problem still exists, change the tag 'verification-needed-zesty' to 'verification-failed-zesty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-zesty
Manoj Iyer (manjo) wrote :

== Testing kernel in -proposed ==

The kernel installs and boots fine on the QDF2400 platform.

$ uname -a
Linux awsdp0 4.10.0-28-generic #32-Ubuntu SMP Fri Jun 30 05:33:10 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux

$ apt policy linux-image-4.10.0-28-generic
linux-image-4.10.0-28-generic:
  Installed: 4.10.0-28.32
  Candidate: 4.10.0-28.32
  Version table:
 *** 4.10.0-28.32 500
        500 http://ports.ubuntu.com/ubuntu-ports zesty-proposed/main arm64 Packages
        100 /var/lib/dpkg/status

The testcase needs hwpoison-inject module to run. But it is not enabled in the Configs by default.

$ sudo ./run_hugepage_overcommit.sh
sysctl: cannot stat /proc/sys/vm/memory_failure_early_kill: No such file or directory
modprobe: FATAL: Module hwpoison-inject not found in directory /lib/modules/4.10.0-28-generic
DIE: Failed to load hwpoison-inject module. Abort.
DIE: Failed to load hwpoison-inject module. Abort.

I can confirm that rebuilding the kernel with the hwpoison-inject module, and running the test, the kernel does not get soft-lockups and test works as expected.

tags: added: verification-done-zesty
removed: verification-needed-zesty
Manoj Iyer (manjo) wrote :

== with hwpoison-inject module ==

Test case fails as expected .. and does not result in soft-lockups.
$ uname -r
4.10.0-28-generic

$ sudo ./run_hugepage_overcommit.sh
sudo: unable to resolve host awsdp0
hwpoison-inject module is loaded.

***************************************************************************
Pay attention:

This test checks that hugepage soft-offlining works under overcommitting.
***************************************************************************

-------------------------------------
TestCase ./thugetlb_overcommit 1
FAIL: migration failed.
Unpoisoning.

 Num of Executed Test Case: 1 Num of Failed Case: 1

Launchpad Janitor (janitor) wrote :
Download full text (8.1 KiB)

This bug was fixed in the package linux - 4.10.0-28.32

---------------
linux (4.10.0-28.32) zesty; urgency=low

  * linux: 4.10.0-28.32 -proposed tracker (LP: #1701013)

  * KILLER1435-S[0489:e0a2] BT cannot search BT 4.0 device (LP: #1699651)
    - Bluetooth: btusb: Add support for 0489:e0a2 QCA_ROME device

  * aacraid driver may return uninitialized stack data to userspace
    (LP: #1700077)
    - SAUCE: scsi: aacraid: Don't copy uninitialized stack memory to userspace

  * CVE-2017-9605
    - drm/vmwgfx: Make sure backup_handle is always valid

  * CVE-2017-1000380
    - ALSA: timer: Fix race between read and ioctl
    - ALSA: timer: Fix missing queue indices reset at SNDRV_TIMER_IOCTL_SELECT

  * XDP eBPF programs fail to verify on Zesty ppc64el (LP: #1699627)
    - [Config] ppc64el: build for Power8 not Power7

  * AACRAID for power9 platform (LP: #1689980)
    - scripts/spelling.txt: add "therfore" pattern and fix typo instances
    - scsi: aacraid: fix PCI error recovery path
    - scsi: aacraid: pci_alloc_consistent() failures on ARM64
    - scsi: aacraid: Remove __GFP_DMA for raw srb memory
    - scsi: aacraid: Fix DMAR issues with iommu=pt
    - scsi: aacraid: Added 32 and 64 queue depth for arc natives
    - scsi: aacraid: Set correct Queue Depth for HBA1000 RAW disks
    - scsi: aacraid: Remove reset support from check_health
    - scsi: aacraid: Change wait time for fib completion
    - scsi: aacraid: Log count info of scsi cmds before reset
    - scsi: aacraid: Print ctrl status before eh reset
    - scsi: aacraid: Using single reset mask for IOP reset
    - scsi: aacraid: Rework IOP reset
    - scsi: aacraid: Add periodic checks to see IOP reset status
    - scsi: aacraid: Rework SOFT reset code
    - scsi: aacraid: Rework aac_src_restart
    - scsi: aacraid: Use correct function to get ctrl health
    - scsi: aacraid: Make sure ioctl returns on controller reset
    - scsi: aacraid: Enable ctrl reset for both hba and arc
    - scsi: aacraid: Add reset debugging statements
    - scsi: aacraid: Remove reference to Series-9
    - scsi: aacraid: Update driver version to 50834

  * arm64 kernel crashdump support (LP: #1694859)
    - memblock: add memblock_clear_nomap()
    - memblock: add memblock_cap_memory_range()
    - arm64: limit memory regions based on DT property, usable-memory-range
    - arm64: kdump: reserve memory for crash dump kernel
    - arm64: mm: add set_memory_valid()
    - arm64: mm: use phys_addr_t instead of unsigned long in __map_memblock
    - arm64: kdump: protect crash dump kernel memory
    - arm64: hibernate: preserve kdump image around hibernation
    - arm64: kdump: implement machine_crash_shutdown()
    - arm64: kdump: add VMCOREINFO's for user-space tools
    - [Config] CONFIG_CRASH_DUMP=y on arm64
    - arm64: kdump: provide /proc/vmcore file
    - Documentation: kdump: describe arm64 port
    - Documentation: dt: chosen properties for arm64 kdump
    - efi/libstub/arm*: Set default address and size cells values for an empty dtb

  * hibmc driver does not include "pci:" prefix in bus ID (LP: #1698700)
    - SAUCE: drm: hibmc: Use set_busid function from drm core

  * Processes in "D" state due to za...

Read more...

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers