task-size-overrun test will hang on IBM s390x zKVM / zVM

Bug #1718107 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
Undecided
bugproxy
linux (Ubuntu)
Fix Released
Undecided
Canonical Kernel Team

Bug Description

This issue does not exist on other amd64 / i386 / arm64 boxes.

Steps:
1. Clone git://kernel.ubuntu.com/ubuntu/autotest-client-tests
2. Untar libhugetlbfs/libhugetlbfs-2.20.tar.gz
3. Run `sudo BUILDTYPE=NATIVEONLY make check` in the extracted libhugetlbfs-2.20 directory
4. Hit Ctrl + C to terminate it

From the output you will see it get stuck:
truncate_above_4GB (1024K: 64): PASS
brk_near_huge (1024K: 64): brk_near_huge: malloc.c:2427: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.

task-size-overrun (1024K: 64):

From the syslog you can see brk_near_huge has crashed:
Sep 19 07:28:25 s2lp6g004 rsyslogd-2007: action 'action 10' suspended, next retry is Tue Sep 19 07:29:55 2017 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Sep 19 07:28:36 s2lp6g004 AutotestCrashHandler: Application brk_near_huge, PID 10474 crashed
Sep 19 07:28:36 s2lp6g004 AutotestCrashHandler: Writing core files to ['/home/ubuntu/autotest/client/results/default/libhugetlbfs/debug/crash.brk_near_huge.10474']
Sep 19 07:28:36 s2lp6g004 AutotestCrashHandler: Could not determine from which application core file /home/ubuntu/autotest/client/results/default/libhugetlbfs/debug/crash.brk_near_huge.10474/core is from

After hitting Ctrl + C to interrupt, kernel trace could be found in syslog:
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325504] report_user_fault: 35 callbacks suppressed
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325507] User process fault: interruption code 0010 ilc:3 in stack_grow_into_huge[12f880000+4000]
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325520] Failing address: 000003ffe72fe000 TEID: 000003ffe72fe400
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325521] Fault in primary space mode while using user ASCE.
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325523] AS:00000000551701c7 R3:0000000048d88007 S:0000000000000020
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325527] CPU: 0 PID: 10481 Comm: stack_grow_into Tainted: P D O 4.13.0-11-generic #12-Ubuntu
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325527] Hardware name: IBM 2964 N63 400 (KVM/Linux)
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325528] task: 0000000062450000 task.stack: 0000000062b3c000
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325529] User PSW : 0705300180000000 000000012f8819a8
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325530] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:3 PM:0 RI:0 EA:3
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325531] User GPRS: 000002001c51e2f8 fffffffffffff000 0000000000000000 0000000000000004
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325531] 000003fff72fe5c0 0000000000000000 0000000000000003 0000000157806000
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325532] fffffffffff00000 0000000000100000 000003fff7000000 000003fff72fe520
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325532] 000002001c426000 0000000000000003 000000012f8819a2 000003ffe72fe518
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325540] User Code: 000000012f881996: e548b0a80001 mvghi 168(%r11),1
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325540] 000000012f88199c: c0e5fffffc86 brasl %r14,12f8812a8
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325540] #000000012f8819a2: c2f410000008 slgfi %r15,268435464
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325540] >000000012f8819a8: e54cf0a00001 mvhi 160(%r15),1
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325540] 000000012f8819ae: 4110f0a0 la %r1,160(%r15)
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325540] 000000012f8819b2: ec1afff8a065 clgrj %r1,%r10,10,12f8819a2
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325540] 000000012f8819b8: b24f0010 ear %r1,%a0
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325540] 000000012f8819bc: eb110020000d sllg %r1,%r1,32
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325549] Last Breaking-Event-Address:
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325552] [<000002001c7f26e8>] 0x2001c7f26e8
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325562] Process 10481(stack_grow_into) has RLIMIT_CORE set to 1
Sep 19 07:30:02 s2lp6g004 kernel: [11224.325563] Aborting core

Please find attachment for a more detailed syslog and core dump file.

ProblemType: Bug
DistroRelease: Ubuntu 17.10
Package: linux-image-4.13.0-11-generic 4.13.0-11.12
ProcVersionSignature: Ubuntu 4.13.0-11.12-generic 4.13.1
Uname: Linux 4.13.0-11-generic s390x
NonfreeKernelModules: zfs zunicode zavl zcommon znvpair
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access '/dev/snd/': No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.7-0ubuntu1
Architecture: s390x
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
CurrentDmesg:

Date: Tue Sep 19 07:16:55 2017
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lspci:

Lsusb: Error: command ['lsusb'] failed with exit code 1:
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C
 SHELL=/bin/bash
ProcFB: Error: [Errno 2] No such file or directory: '/proc/fb'
ProcKernelCmdLine: root=UUID=d86dc8de-6ac8-4e4b-aba8-92872a209a8c crashkernel=196M
RelatedPackageVersions:
 linux-restricted-modules-4.13.0-11-generic N/A
 linux-backports-modules-4.13.0-11-generic N/A
 linux-firmware 1.168
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Complete output of command:
sudo BUILDTYPE=NATIVEONLY make check

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
  • core Edit (2.1 MiB, application/octet-stream)

Program: Unknown
PID: 10474
Signal: 6
Hostname: s2lp6g004
Time of the crash (according to kernel): Tue Sep 19 07:28:36 2017
Program backtrace:
Could not determine backtrace for core file /home/ubuntu/autotest/client/results/default/libhugetlbfs/debug/crash.brk_near_huge.10474/core

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1718107

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Po-Hsu Lin (cypressyew)
description: updated
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
assignee: nobody → bugproxy (bugproxy)
bugproxy (bugproxy)
tags: added: architecture-s39064 bugnameltc-158964 severity-high targetmilestone-inin1710
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-09-20 08:56 EDT-------
It doesn't look like "brk_near_huge" is hanging, but rather the next test "task-size-overrun". The Assertion failure originates from "brk_near_huge", but that is probably unrelated to the hanging "task-size-overrun".

Also, "task-size-overrun" may only appear to be hanging, IIRC this test will try to figure out the TASK_SIZE by successively allocating 4K pages as long as possible. We already had an issue there on s390, with our dynamic pagetable upgrade feature, which was fixed by the following libhugetlbfs commit:

commit a06eeed7e005af579169f3020d3980198eb3ba30
Author: Gerald Schaefer <email address hidden>
Date: Thu Mar 7 15:41:58 2013 +0100

task-size-overrun: fix problem with dynamic pagetable upgrade on s390x
The strategy to find out TASK_SIZE won't work on s390x anymore, starting
with kernel 3.9. We will dynamically increase the pagetable levels on
s390x on access beyond TASK_SIZE, effectively increasing TASK_SIZE from
2^42 to 2^53, but /proc/self/maps won't reflect this.
With the current strategy that means that find_task_size() would loop
for a very long time, from 2^42 to 2^53. To fix this, increase addr
in the loop for s390x as soon as we exceed the 2^42 limit.
Signed-off-by: Gerald Schaefer <email address hidden>
Signed-off-by: Eric B Munson <email address hidden>

Since we only recently added 5-level pagetable support, I guess this fix no longer helps, as we will now loop from 2^53 to 2^64. I'll look into this, we probably need to update "task-size-overrun" again to reflect the 5-level pagetable update.

Revision history for this message
bugproxy (bugproxy) wrote : syslog
  • syslog Edit (45.6 KiB, application/octet-stream)

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : AudioDevicesInUse.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Dependencies.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : JournalErrors.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : ProcCpuinfo.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : ProcCpuinfoMinimal.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : ProcInterrupts.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : ProcModules.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : UdevDb.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : WifiSyslog.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : test-result.txt

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : core
  • core Edit (2.1 MiB, application/octet-stream)

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-09-26 11:17 EDT-------
*** Bug 159257 has been marked as a duplicate of this bug. ***

Revision history for this message
Frank Heimes (fheimes) wrote : Re: brk_near_huge test will hang on IBM s390x zKVM / zVM

According to #5: Please let us know when an updated fix becomes available and upstream accepted - thx.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in ubuntu-z-systems:
status: Triaged → Confirmed
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-09-26 14:11 EDT-------
(In reply to comment #20)
> According to #5: Please let us know when an updated fix becomes available
> and upstream accepted - thx.

Sure, I already posted a fix for this on the libhugetlbfs mailing list, but it doesn't look like there is very much traffic or attention on this (new) list:
https://groups.google.com/forum/?hl=en#!forum/libhugetlbfs

BTW, someone from Power apparently got triggered by my patch (or pure coincidence) an posted a similar fix for Power, so I assume you would see the same issue ("hanging" task-size-overrun) also on Power.

BTW2, the title of this bug is wrong, it is not "brk_near_huge" that seems to be hanging, but rather "task-size-overrun".

Frank Heimes (fheimes)
summary: - brk_near_huge test will hang on IBM s390x zKVM / zVM
+ task-size-overrun test will hang on IBM s390x zKVM / zVM
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-09-27 12:27 EDT-------
*** Bug 159257 has been marked as a duplicate of this bug. ***

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Whilst we are waiting on upstream, can kernel team pick up the proposed patches for power & s390 as patche sinto autotest-client-tests framework?

Changed in linux (Ubuntu):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Will do.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Tested with Gerald's patch. It can fix the task-size-overrun hang issue on s390x.
Test report could be found here: http://pastebin.ubuntu.com/25765114/

I will check with powerpc later. Thanks

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Tested on powerpc, looking good.
We have applied these two patches as SAUCE patches in our repo:
git://kernel.ubuntu.com/ubuntu/autotest-client-tests
With commit:
    84c3f660ebc1b68436136ab8f58e1a5e542d9c96
    2b8f225706575e54ee76faca0b57c5486c18d243

As a result I am closing this bug with Fix-released.
Thank you.

Po-Hsu Lin (cypressyew)
Changed in ubuntu-z-systems:
status: Confirmed → Fix Released
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-11-06 04:54 EDT-------
IBM Bugzilla status -> closed, fix released by Canonical

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.