memory_stress_ng failing for Power architecture for 16.04

Bug #1573062 reported by Mike Rushton
26
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Checkbox Provider - Base
Fix Released
Critical
Jeff Lane 
linux (Ubuntu)
Confirmed
High
Unassigned
Xenial
Confirmed
High
Unassigned
Yakkety
Won't Fix
High
Unassigned

Bug Description

memory_stress_ng, as part of server certification is failing for IBM Power S812LC(TN71-BP012) in bare metal mode. Failing in this case is defined by the test locking up the server in an unrecoverable state which only a reboot will fix.

I will be attaching screen and kern logs for the failures and a successful run on 14.04 on the same server.

Revision history for this message
Mike Rushton (leftyfb) wrote :

screen session of failure

Revision history for this message
Mike Rushton (leftyfb) wrote :

kern.log from failure

Revision history for this message
Mike Rushton (leftyfb) wrote :

screen session log from successful test on 14.04

Revision history for this message
Mike Rushton (leftyfb) wrote :

kern.log from successful test on 14.04

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1573062

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Mike Rushton (leftyfb)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: Confirmed → Triaged
Revision history for this message
Colin Ian King (colin-king) wrote : Re: memory_stress_ng failing for IBM Power S812LC(TN71-BP012) for 16.04

I don't see any evidence of a hang, just evidence of a machine being power-cycled.

Apr 19 21:26:33 ubuntu kernel: [19749.994340] [45742] 1000 45742 369 18 6 3 14 0 stress-ng-brk
Apr 19 21:26:33 ubuntu kernel: [19749.994342] [45743] 1000 45743 369 18 6 3 14 0 stress-ng-brk
Apr 19 21:26:33 ubuntu kernel: [19749.994344] Out of memory: Kill process 45583 (stress-ng-brk) score 28 or sacrifice child
Apr 19 21:26:33 ubuntu kernel: [19749.994566] Killed process 45583 (stress-ng-brk) total-vm:7976960kB, anon-rss:7048512kB, file-rss:1152kB
Apr 20 14:28:33 binacle kernel: [ 0.000000] opal: OPAL V3 detected !
Apr 20 14:28:33 binacle kernel: [ 0.000000] Allocated 4980736 bytes for 2048 pacas at c00000000fb40000
Apr 20 14:28:33 binacle kernel: [ 0.000000] Using PowerNV machine description
Apr 20 14:28:33 binacle kernel: [ 0.000000] Page sizes from device-tree:
Apr 20 14:28:33 binacle kernel: [ 0.000000] base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
Apr 20 14:28:33 binacle kernel: [ 0.000000] base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
Apr 20 14:28:33 binacle kernel: [ 0.000000] base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
Apr 20 14:28:33 binacle kernel: [ 0.000000] base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1

Can you supply details of how stress-ng is being run?

For this brk stressor test, stress-ng performs rapid heap expansion using the brk() system call, which will force the system to consume all available memory and then transition into a swapping phase. This can lead to an apparent "hung" situation (for example, your shell may be swapped out), even though it is still active and not crashed.

The kernel messages you see in the kern.log are just the kernel OOM killer killing off the best candidate process to free up memory; this free'd memory will be consumed rapidly again by other contending brk stressors so the system will appear to be hung while it is busy cycling around on the killing and spawning of these stressors.

Was the machine ping'able? If so, it's not dead/hung. Perhaps it got power cycled prematurely. What evidence was there that the machine was hung?

Revision history for this message
Mike Rushton (leftyfb) wrote :

The very first line of the screen session shows how the test is being run:

/usr/lib/plainbox-provider-checkbox/bin/memory_stress_ng

The machine is completely unresponsive. No ssh, no ping and the console over ipmi is locked up. Hitting CTRL+C over the sol console does nothing at all. This is after leaving the test running for 18+ hours where it should only take around 1-3 hours tops to complete depending on the machine. As you can see on the same machine running the same test on 14.04, the test took about an hour to complete with no lockup at the end.

Revision history for this message
Rod Smith (rodsmith) wrote :

Note that memory_stress_ng is a script in the Checkbox certification suite. The actual stress-ng command line should be:

timeout -s 9 $end_time stress-ng --aggressive --verify --timeout $runtime --brk 0

The exact values of $end_time and $runtime vary with the amount of memory in the system -- it's 300 seconds plus 10 seconds per GiB for $runtime and 50% more for $end_time.

Note also that Mike's "1-3 hours tops" refers to the ENTIRE memory_stress_ng run; the "stress-ng... --brk 0" run would of course be much shorter than that, since the script runs a series of stressors in sequence.

Revision history for this message
Colin Ian King (colin-king) wrote :

Is this being run as a normal user or with root privileges?

Revision history for this message
Rod Smith (rodsmith) wrote :

It should be run as root.

Revision history for this message
Colin Ian King (colin-king) wrote :

Ahah, perhaps one should examine the stress-ng manual:

"Running stress-ng with root privileges will adjust out of memory settings on Linux systems to make the stressors unkillable in low memory situations, so use this judiciously. "

We may be just being a bit too demanding...

Revision history for this message
Mike Rushton (leftyfb) wrote :

The tests run manually above were NOT run as root. The test during certification does run as root. I have run the test as root and as the ubuntu user, both with the same results.

Also, as shown above, this test ran fine manually while running Ubuntu 14.04 without root and during proper certification testing as root.

tags: added: kernel-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a bisect to figure out what commit caused this regression. We need to identify the earliest kernel where the issue started happening as well as the latest kernel that did not have this issue.

Can you test the following kernels and report back? We are looking for the first kernel version that exhibits this bug:

3.16: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16-utopic/
3.19: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.19-vivid/
4.2: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2-wily/

You don't have to test every kernel, just up until the kernel that first has this bug.

Once we know the results of those kernels, we can narrow it down further.

Thanks in advance!

tags: added: performing-bisect
Revision history for this message
Mike Rushton (leftyfb) wrote :

Are there ppc64le versions of those kernels?

Revision history for this message
Tim Gardner (timg-tpi) wrote :
Revision history for this message
Mike Rushton (leftyfb) wrote :

Confirmed the above kernel fails in the same manner.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you test the 4.1 kernel next? It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.1-wily/

We may also want to test the latest mainline kernel to see if this bug is already fixed there. If it is, we can perform a "Reverse" bisect to find the fix. The mainline kernel is available from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-rc6-wily/

Revision history for this message
Mike Rushton (leftyfb) wrote :

Ubuntu 14.04
3.19.0-58-generic - pass
4.2.0-35-generic - fails
4.1.0-040100-generic - fail
4.6.0-040600rc6-generic - pass

Ubuntu 16.04
4.4.0-22-generic - fail
4.1.0-040100-generic - fail
4.6.0-040600rc6-generic - pass

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

That is good news that the bug is fixed in mainline. We can perform a reverse bisect and find the commit that fixes this. We first need to narrow down which version fixes the issue. Can you test the following kernels next:

v4.5: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.5-wily/
v4.6-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-rc1-wily/

Changed in linux (Ubuntu):
status: Triaged → In Progress
assignee: Canonical Kernel Team (canonical-kernel-team) → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
status: New → In Progress
Changed in linux (Ubuntu Wily):
status: New → In Progress
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
Changed in linux (Ubuntu Wily):
importance: Undecided → High
Changed in linux (Ubuntu Xenial):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Wily):
assignee: nobody → Joseph Salisbury (jsalisbury)
tags: added: kernel-da-key
removed: kernel-key
Revision history for this message
Colin Ian King (colin-king) wrote :

I'd like to chip in with my $0.02 worth. This may be a sporadic hang, so it may be worth double checking each bisection point in case it's a bug that may be a little more racy than we'd like.

Revision history for this message
Mike Rushton (leftyfb) wrote :

I have noticed only on the Alpine hardware within an LPAR that is has been sporadic. I assume that is due to the additional hardware resources that can better handle the memory/stress test. All other tests have been 2 or 3 times each. More more than that. I can run the above kernel tests, but mind you, each test takes several hours to half a days worth. Running each test multiple times could take about a week to complete. We really need to get this sorted out ASAP since it is holding up all Power certification.

Revision history for this message
Mike Rushton (leftyfb) wrote :

4.5.0-040500.201603140130_ppc64el - fail
4.6.0-040600rc1.201603261930_ppc64el - pass

4.6 was run twice and passed both times.

tags: added: kernel-key
removed: kernel-da-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I started a "Reverse" kernel bisect between v4.5 final and v4.6-rc1. The kernel bisect will require testing of about 10-12 test kernels.

I built the first test kernel, up to the following commit:

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1573062

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

In parallel to the bisect, I'll review the git logs to see if the fix sticks out. There are 166 powerpc commits between these two versions, so a bisect may be the only way to identify the exact commit.

Thanks in advance

Revision history for this message
Mike Rushton (leftyfb) wrote :

Testing now...

Revision history for this message
Mike Rushton (leftyfb) wrote :

So far, after 2 runs, 4.5.0-040500-generic_4.5.0-040500.201605161244 seems to be working. I'm running it a 3rd time now just to be sure. It takes about 7 hours to finish.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the update, Mike. If you think the two test runs are enough to prove that kernel is good, I'll build the next one. Otherwise, I'll wait for the results of the third run.

Revision history for this message
Mike Rushton (leftyfb) wrote :

I would go ahead and assume it's good. I haven't had 2 tests run good twice in a row yet.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
53d2e6976bd4042672ed7b90dfbf4b31635b7dcf

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1573062

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Commit 6b5f04b6cf8ebab9a65d9c0026c650bb2538fd0f was actually bad. So the test kernel in comment #28 is invalid. I'll re-update the bisect and build the next kernel.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I reset the bisect and built the first test kernel, up to the following commit:
96b9b1c95660d4bc5510c5d798d3817ae9f0b391

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1573062

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Sorry we should actually test the kernel in comment #28. This is because we are performing a "Reverse" bisect and not a regular bisect. The test kernel is available from:

http://kernel.ubuntu.com/~jsalisbury/lp1573062

The .deb file name is:
linux-image-4.5.0-040500-generic_4.5.0-040500.201605171117_ppc64el.deb

Thanks, and sorry for the confusion.

Revision history for this message
Mike Rushton (leftyfb) wrote :

4.5.0-040500.201605171117 failed

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
2e11590171683c6b12193fe4b0ede1e6201b7f45

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1573062

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-141723 severity-critical targetmilestone-inin16041
bugproxy (bugproxy)
tags: added: severity-high
removed: severity-critical
Revision history for this message
Mike Rushton (leftyfb) wrote :

PASS,2016-05-23-13-21-37,4.5.0-040500-generic-201605230752
FAIL,2016-05-23-21-06-07,4.5.0-040500-generic-201605230752
PASS,2016-05-24-04-32-39,4.5.0-040500-generic-201605230752
FAIL,2016-05-24-13-31-01,4.5.0-040500-generic-201605230752
FAIL,2016-05-25-04-06-51,4.5.0-040500-generic-201605230752

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
5a010c73cdb760c9bdf37b28824b6566789cc005

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1573062

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
Mike Rushton (leftyfb) wrote :

4.5.0-040500-generic-201605251558 failed

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
8f40842e4260f73792c156aded004197a19135ee

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1573062

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-05-27 03:40 EDT-------
Hi, where could I get the src/binary of memory_stress_ng. I will try to reproduce it in local power servers

Revision history for this message
Michael Ellerman (michael-ellerman) wrote : Re: memory_stress_ng failing for IBM Power S812LC(TN71-BP012) for 16.04

Can you please post the bisection log using upstream commit ids.

Revision history for this message
Mike Rushton (leftyfb) wrote :

8f40842e4260f73792c156aded004197a19135ee failed

no longer affects: linux (Ubuntu Wily)
tags: added: kernel-da-key
removed: kernel-key
tags: added: kernel-key
removed: kernel-da-key
tags: added: kernel-da-key
removed: kernel-key
71 comments hidden view all 151 comments
Revision history for this message
Mike Rushton (leftyfb) wrote :

Kalpana:

The above failed test was run on only 1 of the possible 4 machines we have been testing with for months now. The other openpower server is a Habanero with 256G. I am running tests on that as we speak and going to include the kern.log as well.

Just to clarify, the memory test I'm running is /usr/lib/plainbox-provider-checkbox/bin/memory_stress_ng

Revision history for this message
Balbir Singh (bsingharora) wrote :

I am unable to reproduce the failure either, but your system with 32G and 128 threads seems like the test would start 128 hogs each hogging up 32GB. How much swap do you have on them? Could you post the dmesg to see what failed and the logs around it? It looks like the stack stressor failed.

Revision history for this message
Mike Rushton (leftyfb) wrote :

This is a failed test from today on the Habanero. Find the test output and kern.log attached:

ubuntu@binacle:~/4.4.0-34-generic-53~lp1573062PATCHED$ free -m
              total used free shared buff/cache available
Mem: 261533 621 259912 20 1000 259938
Swap: 8191 0 8191

Revision history for this message
Mike Rushton (leftyfb) wrote :

kern.log from the Habanero 4.4.0-34-generic-53~lp1573062PATCHED test

Revision history for this message
Mike Rushton (leftyfb) wrote :

This is a failed test from today on the Firestone. Find the test output and kern.log attached:

ubuntu@gulpin:~/4.4.0-34-generic-53~lp1573062PATCHED$ free -m
              total used free shared buff/cache available
Mem: 32565 342 30092 21 2131 30918
Swap: 8191 0 8191

Revision history for this message
Mike Rushton (leftyfb) wrote :

kern.log from the Firestone 4.4.0-34-generic-53~lp1573062PATCHED test

Revision history for this message
Balbir Singh (bsingharora) wrote :

These logs are something I've not seen here in my testing. This shows that we are stuck doing an up_write() on root->rwsem in the anon_vma path. It looks like we are contending on the rwsem's sem->wait_lock. I don't have a reproduction of this issue, it will be interesting to examine what is causing the heavy contention

Revision history for this message
Mike Rushton (leftyfb) wrote :

Out of the 4 tests last night (Habanero(NV), Firstone(NV), Tuleta(NV), Alpine(VM) only the Firestone failed. Attached is the output from the test, kern.log and dmesg output I was running in a while loop during testing.

Revision history for this message
Mike Rushton (leftyfb) wrote :

kern.log from failed test on Firestone

Revision history for this message
Mike Rushton (leftyfb) wrote :

dmesg from failed test on Firestone

Mike Rushton (leftyfb)
summary: - memory_stress_ng failing for IBM Power S812LC(TN71-BP012) for 16.04
+ memory_stress_ng failing for Power architecture for 16.04
Revision history for this message
Jeff Lane  (bladernr) wrote :

Just an additional data point, I ran this test (using Mike's script in a loop) on other architectures:

amd64 bare metal: 30 times 4.4.0, no failures
amd64 virtual machine: 30 times, 4.4.0, no failures

s390 4GB RAM: 40 times on both 4.4.0-31 and 4.7.0, no failures
s390 20GB RAM: 40 times on both 4.4.0-31 and 4.7.0, no failures.

I'm still running on arm64 and waiting on final results.

Revision history for this message
Balbir Singh (bsingharora) wrote :

Thanks Jeff. I see that the ARM64 might have failed - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1610320

Can we know the git commit id that fixed the ARM64 failure in mainline?

BTW, could you please share the full machine configurations - threads+RAM+swap for each of the other architectures you ran this on?

tags: added: kernel-key
removed: performing-bisect
Revision history for this message
bugproxy (bugproxy) wrote :

This comment is to inform you that mirroring has been disabled between Launchpad bug 1573062 and LTC bug 141723 due to the size of this LP bug attachments.

If you want to make this LP bug bridgeable again, I suggest you to remove unwanted attachments or perhaps to compress some of the existing attachments (several large uncompressed text log files currently exist).

Revision history for this message
Balbir Singh (bsingharora) wrote :

I just posted another patch @ http://<email address hidden>/msg1219903.html, I am testing this patch at the moment.

Revision history for this message
Kalpana S Shetty (kalshett) wrote :

Joseph: Is it possible for you to post test kernel with Balbir's patches ?
and place them at earlier location - http://kernel.ubuntu.com/~jsalisbury/lp1573062/

Revision history for this message
Mike Rushton (leftyfb) wrote :

Attached is the cultivation of all the kernel testing we have done.

Revision history for this message
Balbir Singh (bsingharora) wrote :

Thank you for the excellent summary. Questions

1. Can we get the configurations of the machines.
2. The first column is the number of times the test ran?
3. I see that 4.4.0-31-generic-50-Ubuntu passed on all machines across several runs, is that true?
4. Did any of the tests result in system hang? Can we find out from the summary?

Would it be fair to assume that tests when run against 4.4.0-31-generic-50-Ubuntu would pass again and we should work off of that?

My interest is in 4.4.0-34-generic-53~lp1573062PATCHED, With that I notice that gulpin saw failures in mmapfork and probably a hang there -- for which I posted a scheduler try_to_wake_up fix upstream. Generally binacle faces brk stress test failures -- it will be interesting to see its machine configuration and why the test failed

One observation at my end is that we should reboot between runs as I think some tests can kill important tasks in the system and I am not sure if there is a guarantee that the system is able to carry on correct operation recovering from all the tasks being re-spawned after OOM for example.

Revision history for this message
Mike Rushton (leftyfb) wrote :

Sorry, I should have taken out the 4.4.0-31 tests. That test was not using stress-ng but our original memory stress test.

tags: removed: kernel-key
Revision history for this message
Balbir Singh (bsingharora) wrote :

Can we please get answers to 1, 2 and 4 for comment #128. Also Kalpana has a request for a new kernel build.

Thanks,

Revision history for this message
Mike Rushton (leftyfb) wrote :

1. Can we get the configurations of the machines.

XML files for all 4 machines with hardware information from the certification process will be attached in the next few comments

2. The first column is the number of times the test ran?

Yes

4. Did any of the tests result in system hang? Can we find out from the summary?

All failed tests failed with a complete system hang. The only way to recover would be to reboot.

Revision history for this message
Mike Rushton (leftyfb) wrote :
Revision history for this message
Mike Rushton (leftyfb) wrote :
Revision history for this message
Mike Rushton (leftyfb) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Balbir, Kalpana,

I can build the kernel for you. Can you post a list of all the patches you would like included? That way we ensure they are all included.

Revision history for this message
Balbir Singh (bsingharora) wrote :

Can we build the latest 4.8, may be we should wait for 4.8-rc7. I've got all the fixes upstream, with the latest being 135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Sure, we can wait for 4.8-rc7. It should be available on Monday. I'll post a link to the download location once it's available.

Revision history for this message
Balbir Singh (bsingharora) wrote :

FYI, I think the patch made it to 4.4 stable as well

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Balibir, the v4.8 final kernel is now available and can be downloaded from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8/

The latest upstream stable 4.4 kernel is also avaiable from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.23

Revision history for this message
Mike Rushton (leftyfb) wrote :

output of stress_ng test running on Power 8 Tuleta LPAR

Revision history for this message
Mike Rushton (leftyfb) wrote :

dmesg output while stress_ng test was running on Power 8 Tuleta LPAR

Revision history for this message
Mike Rushton (leftyfb) wrote :

`ps -ax` output while stress_ng test was running on Power 8 Tuleta LPAR

Revision history for this message
Colin Ian King (colin-king) wrote :

We're getting zombies here which aren't being reaped:

130428 ? Z 0:00 [stress-ng-brk] <defunct>
130432 ? Z 0:00 [stress-ng-brk] <defunct>
130434 ? Z 0:00 [stress-ng-brk] <defunct>
130436 ? Z 0:00 [stress-ng-brk] <defunct>

The reason for this is that memory stressors like brk have a parent that forks off a child. The child performs the stressing and if it gets OOM'd the parent can spawn off another stressor. So I think the SIGKILL on the stress-ng brk stressor is killing the parent bug the child (which is still holding onto a load of memory on the heap) is not being waited for and hence is in a memory hogging zombie state. We may be in a pathologically memory hogging state because the zombies may be holding brk regions that are swapped out to disk due to memory pressure and we're hitting a low-memory state which is not being cleared up.

I suggest modifying the test bash script as follows:

1. run stress-ng with -k flag (so that all the processes have the same stress-ng name)
2. kill with ALRM first
3. then kill with KILL all the stress-ng processes after a small grace period.
4. report on unkillable stressors

refer to the changes I made to https://launchpadlibrarian.net/296974522/disk_stress_ng

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-12-20 22:55 EDT-------
Could the Ubuntu team check if this is still an issue with the 4.8 kernel?

Changed in linux (Ubuntu):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Xenial):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Yakkety):
assignee: Joseph Salisbury (jsalisbury) → nobody
status: In Progress → Incomplete
Changed in linux (Ubuntu Xenial):
status: In Progress → Incomplete
Changed in linux (Ubuntu):
status: In Progress → Incomplete
1 comments hidden view all 151 comments
Revision history for this message
Jeff Lane  (bladernr) wrote :

Ok, so we're now using stress-ng 0.07.21 which resolved a lot of issues and have also adopted Colin's suggestions into our wrapper script to modify how stress-ng is being called and culled.

This seems to resolve the problems that led to this.

Changed in plainbox-provider-checkbox:
status: New → Fix Committed
importance: Undecided → Critical
assignee: nobody → Mike Rushton (leftyfb)
milestone: none → 0.36.0
Jeff Lane  (bladernr)
Changed in plainbox-provider-checkbox:
milestone: 0.36.0 → 0.35.0
Pierre Equoy (pieq)
Changed in plainbox-provider-checkbox:
status: Fix Committed → Fix Released
Revision history for this message
Mike Rushton (leftyfb) wrote :

This issue, or similar symptoms doesn't seem to be fixed. This is with stock kernel, stock stress-ng but with the suggested changes to the memory_stress_ng test script. The machine locks up completely maybe 1 out of 5-10 runs. See attached dmesg for errors.

Kernel: 4.4.0-64-generic-85-Ubuntu
stress-ng Version: 0.07.21-1~ppa

Mike Rushton (leftyfb)
Changed in plainbox-provider-checkbox:
status: Fix Released → Confirmed
assignee: Mike Rushton (leftyfb) → nobody
Revision history for this message
Jeff Lane  (bladernr) wrote :

OK, I have a Firestone available, so I'm re-trying this with a cycle of 10 runs to see what happens. Currently using 16.04 w/ HWE-Edge (4.10)

Changed in plainbox-provider-checkbox:
assignee: nobody → Jeff Lane (bladernr)
milestone: 0.35.0 → future
Revision history for this message
Jeff Lane  (bladernr) wrote :

OK... so I tried three runs of the stress-ng memory test (that has been used successfully on every other system we certify for 16.04) and so far, two of three runs have resulted in the system locking up and a bunch of stack trace data dumped to console.

In both cases, I had to power cycle the system to reboot it into a usable state.

All three attempts were 16.04 w/ hwe-edge (4.10) deployed by MAAS.

System info:

Linux oil-entei 4.10.0-20-generic #22~16.04.1-Ubuntu SMP Thu Apr 20 10:30:58 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux

stress-ng:
  Installed: 0.07.21-1~ppa
(We keep an updated version of stress-ng in the cert PPA, it's not modified from Colin's code)

I've attached logs (kernel and syslog, other info) in a tarball to this.

Revision history for this message
Jeff Lane  (bladernr) wrote :

resetting all the tasks to confirmed, rather than Incomplete as I've tested and am still waiting for a resolution or whatever.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Xenial):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Yakkety):
status: Incomplete → Confirmed
Revision history for this message
Andy Whitcroft (apw) wrote : Closing unsupported series nomination.

This bug was nominated against a series that is no longer supported, ie yakkety. The bug task representing the yakkety nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Yakkety):
status: Confirmed → Won't Fix
Jeff Lane  (bladernr)
Changed in plainbox-provider-checkbox:
status: Confirmed → Fix Committed
Changed in plainbox-provider-checkbox:
milestone: future → 0.40.0
Changed in plainbox-provider-checkbox:
status: Fix Committed → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
Jeff Lane  (bladernr)
tags: removed: blocks-hwcert-server
Displaying first 40 and last 40 comments. View all 151 comments or add a comment.