Heavy I/O on Sandybridge systems with small memory footprints causes system hangs

Bug #755066 reported by Kent Baxley
50
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OEM Priority Project
Fix Released
Critical
Chris Van Hoof
linux (Ubuntu)
Fix Released
Critical
Colin Ian King
Natty
Fix Released
Critical
Colin Ian King

Bug Description

System hangs have been observed when performing heavy i/o on sandybridge systems with small memory footprints (less than 4GB of memory in this case).

The issue was first discovered when running an a customized Natty installer from an OEM. The installer initially copies all the files off to a partition on the hard drive, and then the system is installed from the hard disk. It is during the file copy phase that the systems will hang. The hang is either indefinite or sometimes the system will recover and complete the installation after several minutes.

A way to reproduce this behavior on a vanilla natty AMD64 image installer image is as follows. You will need a sandybridge system with UMA graphics, and preferably only 1GB of memory installed. 1GB seems to reproduce it the most frequently.

1) Boot the image up in single user mode
2) Run this attached script

It sometimes doesn't happen on the first try, but usually by the second or third try the behavior will happen (a tell tale sign is the HDD indicator freezes).

The script wraps around the ubiquity file copy routine as it's very I/O intensive since everything will be md5'ed as it's copied. It first copies from the USB stick to a 2G partition, and then from that 2G partition to another partition.

Revision history for this message
Kent Baxley (kentb) wrote :
Revision history for this message
Kent Baxley (kentb) wrote :
Revision history for this message
Kent Baxley (kentb) wrote :
Tony Espy (awe)
Changed in oem-priority:
importance: Undecided → Critical
Chris Van Hoof (vanhoof)
tags: added: hwe-blocker
Changed in oem-priority:
assignee: nobody → Canonical Platform QA Team (canonical-platform-qa)
Chris Van Hoof (vanhoof)
Changed in linux (Ubuntu):
assignee: nobody → Chris Van Hoof (vanhoof)
importance: Undecided → Critical
Revision history for this message
Chris Van Hoof (vanhoof) wrote :

I did some testing on this issue, and initially thought I had a culprit with mkfs.ext4:

http://people.canonical.com/~vanhoof/lazy_itable_init_testing_plus_dirty/

However it has been proven to occur even with a more generic file copy operation.

I was able to grab some perf top results today during an instance of this lockup:

http://people.canonical.com/~vanhoof/lazy_itable_init_testing_plus_dirty/perftop/

Which does show a large spike in shrink_slab when this hits.

Revision history for this message
Andy Whitcroft (apw) wrote :

@Chris --- I wonder if we are getting an OOM or similar. It would be great if we could examine the dmesg when this is occuring. Perhaps configure a network console, or serial and see if we capture anything.

Revision history for this message
Seth Forshee (sforshee) wrote :

@Chris, might also be useful to see /proc/meminfo and /proc/slabinfo in that state to get an idea of where all the memory is being used.

Revision history for this message
Chris Van Hoof (vanhoof) wrote :

@Seth here ya go:

Full set of various commands, top, vmstat, iostat, mpstat, slabinfo, meminfo etc from the start of a test, to the point where I hang:

http://people.canonical.com/~vanhoof/lazy_itable_init_testing_plus_dirty/debug/

--chris

Revision history for this message
Seth Forshee (sforshee) wrote :

Sorry, I don't think that information really tells us anything new. Lots of memory used for the page cache, which isn't surprising. slabinfo does show a large increase in the size of the buffer_head slab at the end, which indicates a lot of stuff waiting to undergo I/O, and again not surprising.

Well, it was worth a try. I've seen systems that ground to a halt because of an in-kernel memory leak or similar creating enormous system-wide memory pressure. But that doesn't seem to be what's happening here unless something went horribly wrong after the logs stop.

Revision history for this message
Chris Van Hoof (vanhoof) wrote :

An update from today's testing:

We went through the test case today, and began testing older kernel releases to see if we could find a bisection point. These tests were run on three different SandyBridge machines, with 1gb, 2gb, and 3gb of memory. All using a dual core chip.

In any case where there was a failure, we hung < 5 iterations through the reproducer.
In any case where there was success we went beyond 20 (even 35) iterations.

Using the Kernel Team's mainline builds from here:
 - http://kernel.ubuntu.com/~kernel-ppa/mainline/

Working: v2.6.37-natty & v2.6.37.6-natty
Failing: v2.6.38-rc1-natty

We've even seen the same failure up to 2.6.39-rc3

There is ~8000 patches between 2.6.37 and 2.6.38-rc1, requiring 13 bisections from what I gather.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

changed state to In Progress so it doesn't scare me every time I look at the hot list. :)

~JFo

Changed in linux (Ubuntu):
status: New → In Progress
Revision history for this message
Anthony Batchelor (toeknee) wrote :

I get the same problem, but not when installing. Copying a large amount of files, or importing a lot of photos them into shotwell also causes the system to hang.

I have tried several of the mainline kernels as described in including:
 * http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/2011-04-13-natty/
 * http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.39-rc4-natty/

These kernels do not hang.

The machine is a Dell 15R with 3G of RAM : http://www.dell.com/uk/p/inspiron-r/pd?p=inspiron-r&view=pdetails

Let me know if I can give any more information to help.

Revision history for this message
Luis Carlos Cobo (luiscarlos) wrote :

I could install too, and copy a large amount of data until I got the first failure. From then, failures were frequent and easily reproducible (an rsync would fail in less than 5 minutes), giving me "general protection fault".

However I have been unable to reinstall from scratch again, the process now fails at different stages of the process, both with Natty beta 2 and with Maverick. So right now I am inclined to think my machine might be botched. It was a brand new Thinkpad t420s.

Revision history for this message
Luis Carlos Cobo (luiscarlos) wrote :

Do your systems install Maverick fine?

As I said, I was able to install Natty beta 2 and transfer lots of data (around 70GB) before it cracked. But now I am unable to reinstall Natty, Maverick or even Windows 7 retail (unfortunately I do not have a recovery cd). This was on a 5 hour old Thinkpad t420s. Tech support said the did not support Linux (obviously) and that it was normal that windows 7 cracked because it was the retail (not the OEM) version, but I think the guy just thought I did not know what I was doing.

Chris Van Hoof (vanhoof)
Changed in linux (Ubuntu):
assignee: Chris Van Hoof (vanhoof) → Colin King (colin-king)
Revision history for this message
Anton¡o Sch¡fano (skiantoz) wrote :

Hello,

I am experiencing a very similar problem with my laptop (AMD fusion based, HP Pavilion dm1-3101 - 4GB RAM, CPU AMD E350 dual core), a fresh Natty installation with updates and kernel 2.6.38-9-generic amd64.
When trying to rsync a few GB directories from my main PC to the laptop, after a while (random) the laptop hangs. Nothing is displayed on screen and I cannot find any trace in the logs.

After some hours (sic!) spent on this problem, I have noticed that:
- The system hangs when free memory gets low (monitored with 'top' on a terminal).
- As free memory decreases, kswapd0 starts consuming more and more cpu time, till it reaches 60%-80%. Swap is not used however, and I was able to reproduce the hang even with swap off.
- As a workaround, I eventually found that dropping the disk caches when kswapd0 starts misbehaving (echo 1 >/proc/sys/vm/drop_caches) prevents the kernel from freezing and, thanks to that, I could finally complete the rsync.

Do you think this is the same problem, or should I open a new bug?

Revision history for this message
Laurens Bosscher (laurens-laurensbosscher) wrote :

I have the same problem on a sandybridge XPS15 laptop with 4 GB ram. It almost exclusively happens when Dropbox is synchronizing files and it's very annoying.

So I guess it's not just limited to devices with an low amount of memory.

Revision history for this message
Colin Ian King (colin-king) wrote :

Update: I worked with upstream on a bunch of the patches for this issue and the good news is that we have two fixes that hit GregKH's stable 2.6.38.8 tree a few days ago:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.38.y.git;a=commit;h=2020aa625c559d371518040290b5476356e7aacf
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.38.y.git;a=commit;h=28663b64187171a869bf991b20e3dc24f88067d4

so hopefully these will be picked up on the Natty stable updates at some point in the future (a few weeks).

Revision history for this message
Colin Ian King (colin-king) wrote :

Just to add, kudos to Minchan Kim, Johannes Weiner and Mel Gorman to name but a few on working on this issue.

Chris Van Hoof (vanhoof)
Changed in oem-priority:
assignee: Canonical Platform QA Team (canonical-platform-qa) → Chris Van Hoof (vanhoof)
status: New → Confirmed
Revision history for this message
Colin Ian King (colin-king) wrote :

SRU Request:

System hangs have been observed when performing heavy I/O on
sandybridge systems with small memory footprints (for example,
less than 2GB of memory). kswapd consumes all the CPU and the
machine effectively becomes unusable because kswapd is missing
every cond_resched(). Also, we need to invert the logic in
commit 1741c877 ("mm: kswapd: keep kswapd awake for high-order
allocations until a percentage of the node is balanced") to
allow kswapd to go to sleep when balanced for high orders.

Testing involved multiply copying ~600MB of files from an install
image on a laptop with 2GB of memory. A 2.5 hour (150 iteration)
soak test cannot trip the hang with these patches, where as
without them the bug occurs in the first 5 to 30 iterations.

Test Case:

600MB of files from Natty ISO image copied to a 1GB ext4 partition on /dev/sda3 and copied via the attached bash
script to an ext4 partition on /dev/sda4. This script contains elements of the original ubiquity installer Python script
that originally tripped the hang.

Without the patches, system hangs after 5 to 30 iterations. kswapd chews up CPU and machine becomes unusable. with the
patch, script can run for hours and hundreds of copy iterations.

summary: - heavy i/o on sandybridge systems with small memory footprints causes
+ Heavy I/O on Sandybridge systems with small memory footprints causes
system hangs
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu):
status: In Progress → Fix Released
Changed in linux (Ubuntu Natty):
assignee: nobody → Colin King (colin-king)
status: New → Fix Committed
Chris Van Hoof (vanhoof)
Changed in oem-priority:
status: Confirmed → Fix Committed
Chris Van Hoof (vanhoof)
Changed in linux (Ubuntu Natty):
importance: Undecided → Critical
Revision history for this message
Herton R. Krzesinski (herton) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-natty' to 'verification-done-natty'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-natty
Revision history for this message
Colin Ian King (colin-king) wrote :

Just to make things a little more confusing, the patches in the bug got replaced by a better fix and these are being tracked in bug 808509 namely: "SRU: Stop kswapd consuming 100% CPU when highest zone is small".

I've tested the -proposed kernel and it addresses this bug (since it's essentially a manifestation of the same root bug). Testing as as follows:

~4 hours of running a script (see https://bugs.launchpad.net/ubuntu/+source/linux/+bug/808509/comments/7 ) giving 500 iterations of a copy loop that copies ~795MB of files from one ext4 partition to another. Exhaustive test passed. Without the fix the test would fail after a few tens of iterations. So I will mark this as verified.

tags: added: natty-verification-done
removed: verification-needed-natty
tags: added: verification-done-natty
removed: natty-verification-done
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (13.4 KiB)

This bug was fixed in the package linux - 2.6.38-11.48

---------------
linux (2.6.38-11.48) natty-proposed; urgency=low

  [Herton R. Krzesinski]

  * Release Tracking Bug
    - LP: #818175

  [ Upstream Kernel Changes ]

  * Revert "HID: magicmouse: ignore 'ivalid report id' while switching
    modes"
    - LP: #814250

linux (2.6.38-11.47) natty-proposed; urgency=low

  [Steve Conklin]

  * Release Tracking Bug
    - LP: #811180

  [ Keng-Yu Lin ]

  * SAUCE: Revert: "dell-laptop: Toggle the unsupported hardware
    killswitch"
    - LP: #775281

  [ Ming Lei ]

  * SAUCE: fix yama_ptracer_del lockdep warning
    - LP: #791019

  [ Stefan Bader ]

  * SAUCE: Re-enable RODATA for i386 virtual
    - LP: #809838

  [ Tim Gardner ]

  * [Config] Add grub-efi as a recommended bootloader for server and
    generic
    - LP: #800910
  * SAUCE: rtl8192se: Force a build for a 2.6/3.0 kernel
    - LP: #805494

  [ Upstream Kernel Changes ]

  * Revert "bridge: Forward reserved group addresses if !STP"
    - LP: #793702
  * Fix up ABI directory
  * bonding: Incorrect TX queue offset, CVE-2011-1581
    - LP: #792312
    - CVE-2011-1581
  * fs/partitions/efi.c: corrupted GUID partition tables can cause kernel
    oops
    - LP: #795418
    - CVE-2011-1577
  * usbnet/cdc_ncm: add missing .reset_resume hook
    - LP: #793892
  * ath5k: Disable fast channel switching by default
    - LP: #767192
  * mm: vmscan: correctly check if reclaimer should schedule during
    shrink_slab
    - LP: #755066
  * mm: vmscan: correct use of pgdat_balanced in sleeping_prematurely
    - LP: #755066
  * ALSA: hda - Use LPIB for ATI/AMD chipsets as default
    - LP: #741825
  * ALSA: hda - Enable snoop bit for AMD controllers
    - LP: #741825
  * ALSA: hda - Enable sync_write workaround for AMD generically
    - LP: #741825
  * cpuidle: menu: fixed wrapping timers at 4.294 seconds
    - LP: #774947
  * drm/i915: Fix gen6 (SNB) missed BLT ring interrupts.
    - LP: #761065
  * USB: ehci: remove structure packing from ehci_def
    - LP: #791552
  * drm/i915: disable PCH ports if needed when disabling a CRTC
    - LP: #791752
  * kmemleak: Do not return a pointer to an object that kmemleak did not
    get
    - LP: #793702
  * kmemleak: Initialise kmemleak after debug_objects_mem_init()
    - LP: #793702
  * Fix _OSC UUID in pcc-cpufreq
    - LP: #793702
  * CPU hotplug, re-create sysfs directory and symlinks
    - LP: #793702
  * Fix memory leak in cpufreq_stat
    - LP: #793702
  * net: recvmmsg: Strip MSG_WAITFORONE when calling recvmsg
    - LP: #793702
  * ftrace: Only update the function code on write to filter files
    - LP: #793702
  * qla2xxx: Fix hang during driver unload when vport is active.
    - LP: #793702
  * qla2xxx: Fix virtual port failing to login after chip reset.
    - LP: #793702
  * qla2xxx: Fix vport delete hang when logins are outstanding.
    - LP: #793702
  * powerpc/kdump64: Don't reference freed memory as pacas
    - LP: #793702
  * powerpc/kexec: Fix memory corruption from unallocated slaves
    - LP: #793702
  * x86, cpufeature: Fix cpuid leaf 7 feature detection
    - LP: #793702
  * ath9k_hw: do noise floor calibration only on required chain...

Changed in linux (Ubuntu Natty):
status: Fix Committed → Fix Released
Chris Van Hoof (vanhoof)
Changed in oem-priority:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.