transparent hugepages and thrashing on amd64

Bug #1013807 reported by garyr
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

I seem to have found a solution to a severe thrashing/swapping/freezing problem that I've been having for months now. I guess the real question is - should I turn it into a bug report and what would be useful data to include if so.

This is a quad core AMD Phaeom system with 4G of ram, dual monitors and a single 1TB WD caviar black HD. It had been behaving normally until something broke sometime late in the 11.x release cycle and continues in the current 12.04 LTS. The symptoms are running a moderate load of apps (firefox with ~8 tabs, a terminal or 2, and aisleriot solitaire for example) and experiencing system freezes where the entire UI becomes totally unresponsive for 20 seconds - 5 minutes with solid disk activity. Trying to figure out what was going on via iotop and top show jbd2 and kswapd accounting for the largest load, but since it freezes iotop like everything else I can't tell what's going during the worst storms. Googling around shows a fair number of other people with similar problems, most of them with multi core amd64 systems.

The other day I spotted this report on opensuse that looked similar but not identical:

http://lists.opensuse.org/opensuse/2012-03/msg00657.html

I booted with the grub parameter transparent_hugepage=never yesterday and the problem went and away and hasn't come back. I've streesed the system by running a bunch of flash/java tabs in firefox, running a large java based stock app (ThinkorSwim) in another workspace and playing a 1080p 60fps movie in a third workspace. This certainly causes swapping, but not freezing or stumbling. It actually did a bit of swapping a minute ago while I was typing and it managed to make Pandora radio stumble for a moment - but that's orders of magnitude better than it has been.

I think there may be a fundamental problem with how transparent hugepages are handled with some AMD CPUs. I think this problem started when this feature was implemented and enabled by default. The manpage for madvise() says this was added in 2.6.38, but I don't know if it was enabled by default at that point.

Hre's a partial list of things that haven't worked well in the past:

Playing with the swappiness value: setting swappiness to very low values makes the problem take longer to surface, but (unsurprisinglly) makes it even worse once it does.

swapoff-a ; swapon-a: this makes it go away for a while. A potentially interesting thing is that as soon as I can get the system to act on the swapoff -a the system becomes responsive again. It pegs once CPU core at 100% and the HD grinds like crazy but it stops freezing right away.

Moving swap from the HD to a USB thumb drive: Obviously I didn't expect that to be faster but wanted to see if segregating swap to a different device on a different bus would make it swap more smoothly - it didn't.

Playing with nice and ionice priorities for jdb2, kswapd. The fact that running these processes at a lower priority than anything else on the system makes no difference leads me to think they were just symptoms and not at the root of the problem.

I think this may be a tip of the iceberg and there may be a lot of other having this problem. Looking around I see a fair number of reports, most of them unsolved. Some may have been fixed by just adding enough RAM that dirty hugepages just don't collect. Some may have been fixed by chaanging filesystems - ext4 seems like something a lot of people with this problem have in common.

Workaround:
hold down the spacebar during boot in order to bring up the grub menu, edit the command line and add
transparent_hugepage=never

If this fixes the problem you can make it permanent by editing /etc/default/grub and adding the ransparent_hugepage=never to the GRUB_CMDLINE_LINUX_DEFAULT line and then running update-grub

Problems with this workaround:
1) transparent hugepages should work. This may cause a small performance hit in some situations and a larger hit in others.
2) If you do this you will probably never know when or if it actually gets fixed.

PS: Lars Müller [ˈlaː(r)z ˈmʏlɐ]
Samba Team
SUSE Linux, Maxfeldstraße 5, 90409 Nürnberg, Germany

is looking for bugzilla reports on this too.
---
ApportVersion: 2.19.1-0ubuntu5
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: garyrich 2339 F.... pulseaudio
 /dev/snd/controlC1: garyrich 2339 F.... pulseaudio
CurrentDesktop: Unity
DistroRelease: Ubuntu 15.10
EcryptfsInUse: Yes
HibernationDevice: RESUME=UUID=5b22d42d-33c8-435c-bea0-c1fbba7f88bf
InstallationDate: Installed on 2010-01-22 (2149 days ago)
InstallationMedia: Ubuntu 9.10 "Karmic Koala" - Release amd64 (20091027)
IwConfig:
 eth0 no wireless extensions.

 lo no wireless extensions.
MachineType: System manufacturer System Product Name
Package: linux (not installed)
ProcFB: 0 radeondrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.2.0-19-generic root=UUID=ab3cda85-2125-40b3-a1d0-f27222cc9ff6 ro quiet splash elevator=cfq vt.handoff=7
ProcVersionSignature: Ubuntu 4.2.0-19.23-generic 4.2.6
RelatedPackageVersions:
 linux-restricted-modules-4.2.0-19-generic N/A
 linux-backports-modules-4.2.0-19-generic N/A
 linux-firmware 1.149.3
RfKill:

Tags: wily
Uname: Linux 4.2.0-19-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm admin audio cdrom debian-tor dialout dip fax fuse games lpadmin messagebus netdev plugdev sambashare ssh staff syslog tape users video
_MarkForUpload: True
dmi.bios.date: 04/13/2011
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 3503
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: M4A78T-E
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: Asset-1234567890
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr3503:bd04/13/2011:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKComputerINC.:rnM4A78T-E:rvrRev1.xx:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1013807/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
garyr (garyrich)
affects: ubuntu → libhugetlbfs (Ubuntu)
Revision history for this message
garyr (garyrich) wrote :

Thank you "Bug Bot". Though I knew that logging this against Ubuntu generically was incorrect - it was not obvious how to change it wuen making a question=>bug report (metabug?). In reality I doubt it;s even has much to do with the distro. It looks like people are having similar problems in Suse, Arch, Mandrive and RedHat

Revision history for this message
Geoffrey Thomas (geofft) wrote :

Since you seem to be using transparent hugepages and not libhugetlbfs (LD_PRELOAD=libhugetlbfs.so), I'm going to reassign this bug to the Linux kernel package.

(It's also entirely possible this bug got solved in the last 3 years, to be fair)

affects: libhugetlbfs (Ubuntu) → linux (Ubuntu)
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1013807

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
garyr (garyrich) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected wily
description: updated
Revision history for this message
garyr (garyrich) wrote : CRDA.txt

apport information

Revision history for this message
garyr (garyrich) wrote : CurrentDmesg.txt

apport information

Revision history for this message
garyr (garyrich) wrote : JournalErrors.txt

apport information

Revision history for this message
garyr (garyrich) wrote : Lspci.txt

apport information

Revision history for this message
garyr (garyrich) wrote : Lsusb.txt

apport information

Revision history for this message
garyr (garyrich) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
garyr (garyrich) wrote : ProcEnviron.txt

apport information

Revision history for this message
garyr (garyrich) wrote : ProcInterrupts.txt

apport information

Revision history for this message
garyr (garyrich) wrote : ProcModules.txt

apport information

Revision history for this message
garyr (garyrich) wrote : PulseList.txt

apport information

Revision history for this message
garyr (garyrich) wrote : UdevDb.txt

apport information

Revision history for this message
garyr (garyrich) wrote : UdevLog.txt

apport information

Revision history for this message
garyr (garyrich) wrote : WifiSyslog.txt

apport information

Revision history for this message
garyr (garyrich) wrote :

What may be useful to a 3 year old bug report other than the log files? The problem still exists on that computer. I added zram swap shortly after this bug report and it helped some. Some. Using hugepages currently and there times when it's easier to walk away and do something else for 10 minutes while some i/o intensive cron job like mlocate (since uninstalled) thrashes. Obscuring the problem is that some apt update set the scheduler to "deadline" rather than "cfq" so ionice values don't work. Forcing it back to "cfq" helps some - at least the entire user interface isn't locked up for half an hour.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.4 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc5-wily

Changed in linux (Ubuntu):
importance: Undecided → Medium
Revision history for this message
garyr (garyrich) wrote :

Installed the low latency version (since that's what I use) of 4.4-rc5. It may be a few days before I can confirm or deny that anything changed

Revision history for this message
garyr (garyrich) wrote :

Using
4.4.0-040400rc5-lowlatency #201512140221 SMP PREEMPT
kernel has made a dramatic change to this computer. So much so that I can't think of a test that would restrict it to the hugepage problem. Even under heavy memory load it hits the disk drive far far less. Something has certainly been fixed. Loading memory to 90% and then pointing Firefox to www.netflix.com (as of 12/18/2015 a huge resource hog) still causes a stumble. Gnome system monitor (for instance) still goes grey and stops updating and the mouse refuses to move from one display to the second one -- but it only does that for a few seconds and then back to normal.

Something has certainly been fixed and I'm going to tag it fixed upstream. I'm not 100% sure what has been fixed though.

tags: added: kernel-fixed-upstream
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.