Bug #1518457 “kswapd0 100% CPU usage” : Bugs : linux package : Ubuntu

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2015-11-20:

#1

CurrentDmesg.txt Edit (35.3 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (2.6 KiB, text/plain; charset="utf-8")
JournalErrors.txt Edit (183.2 KiB, text/plain; charset="utf-8")
Lspci.txt Edit (3.0 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (799 bytes, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (1.7 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (2.3 KiB, text/plain; charset="utf-8")
UdevDb.txt Edit (92.1 KiB, text/plain; charset="utf-8")
WifiSyslog.txt Edit (61.1 KiB, text/plain; charset="utf-8")

Revision history for this message

Brad Figg (brad-figg) wrote on 2015-11-20: Status changed to Confirmed

#2

This change was made by a bot.

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

mecat (habdankm) wrote on 2015-11-22:

#3

same issue here:
root@orangepi:/var/log# uname -a
Linux orangepi 3.4.39 #2 SMP PREEMPT Mon Oct 12 12:03:03 CEST 2015 armv7l armv7l armv7l GNU/Linux
root@orangepi:/var/log# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=15.10
DISTRIB_CODENAME=wily
DISTRIB_DESCRIPTION="Ubuntu 15.10"
also
echo 1 > /proc/sys/vm/drop_caches
temporary solve issue

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-11-23:

#4

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.4 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc2+cod1-wily/

Changed in linux (Ubuntu):
importance:	Undecided → Medium
status:	Confirmed → Incomplete
tags:	added: kernel-da-key

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2015-11-24:

#5

This was a clean build, so I don't have any information about previous versions unfortunately. (The previous server, which didn't have this issue, was different AWS hardware and the previous Ubuntu version.)

I've tested with the latest mainline kernel and this is still occurring.

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed
tags:	added: kernel-bug-exists-upstream

Joseph Salisbury (jsalisbury) on 2015-11-24

Changed in linux (Ubuntu):
status:	Confirmed → Triaged

Revision history for this message

Sean Groarke (sgroarke) wrote on 2015-11-26:

#6

Pretty much same description here. Started when I upgraded Amazon instance to 15.10.

Causing a lot of disruption - available to test also if it helps move us forward.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-11-30:

#7

I'd like to perform a bisect to figure out what commit caused this regression. We need to identify the earliest kernel where the issue started happening as well as the latest kernel that did not have this issue.

Can you test the following kernels and report back? We are looking for the first kernel version that exhibits this bug:

4.0 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.0-vivid/
4.1 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.1-wily/
4.2 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2-wily/

You don't have to test every kernel, just up until the kernel that first has this bug. We can then narrow down further by testing some release candidates.

Thanks in advance!

Changed in linux (Ubuntu):
importance:	Medium → High

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2015-11-30:

#8

Okay, I cloned my server and tried kernel versions. The latest version which does _not_ exhibit the issue is 3.12.51. The first which does is 3.13-rc1.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-12-01:

#9

Thanks for testing, Sam. Could you also test the 3.12 final version, since 3.13-rc1 is the next linear version after 3.12 final. The kernel can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.12-trusty/

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2015-12-01:

#10

3.12 final doesn't exhibit the issue either.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-12-10:

#11

I started a kernel bisect between v3.12 final and v3.13-rc1. The kernel bisect will require testing of about 7-10 test kernels.

I built the first test kernel, up to the following commit:
5cbb3d216e2041700231bcfc383ee5f8b7fc8b74

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1518457

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2015-12-10:

#12

No bug on that version.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-12-10:

#13

I built the next test kernel, up to the following commit:
e1f56c89b040134add93f686931cc266541d239a

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1518457

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2015-12-10:

#14

No bug on that version.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-12-11:

#15

I built the next test kernel, up to the following commit:
9073e1a804c3096eda84ee7cbf11d1f174236c75

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1518457

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2015-12-11:

#16

Bug is present in this version.

Joseph Salisbury (jsalisbury) on 2015-12-11

Changed in linux (Ubuntu):
assignee:	nobody → Joseph Salisbury (jsalisbury)
status:	Triaged → In Progress

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-12-11:

#17

I built the next test kernel, up to the following commit:
ab0169bb5cc4a5c86756dde662087f9d12302eb0

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1518457

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2015-12-13:

#18

No bug in that version.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-12-14:

#19

I built the next test kernel, up to the following commit:
f080480488028bcc25357f85e8ae54ccc3bb7173

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1518457

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2015-12-15:

#20

Bug is present in this version.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-12-16:

#21

I built the next test kernel, up to the following commit:
b746f9c7941f227ad582b4f0bc981f3adcbc46b2

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1518457

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2015-12-16:

#22

No bug in that version.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-12-16:

#23

I built the next test kernel, up to the following commit:
72c1253574a1854b0b6f196e24cd0dd08c1ad9b9

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1518457

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2015-12-16:

#24

kswapd0-kernel-oops.txt Edit (32.4 KiB, text/plain)

Download full text (3.3 KiB)

It's crashing on boot with this version. It's related to paging, so it might be relevant to the issue, so I've attached the full dmesg and here's the actual crash:

[ 3.716345] BUG: unable to handle kernel paging request at 000060ffc0002370
[ 3.720056] IP: [<ffffffff811a6ae1>] mem_cgroup_move_account+0xd1/0x250
[ 3.720056] PGD 0
[ 3.720056] Oops: 0000 [#1] SMP
[ 3.720056] Modules linked in: ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_recent xt_conntrack nf_conntrack iptable_filter ip_tables x_tables autofs4 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse floppy pata_acpi
[ 3.720056] CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 3.12.0-031200rc2-generic #201512161751
[ 3.720056] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/07/2015
[ 3.720056] Workqueue: events css_killed_work_fn
[ 3.720056] task: ffff88003da946b0 ti: ffff88003daac000 task.ti: ffff88003daac000
[ 3.720056] RIP: 0010:[<ffffffff811a6ae1>] [<ffffffff811a6ae1>] mem_cgroup_move_account+0xd1/0x250
[ 3.720056] RSP: 0000:ffff88003daadcc8 EFLAGS: 00010046
[ 3.720056] RAX: 0000000000000246 RBX: ffff88003d803a60 RCX: 000000000000053e
[ 3.720056] RDX: 000060ffc0002358 RSI: 0000000000000001 RDI: ffff88003c4e822c
[ 3.720056] RBP: ffff88003daadd20 R08: ffff88003cc55000 R09: 0000000000000004
[ 3.720056] R10: ffff88003c4e8000 R11: 0000000000000001 R12: 0000000000000000
[ 3.720056] R13: ffffea0000e0e980 R14: ffff88003c4e8000 R15: 0000000000000001
[ 3.720056] FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[ 3.720056] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3.720056] CR2: 000060ffc0002370 CR3: 0000000036754000 CR4: 00000000001406f0
[ 3.720056] Stack:
[ 3.720056] ffffffff811a7bc8 ffff88003fffb780 ffffea0000e0e980 ffff88003cc55000
[ 3.720056] ffff88003c4e8000 ffff88003c4e822c ffff880036c1da00 ffffea0000e0e980
[ 3.720056] ffff88003fffbcc0 ffff88003d803a60 ffffea0000e0e9a0 ffff88003daadda8
[ 3.720056] Call Trace:
[ 3.720056] [<ffffffff811a7bc8>] ? mem_cgroup_page_lruvec+0x28/0x90
[ 3.720056] [<ffffffff811a8427>] mem_cgroup_reparent_charges+0x257/0x460
[ 3.720056] [<ffffffff811a87df>] mem_cgroup_css_offline+0xaf/0x220
[ 3.720056] [<ffffffff810de897>] offline_css+0x27/0x50
[ 3.720056] [<ffffffff810e199d>] css_killed_work_fn+0x2d/0xa0
[ 3.720056] [<ffffffff81081032>] process_one_work+0x182/0x450
[ 3.720056] [<ffffffff81081dc1>] worker_thread+0x121/0x410
[ 3.720056] [<ffffffff81081ca0>] ? rescuer_thread+0x3d0/0x3d0
[ 3.720056] [<ffffffff81088ba0>] kthread+0xc0/0xd0
[ 3.720056] [<ffffffff81088ae0>] ? kthread_create_on_node+0x120/0x120
[ 3.720056] [<ffffffff816ff4fc>] ret_from_fork+0x7c/0xb0
[ 3.720056] [<ffffffff81088ae0>] ? kthread_create_on_node+0x120/0x120
[ 3.720056] Code: d6 00 55 00 4d 85 e4 4c 8b 55 c8 4c 8b 45 c0 0f 85 a5 00 00 00 41 8b 55 18 85 d2 0f 88 99 00 00 00 49 8b 96 30 02 00 00 45 89 fb <4c> 39 5a 18 0f 8c c2 00 00 00 44 89 f9 f7 d9 89 ce 65 48 01 72
[ 3.720056] RIP [<ffffffff811a6ae1>] mem_cg...

It's crashing on boot with this version. It's related to paging, so it might be relevant to the issue, so I've attached the full dmesg and here's the actual crash:

[    3.716345] BUG: unable to handle kernel paging request at 000060ffc0002370
[    3.720056] IP: [<ffffffff811a6ae1>] mem_cgroup_move_account+0xd1/0x250
[    3.720056] PGD 0 
[    3.720056] Oops: 0000 [#1] SMP 
[    3.720056] Modules linked in: ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_recent xt_conntrack nf_conntrack iptable_filter ip_tables x_tables autofs4 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse floppy pata_acpi
[    3.720056] CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 3.12.0-031200rc2-generic #201512161751
[    3.720056] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/07/2015
[    3.720056] Workqueue: events css_killed_work_fn
[    3.720056] task: ffff88003da946b0 ti: ffff88003daac000 task.ti: ffff88003daac000
[    3.720056] RIP: 0010:[<ffffffff811a6ae1>]  [<ffffffff811a6ae1>] mem_cgroup_move_account+0xd1/0x250
[    3.720056] RSP: 0000:ffff88003daadcc8  EFLAGS: 00010046
[    3.720056] RAX: 0000000000000246 RBX: ffff88003d803a60 RCX: 000000000000053e
[    3.720056] RDX: 000060ffc0002358 RSI: 0000000000000001 RDI: ffff88003c4e822c
[    3.720056] RBP: ffff88003daadd20 R08: ffff88003cc55000 R09: 0000000000000004
[    3.720056] R10: ffff88003c4e8000 R11: 0000000000000001 R12: 0000000000000000
[    3.720056] R13: ffffea0000e0e980 R14: ffff88003c4e8000 R15: 0000000000000001
[    3.720056] FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[    3.720056] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.720056] CR2: 000060ffc0002370 CR3: 0000000036754000 CR4: 00000000001406f0
[    3.720056] Stack:
[    3.720056]  ffffffff811a7bc8 ffff88003fffb780 ffffea0000e0e980 ffff88003cc55000
[    3.720056]  ffff88003c4e8000 ffff88003c4e822c ffff880036c1da00 ffffea0000e0e980
[    3.720056]  ffff88003fffbcc0 ffff88003d803a60 ffffea0000e0e9a0 ffff88003daadda8
[    3.720056] Call Trace:
[    3.720056]  [<ffffffff811a7bc8>] ? mem_cgroup_page_lruvec+0x28/0x90
[    3.720056]  [<ffffffff811a8427>] mem_cgroup_reparent_charges+0x257/0x460
[    3.720056]  [<ffffffff811a87df>] mem_cgroup_css_offline+0xaf/0x220
[    3.720056]  [<ffffffff810de897>] offline_css+0x27/0x50
[    3.720056]  [<ffffffff810e199d>] css_killed_work_fn+0x2d/0xa0
[    3.720056]  [<ffffffff81081032>] process_one_work+0x182/0x450
[    3.720056]  [<ffffffff81081dc1>] worker_thread+0x121/0x410
[    3.720056]  [<ffffffff81081ca0>] ? rescuer_thread+0x3d0/0x3d0
[    3.720056]  [<ffffffff81088ba0>] kthread+0xc0/0xd0
[    3.720056]  [<ffffffff81088ae0>] ? kthread_create_on_node+0x120/0x120
[    3.720056]  [<ffffffff816ff4fc>] ret_from_fork+0x7c/0xb0
[    3.720056]  [<ffffffff81088ae0>] ? kthread_create_on_node+0x120/0x120
[    3.720056] Code: d6 00 55 00 4d 85 e4 4c 8b 55 c8 4c 8b 45 c0 0f 85 a5 00 00 00 41 8b 55 18 85 d2 0f 88 99 00 00 00 49 8b 96 30 02 00 00 45 89 fb <4c> 39 5a 18 0f 8c c2 00 00 00 44 89 f9 f7 d9 89 ce 65 48 01 72 
[    3.720056] RIP  [<ffffffff811a6ae1>] mem_cgroup_move_account+0xd1/0x250
[    3.720056]  RSP <ffff88003daadcc8>
[    3.720056] CR2: 000060ffc0002370
[    3.720056] ---[ end trace 9ea086b6da9e6208 ]---

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-12-17:

#25

Thanks for testing. I skipped that commit in the bisect, in case it's not related to the bug.

I built the next test kernel, up to the following commit:
cbbc58d4fdfab1a39a6ac1b41fcb17885952157a

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1518457

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2015-12-17:

#26

Same crash on that version.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-12-18:

#27

I built the next test kernel, up to the following commit:
3b7834743f9492e3509930feb4ca47135905e640

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1518457

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2015-12-20:

#28

That version has also crashed.

Revision history for this message

mm (mtl-0) wrote on 2015-12-30:

#29

This bug is also affecting me on 2 (ident) Xubuntu 15.10 systems:

uname -a: ### 4.2.0-22-generic #27-Ubuntu SMP Thu Dec 17 22:57:08 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=15.10
DISTRIB_CODENAME=wily
DISTRIB_DESCRIPTION="Ubuntu 15.10"

also
echo 1 or echo 3 > /proc/sys/vm/drop_caches
temporary solves the issue

Revision history for this message

mm (mtl-0) wrote on 2015-12-30:

#30

Also see here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1476211

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2016-01-04:

#31

I built the next test kernel, up to the following commit:
d7876f1be40a16223a44355740de625849504eb5

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1518457

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2016-01-04:

#32

Crashed again.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2016-01-06:

#33

I built the next test kernel, up to the following commit:
732e563373ffc57d38a8a3b6d55f2de865182117

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1518457

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2016-01-06:

#34

Crashed again.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2016-01-06:

#35

I built the next test kernel, up to the following commit:
56aba608257b451f663d25313d5ecae134d5557f

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1518457

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2016-01-06:

#36

Crashed again.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2016-01-06:

#37

I built the next test kernel, up to the following commit:
59ab5a8f4445699e238c4c46b3da63bb9dc02897

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1518457

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2016-01-07:

#38

Crashed again.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2016-01-07:

#39

I built the next test kernel, up to the following commit:
98fda169290b3b28c0f2db2b8f02290c13da50ef

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1518457

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Sam Lade (sam-sentynel) wrote on 2016-01-07:

#40

Crashed again.

Joseph Salisbury (jsalisbury) on 2016-01-19

Changed in linux (Ubuntu):
status:	In Progress → Confirmed
assignee:	Joseph Salisbury (jsalisbury) → nobody

Revision history for this message

In Linux Kernel Bug Tracker #65201, ivanov.maxim (ivanov.maxim-linux-kernel-bugs) wrote on 2016-03-09:

#179

Created attachment 208431
/proc/vmstat (time 0)

Revision history for this message

In Linux Kernel Bug Tracker #65201, ivanov.maxim (ivanov.maxim-linux-kernel-bugs) wrote on 2016-03-09:

#180

Created attachment 208441
/proc/vmstat (time 5s)

Revision history for this message

In Linux Kernel Bug Tracker #65201, ivanov.maxim (ivanov.maxim-linux-kernel-bugs) wrote on 2016-03-09:

#181

Created attachment 208451
/proc/zoneinfo

Revision history for this message

In Linux Kernel Bug Tracker #65201, ivanov.maxim (ivanov.maxim-linux-kernel-bugs) wrote on 2016-03-09:

#182

Created attachment 208461
/proc/pagetypeinfo

Revision history for this message

In Linux Kernel Bug Tracker #65201, ivanov.maxim (ivanov.maxim-linux-kernel-bugs) wrote on 2016-03-09:

#183

Created attachment 208471
/proc/buddyinfo

Revision history for this message

In Linux Kernel Bug Tracker #65201, ivanov.maxim (ivanov.maxim-linux-kernel-bugs) wrote on 2016-03-09:

#184

Created attachment 208481
vmstat -m (time 0)

Revision history for this message

In Linux Kernel Bug Tracker #65201, ivanov.maxim (ivanov.maxim-linux-kernel-bugs) wrote on 2016-03-09:

#185

Created attachment 208491
vmstat -m (time 5s)

Revision history for this message

In Linux Kernel Bug Tracker #65201, ivanov.maxim (ivanov.maxim-linux-kernel-bugs) wrote on 2016-03-09:

#186

I am able to semi-reliably reproduce this (or very similar?) problem on a setup very close to one in comment #21

- kernel: 4.2.0-30-generic (ubuntu 15.10)
- 2 GB RAM, 1 CPU, running under Xen (EC2 t2.small instance)
- docker with LVM thin-pool storage backend, running 3 containers, no memory limits set for their memcg's
- server is mostly idling (load average 0.0-0.1)

To reproduce it I have to:

1. set vm.overcomit_memory=1
2. initiate some disk activity:
find -xdev / -type f |xargs -P10 -n1 md5sum &>/dev/null &
find /var/lib/docker -type f |xargs -P10 -n1 md5sum &>/dev/null &

3. run some memory allocations until you hit OOM
for x in {1..200}; do ./memalloc & : ; done

memalloc above is a simple C program which allocates 100MB and memsets it with 'x':

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
int block_mb = 100;
char *buf;

  printf("allocing %dMB: ", block_mb);
  buf = malloc(block_mb * 1024 * 1000);
  if (! buf) {
    printf("FAILED!\n");
    exit(EXIT_FAILURE);
  }
  printf("ok\n");
  memset(buf, 'x', block_mb * 1024 * 1000);
  sleep(180);
  return 0;
}

once you hit OOM, console slows down, it is time to CTRL+C, pkill memalloc and then check top. many times it spins `kswapd0` then recovers within tens of seconds, but once in a while it stays there for hours (didn't have patience to check for longer).

Once I triggered bug, I tried to get as much information as possible from running system. I am attaching /proc/*info files (some taken 5 s apart), ftrace outputs for event tracer (vmscan events only), ftrace output for function_graph tester. Let me know if you need more information.

To recover from situation need to free enough memory in a short period of time, sometime dropping caches helps, sometimes needed to close applications/containers as well, but never had to reboot to recover.

Revision history for this message

In Linux Kernel Bug Tracker #65201, ivanov.maxim (ivanov.maxim-linux-kernel-bugs) wrote on 2016-03-09:

#187

It would be very helpful if there was a way to get output similar to ftrace function_graph tracer, but with function args and return values, but from the look of it, `pgdat_balance` for some reason keeps returning false even that /proc/zoneinfo shows that number of free pages is much higher than any watermark.

Problem description and recovery method very closely resembles discussion around kernel 3.7 (https://lkml.org/lkml/2012/11/28/88):

> The zonelist reclaim in kswapd would do
> nothing because all high watermarks are met, but the compaction logic
> would find its own requirements unmet and loop over the zones again.
> Indefinitely, until some third party would free enough memory to help
> meet the higher compaction watermark.

Revision history for this message

In Linux Kernel Bug Tracker #65201, hdefendme (hdefendme-linux-kernel-bugs) wrote on 2016-04-30:

#188

(In reply to Anatoli Sakhnik from comment #4)
> My Acer C720 too suffers occasionally. Turning swap on/off doesn't help.
> Dropping caches *does* help:
>
> # echo 3 > /proc/sys/vm/drop_caches # 1 isn't enough
>
> Next my guess would be to try to deactivate zswap.

above work around works for me, kernel 4.4.2 debian jessie.

bug happens randomly after heavy web browsers for kernel 4.5
downgrade to 3.16 stable jessie kernel, bug gone.
upgrade 4.4.2 bug came again

Revision history for this message

In Linux Kernel Bug Tracker #65201, mail+kernel-bugzilla (mail+kernel-bugzilla-linux-kernel-bugs) wrote on 2016-07-25:

#189

Same thing on Thinkpad X220 with 8 GB RAM running Ubuntu 14.04, with Ubuntu's Kernel 3.16.0-77-generic.

Swap is disabled.

kswapd0 runs on high CPU and the HD light is on all the time during this (no idea why).

After 20 (!) minutes the OOM killer manages to kill a process to resolve the situation.

Revision history for this message

In Linux Kernel Bug Tracker #65201, n.sherlock (n.sherlock-linux-kernel-bugs) wrote on 2016-08-25:

#190

Same problem on Amazon's t2.nano instance (512MB of RAM). Seemed to be triggered by doing a bunch of file IO. This is a brand new install of Ubuntu 16.04. I have no swap enabled, and yet:

top - 06:42:57 up 1:58, 1 user, load average: 2.43, 2.66, 2.31
Tasks: 125 total, 3 running, 122 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.1 us, 6.9 sy, 0.0 ni, 0.0 id, 0.9 wa, 0.0 hi, 0.0 si, 90.1 st
KiB Mem : 498416 total, 348096 free, 49772 used, 100548 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 411900 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29 root 20 0 0 0 0 R 65.0 0.0 103:16.64 kswapd0
14343 root 20 0 0 0 0 R 2.9 0.0 0:00.82 python

Running "echo 1 > /proc/sys/vm/drop_caches" didn't fix the problem, but it did fix it immediately with "3".

Also, my /tmp isn't full at all (6.5GB / 85% left on root).

Revision history for this message

In Linux Kernel Bug Tracker #65201, n.sherlock (n.sherlock-linux-kernel-bugs) wrote on 2016-08-25:

#191

A workaround for machines running under Xen has been found over on Ubuntu's bug tracker, see comment #69:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457

The workaround is to disable hot-add of memory:

touch /etc/udev/rules.d/40-vm-hotadd.rules
reboot

Revision history for this message

In Linux Kernel Bug Tracker #65201, dek94 (dek94-linux-kernel-bugs) wrote on 2016-08-30:

#192

I tried the same Ubuntu inspired "disable hot-add of memory" (and CPU) workaround under AWS EC2 HVM, Centos 7.x with mainline (elrepo) 4.4.15 kernel: no such luck, I still see this occasionally.

Dan Streetman (ddstreet) on 2016-10-01

Changed in linux (Ubuntu):
assignee:	nobody → Dan Streetman (ddstreet)

Revision history for this message

In Linux Kernel Bug Tracker #65201, ddstreet (ddstreet-linux-kernel-bugs) wrote on 2016-10-01:

#193

I detailed why this bug happens here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457/comments/126

this appears to be fixed by Mel Gorman's patch series to change memory reclaim from "per zone" to "per node":
https://marc.info/?l=linux-mm&m=146797052519026

So this bug should be fixed with the latest kernel.

Revision history for this message

In Linux Kernel Bug Tracker #65201, mail+kernel-bugzilla (mail+kernel-bugzilla-linux-kernel-bugs) wrote on 2016-10-02:

#194

(In reply to Dan Streetman from comment #40)
> I detailed why this bug happens here:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457/comments/126
>
> So this bug should be fixed with the latest kernel.

Can you clarify, the link you mention seems to talk mainly about Xen. Do you think the latest kernel will fix it also for non-Xen machines?

Revision history for this message

In Linux Kernel Bug Tracker #65201, ddstreet (ddstreet-linux-kernel-bugs) wrote on 2016-10-02:

#195

(In reply to mail+kernel-bugzilla from comment #41)
> (In reply to Dan Streetman from comment #40)
> > I detailed why this bug happens here:
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457/comments/126
> >
> > So this bug should be fixed with the latest kernel.
>
> Can you clarify, the link you mention seems to talk mainly about Xen. Do you
> think the latest kernel will fix it also for non-Xen machines?

what does your /proc/zoneinfo look like? do you have a system with (approx) <= 4g and Normal zone with few managed pages?

Revision history for this message

In Linux Kernel Bug Tracker #65201, mail+kernel-bugzilla (mail+kernel-bugzilla-linux-kernel-bugs) wrote on 2016-10-02:

#196

(In reply to Dan Streetman from comment #42)
> what does your /proc/zoneinfo look like? do you have a system with (approx)
> <= 4g and Normal zone with few managed pages?

My zoneinfo file right now looks like this: https://gist.github.com/nh2/7ba7375d5c8de797714f7a909e6f0c94

(I upgraded from 8 GB to 16 GB memory recently though, after I wrote comment #36.)

Revision history for this message

In Linux Kernel Bug Tracker #65201, ddstreet (ddstreet-linux-kernel-bugs) wrote on 2016-10-02:

#197

(In reply to mail+kernel-bugzilla from comment #43)
> (In reply to Dan Streetman from comment #42)
> > what does your /proc/zoneinfo look like? do you have a system with
> (approx)
> > <= 4g and Normal zone with few managed pages?
>
> My zoneinfo file right now looks like this:
> https://gist.github.com/nh2/7ba7375d5c8de797714f7a909e6f0c94
>
> (I upgraded from 8 GB to 16 GB memory recently though, after I wrote comment
> #36.)

That zoneinfo doesn't look like you're seeing the same problem, so if you are seeing consistent, sustained (not just transient) 100% cpu from kswapd, I think it's a different problem from what I described in comment 40.

Seth Forshee (sforshee) on 2016-10-12

Changed in linux (Ubuntu Xenial):
status:	New → Fix Committed
importance:	Undecided → High
assignee:	nobody → Dan Streetman (ddstreet)

Seth Forshee (sforshee) on 2016-10-12

Changed in linux (Ubuntu Yakkety):
status:	Confirmed → Fix Committed

Andy Whitcroft (apw) on 2016-10-13

Changed in linux (Ubuntu Yakkety):
status:	Fix Committed → Invalid

Launchpad Janitor (janitor) on 2016-10-13

Changed in linux (Ubuntu Xenial):
status:	Fix Committed → Fix Released

Revision history for this message

In Linux Kernel Bug Tracker #65201, samkostka (samkostka-linux-kernel-bugs) wrote on 2016-10-13:

#198

I'm assuming by latest kernel you mean 4.8? If so I'm looking forward to Arch pushing it through testing :)

Seth Forshee (sforshee) on 2016-10-18

tags:

added: verification-needed-yakkety

Øystein Gisnås (oystein-gisnas) on 2016-10-20

tags:

added: verification-done-yakkety
removed: verification-needed-yakkety

Launchpad Janitor (janitor) on 2016-11-09

Changed in linux (Ubuntu Yakkety):
status:	Invalid → Fix Released
status:	Invalid → Fix Released

Revision history for this message

In Linux Kernel Bug Tracker #65201, jc (jc-linux-kernel-bugs) wrote on 2016-11-15:

#199

I am having the same issue on Fedora 24 with kernel 4.8.6. So I guess it has not been pushed there, or it does not fix anything.
It is a huge job stopper as I need to transfer many files between two USB disks.
Kwapd0 appears on top of processes after a while, and slowly degrades overall performance until I have to hard reboot the machine in the middle of some transfer.

Revision history for this message

In Linux Kernel Bug Tracker #65201, samkostka (samkostka-linux-kernel-bugs) wrote on 2016-11-15:

#200

My guess is Fedora didn't put the changes through or something, because 4.8 has DEFINITELY fixed it for me. I used to have to reboot about twice daily due to this, but ever since I upgraded to 4.8 it hasn't happened once.

Revision history for this message

In Linux Kernel Bug Tracker #65201, me (me-linux-kernel-bugs) wrote on 2016-11-20:

#201

I'm on openSUSE with 4.8.8 and still have this issue.

Launchpad Janitor (janitor) on 2016-12-06

Changed in linux (Ubuntu):
status:	Invalid → Fix Released

Revision history for this message

In Linux Kernel Bug Tracker #65201, 00cpxxx (00cpxxx-linux-kernel-bugs) wrote on 2016-12-09:

#202

I'm on Debian with 4.8.7 and still have this issue.

Revision history for this message

In Linux Kernel Bug Tracker #65201, Wilhelm.Buchmueller (wilhelm.buchmueller-linux-kernel-bugs) wrote on 2017-01-06:

#203

4.8.13-100.fc23.i686+PAE #1
/dev/sda is Samsung SSD 850 EVO 250GB

swapoff -va
sysctl vm.drop_caches=3

Problem, causes always heavy kswapd0 load:
  cat /dev/sda >> /dev/zero
  hdparm -t /dev/sda
  ddrescue /dev/sda /dev/zero -vf
  hexdump /dev/sda
  dd if=/dev/sda of=/dev/zero
  etc.

No problem (read speed ~500MB/s, except hdparm ):
  hdparm --direct -t /dev/sda
  dd iflag=direct if=/dev/sda of=/dev/zero bs=1073741824
  ddrescue --direct /dev/sda /dev/zero -vf -b 4096 -c 8192

Revision history for this message

In Linux Kernel Bug Tracker #65201, dclowes1 (dclowes1-linux-kernel-bugs) wrote on 2017-01-08:

#204

I am not sure if this is the same bug, but for me kswapd0 goes high-cpu following a page allocation failure in xhci_segment_alloc and I think that this has been occurring since moving to 4.8 on Fedora 24. I don't remember experiencing it before that. Currently on 4.8.15.

I normally boot with 3 or 4 USB 3.0 disks attached and, after the upgrade to 4.8.x noticed that kswapd0 was running at 100%. I went back to 4.7.x and no problem. Searches on this issue frequently referred to USB disks so I unplugged and rebooted.

If I unplug all of my USB 3.0 devices I get a normal boot, even with a USB weather station, keyboard, mouse. Sometimes, one or two USB 3.0 disks is OK too, If I boot with all of the USB 3.0 disks included, I get a kworker page allocation failure and after boot kswapd0 is high-cpu, usually split across 2-4 cores.

If I boot with two USB 3.0 disks and get a normal boot (no page allocation failure and normal kswapd) and then plug in a hub with the rest of the disks (and a USB 3.0 card reader) I get the page allocation failure at that point and kswapd0 goes high-cpu.

I have not looked at them all, but whenever I see kswapd0 high-cpu and I do look, there is the page allocation failure in the log.

The 'perf top' command seems to show different information from time to time but the top contenders are frequently 'shrink_inactive_list', 'inactive_list_is_low', 'find_next_bit', 'shrink_none_memcg', '_raw_spin_lock' to name a few.

Makes me wonder if the xhci allocation failure is the trigger, and fails to clean up on the error exit path, and kswapd0 is just a hapless victim. There is a stack trace (on ubuntu kernel) of the page allocation failure in the dmesg attached to https://bugzilla.redhat.com/show_bug.cgi?id=1395825 on this issue but I have more if it would help.

I have 19GiB free on a 24GiB machine so there should be no memory shortage to prompt swapping or the page allocation failure.

I had also noticed frequently that not all of my USB disks were mounted after boot and that I had to remove and reinsert a disk to use it. IIRC this affected my USB 2.0 disks too and from before the upgrade to 4.8 too.