swap storms kills interactive use

Bug #1861359 reported by Seth Arnold
46
This bug affects 6 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Andrea Righi
Focal
Fix Released
High
Andrea Righi

Bug Description

[Impact]

The i915 shrinker can make the system unresponsive and kill interactive performance by swapping out dirty objects to reclaim more memory.

[Test case]

The problem can be easily reproduced running an I/O intensive writer, for example:

$ dd if=/dev/zero of=/tmp/zero bs=1G count=32

When the system is running out of memory and the i915 shrinker is called we can experience noticeable lags in the normal interactive activities (mouse not responding for 5-10 seconds, screen not refreshing, etc.).

[Fix]

Avoid the lags by preventing to swap out dirty objects in the i915 shrinker (other cached objects can still be reclaimed).

[Regression potential]

The right fix would be to backport commit 2850748ef876 ("drm/i915: Pull i915_vma_pin under the vm->mutex"), available in v5.5+. However the backport is not trivial and the risk to introduce other bugs is high.

A simple change like this (it's basically a one-liner that removes a flag) can be a reasonable compromise to prevent the problem and avoid the risk of introducing other bugs.

[Original report]

Hello, several times since upgrading to focal from 19.04 I've found my computer entirely unresponsive for periods of twenty or thirty seconds. No mouse movement, no keyboard input, the screen output does not change.

My computer was using swap space and despite very slow writeout speeds well below what the NVME drive can handle, the computer was unusable.

I've captured some vmstat 1 output and top output that I started collecting during the event. (Normally one very long painful period is followed by several shorter periods of uselessness.)

Thanks

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.4.0-12-generic 5.4.0-12.15
ProcVersionSignature: Ubuntu 5.4.0-12.15-generic 5.4.8
Uname: Linux 5.4.0-12-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.11-0ubuntu15
Architecture: amd64
Date: Wed Jan 29 23:44:05 2020
ProcEnviron:
 TERM=rxvt-unicode-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-signed-5.4
UpgradeStatus: Upgraded to focal on 2020-01-24 (5 days ago)
---
ProblemType: Bug
AlsaVersion: Advanced Linux Sound Architecture Driver Version k5.4.0-12-generic.
ApportVersion: 2.20.11-0ubuntu16
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: sarnold 2734 F.... pulseaudio
 /dev/snd/controlC1: sarnold 2734 F.... pulseaudio
Card0.Amixer.info:
 Card hw:0 'PCH'/'HDA Intel PCH at 0x2fe1028000 irq 145'
   Mixer name : 'Realtek ALC285'
   Components : 'HDA:10ec0285,17aa225c,00100002 HDA:8086280b,80860101,00100000'
   Controls : 53
   Simple ctrls : 15
Card1.Amixer.info:
 Card hw:1 'Audio'/'Generic ThinkPad Dock USB Audio at usb-0000:00:14.0-4.2.4, high speed'
   Mixer name : 'USB Mixer'
   Components : 'USB17ef:306f'
   Controls : 9
   Simple ctrls : 4
DistroRelease: Ubuntu 20.04
HibernationDevice: RESUME=none
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
MachineType: LENOVO 20KHCTO1WW
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
Package: linux (not installed)
ProcEnviron:
 TERM=rxvt-unicode-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/BOOT/ubuntu@/vmlinuz-5.4.0-12-generic root=ZFS=rpool/ROOT/ubuntu ro root=ZFS=rpool/ROOT/ubuntu quiet splash acpi_osi=! "acpi_osi=Windows 2015" vt.handoff=1
ProcVersionSignature: Ubuntu 5.4.0-12.15-generic 5.4.8
RelatedPackageVersions:
 linux-restricted-modules-5.4.0-12-generic N/A
 linux-backports-modules-5.4.0-12-generic N/A
 linux-firmware 1.185
Tags: focal
Uname: Linux 5.4.0-12-generic x86_64
UpgradeStatus: Upgraded to focal on 2020-01-24 (5 days ago)
UserGroups: adm cdrom libvirt lpadmin plugdev sambashare sbuild sudo
_MarkForUpload: True
dmi.bios.date: 11/25/2019
dmi.bios.vendor: LENOVO
dmi.bios.version: N23ET69W (1.44 )
dmi.board.asset.tag: Not Available
dmi.board.name: 20KHCTO1WW
dmi.board.vendor: LENOVO
dmi.board.version: SDK0J40709 WIN
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.modalias: dmi:bvnLENOVO:bvrN23ET69W(1.44):bd11/25/2019:svnLENOVO:pn20KHCTO1WW:pvrThinkPadX1Carbon6th:rvnLENOVO:rn20KHCTO1WW:rvrSDK0J40709WIN:cvnLENOVO:ct10:cvrNone:
dmi.product.family: ThinkPad X1 Carbon 6th
dmi.product.name: 20KHCTO1WW
dmi.product.sku: LENOVO_MT_20KH_BU_Think_FM_ThinkPad X1 Carbon 6th
dmi.product.version: ThinkPad X1 Carbon 6th
dmi.sys.vendor: LENOVO
---
ProblemType: Bug
AlsaVersion: Advanced Linux Sound Architecture Driver Version k5.4.0-12-generic.
ApportVersion: 2.20.11-0ubuntu16
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: sarnold 2734 F.... pulseaudio
 /dev/snd/controlC1: sarnold 2734 F.... pulseaudio
Card0.Amixer.info:
 Card hw:0 'PCH'/'HDA Intel PCH at 0x2fe1028000 irq 145'
   Mixer name : 'Realtek ALC285'
   Components : 'HDA:10ec0285,17aa225c,00100002 HDA:8086280b,80860101,00100000'
   Controls : 53
   Simple ctrls : 15
Card1.Amixer.info:
 Card hw:1 'Audio'/'Generic ThinkPad Dock USB Audio at usb-0000:00:14.0-4.2.4, high speed'
   Mixer name : 'USB Mixer'
   Components : 'USB17ef:306f'
   Controls : 9
   Simple ctrls : 4
DistroRelease: Ubuntu 20.04
HibernationDevice: RESUME=none
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
MachineType: LENOVO 20KHCTO1WW
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
Package: linux (not installed)
ProcEnviron:
 TERM=rxvt-unicode-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/BOOT/ubuntu@/vmlinuz-5.4.0-12-generic root=ZFS=rpool/ROOT/ubuntu ro root=ZFS=rpool/ROOT/ubuntu quiet splash acpi_osi=! "acpi_osi=Windows 2015" vt.handoff=1
ProcVersionSignature: Ubuntu 5.4.0-12.15-generic 5.4.8
RelatedPackageVersions:
 linux-restricted-modules-5.4.0-12-generic N/A
 linux-backports-modules-5.4.0-12-generic N/A
 linux-firmware 1.185
Tags: focal
Uname: Linux 5.4.0-12-generic x86_64
UpgradeStatus: Upgraded to focal on 2020-01-24 (5 days ago)
UserGroups: adm cdrom libvirt lpadmin plugdev sambashare sbuild sudo
_MarkForUpload: True
dmi.bios.date: 11/25/2019
dmi.bios.vendor: LENOVO
dmi.bios.version: N23ET69W (1.44 )
dmi.board.asset.tag: Not Available
dmi.board.name: 20KHCTO1WW
dmi.board.vendor: LENOVO
dmi.board.version: SDK0J40709 WIN
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.modalias: dmi:bvnLENOVO:bvrN23ET69W(1.44):bd11/25/2019:svnLENOVO:pn20KHCTO1WW:pvrThinkPadX1Carbon6th:rvnLENOVO:rn20KHCTO1WW:rvrSDK0J40709WIN:cvnLENOVO:ct10:cvrNone:
dmi.product.family: ThinkPad X1 Carbon 6th
dmi.product.name: 20KHCTO1WW
dmi.product.sku: LENOVO_MT_20KH_BU_Think_FM_ThinkPad X1 Carbon 6th
dmi.product.version: ThinkPad X1 Carbon 6th
dmi.sys.vendor: LENOVO

CVE References

Revision history for this message
Seth Arnold (seth-arnold) wrote :
Revision history for this message
Seth Arnold (seth-arnold) wrote :
Revision history for this message
Seth Arnold (seth-arnold) wrote :
Steve Beattie (sbeattie)
affects: linux-signed-5.4 (Ubuntu) → linux (Ubuntu)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1861359

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Seth Arnold (seth-arnold) wrote : AlsaDevices.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Seth Arnold (seth-arnold) wrote : AplayDevices.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : AlsaDevices.txt

apport information

description: updated
Revision history for this message
Seth Arnold (seth-arnold) wrote : AplayDevices.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : ArecordDevices.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : CRDA.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : Card0.Amixer.values.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : Card0.Codecs.codec.0.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : Card0.Codecs.codec.2.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : Card1.Amixer.values.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : Lspci.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : Lsusb.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : Lsusb-t.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : Lsusb-v.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : PciMultimedia.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : ProcModules.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : PulseList.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : RfKill.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : UdevDb.txt

apport information

Revision history for this message
Seth Arnold (seth-arnold) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Jamie Strandboge (jdstrand) wrote :

Seth and I talked about this and I marked this as affects me. If it helps, I saw this on eoan and focal doesn't make a difference (which might suggest the change is between disco and eoan).

Revision history for this message
Jamie Strandboge (jdstrand) wrote :

FYI, I decided to do this:

$ sudo swapoff -a && sudo swapon -a
$ free -h
              total used free shared buff/cache available
Mem: 15Gi 5.9Gi 4.8Gi 2.0Gi 4.8Gi 7.2Gi
Swap: 15Gi 348Mi 15Gi

Even though I am no where near needing to swap, after the swapoff/swapon, it immediately climbed back up to 348M.

Oddly, I did it again after flushing the page cache, and the swap stayed at zero:

$ sudo sync ; sudo sh -c 'echo 1 > /proc/sys/vm/drop_caches'
$ sudo swapoff -a && sudo swapon -a
$ free -h
              total used free shared buff/cache available
Mem: 15Gi 5.6Gi 6.8Gi 2.6Gi 3.1Gi 7.0Gi
Swap: 15Gi 0B 15Gi

I don't know if flushing the page cache is related or coincidence with the second swap clearance.

Revision history for this message
Jamie Strandboge (jdstrand) wrote :

I forgot to mention, I also have nvme.

Changed in linux (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
importance: Undecided → High
Revision history for this message
Seth Arnold (seth-arnold) wrote :

I had a vmstat 1 running; the entire time the system was swapping out, X was unusable.

Thanks

Andrea Righi (arighi)
Changed in linux (Ubuntu):
assignee: Colin Ian King (colin-king) → Andrea Righi (arighi)
Revision history for this message
Andrea Righi (arighi) wrote :

Hello Seth, thanks for reporting the problem. I was wondering if this could be related to the memory cgroup controller.

As a simple test could you try to reboot the system adding cgroup_disable=memory to the kernel boot parameters?

In this way if the problem goes away at least we know it's related to the memory cgroup and we can investigate more (it could a misconfiguration enforced by systemd or a recent change in the kernel that introduced this different behavior).

Revision history for this message
Seth Arnold (seth-arnold) wrote :
Download full text (4.9 KiB)

This kernel command line parameter didn't appear to help:

Moments after loading a new URL into firefox:

1 0 0 699176 17804 3699820 0 0 0 0 2289 863 2 1 98 0 0
 0 0 0 702860 17804 3699780 0 0 0 2288 492 1817 0 1 99 0 0
 0 0 0 702860 17804 3699780 0 0 0 0 232 525 0 0 100 0 0
 0 0 0 703112 17804 3699748 0 0 0 0 220 564 0 0 100 0 0
 2 0 0 707792 17804 3699748 0 0 0 0 384 734 1 0 99 0 0
 0 0 0 712468 17804 3699996 0 0 0 0 1005 854 2 0 97 0 0
 0 0 0 712500 17804 3699988 0 0 0 2308 497 1908 0 0 99 0 0
 0 0 0 712752 17804 3699748 0 0 0 0 240 567 0 0 100 0 0
 0 0 256 632540 17804 3763996 0 232 4 476 1106 2138 2 1 96 0 0
 1 0 4352 649568 17804 3761060 0 2980 0 8132 5030 17862 1 7 92 0 0
 1 0 7168 644588 17804 3758040 0 3032 0 9104 5304 20299 0 7 93 0 0
 1 0 10240 654444 17804 3755152 0 2892 0 12864 5367 21353 0 8 92 1 0
 1 0 13312 651200 17804 3752216 0 2932 0 8804 5073 18847 0 5 95 0 0
 1 0 16128 660636 17804 3749200 0 3016 0 9048 5183 19639 1 6 94 0 0
 0 0 18944 652444 17804 3746180 0 3020 0 9060 5221 19613 0 5 95 0 0
 2 0 22016 645228 17804 3743168 0 3012 0 9032 5219 20287 0 7 93 0 0
 0 0 25600 674648 17804 3740152 0 3016 0 11148 5290 20518 0 7 93 0 0
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
 r b swpd free buff cache si so bi bo in cs us sy id wa st
 1 0 28160 667968 17804 3737076 0 3076 0 9228 5336 19913 0 7 93 0 0
 0 0 31488 663008 17804 3734048 0 3028 0 9084 5377 19518 0 6 94 0 0
 2 0 34560 655620 17804 3731020 0 3028 0 9080 5356 20395 0 7 93 0 0
 0 0 37632 649356 17804 3727996 0 3024 0 9076 5258 19858 0 6 94 0 0
 1 0 40192 677924 17804 3724948 0 3064 0 11388 5503 22313 0 8 92 0 0
 0 0 43264 670968 17804 3721944 0 3004 0 9016 5211 19591 0 7 93 0 0
 2 0 46336 665016 17804 3718996 0 2948 0 8844 5167 18978 0 7 93 0 0
 2 0 49664 658068 17804 3715896 0 3100 0 9300 5304 20485 0 7 93 0 0
 1 0 52480 650784 17804 3712944 0 2984 0 8952 5150 19377 0 5 95 0 ...

Read more...

Revision history for this message
Seth Arnold (seth-arnold) wrote :

$ systemctl show '*.slice' | grep -e '^Slice' -e '^ControlGroup' -e "^DefaultMem" -e "^Memory"
Slice=system.slice
ControlGroup=/system.slice/system-systemd\x2dfsck.slice
MemoryCurrent=[not set]
MemoryAccounting=yes
DefaultMemoryLow=0
DefaultMemoryMin=0
MemoryMin=0
MemoryLow=0
MemoryHigh=infinity
MemoryMax=infinity
MemorySwapMax=infinity
MemoryLimit=infinity
Slice=system.slice
ControlGroup=/system.slice/system-getty.slice
MemoryCurrent=[not set]
MemoryAccounting=yes
DefaultMemoryLow=0
DefaultMemoryMin=0
MemoryMin=0
MemoryLow=0
MemoryHigh=infinity
MemoryMax=infinity
MemorySwapMax=infinity
MemoryLimit=infinity
Slice=-.slice
ControlGroup=/user.slice
MemoryCurrent=[not set]
MemoryAccounting=yes
DefaultMemoryLow=0
DefaultMemoryMin=0
MemoryMin=0
MemoryLow=0
MemoryHigh=infinity
MemoryMax=infinity
MemorySwapMax=infinity
MemoryLimit=infinity
ControlGroup=/
MemoryCurrent=14869471232
MemoryAccounting=yes
DefaultMemoryLow=0
DefaultMemoryMin=0
MemoryMin=0
MemoryLow=0
MemoryHigh=infinity
MemoryMax=infinity
MemorySwapMax=infinity
MemoryLimit=infinity
Slice=user.slice
ControlGroup=/user.slice/user-1000.slice
MemoryCurrent=[not set]
MemoryAccounting=yes
DefaultMemoryLow=0
DefaultMemoryMin=0
MemoryMin=0
MemoryLow=0
MemoryHigh=infinity
MemoryMax=infinity
MemorySwapMax=infinity
MemoryLimit=infinity
Slice=system.slice
ControlGroup=/system.slice/system-ubxtool.slice
MemoryCurrent=[not set]
MemoryAccounting=yes
DefaultMemoryLow=0
DefaultMemoryMin=0
MemoryMin=0
MemoryLow=0
MemoryHigh=infinity
MemoryMax=infinity
MemorySwapMax=infinity
MemoryLimit=infinity
Slice=system.slice
ControlGroup=/system.slice/system-systemd\x2dcryptsetup.slice
MemoryCurrent=[not set]
MemoryAccounting=yes
DefaultMemoryLow=0
DefaultMemoryMin=0
MemoryMin=0
MemoryLow=0
MemoryHigh=infinity
MemoryMax=infinity
MemorySwapMax=infinity
MemoryLimit=infinity
Slice=system.slice
ControlGroup=/system.slice/system-systemd\x2dbacklight.slice
MemoryCurrent=[not set]
MemoryAccounting=yes
DefaultMemoryLow=0
DefaultMemoryMin=0
MemoryMin=0
MemoryLow=0
MemoryHigh=infinity
MemoryMax=infinity
MemorySwapMax=infinity
MemoryLimit=infinity
Slice=-.slice
ControlGroup=/system.slice
MemoryCurrent=[not set]
MemoryAccounting=yes
DefaultMemoryLow=0
DefaultMemoryMin=0
MemoryMin=0
MemoryLow=0
MemoryHigh=infinity
MemoryMax=infinity
MemorySwapMax=infinity
MemoryLimit=infinity

Revision history for this message
Andrea Righi (arighi) wrote :

OK, so we know that it's not related to the memory cgroup subsystem.

Another reason of such unexpected swapping activity could be due to memory compaction code that is triggering some direct memory reclaim and forcing to swap out pages.

What do you have in /proc/sys/vm/swappiness? Could you try to set that to 0 (kernel prefers to drop file-backed pages instead of swapping out anonymous pages) and see if the swap out activity is still happening?

Moreover, we can do some more advanced tracing of the direct memory reclaim code, for example:
 $ sudo kprobe-perf -s 'p:shrink_node'

This may give us useful information if and how direct memory reclaim code was triggered (this command should print out some stack traces when notice the swap out activity).

Revision history for this message
Seth Arnold (seth-arnold) wrote : Re: [Bug 1861359] Re: swap storms kills interactive use

On Mon, Feb 10, 2020 at 07:53:20AM -0000, Andrea Righi wrote:
> OK, so we know that it's not related to the memory cgroup subsystem.

But this is a good instinct. It does seem to happen when eg firefox or git
is in heavy memory use, not the system as a whole.

> Another reason of such unexpected swapping activity could be due to
> memory compaction code that is triggering some direct memory reclaim and
> forcing to swap out pages.
>
> What do you have in /proc/sys/vm/swappiness? Could you try to set that
> to 0 (kernel prefers to drop file-backed pages instead of swapping out
> anonymous pages) and see if the swap out activity is still happening?

$ cat /proc/sys/vm/swappiness
60

I'll try 0.

> Moreover, we can do some more advanced tracing of the direct memory reclaim code, for example:
> $ sudo kprobe-perf -s 'p:shrink_node'
>

This doesn't work:

# kprobe-perf -s 'p:shrink_node'
ERROR: func shrink_node not in /sys/kernel/debug/tracing/available_filter_functions.
Either it doesn't exist, or, it might be unsafe to kprobe. Exiting. Use -F to override.
# ls -l /sys/kernel/debug/tracing/available_filter_functions
ls: cannot access '/sys/kernel/debug/tracing/available_filter_functions': No such file or directory
# find /sys/kernel/debug/tracing
/sys/kernel/debug/tracing
/sys/kernel/debug/tracing/instances
/sys/kernel/debug/tracing/trace_stat
/sys/kernel/debug/tracing/per_cpu
/sys/kernel/debug/tracing/per_cpu/cpu7
/sys/kernel/debug/tracing/per_cpu/cpu6
/sys/kernel/debug/tracing/per_cpu/cpu5
/sys/kernel/debug/tracing/per_cpu/cpu4
/sys/kernel/debug/tracing/per_cpu/cpu3
/sys/kernel/debug/tracing/per_cpu/cpu2
/sys/kernel/debug/tracing/per_cpu/cpu1
/sys/kernel/debug/tracing/per_cpu/cpu0
/sys/kernel/debug/tracing/options
#

Thanks

Revision history for this message
Seth Arnold (seth-arnold) wrote :

> What do you have in /proc/sys/vm/swappiness? Could you try to set that
> to 0 (kernel prefers to drop file-backed pages instead of swapping out
> anonymous pages) and see if the swap out activity is still happening?

Unfortunately, this did not solve the problem.

Setting swappiness to 0 and re-enabling my swap nearly immediately caused
the whole computer to lag whenever firefox had focus -- and swap use
jumped to 380M, despite having roughly five gigs free space.

Turning swap back off again helped immensely.

Thanks

Revision history for this message
Andrea Righi (arighi) wrote :

Weird that kprobe-perf isn't working... I've just tried it on a fresh new installed 20.04 instance and:

ubuntu@ubuntu:~$ uname -a
Linux ubuntu 5.4.0-12-generic #15-Ubuntu SMP Tue Jan 21 15:12:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@ubuntu:~$ sudo ls -l /sys/kernel/debug/tracing/available_filter_functions
-r--r--r-- 1 root root 0 Feb 12 11:32 /sys/kernel/debug/tracing/available_filter_functions
ubuntu@ubuntu:~$ sudo kprobe-perf -s 'p:shrink_node'
Tracing kprobe shrink_node. Ctrl-C to end.
^C
Ending tracing...
ubuntu@ubuntu:~$

Are we using a different kernel?

What are your kernel .config options about ftrace (grep FTRACE /boot/config-`uname -r`)?

Revision history for this message
Seth Arnold (seth-arnold) wrote :

$ uname -a
Linux millbarge 5.4.0-12-generic #15-Ubuntu SMP Tue Jan 21 15:12:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ grep FTRACE /boot/config-`uname -r`
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_STM_SOURCE_FTRACE=m
# CONFIG_PSTORE_FTRACE is not set
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set

$ sudo ls -l /sys/kernel/debug/tracing/available_filter_functions
ls: cannot access '/sys/kernel/debug/tracing/available_filter_functions': No such file or directory
$ sudo kprobe-perf -s 'p:shrink_node'
ERROR: func shrink_node not in /sys/kernel/debug/tracing/available_filter_functions.
Either it doesn't exist, or, it might be unsafe to kprobe. Exiting. Use -F to override.

Are you using secure boot?

$ dmesg | grep -i secure
[ 0.000000] secureboot: Secure boot enabled
[ 0.000000] Kernel is locked down from EFI Secure Boot mode; see man kernel_lockdown.7
[ 0.019867] secureboot: Secure boot enabled
[ 0.965284] integrity: Loaded X.509 cert 'millbarge Secure Boot Module Signature key: effb550c01bc71fba674d2a0b19930ccde197df8'
[ 72.131681] Bluetooth: hci0: Secure boot is enabled

Thanks

Revision history for this message
Andrea Righi (arighi) wrote :

Ah! You're right, that's the reason! When the kernel is locked down ftrace is explicitly disabled. To confirm that, you should have 0 in /proc/sys/kernel/ftrace_enabled.

Can you try to set it back to 1 and see if kprobe-perf works after that? Otherwise I'll figure out an alternative way to trace down the swapping issue. Thanks.

Revision history for this message
Seth Arnold (seth-arnold) wrote :

$ uname -a
Linux millbarge 5.4.0-12-generic #15-Ubuntu SMP Tue Jan 21 15:12:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/sys/kernel/ftrace_enabled
1
$ sudo kprobe-perf -s 'p:shrink_node'
[sudo] password for sarnold:
ERROR: func shrink_node not in /sys/kernel/debug/tracing/available_filter_functions.
Either it doesn't exist, or, it might be unsafe to kprobe. Exiting. Use -F to override.
$ sudo kprobe-perf -F -s 'p:shrink_node'
Tracing kprobe shrink_node. Ctrl-C to end.
/usr/sbin/kprobe-perf: line 227: current_tracer: Permission denied
/usr/sbin/kprobe-perf: line 228: kprobe_events: Permission denied
ERROR: adding kprobe "p:shrink_node shrink_node".
Last 2 dmesg entries (might contain reason):
    [1039094.850875] usb 1-4.4.1.3: Manufacturer: u-blox AG - www.u-blox.com
    [1039094.860068] cdc_acm 1-4.4.1.3:1.0: ttyACM1: USB ACM device
Exiting.

Does the /proc/sys/kernel/ftrace_enabled output need a new bug report?

Thanks

Revision history for this message
Andrea Righi (arighi) wrote :

I guess we can't use ftrace and secure boot at the same time then... would it be possible to disable secure boot / kernel lockdown on your side and run a test using that kprobe-perf command?

If it's not possible or too complicated we'll find an alternative way, maybe I can create a custom kernel and add some specific code to trace shrink_node().

Revision history for this message
Seth Arnold (seth-arnold) wrote :

This web page may be a good reproducer candidate:

https://platform.leolabs.space/visualizations/conjunction?type=conjunction&reportId=2004981040

Loading it in firefox would make my computer unresponsive for over a minute. (Be careful with firefox reloading it when re-opening firefox.)

Loading it in chromium-browser made my computer unresponsive for 45 seconds.

Thanks

Revision history for this message
Andrea Righi (arighi) wrote :

Many thanks for the reproducer Seth! I've been able to reproduce the swapping issue on my laptop! Now I can investigate more on my side. I'll keep you posted!

Revision history for this message
Andrea Righi (arighi) wrote :

Seth, can you try to see if you can reproduce the problem with the latest unstable kernel (5.4.0-17.21)? https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/unstable

I can't reproduce the problem with it. I have not verified yet, but I suspect it might be related to this commit:

  d92ff805127f mm: move_pages: report the number of non-attempted pages

Thanks!

Revision history for this message
Steve Beattie (sbeattie) wrote :

On Sat, Feb 29, 2020 at 10:06:42AM -0000, Andrea Righi wrote:
> Seth, can you try to see if you can reproduce the problem with the
> latest unstable kernel (5.4.0-17.21)? https://launchpad.net/~canonical-
> kernel-team/+archive/ubuntu/unstable
>
> I can't reproduce the problem with it. I have not verified yet, but I
> suspect it might be related to this commit:
>
> d92ff805127f mm: move_pages: report the number of non-attempted pages

I've installed this kernel, but haven't rebooted into it. I noticed
while installing it, I got a zillion messages:

  depmod: ERROR: ../libkmod/libkmod.c:515 lookup_builtin_file() could not open builtin file '/lib/modules/5.4.0-17-generic/modules.builtin.bin'

Both linux-modules-5.4.0-17-generic and
linux-modules-extra-5.4.0-17-generic were installed, so I'm not sure
what's gone wrong here.

--
Steve Beattie
<email address hidden>
http://NxNW.org/~steve/

Revision history for this message
Andrea Righi (arighi) wrote :

The depmod error messages have been fixed in initramfs-tools (see https://bugs.launchpad.net/bugs/1863261). It doesn't actually prevent the kernel from booting, so you can safely reboot.

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Andrea, this new kernel looks promising.

Linux millbarge 5.4.0-17-generic #21-Ubuntu SMP Fri Feb 28 16:18:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Unfortunately I didn't check the reproducer before rebooting: the error message I get with it now suggests that it might not work any more.

However, this visualization https://platform.leolabs.space/visualizations/leo sure feels like it would have the heft to reproduce the problem. It worked fine for me in both firefox and chromium-browser:

sarnold 4987 14.1 3.0 3042112 495860 ? Sl 03:42 0:46 /usr/lib/firefox/firefox
sarnold 5070 7.3 1.8 2651760 293108 ? Sl 03:42 0:23 /usr/lib/firefox/firefox -contentproc -childID 1
sarnold 5175 2.5 0.9 2559560 155888 ? Sl 03:42 0:08 /usr/lib/firefox/firefox -contentproc -childID 2
sarnold 5736 5.6 2.9 4144456 476876 ? Sl 03:42 0:17 /usr/lib/firefox/firefox -contentproc -childID 4
sarnold 9668 0.3 0.4 2382156 71616 ? Sl 03:46 0:00 /usr/lib/firefox/firefox -contentproc -childID 6

sarnold 10394 6.7 1.1 3036576 182812 ? Sl 03:48 0:04 /snap/chromium/1040/usr/lib/chrom
sarnold 10672 0.0 0.2 407644 47472 ? S 03:48 0:00 /snap/chromium/1040/usr/lib/chrom
sarnold 10674 0.0 0.0 407644 12932 ? S 03:48 0:00 /snap/chromium/1040/usr/lib/chrom
sarnold 10690 17.9 2.3 1269704 372880 ? Sl 03:48 0:10 /snap/chromium/1040/usr/lib/chrom
sarnold 10694 0.8 0.4 1014184 77752 ? Sl 03:48 0:00 /snap/chromium/1040/usr/lib/chrom
sarnold 10735 60.2 3.2 6219072 521464 ? Sl 03:48 0:36 /snap/chromium/1040/usr/lib/chrom
sarnold 10780 0.0 0.3 5498108 51232 ? Sl 03:48 0:00 /snap/chromium/1040/usr/lib/chrom

I also did several git fast-forward merges on various Linux git trees; these have not been reliable reproducers, but they also completed without trouble.

An uptime of 30 minutes is perhaps too short to declare success but this is certainly promising.

Thanks

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Andrea, unfortunately this updated kernel hasn't fixed the problem:

 01:02:48 up 21:21, 9 users, load average: 1.45, 0.98, 0.58

Linux millbarge 5.4.0-17-generic #21-Ubuntu SMP Fri Feb 28 16:18:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

I was able to reproduce the swap growth and hangs with Firefox, doing a few google image searches.

Thanks

Revision history for this message
Andrea Righi (arighi) wrote :

@seth-arnold ok I'll do this tests also on my side and see if I can reproduce the problem. If you find a specific web page that can trigger the problem easily let me know. Thanks!

Revision history for this message
Seth Arnold (seth-arnold) wrote :

There is an interesting (to me, anyway) change of behaviour with the -17 kernel: while earlier kernels would appear to be locked solid for 30-60 seconds before the screen could update, -17 allows screen updates every six seconds or so.

I have an always-running mosh session to a remote host running irssi. This session has a clock that updates every second. When this problem strikes I can glance at this clock and by watching the time jumps, gauge how long the hangs are.

When the -17 kernel hangs during these events, this clock updates roughly every six seconds:

eg on seconds 36, 42, 48, 54, 00, 06, 13, 19, 25, 31, 37, 43, 50, 56..

The system is by no means usable during this time, but it is interesting that enough userspace ran for mosh to accept packets and redraw the urxvt terminal, and xorg to update the screen.

My most recent hang happened while firefox loaded https://www.google.com/maps/place/20%C2%B054'35.7%22S+55%C2%B035'05.0%22E/@-20.9099167,55.5826253,2281m/data=!3m2!1e3!4b1!4m5!3m4!1s0x0:0x0!8m2!3d-20.90991!4d55.58471 but it might be the case that this is just what pushed my browser over the edge, after loading several pages of google docs content and a PDF.

Thanks

Revision history for this message
Jamie Strandboge (jdstrand) wrote :

Using 5.4.0-17.21-generic, my laptop has 16G of ram. If I launch 3 vms (xenial desktop (768M), bionic desktop (1.5G) and focal desktop (2.6G)) then load this page: https://people.canonical.com/~ubuntu-security/oval/com.ubuntu.xenial.cve.oval.xml, at some point while the page is loading, the desktop interactivity degrades. top tells me I am swapping 2.5G. (I have other tabs open in firefox of course). Eventually the system settles down and interactivity is decent again.

top - 07:59:36 up 4 days, 21:27, 1 user, load average: 2.37, 2.69, 2.43
Tasks: 456 total, 2 running, 454 sleeping, 0 stopped, 0 zombie
%Cpu(s): 10.8 us, 8.1 sy, 0.0 ni, 79.1 id, 0.0 wa, 0.0 hi, 2.0 si, 0.0 st
MiB Mem : 15859.5 total, 3358.4 free, 10540.5 used, 1960.5 buff/cache
MiB Swap: 16232.0 total, 13749.7 free, 2482.2 used. 4008.9 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 200286 jamie 20 0 9007480 4.2g 159064 S 0.0 27.3 6:36.74 Web Content
 410962 libvirt+ 20 0 3428276 572112 956 S 0.0 3.5 0:16.71 qemu-system-+
 199841 jamie 20 0 3704896 502596 138168 S 2.7 3.1 34:35.49 firefox-bin
   6292 jamie 20 0 1173892 481776 2760 S 0.0 3.0 0:33.38 gnome-softwa+
 270490 jamie 20 0 2855220 453944 52804 S 0.0 2.8 5:28.90 chrome
   4798 jamie 20 0 4703264 386268 102868 S 10.7 2.4 123:08.13 gnome-shell
 200005 jamie 20 0 3168928 383332 82760 S 1.3 2.4 8:57.08 Web Content
 411248 libvirt+ 20 0 5171760 376240 1152 S 14.7 2.3 1:01.78 qemu-system-+
 200257 jamie 20 0 3072596 346168 36604 S 4.0 2.1 18:06.12 Web Content
 411125 libvirt+ 20 0 3944844 296928 2248 R 2.7 1.8 0:42.79 qemu-system-+
 200235 jamie 20 0 3129932 293856 135852 S 0.0 1.8 8:12.52 Web Content
 200198 jamie 20 0 2980312 268940 40484 S 1.3 1.7 13:50.98 Web Content
   5396 jamie 20 0 1032416 250628 0 S 0.0 1.5 11:31.16 multipass.gui
 200112 jamie 20 0 3001112 246340 42248 S 0.0 1.5 3:49.98 WebExtensions
 185759 jamie 20 0 2350136 159272 3984 S 0.0 1.0 2:57.77 rhythmbox
   4458 jamie 20 0 1161904 158884 124960 S 12.0 1.0 172:18.63 Xorg
...

Revision history for this message
Seth Arnold (seth-arnold) wrote :

BTW, this is still happening in:

Linux millbarge 5.4.0-20-generic #24-Ubuntu SMP Mon Mar 23 20:55:46 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

I've seen it both with firefox in trello, firefox in launchpad (typing this comment) and doing two sequential wgets of http://releases.ubuntu.com/18.04/ubuntu-18.04.4-desktop-amd64.iso (a 1.8 GB file).

Thanks

Revision history for this message
Seth Arnold (seth-arnold) wrote :

I'm adding the champagne tag to this bug to bring it to a potential wider audience; I think we may need to take more drastic steps like disabling swap on upgrades, not offering swap in our installers, etc., to try to have a better experience.

Thanks

tags: added: champagne
Revision history for this message
Sultan Alsawaf (kerneltoast) wrote :

This problem is caused by an upstream memory management feature called watermark boosting. Normally, when a memory allocation fails and falls back to the page allocator, the page allocator will wake up kswapd to free up pages in order to make the memory allocation succeed. kswapd tries to free memory until it reaches a minimum amount of memory for each memory zone called the high watermark.

What watermark boosting does is try to preemptively fire up kswapd to free memory when there hasn't been an allocation failure. It does this by increasing kswapd's high watermark goal and then firing up kswapd. The reason why this causes freezes is because, with the increased high watermark goal, kswapd will steal memory from processes that need it in order to make forward progress. These processes will, in turn, try to allocate memory again, which will cause kswapd to steal necessary pages from those processes again, in a positive feedback loop known as page thrashing. When page thrashing occurs, your system is essentially livelocked until the necessary forward progress can be made to stop processes from trying to continuously allocate memory and trigger kswapd to steal it back.

This problem already occurs with kswapd *without* watermark boosting, but it's usually only encountered on machines with a small amount of memory and/or a slow CPU. Watermark boosting just makes the existing problem worse enough to notice on higher spec'd machines.

To fix the issue in this bug, watermark boosting can be disabled with the following:
# echo 0 > /proc/sys/vm/watermark_boost_factor

There's really no harm in doing so, because watermark boosting is an inherently broken feature...

Andrea Righi (arighi)
description: updated
Revision history for this message
Sultan Alsawaf (kerneltoast) wrote :

FYI, this bug has nothing to do with the use of swap. It just happens that the slow writeback incurred by using a swap device backed by non-volatile memory makes kswapd's bouts of page thrashing last longer, enough to the point where there's a visual freeze.

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Sultan, thanks for the advice.

I set this watermark boost factor to zero as you suggested, and then decided to try a stupid simple benchmark of my storage -- my swap is a zfs dataset on nvme. zfs means it'll go slower than raw nvme block access:

$ dd if=ubuntu-18.04.4-desktop-amd64.iso of=foo bs=$((1024*1024)) conv=fsync
2028+1 records in
2028+1 records out
2126544896 bytes (2.1 GB, 2.0 GiB) copied, 6.91399 s, 308 MB/s

And this is a sequential workload with large blocks rather than random with pages, but it gives the flavour of the speed well enough.

Anyway, my system hung again when re-focusing firefox even after changing the watermark boost factor to zero. It was unusable for ten or twenty seconds. I have loads of free memory, it shouldn't be in swap:

$ free -h
              total used free shared buff/cache available
Mem: 15Gi 11Gi 1.6Gi 643Mi 2.2Gi 2.8Gi
Swap: 2.0Gi 23Mi 2.0Gi

Thanks

Revision history for this message
Sultan Alsawaf (kerneltoast) wrote :

As a note: https://platform.leolabs.space/visualizations/leo is not a valid reproducer for this bug, since the lags it causes are from overloading the GPU, not from stressing memory.

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Sultan put together a kernel with some debugging for me:

[101616.889859] __alloc_pages_nodemask: stall of 3683ms for order-0, mask: 0x100dca
[101616.889863] Call Trace:
[101616.889880] __alloc_pages_nodemask+0x34f/0x3b0
[101616.889887] alloc_pages_vma+0x7f/0x200
[101616.889893] do_anonymous_page+0x118/0x650
[101616.889898] __handle_mm_fault+0x760/0x7a0
[101616.889903] handle_mm_fault+0xca/0x200
[101616.889911] do_user_addr_fault+0x1f9/0x450
[101616.889917] __do_page_fault+0x58/0x90
[101616.889925] ? syscall_slow_exit_work+0x123/0x150
[101616.889930] do_page_fault+0x2c/0xe0
[101616.889938] page_fault+0x34/0x40
[101616.889943] RIP: 0033:0x7f0e38f80190
[101616.889949] Code: 78 04 41 bf 01 00 00 00 44 39 ef 0f 87 4f 01 00 00 41 83 c5 01 48 ba 00 00 00 00 00 80 fa ff 44 89 ee 48 8d 0c f9 48 c1 e6 03 <48> 89 11 48 83 c1 08 48 8b 7b 18 48 01 f7 48 39 f9 75 ed 44 89 68
[101616.889951] RSP: 002b:00007ffd1b151780 EFLAGS: 00010202
[101616.889955] RAX: 00007f0e27619000 RBX: 000006b3e615a810 RCX: 00007f0e27634000
[101616.889958] RDX: fffa800000000000 RSI: 000000000001aff8 RDI: 00000000000035fe
[101616.889960] RBP: 0000000000006bfd R08: 00007ffd1b151820 R09: 00000000292d95be
[101616.889962] R10: 0000000000000011 R11: 00007f0e2632d000 R12: 00007f0e3082e000
[101616.889964] R13: 00000000000035ff R14: 0000000000000001 R15: 0000000000000001

The trigger in this case was firefox loading https://code.launchpad.net/~ahasenack/qa-regression-testing/+git/qa-regression-testing/+merge/381582

Recent calls to journalctl > /tmp/foo had lead to a minute-long hang (though the mouse did make ~10ms motion every six or seven seconds), swap growth from 1MB to 62M, but no stacktraces.

Thanks

Stefan Bader (smb)
Changed in linux (Ubuntu Focal):
status: Confirmed → Fix Committed
Revision history for this message
Mason Loring Bliss (y-mason) wrote :

Reporter hasn't confirmed that it's corrected yet... "Fix committed" seems premature.

Changed in linux (Ubuntu Focal):
status: Fix Committed → Confirmed
Revision history for this message
Seth Arnold (seth-arnold) wrote :
Download full text (7.0 KiB)

Stefan, while recent kernels seem happier than previous kernels (I think -14 era was terrible), I don't think this problem is fixed yet:

sarnold@millbarge:/tmp$ uname -a
Linux millbarge 5.4.0-21-generic #25-Ubuntu SMP Sat Mar 28 13:10:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
sarnold@millbarge:/tmp$ uptime
 17:18:02 up 14:59, 8 users, load average: 0.36, 0.27, 0.16
sarnold@millbarge:/tmp$ time journalctl > foo

real 0m47.135s
user 0m36.886s
sys 0m10.180s
sarnold@millbarge:/tmp$ ls -lh /tmp/foo
-rw-rw-r-- 1 sarnold sarnold 841M Apr 3 17:15 /tmp/foo
sarnold@millbarge:/tmp$ wc -l /tmp/foo
4045530 /tmp/foo

During that journalctl run, my mouse could not move for twenty seconds. I left an iovisor tool running to log high-latency __alloc_pages_nodemask() calls (identified by Sultan as a likely cause):

# funcslower-bpfcc __alloc_pages_nodemask
Tracing function calls slower than 1 ms... Ctrl+C to quit.
COMM PID LAT(ms) RVAL FUNC
dp_sync_taskq 716 7.72 fffff2d109115040 __alloc_pages_nodemask
systemd-udevd 44671 1.59 fffff2d10575b800 __alloc_pages_nodemask
git 46218 4.27 fffff2d10500edc0 __alloc_pages_nodemask
z_wr_iss 683 4.90 fffff2d101b4cf40 __alloc_pages_nodemask
systemd-udevd 51359 4.27 fffff2d10ec12800 __alloc_pages_nodemask
systemd-udevd 56585 2.48 fffff2d10d3b0500 __alloc_pages_nodemask
systemd-udevd 56577 2.55 fffff2d106f73380 __alloc_pages_nodemask
systemd-udevd 72552 2.09 fffff2d105249680 __alloc_pages_nodemask
systemd-udevd 76964 1.53 fffff2d1017fd440 __alloc_pages_nodemask
systemd-udevd 80900 3.94 fffff2d102675200 __alloc_pages_nodemask
systemd-udevd 88669 2.09 fffff2d1015fb5c0 __alloc_pages_nodemask
kthreadd 2 4.41 fffff2d10d6f96c0 __alloc_pages_nodemask
kthreadd 2 4.65 fffff2d10d38c3c0 __alloc_pages_nodemask
kthreadd 2 3.97 fffff2d10d3cc780 __alloc_pages_nodemask
updatedb.mloca 106194 2.44 fffff2d10e1c2680 __alloc_pages_nodemask
kthreadd 2 1.20 fffff2d10d36ff40 __alloc_pages_nodemask
kthreadd 2 4.13 fffff2d10d495940 __alloc_pages_nodemask
kthreadd 2 6.34 fffff2d10d317180 __alloc_pages_nodemask
kthreadd 2 4.76 fffff2d110e85000 __alloc_pages_nodemask
systemd-udevd 113822 1.51 fffff2d10f5c8e00 __alloc_pages_nodemask
systemd-udevd 113820 1.91 fffff2d10e611800 __alloc_pages_nodemask
systemd-udevd 122017 3.13 fffff2d10e539700 __alloc_pages_nodemask
systemd-udevd 141102 2.43 fffff2d104487200 __alloc_pages_nodemask
systemd-udevd 144565 4.82 fffff2d106b84040 __alloc_pages_nodemask
systemd-udevd 1156 1.28 fffff2d102325c00 __alloc_pages_nodemask
systemd-udevd 152486 3.07 fffff2d101516500 __alloc_pages_nodemask
systemd-udevd 152485 4.94 fffff2d1015cbbc0 __alloc_pages_nodemask
systemd-udevd 172807 1.33 fffff2d1015f4700 __alloc_pages_nodemask
systemd-udevd 185503 4.46 fffff2d10d321240 __alloc_pages_nodemask
systemd-udevd 189244 5.12 fffff2d10e7f0b00 __alloc_pages_nodemask
systemd-udevd 205973 5.36 fffff2d10d4171c0 __alloc_pages_nodemask
systemd-udevd 210133 5.19 ff...

Read more...

Revision history for this message
Andrea Righi (arighi) wrote :

TL;DR @seth-arnold, as a test can you try to set the following options?

  $ echo $((32 * 1024 * 1024)) | sudo tee /proc/sys/vm/dirty_bytes
  $ echo $((32 * 1024 * 1024)) | sudo tee /proc/sys/vm/dirty_background_bytes

Repeat the test and see if the system is still unresponsive.

Details below.

This is what I think it's happening in this last scenario: interactive performance killed when a large I/O writer is running.

The large I/O writer generates a lot of dirty pages, nothing is forcing to sync those pages to the backing store until the dirty_ratio (=20%) / dirty_background_ratio (=10%) thresholds are hit. And they can be quite high with the default settings in systems with a lot of RAM.

For example in a system with 16GB of free/reclaimable memory, the amount of dirty memory that is allowed before a writer is actively forced to flush those pages to the backing store is: 16GB * 20 / 100 = 3.2GB. Flusher threads are started when the amount of dirty pages is 16GB * 10 / 100 = 1.6GB of dirty memory.

So, if the writer doesn't stop, it will consume all the free pages in the system and at that point we are going to have a lot of dirty pages. Then the kernel needs to decide what to do to free up some pages.

Reclaimable memory is the first choice: cached clean pages that already have a copy on the corresponding backing store are easy to reclaim, because they just need to be dropped from the page cache (no I/O involved). Dirty pages are more expensive to reclaim, because they need to be flushed to the backing store before freeing up the page. Same with anonymous memory that needs to be flushed to the swap device, before being able to re-use the page.

So when the system starts to reclaim some pages, we see some swap activity and we also see some I/O due to the flushing of the dirty pages. I think the system becomes sluggish, because there are too many dirty pages, the kernel is spending too much time to select the right pages to reclaim and interactive performance is killed.

This looks like a bug/regression in the kernel and I think we should definitely investigate more and track down the reason of the problem. In the meantime, as a test to prove this thoery I think we could try to reduce the amount of allowed dirty pages in the system, tuning the dirty thresholds: vm.dirty_bytes and vm.dirty_background_bytes (using the *_bytes tuners to have a more fine-grained control on those thresholds) and see if there are some benefits in the specific scenario reported by Seth.

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Thanks Andrea, I don't think that helped. I'll attach a file with vmstat 1 output and funclatency output, along with a few notes on the testing.

Thanks

Revision history for this message
Seth Arnold (seth-arnold) wrote :

I should point out that the period bursts of writes every five seconds in my vmstat 1 output is due to zfs's flushing mechanism; by default it flushes dirty pages every five seconds.

Revision history for this message
Andrea Righi (arighi) wrote :

Hi Seth, sorry for my late response.

I did more tests this morning on my laptop tracing the callers of __alloc_pages_nodemask() and I noticed that pretty much all the time it is called by the i915 shrinker. So I tried to disable it and I have to say that on my laptop (at least) the system is always very responsive now even when the system is swapping out pages and I generate a lot of I/O.

I've uploaded this test kernel here (5.4.0-24.28+lp1861359v1):
https://kernel.ubuntu.com/~arighi/LP-1861359/

Can you do one more test with it and verify if it does some improvements also in your case?

Probably it's not yet a valid solution, but if it works it would tell us that the problem is exactly there. Thanks!

Revision history for this message
Andrea Righi (arighi) wrote :

I've uploaded another test kernel (5.4.0-24.28+lp1861359v2):
https://kernel.ubuntu.com/~arighi/LP-1861359/

In this one, instead of completely disabling the i915 shrinker, I'm only preventing to swap out the i915 caches when the system is short on memory.

I'm testing this new one on my laptop right now and it seems to be a better compromise, since it still allows to free up some i915 memory, but it also prevents to completely kill interactive performance.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (35.2 KiB)

This bug was fixed in the package linux - 5.4.0-24.28

---------------
linux (5.4.0-24.28) focal; urgency=medium

  * focal/linux: 5.4.0-24.28 -proposed tracker (LP: #1871939)

  * getitimer returns it_value=0 erroneously (LP: #1349028)
    - [Config] CONTEXT_TRACKING_FORCE policy should be unset

  * 12d1:1038 Dual-Role OTG device on non-HNP port - unable to enumerate USB
    device on port 1 (LP: #1047527)
    - [Config] USB_OTG_FSM policy not needed

  * Add DCPD backlight support for HP CML system (LP: #1871589)
    - SAUCE: drm/i915: Force DPCD backlight mode for HP CML 2020 system

  * Backlight brightness cannot be adjusted using keys (LP: #1860303)
    - SAUCE drm/i915: Force DPCD backlight mode for HP Spectre x360 Convertible
      13t-aw100

  * CVE-2020-11494
    - slcan: Don't transmit uninitialized stack data in padding

  * Ubuntu Kernel Support for OpenPOWER NV Secure & Trusted Boot (LP: #1866909)
    - powerpc: Detect the secure boot mode of the system
    - powerpc/ima: Add support to initialize ima policy rules
    - powerpc: Detect the trusted boot state of the system
    - powerpc/ima: Define trusted boot policy
    - ima: Make process_buffer_measurement() generic
    - certs: Add wrapper function to check blacklisted binary hash
    - ima: Check against blacklisted hashes for files with modsig
    - powerpc/ima: Update ima arch policy to check for blacklist
    - powerpc/ima: Indicate kernel modules appended signatures are enforced
    - powerpc/powernv: Add OPAL API interface to access secure variable
    - powerpc: expose secure variables to userspace via sysfs
    - x86/efi: move common keyring handler functions to new file
    - powerpc: Load firmware trusted keys/hashes into kernel keyring
    - x86/efi: remove unused variables

  * [roce-0227]sync mainline kernel 5.6rc3 roce patchset into ubuntu HWE kernel
    branch (LP: #1864950)
    - RDMA/hns: Cleanups of magic numbers
    - RDMA/hns: Optimize eqe buffer allocation flow
    - RDMA/hns: Add the workqueue framework for flush cqe handler
    - RDMA/hns: Delayed flush cqe process with workqueue
    - RDMA/hns: fix spelling mistake: "attatch" -> "attach"
    - RDMA/hns: Initialize all fields of doorbells to zero
    - RDMA/hns: Treat revision HIP08_A as a special case
    - RDMA/hns: Use flush framework for the case in aeq
    - RDMA/hns: Stop doorbell update while qp state error
    - RDMA/hns: Optimize qp destroy flow
    - RDMA/hns: Optimize qp context create and destroy flow
    - RDMA/hns: Optimize qp number assign flow
    - RDMA/hns: Optimize qp buffer allocation flow
    - RDMA/hns: Optimize qp param setup flow
    - RDMA/hns: Optimize kernel qp wrid allocation flow
    - RDMA/hns: Optimize qp doorbell allocation flow
    - RDMA/hns: Check if depth of qp is 0 before configure

  * [hns3-0316]sync mainline kernel 5.6rc4 hns3 patchset into ubuntu HWE kernel
    branch (LP: #1867586)
    - net: hns3: modify an unsuitable print when setting unknown duplex to fibre
    - net: hns3: add enabled TC numbers and DWRR weight info in debugfs
    - net: hns3: add support for dump MAC ID and loopback status in debugfs
    - net: hns3: add missing help info for QS shaper...

Changed in linux (Ubuntu Focal):
status: Confirmed → Fix Released
Revision history for this message
Jamie Strandboge (jdstrand) wrote :

This entry:

  * swap storms kills interactive use (LP: #1861359)
    - SAUCE: mm/page_alloc.c: disable memory reclaim watermark boosting by default

closed this bug, but per latest comments, that isn't sufficient to address the issue. Putting back to Confirmed.

Changed in linux (Ubuntu Focal):
status: Fix Released → Confirmed
Revision history for this message
Seth Arnold (seth-arnold) wrote :

Andrea, I've been running the v1 kernel for a day or so now:

[ 0.000000] Linux version 5.4.0-24-generic (arighi@sita) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu1)) #28+lp1861359v1 SMP Wed Apr 15 14:49:33 UTC 2020 (Ubuntu 5.4.0-24.28+lp1861359v1-generic 5.4.30)

$ uptime
 02:21:45 up 1 day, 9 min, 14 users, load average: 0.42, 0.25, 0.23

and have not seen any hangs or stalls of any sort in the ~ten hours that I've been using it.

Normally these problems start to appear about a day into use, so it's perhaps still early to declare success, but this is looking promising.

Even loading https://platform.leolabs.space/visualization in firefox just causes momentary glitches in mouse movement, like 50ms kind of range. (I know this range is pretty poor for humans to tell, but it sure feels like less than half a second by a lot.)

Thanks

Revision history for this message
Andrea Righi (arighi) wrote :

Seth, thanks for the update!

JFYI, I've just upladed also a v3 kernel (5.4.0-24.28+lp1861359v3) that I'm currently testing on my laptop with positive result. This change is even smaller than the previous one (v2), because we simply disable the direct swap out in the i915 shrinker (I915_SHRINK_WRITEBACK). It's really a one-line change, so the regression potential is very minimal.

If I don't find any problem I would consider to include this last fix. It would be great if you could also do a test with this one (so please ignore the v2 kernel). Thanks!

Andrea Righi (arighi)
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.4.0-26.30

---------------
linux (5.4.0-26.30) focal; urgency=medium

  * focal/linux: 5.4.0-26.30 -proposed tracker (LP: #1873882)

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * swap storms kills interactive use (LP: #1861359)
    - SAUCE: drm/i915: prevent direct writeback from the shrinker

  * 5.4.0-24.28 does not seem to apply rtprio, whereas -21 does. (LP: #1873315)
    - [Config] lowlatency: turn off RT_GROUP_SCHED

  * [RTL810xE] No ethernet connection (LP: #1871182)
    - net: phy: realtek: fix handling of RTL8105e-integrated PHY

 -- Andrea Righi <email address hidden> Mon, 20 Apr 2020 18:33:49 +0200

Changed in linux (Ubuntu Focal):
status: Confirmed → Fix Released
Revision history for this message
Seth Arnold (seth-arnold) wrote :

I was reminded of this bug earlier today -- Andrea, Sultan, thanks so much for fixing my issues. I've been happily running along for months now. :) Thanks!

Jeff Lane  (bladernr)
tags: added: ubuntu-certified
To post a comment you must log in.