le9h.patch revision 4: removed CONFIG_PROC_FS / CONFIG_IKCONFIG_PROC because I remembered that IKCONFIG_PROC was initially used by le9e.patch as a personal preference(so I'd know whether CONFIG_LE9D_PATCH was part of the kernel ie. was patch applied to the running kernel?), but for everyone else it's not needed because the presence of the sysctl values gives away that the patch was applied or not! revision 3: replaced CONFIG_IKCONFIG_PROC dependency with CONFIG_PROC_FS ^ thanks to mikhailnov for pointing it out here https://gist.github.com/constantoverride/84eba764f487049ed642eb2111a20830#gistcomment-3015618 revision 2: fixed 1 compilation warning. warning: there's a regression that is fixed by reverting commit aa56a292ce623734ddd30f52d73f527d1f3529b5 which is required to use this patch successfully(even tho this regression happens without this patch too!) see https://bugzilla.kernel.org/show_bug.cgi?id=203317#c4 ^ already reverted in v5.3.0 kernel! thanks! this is licensed under all/any of: Apache License, Version 2.0 MIT license 0BSD CC0 UNLICENSE Interesting, a phoronix user here https://www.phoronix.com/forums/forum/phoronix/general-discussion/1118164-yes-linux-does-bad-in-low-ram-memory-pressure-situations-on-the-desktop/page17#post1120291 linked to two older patches(that I was completely unaware of) which are supposedly doing this same thing (9yrs ago for the first, Oct 23, 2018 or earlier for the second): https://lore.kernel.org/patchwork/patch/222042/ https://gitlab.freedesktop.org/seanpaul/dpu-staging/commit/0b992f2dbb044896c3584e10bd5b97cf41e2ec6d they seem way more complicated though, at first glance. but hey, they were at least aware of the issue and how to fix it! see also: https://gitlab.freedesktop.org/hadess/low-memory-monitor/ earlyoom etc. ------ ok, on i87k, with commit aa56a292ce623734ddd30f52d73f527d1f3529b5 reverted, I still hit that bug and the second time it crashed kernel (kdump one seemingly failed to get a good dump, assuming it was stuck with high cpu usage and no disk activity): first time: Sep 01 10:19:49 i87k gpg-agent[1173]: handler 0x71943a213700 for fd 10 terminated Sep 01 10:20:24 i87k systemd-journald[414]: Missed 5 kernel messages Sep 01 10:20:24 i87k kernel: WARNING: CPU: 6 PID: 32590 at fs/ext4/inode.c:3941 ext4_set_page_dirty+0x39/0x50 Sep 01 10:20:24 i87k kernel: Modules linked in: tcp_diag udp_diag inet_diag af_packet_diag vfat fat xt_comment msr usb_storage x t_TCPMSS iptable_mangle iptable_security iptable_nat nf_nat iptable_raw nf_log_ipv4 nf_log_common xt_conntrack xt_LOG xt_connlim it nf_conncount nf_conntrack nf_defrag_ipv4 xt_hashlimit xt_multiport xt_owner snd_hda_codec_hdmi xt_addrtype snd_hda_codec_real tek snd_hda_codec_generic intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal i915 intel_powerclamp coretemp crct10dif_pclmul i2c_algo_bit snd_hda_intel crc32_pclmul drm_kms_helper snd_hda_codec crc32c_intel syscopyarea snd_hwdep sysfillrect snd_hda_core ghash_clmulni_intel sysimgblt snd_pcm fb_sys_fops intel_cstate iTCO_wdt snd_timer drm iTCO_vendor_support intel_uncore snd bfq intel_rapl_perf e1000e soundcore pcspkr i2c_i801 drm_panel_orientation_quirks xhci_pci xhci_hcd Sep 01 10:20:24 i87k kernel: CPU: 6 PID: 32590 Comm: kworker/u24:2 Kdump: loaded Tainted: G U 5.3.0-rc4-gd45331b0 0ddb #51 Sep 01 10:20:24 i87k kernel: Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2201 05/27/2019 Sep 01 10:20:24 i87k kernel: Workqueue: i915 __i915_gem_free_work [i915] Sep 01 10:20:24 i87k kernel: RIP: 0010:ext4_set_page_dirty+0x39/0x50 Sep 01 10:20:24 i87k kernel: Code: 48 8b 00 a8 01 75 16 48 8b 57 08 48 8d 42 ff 83 e2 01 48 0f 44 c7 48 8b 00 a8 08 74 0d 48 8b 07 f6 c4 20 74 0f e9 37 ef f8 ff <0f> 0b 48 8b 07 f6 c4 20 75 f1 0f 0b e9 26 ef f8 ff 66 0f 1f 44 00 Sep 01 10:20:24 i87k kernel: RSP: 0018:ffffb33981fbfd90 EFLAGS: 00010246 Sep 01 10:20:24 i87k kernel: RAX: 002fffc000002056 RBX: ffffa43a43edf000 RCX: 0000000000000000 Sep 01 10:20:24 i87k kernel: RDX: 0000000000000000 RSI: 000000037a133000 RDI: fffff61192614b80 Sep 01 10:20:24 i87k kernel: RBP: fffff61192614b80 R08: 0000000000000000 R09: 0000000000000000 Sep 01 10:20:24 i87k kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000049852e Sep 01 10:20:24 i87k kernel: R13: ffffa44089d8a400 R14: ffffa43fb394f4c0 R15: 0000000000000000 Sep 01 10:20:24 i87k kernel: FS: 0000000000000000(0000) GS:ffffa4416d980000(0000) knlGS:0000000000000000 Sep 01 10:20:24 i87k kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 01 10:20:24 i87k kernel: CR2: 00005cf34686d000 CR3: 00000002b700a005 CR4: 00000000003606e0 Sep 01 10:20:24 i87k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Sep 01 10:20:24 i87k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Sep 01 10:20:24 i87k kernel: Call Trace: Sep 01 10:20:24 i87k kernel: i915_gem_userptr_put_pages+0x135/0x180 [i915] Sep 01 10:20:24 i87k kernel: __i915_gem_object_put_pages+0x50/0x90 [i915] Sep 01 10:20:24 i87k kernel: __i915_gem_free_objects+0x115/0x1d0 [i915] Sep 01 10:20:24 i87k kernel: __i915_gem_free_work+0x5f/0x90 [i915] Sep 01 10:20:24 i87k kernel: process_one_work+0x16b/0x2a0 Sep 01 10:20:24 i87k kernel: worker_thread+0x48/0x390 Sep 01 10:20:24 i87k kernel: kthread+0xee/0x130 Sep 01 10:20:24 i87k kernel: ? process_one_work+0x2a0/0x2a0 Sep 01 10:20:24 i87k kernel: ? kthread_park+0x70/0x70 Sep 01 10:20:24 i87k kernel: ret_from_fork+0x1f/0x30 Sep 01 10:20:24 i87k kernel: ---[ end trace 1d52011dba5f6fdf ]--- Sep 01 10:20:40 i87k glibc64:../sysdeps/posix/getaddrinfo.c:2201/getaddrinfo[38799]: curl[38799](full:'curl') for user root(0(ef f:root(0))) 1of2 attempting to resolve (requested)hostname: raw.githubusercontent.com and a little later: Sep 01 11:52:42 i87k systemd-journald[414]: Missed 23 kernel messages Sep 01 11:52:42 i87k kernel: WARNING: CPU: 11 PID: 12590 at fs/ext4/inode.c:3942 ext4_set_page_dirty+0x43/0x50 Sep 01 11:52:42 i87k kernel: Modules linked in: tcp_diag udp_diag inet_diag af_packet_diag vfat fat xt_comment msr usb_storage xt_TCPMSS iptable_mangle iptable_security iptable_nat nf_nat iptable_raw nf_log_ipv4 nf_log_common xt_conntrack xt_LOG xt_connlimit nf_conncount nf_conntrack nf_defrag_ipv4 xt_hashlimit xt_multiport xt_owner snd_hda_codec_hdmi xt_addrtype snd_hda_codec_realtek snd_hda_codec_generic intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal i915 intel_powerclamp coretemp crct10dif_pclmul i2c_algo_bit snd_hda_intel crc32_pclmul drm_kms_helper snd_hda_codec crc32c_intel syscopyarea snd_hwdep sysfillrect snd_hda_core ghash_clmulni_intel sysimgblt snd_pcm fb_sys_fops intel_cstate iTCO_wdt snd_timer drm iTCO_vendor_support intel_uncore snd bfq intel_rapl_perf e1000e soundcore pcspkr i2c_i801 drm_panel_orientation_quirks xhci_pci xhci_hcd Sep 01 11:52:42 i87k kernel: CPU: 11 PID: 12590 Comm: kworker/u24:1 Kdump: loaded Tainted: G U W 5.3.0-rc4-gd45331b00ddb #51 Sep 01 11:52:42 i87k kernel: Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2201 05/27/2019 Sep 01 11:52:42 i87k kernel: Workqueue: i915 __i915_gem_free_work [i915] Sep 01 11:52:42 i87k kernel: RIP: 0010:ext4_set_page_dirty+0x43/0x50 Sep 01 11:52:42 i87k kernel: Code: 08 48 8d 42 ff 83 e2 01 48 0f 44 c7 48 8b 00 a8 08 74 0d 48 8b 07 f6 c4 20 74 0f e9 37 ef f8 ff 0f 0b 48 8b 07 f6 c4 20 75 f1 <0f> 0b e9 26 ef f8 ff 66 0f 1f 44 00 00 41 54 55 53 49 89 fc 89 d5 Sep 01 11:52:42 i87k kernel: RSP: 0018:ffffb3398acd7d90 EFLAGS: 00010246 Sep 01 11:52:42 i87k kernel: RAX: 002fffc000020016 RBX: ffffa43a9404e000 RCX: 0000000000000000 Sep 01 11:52:42 i87k kernel: RDX: 0000000000000000 RSI: 0000000607983000 RDI: fffff6119b863fc0 Sep 01 11:52:42 i87k kernel: RBP: fffff6119b863fc0 R08: 0000000000000000 R09: 0000000000000000 Sep 01 11:52:42 i87k kernel: R10: 0000000000000018 R11: 0000000000000000 R12: 00000000006e18ff Sep 01 11:52:42 i87k kernel: R13: ffffa43a7ff0b600 R14: ffffa43fbc823ee0 R15: 0000000000000000 Sep 01 11:52:42 i87k kernel: FS: 0000000000000000(0000) GS:ffffa4416dac0000(0000) knlGS:0000000000000000 Sep 01 11:52:42 i87k kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 01 11:52:42 i87k kernel: CR2: 00005cf34711a620 CR3: 00000002b700a004 CR4: 00000000003606e0 Sep 01 11:52:42 i87k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Sep 01 11:52:42 i87k kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Sep 01 11:52:42 i87k kernel: Call Trace: Sep 01 11:52:42 i87k kernel: i915_gem_userptr_put_pages+0x135/0x180 [i915] Sep 01 11:52:42 i87k kernel: __i915_gem_object_put_pages+0x50/0x90 [i915] Sep 01 11:52:42 i87k kernel: __i915_gem_free_objects+0x115/0x1d0 [i915] Sep 01 11:52:42 i87k kernel: __i915_gem_free_work+0x5f/0x90 [i915] Sep 01 11:52:42 i87k kernel: process_one_work+0x16b/0x2a0 Sep 01 11:52:42 i87k kernel: worker_thread+0x48/0x390 Sep 01 11:52:42 i87k kernel: kthread+0xee/0x130 Sep 01 11:52:42 i87k kernel: ? process_one_work+0x2a0/0x2a0 Sep 01 11:52:42 i87k kernel: ? kthread_park+0x70/0x70 Sep 01 11:52:42 i87k kernel: ret_from_fork+0x1f/0x30 Sep 01 11:52:42 i87k kernel: ---[ end trace 1d52011dba5f6fe0 ]--- Sep 01 11:52:54 i87k gpg-agent[1173]: handler 0x71943aa14700 for fd 10 started Sep 01 11:52:54 i87k gpg-agent[1173]: handler 0x71943aa14700 for fd 10 terminated Sep 01 11:53:34 i87k sudo[13566]: user : TTY=pts/25 ; PWD=/home/user/build/1packages/4used/chro3/ungoogled-chromium-archlinux ; USER=root ; COMMAND=validate Sep 01 11:53:34 i87k sudo[13568]: user : TTY=pts/25 ; PWD=/home/user/build/1packages/4used/chro3/ungoogled-chromium-archlinux ; USER=root ; COMMAND=/usr/bin/journalctl --sync --flush Sep 01 11:53:34 i87k sudo[13568]: pam_unix(sudo:session): session opened for user root by (uid=0) Sep 01 11:53:34 i87k sudo[13568]: pam_unix(sudo:session): session closed for user root diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 64aeee1009ca..d0f3f7080f03 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -68,6 +68,7 @@ Currently, these files are in /proc/sys/vm: - numa_stat - swappiness - unprivileged_userfaultfd +- unevictable_activefile_kbytes - user_reserve_kbytes - vfs_cache_pressure - watermark_boost_factor @@ -848,6 +849,69 @@ privileged users (with SYS_CAP_PTRACE capability). The default value is 1. +unevictable_activefile_kbytes +============================= + +How many kilobytes of `Active(file)` to never evict during high-pressure +low-memory situations. ie. never evict active file pages if under this value. +This will help prevent disk thrashing caused by Active(file) being close to zero +in such situations, especially when no swap is used. + +As 'nivedita' (phoronix user) put it: +"Executables and shared libraries are paged into memory, and can be paged out +even with no swap. [...] The kernel is dumping those pages and [...] immediately +reading them back in when execution continues." +^ and that's what's causing the disk thrashing during memory pressure. + +unevictable_activefile_kbytes=X will prevent X kbytes of those most used pages +from being evicted. + +The default value is 65536. That's 64 MiB. + +Set it to 0 to keep the default behaviour, as if this option was never +implemented, so you can see the disk thrashing as usual. + +To get an idea what value to use here for your workload(eg. xfce4 with idle +terminals) to not disk thrash at all, run this:: + + $ echo 1 | sudo tee /proc/sys/vm/drop_caches; grep -F 'Active(file)' /proc/meminfo + 1 + Active(file): 203444 kB + +so, using vm.unevictable_activefile_kbytes=203444 would be a good idea here. +(you can even add a `sleep` before the grep to get a slightly increased value, +which might be useful if something is compiling in the background and you want +to account for that too) + +But you can probably go with the default value of just 65536 (aka 64 MiB) +as this will eliminate most disk thrashing anyway, unless you're not using +an SSD, in which case it might still be noticeable (I'm guessing?). + +Note that `echo 1 | sudo tee /proc/sys/vm/drop_caches` can still cause +Active(file) to go a under the vm.unevictable_activefile_kbytes value. +It's not an issue and this is how you know how much the value for +vm.unevictable_activefile_kbytes should be, at the time/workload when you ran it. + +The value of `Active(file)` can be gotten in two ways:: + + $ grep -F 'Active(file)' /proc/meminfo + Active(file): 2712004 kB + +and:: + + $ grep nr_active_file /proc/vmstat + nr_active_file 678001 + +and multiply that with MAX_NR_ZONES (which is 4), ie. `nr_active_file * MAX_NR_ZONES` +so 678001*4=2712004 kB + +MAX_NR_ZONES is 4 as per: +`include/generated/bounds.h:10:#define MAX_NR_ZONES 4 /* __MAX_NR_ZONES */` +and is unlikely the change in the future. + +The hub of disk thrashing tests/explanations is here: +https://gist.github.com/constantoverride/84eba764f487049ed642eb2111a20830 + user_reserve_kbytes =================== diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 078950d9605b..c2726324a176 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -110,6 +110,15 @@ extern int core_uses_pid; extern char core_pattern[]; extern unsigned int core_pipe_limit; #endif +#if defined(CONFIG_RESERVE_ACTIVEFILE_TO_PREVENT_DISK_THRASHING) +unsigned long sysctl_unevictable_activefile_kbytes __read_mostly = +#if CONFIG_RESERVE_ACTIVEFILE_KBYTES < 0 +#error "CONFIG_RESERVE_ACTIVEFILE_KBYTES should be >= 0" +#else + CONFIG_RESERVE_ACTIVEFILE_KBYTES +#endif +; +#endif extern int pid_max; extern int pid_max_min, pid_max_max; extern int percpu_pagelist_fraction; @@ -1691,6 +1701,15 @@ static struct ctl_table vm_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, +#endif +#if defined(CONFIG_RESERVE_ACTIVEFILE_TO_PREVENT_DISK_THRASHING) + { + .procname = "unevictable_activefile_kbytes", + .data = &sysctl_unevictable_activefile_kbytes, + .maxlen = sizeof(sysctl_unevictable_activefile_kbytes), + .mode = 0644, + .proc_handler = proc_doulongvec_minmax, + }, #endif { .procname = "user_reserve_kbytes", diff --git a/mm/Kconfig b/mm/Kconfig index 56cec636a1fc..d21b737ca32e 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -63,6 +63,39 @@ config SPARSEMEM_MANUAL endchoice +config RESERVE_ACTIVEFILE_TO_PREVENT_DISK_THRASHING + bool "Reserve some `Active(file)` to prevent disk thrashing" + depends on SYSCTL + def_bool y + help + Keep `Active(file)`(/proc/meminfo) pages in RAM so as to avoid system freeze + due to the disk thrashing(disk reading only) that occurrs because the running + executables's code is being evicted during low-mem conditions which is + why it also prevents oom-killer from triggering until 10s of minutes later + on some systems. + + Please see the value of CONFIG_RESERVE_ACTIVEFILE_KBYTES to set how many + KiloBytes of Active(file) to keep by default in the sysctl setting + vm.unevictable_activefile_kbytes + see Documentation/admin-guide/sysctl/vm.rst for more info + +config RESERVE_ACTIVEFILE_KBYTES + int "Set default value for vm.unevictable_activefile_kbytes" + depends on RESERVE_ACTIVEFILE_TO_PREVENT_DISK_THRASHING + default "65536" + help + This is the default value(in KiB) that vm.unevictable_activefile_kbytes gets. + A value of at least 65536 or at most 262144 is recommended for users + of xfce4 to avoid disk thrashing on low-memory/memory-pressure conditions, + ie. mouse freeze with constant disk activity (but you can still sysrq+f to + trigger oom-killer though, even without this mitigation) + + You can still sysctl set vm.unevictable_activefile_kbytes to a value of 0 + to disable this whole feature at runtime. + + see Documentation/admin-guide/sysctl/vm.rst for more info + see also CONFIG_RESERVE_ACTIVEFILE_TO_PREVENT_DISK_THRASHING + config DISCONTIGMEM def_bool y depends on (!SELECT_MEMORY_MODEL && ARCH_DISCONTIGMEM_ENABLE) || DISCONTIGMEM_MANUAL diff --git a/mm/vmscan.c b/mm/vmscan.c index dbdc46a84f63..0dcd4e2dc02d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2445,6 +2445,16 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, BUG(); } +#if defined(CONFIG_RESERVE_ACTIVEFILE_TO_PREVENT_DISK_THRASHING) + extern unsigned int sysctl_unevictable_activefile_kbytes; //FIXME: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] //this just needs to be moved earlier in this same function (too lazy to, atm) + if (LRU_ACTIVE_FILE == lru) { //doneFIXME: warning: comparison between ‘enum node_stat_item’ and ‘enum lru_list’ [-Wenum-compare] //fixed: replaced NR_ACTIVE_FILE with LRU_ACTIVE_FILE (they are both == 3 ) + long long kib_active_file_now=global_node_page_state(NR_ACTIVE_FILE) * MAX_NR_ZONES; + if (kib_active_file_now <= sysctl_unevictable_activefile_kbytes) { + nr[lru] = 0; //ie. don't reclaim any Active(file) (see /proc/meminfo) if they are under sysctl_unevictable_activefile_kbytes see Documentation/admin-guide/sysctl/vm.rst and CONFIG_RESERVE_ACTIVEFILE_TO_PREVENT_DISK_THRASHING and CONFIG_RESERVE_ACTIVEFILE_KBYTES + continue; + } + } +#endif *lru_pages += size; nr[lru] = scan; }