Bug #1833281 “System freeze when memory is put on SWAP in Linux ...” : Bugs : linux package : Ubuntu

Revision history for this message

In Linux Kernel Bug Tracker #196729, netwiz (netwiz-linux-kernel-bugs) wrote on 2017-08-22:

#20

I have 10Gb of RAM in this system and run Fedora 26. If I launch Cities:
Skylines with no swap space, things run well performance wise until I get an
OOM - and it all dies - which is expected.

When I turn on swap to /dev/sda2 which resides on an SSD, I get complete
system freezes while swap is being accessed.

The first swap was after loading a saved game, then launching kmail in the
background. This caused ~500Mb to be swapped to /dev/sda2 on an SSD. The
system froze for about 8 minutes - barely being able to move the mouse. The
HDD LED was on constantly during the entire time.

To hopefully rule out the above glibc issue, I started the game via jemalloc -
but experienced even more severe freezes while swapping. I gave up waiting
after 13 minutes of non-responsiveness - not even being able to move the mouse
properly.

During these hangs, I could typed into a Konsole window, and some of the
typing took 3+ minutes to display on the screen (yay for buffers?).

I have tested this with both the default vm.swappiness values, as well as the
following:
vm.swappiness = 1
vm.min_free_kbytes = 32768
vm.vfs_cache_pressure = 60

I noticed that when I do eventually get screen updates, all 8 cpus (4 cores /
2 threads) show 100% CPU usage - and kswapd is right up there in the process
list for CPU usage. Sadly I haven't been able to capture this information
fully yet due to said unresponsiveness.

(more to come in comments & attachments)

Revision history for this message

In Linux Kernel Bug Tracker #196729, netwiz (netwiz-linux-kernel-bugs) wrote on 2017-08-22:

#21

First - using kernel 4.10.17 - which does not show any issues in swapping:

I tried doing: swapoff /dev/sda2

Attached output as vmstat-4.10.17-10Gb-noswap.log

18:27:00 - Launched Cities: Skylines
18:27:30 - Started loading the saved game
18:28:25 - About this time the game started doing its thing. Started scrolling
around.
18:28:47 - System stopped responding and then the C:S was killed by the OOM
handler

Revision history for this message

In Linux Kernel Bug Tracker #196729, netwiz (netwiz-linux-kernel-bugs) wrote on 2017-08-22:

#22

Created attachment 258045
vmstat-4.10.17-10Gb-noswap.log (OK - OOM running)

Revision history for this message

In Linux Kernel Bug Tracker #196729, netwiz (netwiz-linux-kernel-bugs) wrote on 2017-08-22:

#23

Created attachment 258047
vmstat-4.10.17-10Gb.log (OK with swapping)

Second test, same kernel with swap turned on:

I have attached the vmstat output that goes with the following timestamps for
system utilisation:

15:32:30 - Launch Skylines
15:33:00 - Load the saved game
15:34:11 - Saved game loaded ok.
15:35:00 - Launch Chrome.
15:35:36 - Chrome launched - System responding ok.
15:36:00 - Browsing a few web sites
15:36:50 - Exit Chrome
15:37:30 - Exit Cities: Skylines.

You'll note that there are very few missing vmstat lines - however I did
notice the following missing:
        15:35:10
        15:35:12
        15:35:15
        15:35:20
        15:35:26
        15:35:29
        15:35:30

Attachment is vmstat-4.10.17-10Gb.log

Revision history for this message

In Linux Kernel Bug Tracker #196729, netwiz (netwiz-linux-kernel-bugs) wrote on 2017-08-22:

#24

Created attachment 258049
vmstat-20Gb.log (OK - all in RAM)

Now using kernel 4.11.x (same happens with 4.12.x) - and testing with 20Gb of RAM in the system - meaning no swapping.

Attached as: vmstat-20Gb.log

Timestamps of events:
21:57 - launch the game from within Steam.
21:58:00 - Load the saved game.
21:58:48 - Saved game is loaded and I'm scrolling around in the map.
22:00:00 - Hit the quit to desktop button.
22:00:31 - Am back to desktop with all RAM free again.

Revision history for this message

In Linux Kernel Bug Tracker #196729, netwiz (netwiz-linux-kernel-bugs) wrote on 2017-08-22:

#25

Created attachment 258051
vmstat-10Gb.log (NOT OK - System Unresponsive)

I now drop back to 10Gb of RAM to test the swapping under 4.11.x kernel.

Log attached as vmstat-10Gb.log

Timestamps:
22:10:00 - Launched of the game from within Steam
22:11:00 - Load the same saved game from the previous log
22:12:01 - Saved game is loaded and I can scroll around. Noted a slight
pause when swapd went to 256 - but otherwise all is well.
22:13:00 - Launched Google Chrome browser to make the system swap.

After this point, the whole system went to hell. You'll note many missing
vmstat entries up until around 22:22 when I managed to exit from the game
back to desktop via the normal means (and not getting annoyed and doing a
pkill from tty2).

As such, the system went nuts for ~9 minutes until I was able to exit the
game to stop things going nuts.

I note that with 20Gb RAM - as the system never touches swap, I can still
play the game, browse the web with the Chrome browser, read / write
email, and even watch a DVB-T broadcast in VLC without having any more
than a minor pause in the game for less than a second.

Revision history for this message

In Linux Kernel Bug Tracker #196729, netwiz (netwiz-linux-kernel-bugs) wrote on 2017-08-22:

#26

So overall, this seems to indicate a regression between kernel 4.10.x (I'm pretty sure I tested all ok with 4.10.15?) and the newer 4.11 and 4.12 builds.

I made contact with Rik van Riel and Ying Huang (which I will attempt to add to this as a CC for comment?) - they don't believe it is a swapping issue - however Rik seems to believe that:

> > There is ZERO swap space in use.
> >
> > In other words, it is not actually swapping,
> > but thrashing through the page cache.

You may want to email the people who worked on page cache
replacement stuff recently, and the linux-mm mailing list
as well.

Revision history for this message

In Linux Kernel Bug Tracker #196729, akpm (akpm-linux-kernel-bugs) wrote on 2017-08-22:

#27

(switched to email. Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Tue, 22 Aug 2017 11:17:08 +0000 <email address hidden> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=196729
>
> Bug ID: 196729
> Summary: System becomes unresponsive when swapping - Regression
> since 4.10.x
> Product: Memory Management
> Version: 2.5
> Kernel Version: 4.11.x / 4.12.x
> Hardware: All
> OS: Linux
> Tree: Mainline
> Status: NEW
> Severity: normal
> Priority: P1
> Component: Page Allocator
> Assignee: <email address hidden>
> Reporter: <email address hidden>
> Regression: No

So it's "Regression: yes". More info at the bugzilla link.

> I have 10Gb of RAM in this system and run Fedora 26. If I launch Cities:
> Skylines with no swap space, things run well performance wise until I get an
> OOM - and it all dies - which is expected.
>
> When I turn on swap to /dev/sda2 which resides on an SSD, I get complete
> system freezes while swap is being accessed.
>
> The first swap was after loading a saved game, then launching kmail in the
> background. This caused ~500Mb to be swapped to /dev/sda2 on an SSD. The
> system froze for about 8 minutes - barely being able to move the mouse. The
> HDD LED was on constantly during the entire time.
>
> To hopefully rule out the above glibc issue, I started the game via jemalloc
> -
> but experienced even more severe freezes while swapping. I gave up waiting
> after 13 minutes of non-responsiveness - not even being able to move the
> mouse
> properly.
>
> During these hangs, I could typed into a Konsole window, and some of the
> typing took 3+ minutes to display on the screen (yay for buffers?).
>
> I have tested this with both the default vm.swappiness values, as well as the
> following:
> vm.swappiness = 1
> vm.min_free_kbytes = 32768
> vm.vfs_cache_pressure = 60
>
> I noticed that when I do eventually get screen updates, all 8 cpus (4 cores /
> 2 threads) show 100% CPU usage - and kswapd is right up there in the process
> list for CPU usage. Sadly I haven't been able to capture this information
> fully yet due to said unresponsiveness.
>
> (more to come in comments & attachments)
>
> --
> You are receiving this mail because:
> You are the assignee for the bug.

(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Tue, 22 Aug 2017 11:17:08 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=196729
> 
>             Bug ID: 196729
>            Summary: System becomes unresponsive when swapping - Regression
>                     since 4.10.x
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 4.11.x / 4.12.x
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Page Allocator
>           Assignee: akpm@linux-foundation.org
>           Reporter: netwiz@crc.id.au
>         Regression: No

So it's "Regression: yes".  More info at the bugzilla link.

> I have 10Gb of RAM in this system and run Fedora 26. If I launch Cities: 
> Skylines with no swap space, things run well performance wise until I get an 
> OOM - and it all dies - which is expected.
> 
> When I turn on swap to /dev/sda2 which resides on an SSD, I get complete 
> system freezes while swap is being accessed.
> 
> The first swap was after loading a saved game, then launching kmail in the 
> background. This caused ~500Mb to be swapped to /dev/sda2 on an SSD. The 
> system froze for about 8 minutes - barely being able to move the mouse. The 
> HDD LED was on constantly during the entire time.
> 
> To hopefully rule out the above glibc issue, I started the game via jemalloc
> - 
> but experienced even more severe freezes while swapping. I gave up waiting 
> after 13 minutes of non-responsiveness - not even being able to move the
> mouse 
> properly.
> 
> During these hangs, I could typed into a Konsole window, and some of the 
> typing took 3+ minutes to display on the screen (yay for buffers?).
> 
> I have tested this with both the default vm.swappiness values, as well as the 
> following:
> vm.swappiness = 1
> vm.min_free_kbytes = 32768
> vm.vfs_cache_pressure = 60
> 
> I noticed that when I do eventually get screen updates, all 8 cpus (4 cores / 
> 2 threads) show 100% CPU usage - and kswapd is right up there in the process 
> list for CPU usage. Sadly I haven't been able to capture this information 
> fully yet due to said unresponsiveness.
> 
> (more to come in comments & attachments)
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.

Revision history for this message

In Linux Kernel Bug Tracker #196729, mhocko (mhocko-linux-kernel-bugs) wrote on 2017-08-23:

#28

Created attachment 258067
read_vmstat.c

On Tue 22-08-17 15:55:30, Andrew Morton wrote:
>
> (switched to email. Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> On Tue, 22 Aug 2017 11:17:08 +0000 <email address hidden> wrote:
[...]
> Sadly I haven't been able to capture this information
> > fully yet due to said unresponsiveness.

Please try to collect /proc/vmstat in the bacground and provide the
collected data. Something like

while true
do
cp /proc/vmstat > vmstat.$(date +%s)
sleep 1s
done

If the system turns out so busy that it won't be able to fork a process
or write the output (which you will see by checking timestamps of files
and looking for holes) then you can try the attached proggy
./read_vmstat output_file timeout output_size

Note you might need to increase the mlock rlimit to lock everything into
memory.

Revision history for this message

In Linux Kernel Bug Tracker #196729, netwiz (netwiz-linux-kernel-bugs) wrote on 2017-08-23:

#29

Download full text (3.5 KiB)

Created attachment 258069
8Gb-noswap.tar.gz

On Wednesday, 23 August 2017 11:38:48 PM AEST Michal Hocko wrote:
> On Tue 22-08-17 15:55:30, Andrew Morton wrote:
> > (switched to email. Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
>
> > On Tue, 22 Aug 2017 11:17:08 +0000 <email address hidden>
wrote:
> [...]
>
> > Sadly I haven't been able to capture this information
> >
> > > fully yet due to said unresponsiveness.
>
> Please try to collect /proc/vmstat in the bacground and provide the
> collected data. Something like
>
> while true
> do
> cp /proc/vmstat > vmstat.$(date +%s)
> sleep 1s
> done
>
> If the system turns out so busy that it won't be able to fork a process
> or write the output (which you will see by checking timestamps of files
> and looking for holes) then you can try the attached proggy
> ./read_vmstat output_file timeout output_size
>
> Note you might need to increase the mlock rlimit to lock everything into
> memory.

Thanks Michal,

I have upgraded PCs since I initially put together this data - however I was
able to get strange behaviour by pulling out an 8Gb RAM stick in my new system
- leaving it with only 8Gb of RAM.

All these tests are performed with Fedora 26 and kernel 4.12.8-300.fc26.x86_64

I have attached 3 files with output.

8Gb-noswap.tar.gz contains the output of /proc/vmstat running on 8Gb of RAM
with no swap. Under this scenario, I was expecting the OOM reaper to just kill
the game when memory allocated became too high for the amount of physical RAM.
Interestingly, you'll notice a massive hang in the output before the game is
terminated. I didn't see this before.

8Gb-swap-on-file.tar.gz contains the output of /proc/vmstat still with 8Gb of
RAM - but creating a file with swap on the PCIe SSD /swapfile with size 8Gb
via:
# dd if=/dev/zero of=/swapfile bs=1G count=8
# mkswap /swapfile
# swapon /swapfile

Some times (all in UTC+10):
23:58:30 - Start loading the saved game
23:59:38 - Load ok, all running fine
00:00:15 - Load Chrome
00:01:00 - Quit the game

The game seemed to run ok with no real issue - and a lot was swapped to the
swap file. I'm wondering if it was purely the speed of the PCIe SSD that
caused this appearance - as the creation of the file with dd completed at
~1.4GB/sec.

8Gb-swap-on-ssd.tar.gz contains adding a 32Gb SATA based SSD to the system and
using the entire block device as swap via:
# mkswap -f /dev/sda
# swapon /dev/sda

There are many pauses and unresponsiveness issues while this was loading -
however we eventually got there.

Some timings (all in UTC+10 again):
00:06:33 - Load the saved game
00:11:22 - Saved game loaded - somewhat responsive
00:12:00 - Load Chrome
00:13:07 - Quit the game + chrome

For the sake of information, the following is a speed test on the SSD in
question:
# dd if=/dev/zero of=/dev/sda bs=1M count=8192 conv=fsync
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 44.923 s, 191 MB/s
# dd if=/dev/sda of=/dev/null bs=1M count=8192 conv=fsync
dd: fsync failed for '/dev/null': Invalid argument
8192+0 records in
8192+0 records out
8589934592 bytes (8....

Created attachment 258079
signature.asc

On Thursday, 24 August 2017 10:41:39 PM AEST Michal Hocko wrote:
> On Thu 24-08-17 00:30:40, Steven Haigh wrote:
> > On Wednesday, 23 August 2017 11:38:48 PM AEST Michal Hocko wrote:
> > > On Tue 22-08-17 15:55:30, Andrew Morton wrote:
> > > > (switched to email.  Please respond via emailed reply-to-all, not via
> > > > the
> > > > bugzilla web interface).
> > > > 
> > > > On Tue, 22 Aug 2017 11:17:08 +0000 bugzilla-daemon@bugzilla.kernel.org
> > 
> > wrote:
> > > [...]
> > > 
> > > > Sadly I haven't been able to capture this information
> > > > 
> > > > > fully yet due to said unresponsiveness.
> > > 
> > > Please try to collect /proc/vmstat in the bacground and provide the
> > > collected data. Something like
> > > 
> > > while true
> > > do
> > > 
> > >   cp /proc/vmstat > vmstat.$(date +%s)
> > >   sleep 1s
> > > 
> > > done
> > > 
> > > If the system turns out so busy that it won't be able to fork a process
> > > or write the output (which you will see by checking timestamps of files
> > > and looking for holes) then you can try the attached proggy
> > > ./read_vmstat output_file timeout output_size
> > > 
> > > Note you might need to increase the mlock rlimit to lock everything into
> > > memory.
> > 
> > Thanks Michal,
> > 
> > I have upgraded PCs since I initially put together this data - however I
> > was able to get strange behaviour by pulling out an 8Gb RAM stick in my
> > new system - leaving it with only 8Gb of RAM.
> > 
> > All these tests are performed with Fedora 26 and kernel
> > 4.12.8-300.fc26.x86_64
> > 
> > I have attached 3 files with output.
> > 
> > 8Gb-noswap.tar.gz contains the output of /proc/vmstat running on 8Gb of
> > RAM
> > with no swap. Under this scenario, I was expecting the OOM reaper to just
> > kill the game when memory allocated became too high for the amount of
> > physical RAM. Interestingly, you'll notice a massive hang in the output
> > before the game is terminated. I didn't see this before.
> 
> I have checked few gaps. E.g. vmstat.1503496391 vmstat.1503496451 which
> is one minute. The most notable thing is that there are only very few
> pagecache pages
>                       [base]          [diff]
> nr_active_file        1641            3345
> nr_inactive_file        1630          4787
> 
> So there is not much to reclaim without swap. The more important thing
> is that we keep reclaiming and refaulting that memory
> 
> workingset_activate     5905591       1616391
> workingset_refault      33412538        10302135
> pgactivate            42279686        13219593
> pgdeactivate          48175757        14833350
> 
> pgscan_kswapd         379431778       126407849
> pgsteal_kswapd        49751559        13322930
> 
> so we are effectivelly trashing over the very small amount of
> reclaimable memory. This is something that we cannot detect right now.
> It is even questionable whether the OOM killer would be an appropriate
> action. Your system has recovered and then it is always hard to decide
> whether a disruptive action is more appropriate. One minute of
> unresponsiveness is certainly annoying though. Your system is obviously
> under provisioned to load you want to run obviously.
> 
> It is quite interesting to see that we do not really have too many
> direct reclaimers during this time period
> allocstall_normal       30            1
> allocstall_movable      490           88
> pgscan_direct_throttle  0             0
> pgsteal_direct        24434           4069
> pgscan_direct         38678           5868

Yes, I understand that the system is really not suitable - however I believe 
the test is useful - even from an informational point of view :)

> > 8Gb-swap-on-file.tar.gz contains the output of /proc/vmstat still with 8Gb
> > of RAM - but creating a file with swap on the PCIe SSD /swapfile with
> > size 8Gb> 
> > via:
> >     # dd if=/dev/zero of=/swapfile bs=1G count=8
> >     # mkswap /swapfile
> >     # swapon /swapfile
> > 
> > Some times (all in UTC+10):
> > 23:58:30 - Start loading the saved game
> > 23:59:38 - Load ok, all running fine
> > 00:00:15 - Load Chrome
> > 00:01:00 - Quit the game
> > 
> > The game seemed to run ok with no real issue - and a lot was swapped to
> > the
> > swap file. I'm wondering if it was purely the speed of the PCIe SSD that
> > caused this appearance - as the creation of the file with dd completed at
> > ~1.4GB/sec.
> 
> Swap IO tends to be really scattered and the IO performance is not really
> great even on a fast storage AFAIK.
> 
> Anyway your original report sounded like a regression. Were you able to
> run the _same_ workload on an older kernel without these issues?

When I try the same tests with swap on an SSD under kernel 4.10.x (I believe 
the latest I tried was 4.10.25?) - then swap using the SSD did not cause any 
issues or periods of system unresponsiveness.

The file attached in the original bug report "vmstat-4.10.17-10Gb.log" was 
taken on my old system with 10Gb of RAM - and there were no significant pauses 
while swapping.

I do find it interesting that the newer '8Gb-swap-on-file.tar.gz' does not 
show any issues. I wonder if it would be helpful to attempt the same using a 
file on the SSD that was a swap disk in the '8Gb-swap-on-ssd.tar.gz' so we 
have a constant device - but with a file on the SSD instead of the entire 
block device. That would at least expose any issues on the same device in file 
vs block mode? Or maybe even if there's a difference just having the file on a 
much (much!) faster drive?

Revision history for this message

In Linux Kernel Bug Tracker #196729, ying.huang (ying.huang-linux-kernel-bugs) wrote on 2017-08-28:

#35

Compared with

a) vmstat-4.10.17-10Gb.log (OK with swapping) and
b) vmstat-10Gb.log (NOT OK - System Unresponsive)

The si/so is low in both files. And si/so in a) is higher than that of b), so the problem may be we swap less than before?

The bi is kept high in b). I guess we encountered thrashing for file pages.

Revision history for this message

In Linux Kernel Bug Tracker #196729, netwiz (netwiz-linux-kernel-bugs) wrote on 2017-11-27:

#36

To give this a bit of a nudge, I've been seeing reports of others having similar issues. See:
https://www.reddit.com/r/Fedora/comments/7f0dht/system_freezes_for_45min_in_lowmemory_conditions/

Also lodged on the RH BZ a while ago:
https://bugzilla.redhat.com/show_bug.cgi?id=1472336

Revision history for this message

In Linux Kernel Bug Tracker #196729, code (code-linux-kernel-bugs) wrote on 2018-04-02:

#37

Hi,

I’ve experienced what I believe is the same problem. The problem has gone away completely for me after I bumped vm.min_free_kbytes way up to 393216.

As soon as the system ran out of physical memory, the system would freeze for at least 2 minutes and often up to 45 minutes. GNOME desktop would stop. I could move the mouse cursor, and ping the system from a remote computer; but not connect over SSH or do anything other than wave the mouse about. The system clock on the top of GNOME would stop updating for 45 minutes. (Maaaybe it would move forward 1 minute after 20 minutes and still be 19 minutes out of sync.)

I've been having this issue for years on multiple different computer configurations with 8+ GiB of memory and large SWAP partitions. I never saw more than maybe 5 MiB in use on the SWAP partition. After tuning min_free_kbytes, the SWAP partition is now being used properly and the system only does the occasional (and expected) 1 second stutter when running low on physical memory.

I also run Fedora and have kept up with the latest stable release.

Aside 1: The issue would persist with SWAPOFF, just like Steven Haigh describes.
Aside 2: The problem happen much more frequently when I used BtrFS. After switching to XFS, this happen less frequently (weekly instead of daily).

Revision history for this message

In Linux Kernel Bug Tracker #196729, ultra10e (ultra10e-linux-kernel-bugs) wrote on 2018-05-07:

#38

Please refer also to this bug report. It is the same problem and has existed for eleven (11!) years if one can believe that.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/159356

I personally experience this on two 4GB laptops running live versions (no swap) of Debian 8.6 - 9.4, Fedora 25 - 28, Ubuntu, with a myriad of shells; Gnome, Mate, Cinnamon, KDE, etc. (One laptop I expanded to 8GB now but it doesn't matter- it just take a little longer to freeze the system.)

Various browsers from Firefox52 to current Developers 60, Chrome and Chromium

Certain combos eat up memory faster (Gnome has a memory leak for example in which it consumers memory for every window drawn and NEVER relinquishes that memory, without a restart to gnome-shell), new Firefox or Chrom* vs older ESR versions (54 and under) of FF eat up memory much more quickly.

Under the best combo/circumstances, I can open up 25-30 FF tabs before the system SUDDENLY SEIZES (observe the "USB live stick" light flashing non-stop, as if swapping, even tho no swap on Live versions). If not caught within literal seconds to Ctrl-Alt-F5 to an opened root console where I can kill the FF ps and save this "live" session, the computer is entirely unresponsive and requires power cycling.

In rare instances, some 10's of minutes later or even hours (4,8 12) later, the system *might* finally respond to the request to drop to the console. Keystrokes to issue the kill command can take minutes per key, but if successful, I've seen the load reported after the kill as high as 75.

Truly amazing.

It's difficult to fathom this critical a bug in memory management has gone un-addressed/un-noticed for so long but alas. I can't recall but I've read this behavior ONLY occurs on 64-bit kernels, and is un-reproducible on 32-bit kernels.

Also, on non-"live" installs, with swap configured, one can watch the hard drive light come on and remain solid to the same effect. Power cycle time. I've read from others, that they've determined swap isn't even really being used, so not sure what the "read" thrashing going on is (and it must be read thrashing because on Live versions there's no swap and the USB drive light is steady active also).

I just run Linux to not run Windows. Basic browsing, text document editing, file management, a few cli's and an instant messaging program typically opened simultaneously. Nothing computationally heavy but memory intensive (at least for the web browser) for sure.

STILL- difficult to believe the OS cannot handle this situation with some sort of message, or killing a window/throwing an error about an opened Firefox tab or something-- rather, it simply fills up the memory (I watch on gnome-system-status/Resources tab now) to 99% and then it's too late.

I really don't know technically how the Out Of Memory killer works/is supposed to work, but it sure isn't doing anything here.

Please refer also to this bug report. It is the same problem and has existed for eleven (11!) years if one can believe that.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/159356

I personally experience this on two 4GB laptops running live versions (no swap) of Debian 8.6 - 9.4, Fedora 25 - 28, Ubuntu, with a myriad of shells; Gnome, Mate, Cinnamon, KDE, etc. (One laptop I expanded to 8GB now but it doesn't matter- it just take a little longer to freeze the system.)

Various browsers from Firefox52 to current Developers 60, Chrome and Chromium

Certain combos eat up memory faster (Gnome has a memory leak for example in which it consumers memory for every window drawn and NEVER relinquishes that memory, without a restart to gnome-shell), new Firefox or Chrom* vs older ESR versions (54 and under) of FF eat up memory much more quickly.

Under the best combo/circumstances, I can open up 25-30 FF tabs before the system SUDDENLY SEIZES (observe the "USB live stick" light flashing non-stop, as if swapping, even tho no swap on Live versions). If not caught within literal seconds to Ctrl-Alt-F5 to an opened root console where I can kill the FF ps and save this "live" session, the computer is entirely unresponsive and requires power cycling.

In rare instances, some 10's of minutes later or even hours (4,8 12) later, the system *might* finally respond to the request to drop to the console. Keystrokes to issue the kill command can take minutes per key, but if successful, I've seen the load reported after the kill as high as 75.

Truly amazing.

It's difficult to fathom this critical a bug in memory management has gone un-addressed/un-noticed for so long but alas. I can't recall but I've read this behavior ONLY occurs on 64-bit kernels, and is un-reproducible on 32-bit kernels.

Also, on non-"live" installs, with swap configured, one can watch the hard drive light come on and remain solid to the same effect. Power cycle time. I've read from others, that they've determined swap isn't even really being used, so not sure what the "read" thrashing going on is (and it must be read thrashing because on Live versions there's no swap and the USB drive light is steady active also).

I just run Linux to not run Windows. Basic browsing, text document editing, file management, a few cli's and an instant messaging program typically opened simultaneously. Nothing computationally heavy but memory intensive (at least for the web browser) for sure.

STILL- difficult to believe the OS cannot handle this situation with some sort of message, or killing a window/throwing an error about an opened Firefox tab or something-- rather, it simply fills up the memory (I watch on gnome-system-status/Resources tab now) to 99% and then it's too late.

I really don't know technically how the Out Of Memory killer works/is supposed to work, but it sure isn't doing anything here.

Revision history for this message

In Linux Kernel Bug Tracker #196729, korbin.freedman (korbin.freedman-linux-kernel-bugs) wrote on 2018-05-18:

#39

I experience this too. Ive tested using Kernel version Kernel 4.17 rc8, Kernel 4.16.8, Kernel 4.12.8, and Kernel 4.14 across Manjaro Linux, Ubuntu Linux, Opensuse leap 15 and Fedora 28.

Steps to trigger:
-Open firefox with many tabs, or any other high memory usage program
-Wait a second
-System freezes. Sometimes the only fix is a hard reboot

Other findings:
-I notice really high cpu load averages if the system unfreezes
-If the system is not frozen, it is highly unresponsive on high memory usage when swapping
-Hard drive indicator light stays solidly on when system is frozen (excessive hard disk use)
-The reason the system freezes is because it is swapping

Tested on a
Intel i5 520m with 4gb ram/ 4gb swap (Lenovo t410)
Intel E6400 with 3gb ram/ 3gb swap

This bug is really hard to deal with because it usually requires a hard restart. Please fix ASAP if possible

Revision history for this message

In Linux Kernel Bug Tracker #196729, korbin.freedman (korbin.freedman-linux-kernel-bugs) wrote on 2018-05-18:

#40

My fedora report here : https://bugzilla.redhat.com/show_bug.cgi?id=1577528

Luca Osvaldo Mastromatteo (lukycrociato) on 2019-06-18

description:

updated

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2019-06-18: Missing required logs.

#1

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1833281

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-18: AlsaInfo.txt

#2

AlsaInfo.txt Edit (65.8 KiB, text/plain)

apport information

tags:	added: apport-collected bionic
description:	updated

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-18: CRDA.txt

#3

CRDA.txt Edit (468 bytes, text/plain)

apport information

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-18: CurrentDmesg.txt

#4

CurrentDmesg.txt Edit (76.1 KiB, text/plain)

apport information

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-18: Lspci.txt

#5

Lspci.txt Edit (20.0 KiB, text/plain)

apport information

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-18: Lsusb.txt

#6

Lsusb.txt Edit (681 bytes, text/plain)

apport information

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-18: ProcCpuinfo.txt

#7

ProcCpuinfo.txt Edit (16.2 KiB, text/plain)

apport information

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-18: ProcCpuinfoMinimal.txt

#8

ProcCpuinfoMinimal.txt Edit (1.4 KiB, text/plain)

apport information

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-18: ProcEnviron.txt

#9

ProcEnviron.txt Edit (325 bytes, text/plain)

apport information

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-18: ProcInterrupts.txt

#10

ProcInterrupts.txt Edit (6.7 KiB, text/plain)

apport information

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-18: ProcModules.txt

#11

ProcModules.txt Edit (4.9 KiB, text/plain)

apport information

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-18: PulseList.txt

#12

PulseList.txt Edit (38.6 KiB, text/plain)

apport information

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-18: UdevDb.txt

#13

UdevDb.txt Edit (229.9 KiB, text/plain)

apport information

Luca Osvaldo Mastromatteo (lukycrociato) on 2019-06-18

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-18: Re: Complete system freeze even when swapping small amounts of memory

#14

A small update on this.

I tried another distribution with the same kernel version, and this does not happen... I don't really know why. 0 times out of 10 this does not happen.

Also, when this occurs, triggering the OOM with SYSRQ+F also fixes the issues, but shows that no swap memory was being used, like it couldn't swap properly...? To be correct, it swapped only 64MB in the last try I made, instead of 200MB

Revision history for this message

Kevin Remisoski (kremisoski) wrote on 2019-06-22:

#15

Does this happen to you while using a web browser? This is my issue and it doesn't matter which browser or which kernel. Absolute garbage. Not sure what's happened with Ubuntu or why this is happening, but I'll try to figure out why since clearly I'm not the only person this is happening to and therefore it has nothing to do with my kernel updates.

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-26:

#16

Not necessarly, I can reproduce this sometimes with other stress tests

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-26:

#17

Are you on a Ryzen platform too?

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-28:

#18

I tried on FreeBSD today, and this behavior is not present

Revision history for this message

Luca Osvaldo Mastromatteo (lukycrociato) wrote on 2019-06-29:

#19

Update on this. Looks like it's an old upstream bug.

https://bugzilla.kernel.org/show_bug.cgi?id=196729

But now I noticed this also happens on my laptop, but it's more "recoverable" after something like 25/30 minutes.

Still, this problem is NOT present in older kernels than 4.10

summary:

- Complete system freeze even when swapping small amounts of memory
+ System freeze when memory is put on SWAP in Linux >4.10.x

Bug Watch Updater (bug-watch-updater) on 2019-06-29

Changed in linux:
importance:	Unknown → Medium
status:	Unknown → Confirmed

Revision history for this message

In Linux Kernel Bug Tracker #196729, m.novosyolov (m.novosyolov-linux-kernel-bugs) wrote on 2019-07-24:

#64

I encountered the same or a similar bug on BTRFS + HDD 5400 RPM + swap on a separate partition. Unfortunately, that notebook is not mine and is far away from me, what makes it hard to make experiments, kernels 4.15 and 4.18 both did have this problem, which can be reliably reproduced by running

# stress --vm 2 --vm-bytes 2000M --vm-keep

on that notebook with 4 GB RAM.
After running stress Load Average bumps above 11.0 and the whole system freezes.
Information about its hardware: https://linux-hardware.org/index.php?probe=af73180d0c
(HDD is in bad condition as you may see in smartctl by the URL above, but I still believe it's not the reason why swap is not being used properly).

Some other users complain about similar problems here: https://forum.rosalinux.ru/viewtopic.php?t=9387 (in Russian). Andreas17 there also uses BTRFS. I see that here, in my case, in bug#199763 and also in another case of similar problem that I know there are too many people with BTRFS, but this may be a coincidence.

Revision history for this message

In Linux Kernel Bug Tracker #196729, m.novosyolov (m.novosyolov-linux-kernel-bugs) wrote on 2019-07-24:

#65

Did anybody try to reproduce it in a virtual environment? It would allow to bibisect the kernel automatically.

Revision history for this message

In Linux Kernel Bug Tracker #196729, lukycrociato (lukycrociato-linux-kernel-bugs) wrote on 2019-07-24:

#66

its(In reply to Mikhail Novosyolov from comment #43)
> I encountered the same or a similar bug on BTRFS + HDD 5400 RPM + swap on a
> separate partition. Unfortunately, that notebook is not mine and is far away
> from me, what makes it hard to make experiments, kernels 4.15 and 4.18 both
> did have this problem, which can be reliably reproduced by running
>
> # stress --vm 2 --vm-bytes 2000M --vm-keep
>
> on that notebook with 4 GB RAM.
> After running stress Load Average bumps above 11.0 and the whole system
> freezes.
> Information about its hardware:
> https://linux-hardware.org/index.php?probe=af73180d0c
> (HDD is in bad condition as you may see in smartctl by the URL above, but I
> still believe it's not the reason why swap is not being used properly).
>
> Some other users complain about similar problems here:
> https://forum.rosalinux.ru/viewtopic.php?t=9387 (in Russian). Andreas17
> there also uses BTRFS. I see that here, in my case, in bug#199763 and also
> in another case of similar problem that I know there are too many people
> with BTRFS, but this may be a coincidence.

it's not a coincidence, I did a lot of tests and they were only happening on ALL my machines using a btrfs as root partition

Revision history for this message

In Linux Kernel Bug Tracker #196729, m.novosyolov (m.novosyolov-linux-kernel-bugs) wrote on 2019-07-24:

#67

Did anyone try to reproduce it on (open)SUSE, especially with their LTS kernel 4.12?
SUSE uses BTRFS by default and develops it, there is a chance that they might have caught and fixed or worked around this problem or maybe their default sheduler/kernel options/etc prevent this.

Revision history for this message

In Linux Kernel Bug Tracker #196729, jim (jim-linux-kernel-bugs) wrote on 2019-08-06:

#68

This bug is being discussed on lkml:
https://lkml.org/lkml/2019/8/4/15

I'm not going to participate there, but someone should point them to this bug and point out that everything worked fine until 4.10. Sometimes things that used to work and then got broken rate a higher priority.

Revision history for this message

In Linux Kernel Bug Tracker #196729, m.novosyolov (m.novosyolov-linux-kernel-bugs) wrote on 2019-08-06:

#69

(In reply to Jim Rees from comment #47)
> This bug is being discussed on lkml:
> https://lkml.org/lkml/2019/8/4/15
>
> I'm not going to participate there, but someone should point them to this
> bug and point out that everything worked fine until 4.10. Sometimes things
> that used to work and then got broken rate a higher priority.

I believe it is not relevant. They are discussing problem of memory allocation in general, that behaviour did not change much in recent kernels. But what we are discussing is a regression of using swap in kernels >= 4.10 (or >= 4.11?). In that thread topic starter suggests to swapoff, but what we are discussing is wise a versa not using an enabled swap.

Revision history for this message

In Linux Kernel Bug Tracker #196729, howaboutsynergy (howaboutsynergy-linux-kernel-bugs) wrote on 2019-08-14:

#70

Just an idea, try reproducing with kernel patch `le9g.patch`:

```
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dbdc46a84f63..7a0b7e32ff45 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2445,6 +2445,13 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
BUG();
}

+ if (NR_ACTIVE_FILE == lru) {
+ long long kib_active_file_now=global_node_page_state(NR_ACTIVE_FILE) * MAX_NR_ZONES;
+ if (kib_active_file_now <= 256*1024) {
+ nr[lru] = 0; //don't reclaim any Active(file) (see /proc/meminfo) if they are under 256MiB
+ continue;
+ }
+ }
   *lru_pages += size;
   nr[lru] = scan;
  }
```

see: https://gist.github.com/constantoverride/84eba764f487049ed642eb2111a20830#gistcomment-2997481
(^ scroll a bit up for some details of what the patch does)

Revision history for this message

In Linux Kernel Bug Tracker #196729, m.novosyolov (m.novosyolov-linux-kernel-bugs) wrote on 2019-08-28:

#71

(In reply to ValdikSS from comment #36)
> Those who experience the issue, try to set the following sysctl settings:
>
> vm.swappiness=100
> vm.watermark_scale_factor=200
>
> It greatly helps on my PC.

It did not change anything. Still only around ~15 MB are swapped while 3.5 out of 4 GB of RAM is used. This results to high load average (9-15) when loading new tabs in Chromium.

Revision history for this message

In Linux Kernel Bug Tracker #196729, m.novosyolov (m.novosyolov-linux-kernel-bugs) wrote on 2019-08-28:

#72

Created attachment 284677
log of read_vmstat when filling RAM with new tabs in Chromium

(In reply to Michal Hocko from comment #8)
> Created attachment 258067 [details]
> read_vmstat.c
>
> On Tue 22-08-17 15:55:30, Andrew Morton wrote:
> >
> > (switched to email. Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
> >
> > On Tue, 22 Aug 2017 11:17:08 +0000 <email address hidden>
> wrote:
> [...]
> > Sadly I haven't been able to capture this information
> > > fully yet due to said unresponsiveness.
>
> Please try to collect /proc/vmstat in the bacground and provide the
> collected data. Something like
>
> while true
> do
> cp /proc/vmstat > vmstat.$(date +%s)
> sleep 1s
> done
>
> If the system turns out so busy that it won't be able to fork a process
> or write the output (which you will see by checking timestamps of files
> and looking for holes) then you can try the attached proggy
> ./read_vmstat output_file timeout output_size
>
> Note you might need to increase the mlock rlimit to lock everything into
> memory.

I am facing the following issue on this [1] hardware:
- when new tabs are openned in Chromium, swap (on SSD) is not used, 0K of swap is used. After about 2.5 GB out of total 4 GB RAM becomes used, about 15-50 MB of swap can be used, sometimes up to ~570 MB, but not more;
- this leads to that Load Everage bumps from normal ~0.9 to 9-15 when loading a new tab in Chromium;
- so the system in general freezes from time to time when working in the web browser and switching between tabs and/or openning new ones

I have run your program read_vmstat (./read_vmstat vmstat.log 5s) and then ran a script that openned many tabs in Chromium, a new tab each 5s; after all tabs were openned, about ~3.5 GB of RAM became used, but only about 15 MB were swapped; then I ctrl+c'ed ./read_vmstat. Collected log is attached.

Kernel was 4.15.0-58-generic in Ubuntu.

[1] https://linux-hardware.org/?probe=414558f152

Created attachment 284677
log of read_vmstat when filling RAM with new tabs in Chromium

(In reply to Michal Hocko from comment #8)
> Created attachment 258067 [details]
> read_vmstat.c
> 
> On Tue 22-08-17 15:55:30, Andrew Morton wrote:
> > 
> > (switched to email.  Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
> > 
> > On Tue, 22 Aug 2017 11:17:08 +0000 bugzilla-daemon@bugzilla.kernel.org
> wrote:
> [...]
> > Sadly I haven't been able to capture this information 
> > > fully yet due to said unresponsiveness.
> 
> Please try to collect /proc/vmstat in the bacground and provide the
> collected data. Something like
> 
> while true
> do
>       cp /proc/vmstat > vmstat.$(date +%s)
>       sleep 1s
> done
> 
> If the system turns out so busy that it won't be able to fork a process
> or write the output (which you will see by checking timestamps of files
> and looking for holes) then you can try the attached proggy
> ./read_vmstat output_file timeout output_size
> 
> Note you might need to increase the mlock rlimit to lock everything into
> memory.

I am facing the following issue on this [1] hardware:
- when new tabs are openned in Chromium, swap (on SSD) is not used, 0K of swap is used. After about 2.5 GB out of total 4 GB RAM becomes used, about 15-50 MB of swap can be used, sometimes up to ~570 MB, but not more;
- this leads to that Load Everage bumps from normal ~0.9 to 9-15 when loading a new tab in Chromium;
- so the system in general freezes from time to time when working in the web browser and switching between tabs and/or openning new ones

I have run your program read_vmstat (./read_vmstat vmstat.log 5s) and then ran a script that openned many tabs in Chromium, a new tab each 5s; after all tabs were openned, about ~3.5 GB of RAM became used, but only about 15 MB were swapped; then I ctrl+c'ed ./read_vmstat. Collected log is attached.

Kernel was 4.15.0-58-generic in Ubuntu.

[1] https://linux-hardware.org/?probe=414558f152

Revision history for this message

In Linux Kernel Bug Tracker #196729, m.novosyolov (m.novosyolov-linux-kernel-bugs) wrote on 2019-08-28:

#73

Forgot to write, that if I ran
$ stress --vm 2 --vm-bytes 1000M --vm-keep
swap is eventually used normally

Revision history for this message

In Linux Kernel Bug Tracker #196729, iam (iam-linux-kernel-bugs) wrote on 2019-09-12:

#74

Recently, about since kernel 5.2.7, the issue is either gone or present to much less extent.
Right now I'm running kernel 5.2.11 and finally I can keel Firefox and VirtualBox running at the same time, with 3G+ in swap, and the system does not freeze.

Could anyone affected by this issue try newer kernels?

Mikhail, I have a spare laptop which I can setup for you for tests. Do you have time and wish to investigate this issue?

Revision history for this message

In Linux Kernel Bug Tracker #196729, m.novosyolov (m.novosyolov-linux-kernel-bugs) wrote on 2019-09-12:

#75

(In reply to ValdikSS from comment #53)
> Recently, about since kernel 5.2.7, the issue is either gone or present to
> much less extent.
> Right now I'm running kernel 5.2.11 and finally I can keel Firefox and
> VirtualBox running at the same time, with 3G+ in swap, and the system does
> not freeze.
>
> Could anyone affected by this issue try newer kernels?

First of all, did you reset all your custom sysctls to default values?

1) One user (ilfat@) reported that:
1.1) on kernels 4.19.57 and Ubuntu kernel 4.15.0-54 his system did not swap correctly: all memory was full but only a small part of the swap was used, that lead to freezes
1.2) on kernels 4.19.67 and Ubuntu kernels 4.15.0-60, 4.15.0-62 systemd swaps in general normally, but he tweaked vm.watermark_scale_factor to make it swap better
1.3) Ilfat had exactly the same problem on both ext4 and btrfs

So, something was fixed in upstream, backported to LTS kernel 4.19 and to Ubuntu kernel. I don't know what. And that issue is 100% not in BTRFS but is another problem or another aspect of the problem.

2) I did not see much difference between Ubuntu kernels 54 and 60, 62 in what is described in comment#51. So, that mystereous fix did not help.

3) another user (anreas@) reports the same as in (2):
https://forum.rosalinux.ru/viewtopic.php?p=101903&sid=621857320f4d1a566e0cfa6e80ff4a8c#p101903

4) trying to overcome issues from comment#51, I built Ubuntu kernel 4.15.0 with 2 pathes:

* https://abf.io/mikhailnov/kernel-desktop-4.15/blob/master/le9-rosa.patch
* https://abf.io/mikhailnov/kernel-desktop-4.15/blob/master/Chromium-OS-low-memory-patchset.patch

and set kernel options:
# https://bugzilla.kernel.org/show_bug.cgi?id=196729#c36
vm.watermark_scale_factor=100
vm.unevictable_activefile_kbytes=100000
#vm.swappiness=80
# https://bugs.chromium.org/p/chromium/issues/detail?id=263561#c16
# Disable swap read-ahead
vm.page-cluster=0

After that, I _thank_ it became a bit better (I cannot prove it by numbers, just was unbale to make the system become inresponsive, but LA remained the same), but unfortunately the main user of that notebook did not use it for some days and so right now I can't say that according to her feedback the situation has improved and the system does not microfreeze from time to time anymore. Let's wait a bit more. And still that may be a coincidence, not the result of patches and/or tweaked sysctls.

I was able to dead lock that system by openning too many tabs in Chromium, but that is not what those patches should have solved. nohang/earlymoon would have probably helped if it was used.

>
> Mikhail, I have a spare laptop which I can setup for you for tests. Do you
> have time and wish to investigate this issue?

I don't have ideas how to investigate it. And how to measure the result. Maybe PSI can tell something, I did not try to look at them. And even more, I don't understand what the problem is ;)

(In reply to ValdikSS from comment #53)
> Recently, about since kernel 5.2.7, the issue is either gone or present to
> much less extent.
> Right now I'm running kernel 5.2.11 and finally I can keel Firefox and
> VirtualBox running at the same time, with 3G+ in swap, and the system does
> not freeze.
> 
> Could anyone affected by this issue try newer kernels?

First of all, did you reset all your custom sysctls to default values?

1) One user (ilfat@) reported that:
1.1) on kernels 4.19.57 and Ubuntu kernel 4.15.0-54 his system did not swap correctly: all memory was full but only a small part of the swap was used, that lead to freezes
1.2) on kernels 4.19.67 and Ubuntu kernels 4.15.0-60, 4.15.0-62 systemd swaps in general normally, but he tweaked vm.watermark_scale_factor to make it swap better
1.3) Ilfat had exactly the same problem on both ext4 and btrfs

So, something was fixed in upstream, backported to LTS kernel 4.19 and to Ubuntu kernel. I don't know what. And that issue is 100% not in BTRFS but is another problem or another aspect of the problem.

2) I did not see much difference between Ubuntu kernels 54 and 60, 62 in what is described in comment#51. So, that mystereous fix did not help.

3) another user (anreas@) reports the same as in (2):
https://forum.rosalinux.ru/viewtopic.php?p=101903&sid=621857320f4d1a566e0cfa6e80ff4a8c#p101903

4) trying to overcome issues from comment#51, I built Ubuntu kernel 4.15.0 with 2 pathes:

* https://abf.io/mikhailnov/kernel-desktop-4.15/blob/master/le9-rosa.patch
* https://abf.io/mikhailnov/kernel-desktop-4.15/blob/master/Chromium-OS-low-memory-patchset.patch

and set kernel options:
# https://bugzilla.kernel.org/show_bug.cgi?id=196729#c36
vm.watermark_scale_factor=100
vm.unevictable_activefile_kbytes=100000
#vm.swappiness=80
# https://bugs.chromium.org/p/chromium/issues/detail?id=263561#c16
# Disable swap read-ahead
vm.page-cluster=0

After that, I _thank_ it became a bit better (I cannot prove it by numbers, just was unbale to make the system become inresponsive, but LA remained the same), but unfortunately the main user of that notebook did not use it for some days and so right now I can't say that according to her feedback the situation has improved and the system does not microfreeze from time to time anymore. Let's wait a bit more. And still that may be a coincidence, not the result of patches and/or tweaked sysctls.

I was able to dead lock that system by openning too many tabs in Chromium, but that is not what those patches should have solved. nohang/earlymoon would have probably helped if it was used.

> 
> Mikhail, I have a spare laptop which I can setup for you for tests. Do you
> have time and wish to investigate this issue?

I don't have ideas how to investigate it. And how to measure the result. Maybe PSI can tell something, I did not try to look at them. And even more, I don't understand what the problem is ;)

Revision history for this message

In Linux Kernel Bug Tracker #196729, howaboutsynergy (howaboutsynergy-linux-kernel-bugs) wrote on 2019-09-12:

#76

On an unrelated note(but since btrfs was thought to be a problem at some point), I've discovered that btrfs with zstd:5 (or worse zstd:15) can cause (at least) mouse cursor stuttering(like it was skipping frames), while zstd:1 doesn't(likely because of the low CPU usage during compression), regardless of how fast/many writes are happening on the SSD(2-6M/s with zstd:15, 38-50+M/s with zstd:1), apparently due to high CPU usage during the compression. (zstd unspecified means zstd:3 aka default)

Normal CPU usage by itself(eg. during compiling) doesn't cause such stuttering though. I've tested this on a Lenovo Ideapad Z575, 16G RAM, Kingston SSD SA400S37240G firmware SBFK71F1, and I've personally switched to zstd:1

ie.
```
diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index 6b9e29d050f3..02ffdb27c360 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -22,7 +22,7 @@

#define ZSTD_BTRFS_MAX_WINDOWLOG 17
#define ZSTD_BTRFS_MAX_INPUT (1 << ZSTD_BTRFS_MAX_WINDOWLOG)
-#define ZSTD_BTRFS_DEFAULT_LEVEL 3
+#define ZSTD_BTRFS_DEFAULT_LEVEL 1
#define ZSTD_BTRFS_MAX_LEVEL 15
/* 307s to avoid pathologically clashing with transaction commit */
#define ZSTD_BTRFS_RECLAIM_JIFFIES (307 * HZ)
```

but zstd:1 in /etc/fstab should also work, unless using too old kernel that doesn't know about it (hence why I prefer using the patch anyway)

Revision history for this message

In Linux Kernel Bug Tracker #196729, m.novosyolov (m.novosyolov-linux-kernel-bugs) wrote on 2019-09-12:

#77

(In reply to howaboutsynergy from comment #55)
> On an unrelated note(but since btrfs was thought to be a problem at some
> point), I've discovered that btrfs with zstd:5 (or worse zstd:15) can cause
> (at least) mouse cursor stuttering(like it was skipping frames), while
> zstd:1 doesn't(likely because of the low CPU usage during compression),
> regardless of how fast/many writes are happening on the SSD(2-6M/s with
> zstd:15, 38-50+M/s with zstd:1), apparently due to high CPU usage during the
> compression. (zstd unspecified means zstd:3 aka default)

I tried to move Chromium cache from btrfs to tmpfs, nothing became better, seems that Load Average became even a bit bigger in peacks (regrading what is described in comment#51).

Revision history for this message

In Linux Kernel Bug Tracker #196729, m.novosyolov (m.novosyolov-linux-kernel-bugs) wrote on 2019-09-12:

#78

(In reply to Mikhail Novosyolov from comment #54)
>
> So, something was fixed in upstream, backported to LTS kernel 4.19 and to
> Ubuntu kernel. I don't know what. And that issue is 100% not in BTRFS but is
> another problem or another aspect of the problem.
>
I can suspect commit 2c012a4ad1a2cd3fb5a0f9307b9d219f84eda1fa "mm: vmscan: scan anonymous pages on file refaults"

https://github.com/torvalds/linux/commit/2c012a4ad1a2cd3fb5a0f9307b9d219f84eda1fa

It appeared in kernel 5.3 and was backported to 4.19.62 and Ubuntu kernel 4.15.0-59

(In reply to ValdikSS from comment #53)
> Recently, about since kernel 5.2.7, the issue is either gone or present to
> much less extent.
> Right now I'm running kernel 5.2.11 and finally I can keel Firefox and
> VirtualBox running at the same time, with 3G+ in swap, and the system does
> not freeze.
... and to 5.2.3

But that commit "fixes" commits from the era of kernels 3.8 and 3.9, so probably it is not the one I'm looking for.

Revision history for this message

In Linux Kernel Bug Tracker #196729, GoodMirek (goodmirek) wrote on 2019-10-09:

#79

Same issue here, although I run Fedora 31 with KDE Plasma DE.
Swapping with 5.2.17 works fine, even when using 4GB of swap for several days. Swapping with 5.3.2 causes a complete freeze, i.e. screen freezes up, no mouse movement, no TTY access. I did not try SYSRQ keys.

System under test:
HP Elitebook 850 G4
CPU: Intel i5-7200U with embedded GPU
RAM: 4GB unbuffered, memtest OK
Disk: SSD Samsung PM961 (256GB), LVM+LUKS
Swap: swapping to file of size 20GB at path /swapfile

Revision history for this message

Pål Bergström (palbergstrom) wrote on 2019-11-15:

#80

I have similar problem but more short freezes and lags. Is an update of the graphic drives a solution? For some it seems that way. If so why?

https://askubuntu.com/questions/1185491/ubuntu-19-10-freezes-and-lags-reguarly

Revision history for this message

In Linux Kernel Bug Tracker #196729, iam (iam-linux-kernel-bugs) wrote on 2019-12-27:

#81

I have an idea why this bug is much worse with BTRFS than with EXT4: BTRFS has much bigger read/write amplification, up to 10x higher than EXT4.

Revision history for this message

In Linux Kernel Bug Tracker #196729, m.novosyolov (m.novosyolov-linux-kernel-bugs) wrote on 2019-12-27:

#82

(In reply to ValdikSS from comment #58)
> I have an idea why this bug is much worse with BTRFS than with EXT4: BTRFS
> has much bigger read/write amplification, up to 10x higher than EXT4.

You mean that when e.g. Chromium browser writes its cache, it loads IO "up to 10x higher than EXT4", and, when IO is also loaded by swapping, it causes microfreezes? Are you sure that operations that cause write amplifications are not done in background?

Revision history for this message

In Linux Kernel Bug Tracker #196729, iam (iam-linux-kernel-bugs) wrote on 2019-12-27:

#83

(In reply to Mikhail Novosyolov from comment #59)
> You mean that when e.g. Chromium browser writes its cache, it loads IO "up
> to 10x higher than EXT4"

Yes. This is also true for read operations, not only for write.

> and, when IO is also loaded by swapping, it causes
> microfreezes?

Yes, probably

> Are you sure that operations that cause write amplifications
> are not done in background?

I'm not sure, but people on Rosa forum blamed BTRFS for this bug. I'm pretty sure it's not directly tight with BTRFS, but write and read amplification may explain why lags are more severe with this FS.

Check https://arxiv.org/pdf/1707.08514.pdf and https://habr.com/ru/post/476414/

Revision history for this message

In Linux Kernel Bug Tracker #196729, m.novosyolov (m.novosyolov-linux-kernel-bugs) wrote on 2019-12-27:

#84

(In reply to ValdikSS from comment #60)
> I'm not sure, but people on Rosa forum blamed BTRFS for this bug. I'm pretty
> sure it's not directly tight with BTRFS, but write and read amplification
> may explain why lags are more severe with this FS.

People here and on ROSA forum blamed BTRFS for the problem that swap is being not used, but not for microfreezes...

Revision history for this message

In Linux Kernel Bug Tracker #196729, cfeck (cfeck-linux-kernel-bugs) wrote on 2019-12-29:

#85

I don't use btrfs, but only ext4 on an SSD. Since updating my system from kernel 5.1.7 to kernel 5.3.12 in Tumbleweed, I get regular ~1-2 second freezes (e.g. mouse pointer hangs in X11, or characters don't appear while typing in Konsole) while Blender renders and swaps out finished tiles. This didn't happen with the previous kernel.

Revision history for this message

Corben (tobias-krummen) wrote on 2020-02-05:

#86

I'm experiencing the same issue since I upgraded to Kernel 5.3.0 on Ubuntu 18.04 LTS via HWE stack.
This still does not happen with kernel 4.15 and iirc didn't happen with Kernel 5.0.0 (which got replaced with 5.3.0 through the HWE stack).
I see a heavy increase in the swap file usage after a freeze, the swap file is located on the internal drive of my surface pro 3. If the swapping starts, it uses up to 500 MB at once, while with kernel 4.15 swapping happens later and even then uses only some few MB.

I got 8 GB of RAM, I've set vm.swappiness to 40 instead of 60.

The freezes got a bit mitigated after I've reset the gnome settings via gnome-tweaks, and enabled trim on the swap partition, but still happen and tend to increase after a while.
I'll try to enable zram and see if that helps mitigating it even more, so it's not interrupting the workflow on this device.

Revision history for this message

Nico R (u-nico-c) wrote on 2020-02-12:

#87

Can confirm this on Kubuntu 19.10 on a Core i5-4310U. My RAM is fairly big enough for what I'm doing everyday, but as soon as the tiniest swapping occurs, it renders the machine quite unusable - stuttering window animations, stuttering/hanging mouse cursor, often for 1-2 minutes. As I can see, the bug is already marked as confirmed, so I just want to put this for the record.

Revision history for this message

In Linux Kernel Bug Tracker #196729, russianneuromancer (russianneuromancer-linux-kernel-bugs) wrote on 2020-03-04:

#89

Christoph, your issue is different. Please fill separate bugreport.

Revision history for this message

In Linux Kernel Bug Tracker #196729, russianneuromancer (russianneuromancer-linux-kernel-bugs) wrote on 2020-03-10:

#90

On HP Stream 7 Tablet (1GB RAM) there was similar regression, but it started since Linux 4.12 instead of 4.10. Last year I tried bisect and several workarounds such as autostart cleanup, sysctl tweaks, zram, etc. But in the end it's seems like Linux 5.5.8 solved issue, at least in this particular use-case (device with 1GB RAM) 5.5.8 makes swapping perform like it was with 4.11.

Revision history for this message

In Linux Kernel Bug Tracker #196729, bugzilla (bugzilla-linux-kernel-bugs) wrote on 2020-03-15:

#91

I primarily test by building webkitgtk [1], and I experience the same loss of system responsiveness whether / is ext4 or Btrfs. But I do see a difference in top and iotop.
https://drive.google.com/open?id=12jpQeskPsvHmfvDjWSPOwIWSz09JIUlk

This is an extreme case of refaulting, it's out of memory and swap, and since kswapd and btrfs threads are using a lot of CPU I'm guessing the faults are a mix of anonymous pages and file pages. At this point the system is really lost which is why the UX is the same with ext4 and btrfs; but behind the scenes it does seem more is going on. There might be other workloads which aren't as extreme, thereby exposing the difference. Two possible sources of the heavy CPU for btrfs threads: decompression, and checksumming. If it's true there is near constant reclaim happening, it's not just a simple minimum 4K read but rather a 128K minimum because all Btrfs compressed files use 128K extent size; is then decompressed, and then requires reading csum tree and computing csum on the read to compare. Ordinarily this is cheap but in this situation possibly it's resulting in a lot of extra congestion, but this is the limit of my knowledge so it's just speculation.

Btrfs write amplification is a known issue (wandering trees problem). But that appears to not be the issue in this example.

It might be this problem is better dealt with by cgroupsv2 to protect certain tasks from reclaim, and thus reduce the problem on any file system. But Btrfs alone (for now) does have more sophisticated cgroupvs2 IO isolation control as well.
https://www.spinics.net/lists/cgroups/msg24743.html

The upstream GNOME and KDE developers are aware of the loss of responsiveness problem and have done quite a lot of preliminary work in GNOME 3.34 with more work on the way.
https://blogs.gnome.org/benzea/2019/10/01/gnome-3-34-is-now-managed-using-systemd/

You can today take advantage of this cgroupsv2 work by running resource hungry tasks as a systemd user unit in Fedora 31.
https://blogs.gnome.org/benzea/2019/10/01/gnome-3-34-is-now-managed-using-systemd/#comment-14833

I expect in the next 6-12 months (it's a guesstimate) there will be additional work in GNOME to protect the user session or what I vaguely call the "GUI stack" from reclaim, and thus improve its responsiveness at the expense of the resource hungry process.

[1] first two lines; set -j to RAM in GiB +2 GiB; i.e. if you have 8G RAM, use -j 10; more jobs makes the problem happen faster.
https://trac.webkit.org/wiki/BuildingGtk

I primarily test by building webkitgtk [1], and I experience the same loss of system responsiveness whether / is ext4 or Btrfs. But I do see a difference in top and iotop.
https://drive.google.com/open?id=12jpQeskPsvHmfvDjWSPOwIWSz09JIUlk

This is an extreme case of refaulting, it's out of memory and swap, and since kswapd and btrfs threads are using a lot of CPU I'm guessing the faults are a mix of anonymous pages and file pages. At this point the system is really lost which is why the UX is the same with ext4 and btrfs; but behind the scenes it does seem more is going on. There might be other workloads which aren't as extreme, thereby exposing the difference. Two possible sources of the heavy CPU for btrfs threads: decompression, and checksumming. If it's true there is near constant reclaim happening, it's not just a simple minimum 4K read but rather a 128K minimum because all Btrfs compressed files use 128K extent size; is then decompressed, and then requires reading csum tree and computing csum on the read to compare. Ordinarily this is cheap but in this situation possibly it's resulting in a lot of extra congestion, but this is the limit of my knowledge so it's just speculation.

Btrfs write amplification is a known issue (wandering trees problem). But that appears to not be the issue in this example.

It might be this problem is better dealt with by cgroupsv2 to protect certain tasks from reclaim, and thus reduce the problem on any file system. But Btrfs alone (for now) does have more sophisticated cgroupvs2 IO isolation control as well.
https://www.spinics.net/lists/cgroups/msg24743.html

The upstream GNOME and KDE developers are aware of the loss of responsiveness problem and have done quite a lot of preliminary work in GNOME 3.34 with more work on the way.
https://blogs.gnome.org/benzea/2019/10/01/gnome-3-34-is-now-managed-using-systemd/

You can today take advantage of this cgroupsv2 work by running resource hungry tasks as a systemd user unit in Fedora 31.
https://blogs.gnome.org/benzea/2019/10/01/gnome-3-34-is-now-managed-using-systemd/#comment-14833

I expect in the next 6-12 months (it's a guesstimate) there will be additional work in GNOME to protect the user session or what I vaguely call the "GUI stack" from reclaim, and thus improve its responsiveness at the expense of the resource hungry process.

[1] first two lines; set -j to RAM in GiB +2 GiB; i.e. if you have 8G RAM, use -j 10; more jobs makes the problem happen faster.
https://trac.webkit.org/wiki/BuildingGtk

Revision history for this message

In Linux Kernel Bug Tracker #196729, bugzilla (bugzilla-linux-kernel-bugs) wrote on 2020-03-15:

#92

Another resource, quite long but has a tl;dr, and reviewed by some of the cgroups/resource control folks:
https://chrisdown.name/2018/01/02/in-defence-of-swap.html

I'll use somewhat technically sloppy language, but hopefully a useful metaphor: there's incidental swap and heavy swap. Incidental swap is when some file or anonymous page really isn't needed, and it's good to evict it to free up memory. But in the case of heavy swap (or even reclaim), not at all incidental, it becomes a serious performance impediment. The reality is that there are some tasks that just take gobs of memory and swap isn't a good substitute. But some swap is useful for freeing up memory for things that need to stay in memory.

I'm finding for the incidental swap need, swap-on-ZRAM is quite useful. You are exchanging an IO bound task for a memory+CPU task; also it forces such pages to be pinned into memory. So there's no free lunch. Conservative use would be a ZRAM device around 1/4 of RAM up to a max of 50% RAM. It's not bad or wrong to use more, it's just that some workloads, like the webkitgtk example, have such significant need that it'll actually turn 1/2 of memory into swap, which in effect means you have 50% less RAM for that task. It actually makes the problem worse. Really, the near term is a) build with flags that cause fewer resources to be used in the firist place, or b) build in a systemd user session and limit resources that way, or c) buy more RAM, or d) use conventional swap partition, possibly with zswap to leave the most frequent pages in a small RAM cache pool, that's big enough for the task to eventually complete and just suffer with the ensuing lack of GUI responsiveness.

Anyway, it's a bit complicated. And lots of moving parts. And probably needs more sophisticate use of perf and maybe even bpf could be useful in figuring where the various bottlenecks are.

Revision history for this message

Corben (tobias-krummen) wrote on 2020-03-26:

#88

It seems that zram mitigates the problem for me during a work day. As long as the system doesn't have to write swap to the disk, it seems fine. The longer I use the system, the sooner it is running out of (z)ram and swapping also to disk. As soon as this happens, I get these freezes again. I have the feeling, that the amount of swapped out data that is swapped at once is causing this. Seems like I/O is blocking to me.
I have the impression with a previous kernel, the blocks have been smaller so the system didn't block too long so it didn't really interrupt work.

Revision history for this message

Kai-Heng Feng (kaihengfeng) wrote on 2020-03-31:

#93

Possible duplicate of LP: #1861359.

Revision history for this message

In Linux Kernel Bug Tracker #196729, admin (admin-linux-kernel-bugs) wrote on 2020-09-24:

#94

@lou +1

Sorry for my Google translate.

This bug is very many years old. This is *Bug 12309* !!! Why didn't anyone remember?

I'm tired of him. Was in the early 2010s for 512 MB of RAM. Was in the early 2010s for 512 MB of RAM. and 2 GB of RAM. Mid 2010s with 4GB of RAM. And for the past five years, 16 GB of RAM has not gone anywhere.

There is a funny article in Russian about this bug lurkmore.to/12309 with links to the original bug reports and "problem solutions" in each new kernel. "According to the anonymous author, this useful feature is LOVE AND CAREFULLY into a fresh kernel."

Earlier I tried to rebuild kernels according to the advice on the Internet (BFQ). Sometimes it helped a little. Some assemblies gave time to urgently close programs. But the most effective was vm.swappiness = 90 and a script with a notification when approaching this value to clean up memory (conky + dialog).

Revision history for this message

Kevin Wortman (kwortman) wrote on 2020-12-24:

#95

I am experiencing this bug on a stock Dell Inspiron 5000 (5482) with 8 GB RAM and the factory SSD. It is completely debilitating. Having 1-2 completely unpredictable, 30-minute-plus, hard freezes per day is a showstopper. I can't trust this environment for professional work, or even to take notes in gedit during a phone call.

I've been an Ubuntu user and advocate at work and home for 10+ years, but I guess I have to quit :(

I tried different flavors, and the bug manifested under all of them: Ubuntu LTS 20.04, Ubuntu 20.10, Ubuntu MATE 20.10, and Debian 10.

Fedora 33 does not suffer from this bug, so I'm using that. I get frequent "Gah! Your tab just crashed" errors in Firefox any time I have more than about 5 tabs open. I conjecture there is some difference between the Redhat-based and Debian-based kernels, that causes the Redhat-based ones to kill a tab process in the circumstance that causes the Debian-based ones to lock up. The Fedora experience is acceptable, though it's still disappointing that Linux performance is has regressed to be inadequate for basic web browsing.

Takeaway: Fedora seems to be an acceptable workaround for people experiencing this bug.

Revision history for this message

In Linux Kernel Bug Tracker #196729, egorfedorovichletov (egorfedorovichletov-linux-kernel-bugs) wrote on 2022-01-20:

#96

Are there any changes?

Revision history for this message

In Linux Kernel Bug Tracker #196729, hi-angel (hi-angel-linux-kernel-bugs) wrote on 2022-03-27:

#97

Modern kernels handle SWAPPING situations much better.

Also, these days the Multi-LRU patchset should pretty much resolve the problem. It is not yet upstream, but is used in downstream kernels such as linux-zen and liquorix-kernel. There is hope it will be merged by 5.19¹

1: https://www.phoronix.com/scan.php?page=news_item&px=MGLRU-Not-For-5.18

Revision history for this message

In Linux Kernel Bug Tracker #196729, damir.esenberlin (damir.esenberlin-linux-kernel-bugs) wrote on 2024-08-02:

#98

https://erudit.kz/probnyy-ent
https://erudit.kz/podgotovka-k-ent
https://erudit.kz

Revision history for this message

In Linux Kernel Bug Tracker #196729, hi-angel (hi-angel-linux-kernel-bugs) wrote on 2024-08-02:

#99

(In reply to Konstantin Kharlamov from comment #70)
> Also, these days the Multi-LRU patchset should pretty much resolve the
> problem. It is not yet upstream, but is used in downstream kernels such as
> linux-zen and liquorix-kernel. There is hope it will be merged by 5.19¹

MLRU patches were merged long ago, should perhaps this issue be closed? Is Steven Haigh (the OP) by any chance still here?

Revision history for this message

In Linux Kernel Bug Tracker #196729, ikalvachev (ikalvachev-linux-kernel-bugs) wrote on 2024-08-02:

#100

I've commented here 6 years ago, having similar issues with swap trashing.

I can attest that despite web browsers managing to eat more memory than ram available, my system hasn't gone unresponsive recently.

Revision history for this message

In Linux Kernel Bug Tracker #196729, netwiz (netwiz-linux-kernel-bugs) wrote on 2024-08-03:

#101

Yeah - I'm still around.

These days, its become cheap enough to just have more RAM - so to be honest, I haven't seen an issue like this again in a number of years - but now I don't bother with less than 32Gb of RAM for a desktop.

Likely, this has just become obsolete now - so closing as such.

Bug Watch Updater (bug-watch-updater) on 2024-08-03

Changed in linux:
status:	Confirmed → Expired

Revision history for this message

In Linux Kernel Bug Tracker #196729, hi-angel (hi-angel-linux-kernel-bugs) wrote on 2024-08-03:

#102

To future readers: if you're still seeing this, make sure you have the file `/sys/kernel/mm/lru_gen/enabled` and its value is `0x0007` (it's just the one I tested). AFAIK kernel doesn't enable MLRU by default; but at the same time AFAIK all major distros enable it in their kernels.

Revision history for this message

In Linux Kernel Bug Tracker #196729, marc (marc-linux-kernel-bugs) wrote on 2024-08-05:

#103

(In reply to Steven Haigh from comment #74)
> These days, its become cheap enough to just have more RAM - so to be honest,
> I haven't seen an issue like this again in a number of years - but now I
> don't bother with less than 32Gb of RAM for a desktop.
>
> Likely, this has just become obsolete now - so closing as such.

Sorry to object, but I'm reading this thread because the issue is still fresh enough and relevant to me. Some people are stuck with 4Gb to live everyday.

However
(In reply to Konstantin Kharlamov from comment #75)
This is helpful, like others above, the issue did not show up badly in the last few month using Mint 21.3 and kernel 6.8 (and perhaps since 5.15 or 5.19). I can not easily revert to an older one just for the test, but the machine currently reports `/sys/kernel/mm/lru_gen/enabled` and its value is `0x0007` as suggested.

This later is fair enough to mandate closure.

Affects		Status	Importance	Assigned to	Milestone
	Linux	Expired	Medium	linux-kernel-bugs #196729
	linux (Ubuntu)	Confirmed	Undecided	Unassigned

Ubuntu
linux package

System freeze when memory is put on SWAP in Linux >4.10.x

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntulinux package

System freeze when memory is put on SWAP in Linux >4.10.x

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package