Comment 97 for bug 1655842

Pete Cheslock (pete-cheslock) wrote :

I have seemingly solved this issue with linux-aws version 4.4.0-1016-aws at the very least. The specific issue I was seeing was 2nd order allocations failing when OOMKiller triggered. At the time I was thinking the issue was due to XFS and memory fragmentation with lots and lots of memory mapped files in Elasticsearch/Lucene. When we moved to EXT4 the rate of oomkiller firing dropped, but did not stop. We made the following 2 changes to sysctls which have effectively stopped higher order memory allocaitons from failing and oomkiller firing.

Also these settings were used on i3.2xlarge hosts that have 60G of ram - your milage may vary. Also we do not run swap on our servers, so likely adding swap could have helped, but not an option for us.

vm.min_free_kbytes = 1000000 # We set this to leave about 1G of ram available for the kernel in the hope that even if the memory was heavily fragmented there might still be enough memory for linux to grab a higher order memory allocation fast enough before oomkiller does things.

vm.zone_reclaim_mode = 1 # our hope here was to get the kernel to get more aggressive in reclaiming memory