Thanks for emailing me, happy to answer questions anytime.
> 1. why linux-hwe-4.15.0 source code is used?
If you look closely at the oops in the description, the customer I was working with was running:
4.15.0-106-generic #107~16.04.1-Ubuntu
This is the Xenial (16.04) HWE kernel. I was using the linux-hwe-4.15.0 source code to make sure the debug symbols used for the debug symbol package matched exactly.
In your case:
4.15.0-72-generic #81-Ubuntu
you are running the 4.15 kernel on normal Bionic (18.04), so we can use the normal linux-4.15.0 source code.
> 2. we are using linux-4.15.0-unsigned and by skimming through the source code, looks like try_get_page is not defined at that time?
Yes! You are correct, the original mainline 4.15 kernel did not have try_get_page() defined at:
We see that you hit the exact same WARN_ON_ONCE for the page_ref_count(page) <= 0).
So, whatever page you are trying to access, has its reference counter in the negatives, which suggests that has either wrapped around, or has been decremented too many times.
Looking at your error log, I can't tell for sure if it is the zero_page, but its quite likely going to be. The zero_page is a frequently used page in the system, and it is used outside of ksm, it's just that ksm is a heavy user of the zero_page. If you are constantly allocating large amounts of new memory, you will be be using the zero_page similar to ksm, and the reference counter will eventually overflow.
I think there is a good chance that the fix I submitted in 4.15.0-118-generic will solve your problems. Please do a "apt update" and "apt upgrade" and upgrade to a newer kernel, the newer the better, and it will most likely fix the problem.
Hi Jiatong,
Thanks for emailing me, happy to answer questions anytime.
> 1. why linux-hwe-4.15.0 source code is used?
If you look closely at the oops in the description, the customer I was working with was running:
4.15.0-106-generic #107~16.04.1-Ubuntu
This is the Xenial (16.04) HWE kernel. I was using the linux-hwe-4.15.0 source code to make sure the debug symbols used for the debug symbol package matched exactly.
In your case:
4.15.0-72-generic #81-Ubuntu
you are running the 4.15 kernel on normal Bionic (18.04), so we can use the normal linux-4.15.0 source code.
> 2. we are using linux-4. 15.0-unsigned and by skimming through the source code, looks like try_get_page is not defined at that time?
Yes! You are correct, the original mainline 4.15 kernel did not have try_get_page() defined at:
https:/ /elixir. bootlin. com/linux/ v4.15/source/ mm/gup. c#L156
But if you look closely at the actual kernel sources for 4.15.0-72-generic:
https:/ /git.launchpad. net/~ubuntu- kernel/ ubuntu/ +source/ linux/+ git/bionic/ tree/mm/ gup.c?h= Ubuntu- 4.15.0- 72.81#n156
We see that try_get_page() is there. That is because we backported:
commit 8fde12ca79aff9b 5ba951fce1a2641 901b8d8e64 /github. com/torvalds/ linux/commit/ 8fde12ca79aff9b 5ba951fce1a2641 901b8d8e64
Author: Linus Torvalds <email address hidden>
Date: Thu Apr 11 10:49:19 2019 -0700
Subject: mm: prevent get_user_pages() from overflowing page refcount
Link:https:/
Ubuntu 4.15 backport link: https:/ /paste. ubuntu. com/p/2bF5WWQy2 r/
That commit first turned up in 4.15.0-59-generic, via upstream-stable.
Anyway, let's have a look at your stack trace:
4.15.0-72-generic #81-Ubuntu page_pte+ 0x663/0x6d0
RIP: 0010:follow_
I downloaded the debug symbols:
http:// ddebs.ubuntu. com/ubuntu/ pool/main/ l/linux/ linux-image- unsigned- 4.15.0- 72-generic- dbgsym_ 4.15.0- 72.81_amd64. ddeb
Extracted them:
dpkg -x linux-image- unsigned- 4.15.0- 72-generic- dbgsym_ 4.15.0- 72.81_amd64. ddeb debug
and looked up:
$ eu-addr2line -e ./vmlinux- 4.15.0- 72-generic -f follow_ page_pte+ 0x663 linux-E6MDAa/ linux-4. 15.0/mm/ gup.c:156 in follow_page_pte linux-E6MDAa/ linux-4. 15.0/mm/ gup.c:138
try_get_page inlined at /build/
/build/
We see that you hit try_get_page() in mm/gup.c:156
155 if (flags & FOLL_GET) { !try_get_ page(page) )) {
156 if (unlikely(
157 page = ERR_PTR(-ENOMEM);
158 goto out;
159 }
Looking at try_get_page() in include/linux/mm.h:
854 static inline __must_check bool try_get_page(struct page *page) head(page) ; ONCE(page_ ref_count( page) <= 0))
855 {
856 page = compound_
857 if (WARN_ON_
858 return false;
859 page_ref_inc(page);
860 return true;
861 }
We see that you hit the exact same WARN_ON_ONCE for the page_ref_ count(page) <= 0).
So, whatever page you are trying to access, has its reference counter in the negatives, which suggests that has either wrapped around, or has been decremented too many times.
Looking at your error log, I can't tell for sure if it is the zero_page, but its quite likely going to be. The zero_page is a frequently used page in the system, and it is used outside of ksm, it's just that ksm is a heavy user of the zero_page. If you are constantly allocating large amounts of new memory, you will be be using the zero_page similar to ksm, and the reference counter will eventually overflow.
I think there is a good chance that the fix I submitted in 4.15.0-118-generic will solve your problems. Please do a "apt update" and "apt upgrade" and upgrade to a newer kernel, the newer the better, and it will most likely fix the problem.
Let me know if you have any more questions.
Thanks,
Matthew