Comment 17 for bug 1837810

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi Jiatong,

Thanks for emailing me, happy to answer questions anytime.

> 1. why linux-hwe-4.15.0 source code is used?

If you look closely at the oops in the description, the customer I was working with was running:

4.15.0-106-generic #107~16.04.1-Ubuntu

This is the Xenial (16.04) HWE kernel. I was using the linux-hwe-4.15.0 source code to make sure the debug symbols used for the debug symbol package matched exactly.

In your case:

4.15.0-72-generic #81-Ubuntu

you are running the 4.15 kernel on normal Bionic (18.04), so we can use the normal linux-4.15.0 source code.

> 2. we are using linux-4.15.0-unsigned and by skimming through the source code, looks like try_get_page is not defined at that time?

Yes! You are correct, the original mainline 4.15 kernel did not have try_get_page() defined at:

https://elixir.bootlin.com/linux/v4.15/source/mm/gup.c#L156

But if you look closely at the actual kernel sources for 4.15.0-72-generic:

https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/mm/gup.c?h=Ubuntu-4.15.0-72.81#n156

We see that try_get_page() is there. That is because we backported:

commit 8fde12ca79aff9b5ba951fce1a2641901b8d8e64
Author: Linus Torvalds <email address hidden>
Date: Thu Apr 11 10:49:19 2019 -0700
Subject: mm: prevent get_user_pages() from overflowing page refcount
Link:https://github.com/torvalds/linux/commit/8fde12ca79aff9b5ba951fce1a2641901b8d8e64

Ubuntu 4.15 backport link: https://paste.ubuntu.com/p/2bF5WWQy2r/

That commit first turned up in 4.15.0-59-generic, via upstream-stable.

Anyway, let's have a look at your stack trace:

4.15.0-72-generic #81-Ubuntu
RIP: 0010:follow_page_pte+0x663/0x6d0

I downloaded the debug symbols:

http://ddebs.ubuntu.com/ubuntu/pool/main/l/linux/linux-image-unsigned-4.15.0-72-generic-dbgsym_4.15.0-72.81_amd64.ddeb

Extracted them:

dpkg -x linux-image-unsigned-4.15.0-72-generic-dbgsym_4.15.0-72.81_amd64.ddeb debug

and looked up:

$ eu-addr2line -e ./vmlinux-4.15.0-72-generic -f follow_page_pte+0x663
try_get_page inlined at /build/linux-E6MDAa/linux-4.15.0/mm/gup.c:156 in follow_page_pte
/build/linux-E6MDAa/linux-4.15.0/mm/gup.c:138

We see that you hit try_get_page() in mm/gup.c:156

 155 if (flags & FOLL_GET) {
 156 if (unlikely(!try_get_page(page))) {
 157 page = ERR_PTR(-ENOMEM);
 158 goto out;
 159 }

Looking at try_get_page() in include/linux/mm.h:

 854 static inline __must_check bool try_get_page(struct page *page)
 855 {
 856 page = compound_head(page);
 857 if (WARN_ON_ONCE(page_ref_count(page) <= 0))
 858 return false;
 859 page_ref_inc(page);
 860 return true;
 861 }

We see that you hit the exact same WARN_ON_ONCE for the page_ref_count(page) <= 0).

So, whatever page you are trying to access, has its reference counter in the negatives, which suggests that has either wrapped around, or has been decremented too many times.

Looking at your error log, I can't tell for sure if it is the zero_page, but its quite likely going to be. The zero_page is a frequently used page in the system, and it is used outside of ksm, it's just that ksm is a heavy user of the zero_page. If you are constantly allocating large amounts of new memory, you will be be using the zero_page similar to ksm, and the reference counter will eventually overflow.

I think there is a good chance that the fix I submitted in 4.15.0-118-generic will solve your problems. Please do a "apt update" and "apt upgrade" and upgrade to a newer kernel, the newer the better, and it will most likely fix the problem.

Let me know if you have any more questions.

Thanks,
Matthew