Memory leak in net/xfrm/xfrm_state.c - 8 pages per ipsec connection

Bug #1853197 reported by MIKE OLLIFF on 2019-11-19
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Bionic
High
Stefan Bader
Disco
High
Stefan Bader
Eoan
High
Stefan Bader

Bug Description

[SRU Justification]

== Impact ==

An upstream change in v4.11 made xfrm loose memory (8 pages per ipsec connection). This was fixed in v5.4 by:
  commit 86c6739eda7d "xfrm: Fix memleak on xfrm state destroy"

== Fix ==

Pick the upstream fix into all affected series.

== Testcase ==

see below

== Risk of Regression ==

Low, the change adds a single memory release case in one driver. The effect can be verified.

---

Ubuntu linux distro, 4.15.0-62 kernel, server platform.
This OS is used as an IPSec VPN gateway. It serves up to several hundred concurrent connections

In an attempt to upgrade from the 4.4 kernel to 4.15, the team noticed that VPN gateway VMs were running out of physical memory after 12-48 hours, depending on load.

Attachments from a server machine in this state in attached leakinfo.txt
output of free -t
output of /proc/meminfo in out of memory condition
output of /slabtop -o -sc
/sys/kernel/debug/page_owner sorted and aggregated after server ran for 12 hrs and ran out of memory
Patches for 4.15 and 5.4

Highlight from page_owner, we can see the leak is a buffer associated with the ipsec impelementation. Each connection leaks 32k of memory via alloc_page with order=3

100960 times:
Page allocated via order 3, mask 0x1085220(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP)
 get_page_from_freelist+0xd64/0x1250
 __alloc_pages_nodemask+0x11c/0x2e0
 alloc_pages_current+0x6a/0xe0
 skb_page_frag_refill+0x71/0x100
 esp_output_head+0x265/0x3e0 [esp4]
 esp_output+0xbc/0x180 [esp4]
 xfrm_output_resume+0x179/0x530
 xfrm_output+0x8e/0x230
 xfrm4_output_finish+0x2b/0x30
 __xfrm4_output+0x3a/0x50
 xfrm4_output+0x43/0xc0
 ip_forward_finish+0x51/0x80
 ip_forward+0x38a/0x480
 ip_rcv_finish+0x122/0x410
 ip_rcv+0x292/0x360
 __netif_receive_skb_core+0x815/0xbd0

Patch to fix this issue in 4.15 (tested and verified on same server exhibiting above leak):
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 728272f..7842f83 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -451,6 +451,10 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
        }
        xfrm_dev_state_free(x);
        security_xfrm_state_free(x);
+
+ if(x->xfrag.page)
+ put_page(x->xfrag.page);
+
        kfree(x);
}

Patch for master branch (5.4 I believe) from Paul Wouters (<email address hidden>)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index c6f3c4a1bd99..f3423562d933 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -495,6 +495,8 @@ static void ___xfrm_state_destroy(struct xfrm_state *x)
                                x->type->destructor(x);
                                xfrm_put_type(x->type);
                }
+ if (x->xfrag.page)
+ put_page(x->xfrag.page);
                xfrm_dev_state_free(x);
                security_xfrm_state_free(x);
                xfrm_state_free(x);

Severity: Critical - we are unable to use any kernel later than 4.11, and are sticking with 4.4 in production.

MIKE OLLIFF (mikeo-symc) wrote :

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1853197

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
MIKE OLLIFF (mikeo-symc) wrote :

All VPN servers have been rolled back to 4.4
Additional log collection is not possible.
Setting status to confirmed.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
MIKE OLLIFF (mikeo-symc) on 2019-11-19
description: updated
Kai-Heng Feng (kaihengfeng) wrote :

commit 86c6739eda7d2a03f2db30cbee67a5fb81afa8ba
Author: Steffen Klassert <email address hidden>
Date: Wed Nov 6 08:13:49 2019 +0100

    xfrm: Fix memleak on xfrm state destroy

    We leak the page that we use to create skb page fragments
    when destroying the xfrm_state. Fix this by dropping a
    page reference if a page was assigned to the xfrm_state.

    Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible")
    Reported-by: JD <email address hidden>
    Reported-by: Paul Wouters <email address hidden>
    Signed-off-by: Steffen Klassert <email address hidden>

This commit will be automatically picked by later kernel update since it has "Fixes" tag.

MIKE OLLIFF (mikeo-symc) wrote :

That fix is in the master branch - can it be backported?

Stefan Bader (smb) wrote :

Setting this to invalid for Focal. The fix is in upstream v5.4 and we will move to that version soon.

Changed in linux (Ubuntu Bionic):
importance: Undecided → High
Changed in linux (Ubuntu Disco):
importance: Undecided → High
Changed in linux (Ubuntu Eoan):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
status: New → Triaged
Changed in linux (Ubuntu Disco):
status: New → Triaged
Changed in linux (Ubuntu Eoan):
status: New → Triaged
Changed in linux (Ubuntu):
status: Confirmed → Invalid
Stefan Bader (smb) on 2019-11-29
description: updated
Changed in linux (Ubuntu Eoan):
assignee: nobody → Stefan Bader (smb)
Changed in linux (Ubuntu Disco):
assignee: nobody → Stefan Bader (smb)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Stefan Bader (smb)
Changed in linux (Ubuntu Bionic):
status: Triaged → Fix Committed
Changed in linux (Ubuntu Disco):
status: Triaged → Fix Committed
Changed in linux (Ubuntu Eoan):
status: Triaged → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-disco' to 'verification-done-disco'. If the problem still exists, change the tag 'verification-needed-disco' to 'verification-failed-disco'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-disco
tags: added: verification-needed-bionic

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-eoan' to 'verification-done-eoan'. If the problem still exists, change the tag 'verification-needed-eoan' to 'verification-failed-eoan'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-eoan
Bernd Schütte (pent1ckel) wrote :

it is running for five days and memory consumption looks normal (not leaking)

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Confirmed
MIKE OLLIFF (mikeo-symc) wrote :

Tested 4.15 bionic with original use case. Memory leak is resolved.

Stefan Bader (smb) on 2019-12-10
Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Committed
tags: added: verification-done-bionic
removed: verification-needed-bionic
Bernd Schütte (pent1ckel) wrote :

Does it help when we test disco and eoan as well? The test case is very easy and those kernels are affected as well.

Stefan Bader (smb) wrote :

If there is an easy way to get those releases set up and tested, it helps to helps to build confidence. In this case I think the chances a not that high, that the change has a different effect in different kernel versions. But if someone either already is on Eoan/5.3 or has time to double check, that sure has value. I would not bother about Disco/5.0 that much because that is going end of life soon.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers