Hipersocket page allocation failure on Ubuntu 20.04 based SSC environments

Bug #1959529 reported by bugproxy
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
High
Skipper Bug Screeners
linux (Ubuntu)
Fix Released
High
Krzysztof Kozlowski
Focal
Fix Released
High
Krzysztof Kozlowski

Bug Description

== Comment: #0 - D. Gary Chapman <email address hidden> - 2022-01-26 12:43:25 ==
---Problem Description---
IBM IDAA customer exposes hipersocket page allocation failure on Ubuntu 20.04 based SSC

Contact Information = Gary Chapman (<email address hidden>)

---uname output---
5.4.0-73-generic #82-Ubuntu

Machine Type = IBM Z in IDAA SSC-mode lpar

---Debugger---
A debugger is not configured

----------

IBM SSC LEVEL: 4.1.5
IBM IDAA LEVEL: 7.5.6

On a client system we are observing this:

Jan 19 16:25:57 data5 kernel: kworker/u760:28: page allocation failure: order:0, mode:0xa20(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
Jan 19 16:25:57 data5 kernel: CPU: 20 PID: 4137988 Comm: kworker/u760:28 Kdump: loaded Tainted: G OE 5.4.0-73-generic #82-Ubuntu
Jan 19 16:25:57 data5 kernel: Hardware name: IBM 8561 T01 727 (LPAR)
Jan 19 16:25:57 data5 kernel: Workqueue: kcryptd/253:11 kcryptd_crypt [dm_crypt]
Jan 19 16:25:57 data5 kernel: Call Trace:
Jan 19 16:25:57 data5 kernel: ([<0000006b6d63e092>] show_stack+0x7a/0xc0)
Jan 19 16:25:57 data5 kernel: [<0000006b6d64588a>] dump_stack+0x8a/0xb8
Jan 19 16:25:57 data5 kernel: [<0000006b6cfd8262>] warn_alloc+0xe2/0x160

IBM LTC Networking team has identified the upstream commit 714c9108851743bb718fbc1bfb81290f12a53854 as the root cause.

This patch shows up in the Ubuntu kernel source tree:
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/commit/?id=714c9108851743bb718fbc1bfb81290f12a53854
but has not been ported to Ubuntu 20.04 / kernel 5.4

IDAA on SSC requests backport to focal.

bugproxy (bugproxy)
tags: added: architecture-s39064 bugnameltc-196116 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
affects: ubuntu → linux (Ubuntu)
Frank Heimes (fheimes)
summary: - IBM IDAA customer exposes hypersocket page allocation failure on Ubuntu
+ IBM IDAA customer exposes hipersocket page allocation failure on Ubuntu
20.04 based SSC
Changed in ubuntu-z-systems:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in ubuntu-z-systems:
importance: Undecided → High
Revision history for this message
Frank Heimes (fheimes) wrote (last edit ): Re: IBM IDAA customer exposes hipersocket page allocation failure on Ubuntu 20.04 based SSC

A simple cherry-pick of commit 714c91088517 does not apply cleanly.
Even if the commit itself is relatively small, the code in qeth_core_main.c has obviously changed significantly between kernel 5.7 (where the commit got upstream accepted) and focal's 5.4.
Hence it's indeed a backport of commit 714c91088517 to focal's master-next tree needed.
(changing the bug to Incomplete for now)

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in ubuntu-z-systems:
status: New → Incomplete
Revision history for this message
Frank Heimes (fheimes) wrote :
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2022-02-03 03:44 EDT-------
Here's the backported patch, based on git://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal

===============================================================================

Author: Julian Wiedmann <email address hidden>
Date: Wed Mar 18 13:54:45 2020 +0100

s390/qeth: use memory reserves to back RX buffers

Use dev_alloc_page() for backing the RX buffers with pages. This way we
pick up __GFP_MEMALLOC.

Signed-off-by: Julian Wiedmann <email address hidden>
Signed-off-by: David S. Miller <email address hidden>
(backported from commit 714c9108851743bb718fbc1bfb81290f12a53854)
Signed-off-by: Alexandra Winter <email address hidden>

diff --git a/drivers/s390/net/qeth_core_main.c b/drivers/s390/net/qeth_core_main.c
index ec8c7a640d9e..61372e5c279b 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -227,7 +227,7 @@ static int qeth_alloc_buffer_pool(struct qeth_card *card)
return -ENOMEM;
}
for (j = 0; j < QETH_MAX_BUFFER_ELEMENTS(card); ++j) {
- ptr = (void *) __get_free_page(GFP_KERNEL);
+ ptr = (void *) __dev_alloc_page(GFP_KERNEL);
if (!ptr) {
while (j > 0)
free_page((unsigned long)
@@ -2612,7 +2612,7 @@ static struct qeth_buffer_pool_entry *qeth_find_free_buffer_pool_entry(
struct qeth_buffer_pool_entry, list);
for (i = 0; i < QETH_MAX_BUFFER_ELEMENTS(card); ++i) {
if (page_count(virt_to_page(entry->elements[i])) > 1) {
- page = alloc_page(GFP_ATOMIC);
+ page = dev_alloc_page();
if (!page) {
return NULL;
} else {

Frank Heimes (fheimes)
description: updated
Frank Heimes (fheimes)
summary: - IBM IDAA customer exposes hipersocket page allocation failure on Ubuntu
- 20.04 based SSC
+ Hipersocket page allocation failure on Ubuntu 20.04 based SSC
+ environments.
summary: Hipersocket page allocation failure on Ubuntu 20.04 based SSC
- environments.
+ environments
Revision history for this message
Frank Heimes (fheimes) wrote :

backport of 714c91088517 to focal 5.4

Revision history for this message
Frank Heimes (fheimes) wrote :

SRU request submitted to the Ubuntu kernel team mailing list for focal:
https://lists.ubuntu.com/archives/kernel-team/2022-February/thread.html#127681
Changing status to 'In Progress'.

Changed in linux (Ubuntu):
status: Incomplete → In Progress
Changed in ubuntu-z-systems:
status: Incomplete → In Progress
Changed in linux (Ubuntu):
assignee: Skipper Bug Screeners (skipper-screen-team) → Canonical Kernel Team (canonical-kernel-team)
tags: added: patch
Revision history for this message
Frank Heimes (fheimes) wrote :

Hi Alexandra, the kernel team has a little complaint on this modification.
Would you please have a look at their response here:
https://lists.ubuntu.com/archives/kernel-team/2022-February/127682.html
Thx

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2022-02-04 09:19 EDT-------
Actually the customer problem is fixed by the second part of the patch.
(alloc_page can issue warnings, but dev_alloc_page does not).

Would it be a solution to only backport the second part?
Are there any conventions how to document that in the commit header?

e.g. like this:
===================================================================
Author: Julian Wiedmann <email address hidden>
Date: Wed Mar 18 13:54:45 2020 +0100

s390/qeth: use memory reserves to back RX buffers

Use dev_alloc_page() for backing the RX buffers with pages. This way we
pick up __GFP_MEMALLOC.

Signed-off-by: Julian Wiedmann <email address hidden>
Signed-off-by: David S. Miller <email address hidden>
(cherry-picked from commit 714c9108851743bb718fbc1bfb81290f12a53854)
[__GFP_NOWARN for running devices]
Signed-off-by: Alexandra Winter <email address hidden>

diff --git a/drivers/s390/net/qeth_core_main.c b/drivers/s390/net/qeth_core_main.c
index ec8c7a640d9e..e106db961c43 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -2612,7 +2612,7 @@ static struct qeth_buffer_pool_entry *qeth_find_free_buffer_pool_entry(
struct qeth_buffer_pool_entry, list);
for (i = 0; i < QETH_MAX_BUFFER_ELEMENTS(card); ++i) {
if (page_count(virt_to_page(entry->elements[i])) > 1) {
- page = alloc_page(GFP_ATOMIC);
+ page = dev_alloc_page();
if (!page) {
return NULL;
} else {

Revision history for this message
Frank Heimes (fheimes) wrote :

Thanks to Krzysztof, who spend some time to get this properly in,
there is now a patched focal/20.04 kernel 5.4 available here:
https://launchpad.net/~krzk/+archive/ubuntu/linux-testing
https://launchpad.net/~krzk/+archive/ubuntu/linux-testing/+build/23143306
Since some more effort got spent like originally planned,
it's needed and highly appreciated if this kernel can be tested
and if we can get feedback if the issue is solved with it.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2022-02-08 17:25 EDT-------
Acknowledging we (IBM SSC, IDAA) saw this and are working on verification. Successfully pulled 5.4.0-100.113~test2 patches for packages included in SSC (linux-cloud-tools-common, linux-libc-dev, linux-tools-common, linux-tools-host).

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2022-02-09 09:45 EDT-------
Did a quick test of HiperSockets and OSA with
add-apt-repository ppa:krzk/linux-testing
apt install linux-image-unsigned-5.4.0-100-generic/focal
Looks good to me.

I did not exercise any memory pressure though. So it makes sense that
Gary tries this with the original (IBM SSC, IDAA) scenario and verifies that
the panic on warning does no longer occur.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2022-02-10 08:49 EDT-------
IDAA team tested the new image with stress on the Hipersocket and did not observe any problem. We need this fix to progress to focal-updates as quickly as possible. Thank you for your rapid turnaround on this customer issue.

Revision history for this message
Krzysztof Kozlowski (krzk) wrote :
Changed in linux (Ubuntu):
assignee: Canonical Kernel Team (canonical-kernel-team) → Krzysztof Kozlowski (krzk)
Changed in linux (Ubuntu Focal):
assignee: nobody → Krzysztof Kozlowski (krzk)
status: New → In Progress
importance: Undecided → High
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2022-02-11 09:02 EDT-------
Thanks again to Krzysztof and Thadeu for the careful review of the backport.
I'm embarrassed to admit that my initial tests didn't show the problem; possibly a build error on my side. I tried again today and KASAN showed the issue.
I also took the opportunity to educate myself more about get_free_page,
alloc_page and GFP flags. Thanks again.

bugproxy (bugproxy)
tags: added: targetmilestone-inin2004
removed: targetmilestone-inin---
Stefan Bader (smb)
Changed in linux (Ubuntu Focal):
status: In Progress → Fix Committed
Changed in linux (Ubuntu):
status: In Progress → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.4.0-102.115 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2022-02-25 12:03 EDT-------
I installed and tested:
# add-apt-repository ppa:fheimes/lp1959529
# apt install linux-image-unsigned-5.4.0-102-generic/focal-proposed

(The instructions in https://wiki.ubuntu.com/Testing/EnableProposed did nor work for me)

bugproxy (bugproxy)
tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (37.4 KiB)

This bug was fixed in the package linux - 5.4.0-105.119

---------------
linux (5.4.0-105.119) focal; urgency=medium

  * CVE-2022-0847
    - lib/iov_iter: initialize "flags" in new pipe_buffer

  * Broken network on some AWS instances with focal/impish kernels
    (LP: #1961968)
    - SAUCE: Revert "PCI/MSI: Mask MSI-X vectors only on success"

  * [UBUNTU 20.04] kernel: Add support for CPU-MF counter second version 7
    (LP: #1960182)
    - s390/cpumf: Support for CPU Measurement Facility CSVN 7
    - s390/cpumf: Support for CPU Measurement Sampling Facility LS bit

  * Hipersocket page allocation failure on Ubuntu 20.04 based SSC environments
    (LP: #1959529)
    - s390/qeth: use memory reserves to back RX buffers

  * CVE-2022-0516
    - KVM: s390: Return error on SIDA memop on normal guest

  * CVE-2022-0435
    - tipc: improve size validations for received domain records

  * CVE-2022-0492
    - cgroup-v1: Require capabilities to set release_agent

  * Recalled NFSv4 files delegations overwhelm server (LP: #1957986)
    - NFSv4: Fix delegation handling in update_open_stateid()
    - NFSv4: nfs4_callback_getattr() should ignore revoked delegations
    - NFSv4: Delegation recalls should not find revoked delegations
    - NFSv4: fail nfs4_refresh_delegation_stateid() when the delegation was
      revoked
    - NFS: Rename nfs_inode_return_delegation_noreclaim()
    - NFSv4: Don't remove the delegation from the super_list more than once
    - NFSv4: Hold the delegation spinlock when updating the seqid
    - NFSv4: Clear the NFS_DELEGATION_REVOKED flag in
      nfs_update_inplace_delegation()
    - NFSv4: Update the stateid seqid in nfs_revoke_delegation()
    - NFSv4: Revoke the delegation on success in nfs4_delegreturn_done()
    - NFSv4: Ignore requests to return the delegation if it was revoked
    - NFSv4: Don't reclaim delegations that have been returned or revoked
    - NFSv4: nfs4_return_incompatible_delegation() should check delegation
      validity
    - NFSv4: Fix nfs4_inode_make_writeable()
    - NFS: nfs_inode_find_state_and_recover() fix stateid matching
    - NFSv4: Fix races between open and delegreturn
    - NFSv4: Handle NFS4ERR_OLD_STATEID in delegreturn
    - NFSv4: Don't retry the GETATTR on old stateid in nfs4_delegreturn_done()
    - NFSv4: nfs_inode_evict_delegation() should set NFS_DELEGATION_RETURNING
    - NFS: Clear NFS_DELEGATION_RETURN_IF_CLOSED when the delegation is returned
    - NFSv4: Try to return the delegation immediately when marked for return on
      close
    - NFSv4: Add accounting for the number of active delegations held
    - NFSv4: Limit the total number of cached delegations
    - NFSv4: Ensure the delegation is pinned in nfs_do_return_delegation()
    - NFSv4: Ensure the delegation cred is pinned when we call delegreturn

  * Focal update: v5.4.174 upstream stable release (LP: #1960566)
    - HID: uhid: Fix worker destroying device without any protection
    - HID: wacom: Reset expected and received contact counts at the same time
    - HID: wacom: Ignore the confidence flag when a touch is removed
    - HID: wacom: Avoid using stale array indicies to read contact count
    - f2fs: fix to ...

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.