Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0

Bug #2076866 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
In Progress
High
Ubuntu on IBM Power Systems Bug Triage
linux (Ubuntu)
Status tracked in Oracular
Noble
In Progress
Undecided
Canonical Kernel Team
Oracular
Fix Committed
High
Unassigned

Bug Description

SRU Justification:

[ Impact ]

 * A KVM guest (VM) that got live migrated between two Power 10 systems
   (using nested virtualization, means KVM on top of PowerVM) will
   highly likely crash after about an hour.

 * At that point it looked like the live migration itself was already
   successful, but it wasn't, and the crash is caused due to it.

[ Test Plan ]

 * Setting up two Power 10 systems (with firmware level FW1060 or newer,
   that supports nested KVM) with Ubuntu Server 24.04 for ppc64el.

 * Setup a qemu/KVM environment that allows to live migrate a KVM
   guest from one P10 system to the other.

 * (The disk type does not seem to matter, hence NFS based disk storage
    can be used for example).

 * After about an hour the live migrated guest is likely to crash.
   Hence wait for 2 hours (which increases the likeliness) and
   a crash due to:
   "migrate_misplaced_folio+0x540/0x5d0"
   occurs.

[ Where problems could occur ]

 * The 'fix' to avoid calling folio_likely_mapped_shared for cases where
   folio might have already been unmapped and the move of the checks
   might have an impact on page table locks if done wrong,
   which may lead to wrong locks, blocked memory and finally crashes.

 * The direct folio calls in mm/huge_memory.c and mm/memory.c got now
   'in-directed', which may lead to a different behaviour and side-effects.
   However, isolation is still done, just slightly different and
   instead of using numamigrate_isolate_folio, now in (the renamed)
   migrate_misplaced_folio_prepare.

 * Further upstream conversations:
   https://<email address hidden>
   https://<email address hidden>
   https://<email address hidden>

 * Fixing a confusing return code, now to just return 0, on success is
   clarifying the return code handling and usage, and was mainly done in
   preparation of further changes,
   but can have bad side effects if the return code was used in other
   code places already as is.

 * Further upstream conversations:
   https://<email address hidden>
   https://<email address hidden>

 * Fixing the fact that NUMA balancing prohibits mTHP
   (multi-size Transparent Hugepage Support) seems to be unreasonable
   since its an exclusive mapping.
   Allowing this seems to bring significant performance improvements
   see commit message d2136d749d76), but introduced significant changes
   PTE mapping and modifications and even relies on further commits:
   859d4adc3415 ("mm: numa: do not trap faults on shared data section pages")
   80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other uses")
   This case cause issues on systems configured for THP,
   may confuse the ordering, which may even lead to memory corruption.
   And this may especially hit (NUMA) systems with high core numbers,
   where balancing is more often needed.

 * Further upstream conversations:
   https://<email address hidden>/
   https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25<email address hidden>
   https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950<email address hidden>

 * The refactoring of the code for NUMA mapping rebuilding and moving
   it into a new helper, seems to be straight forward, since the active code
   stays unchanged, however the new function needs to be callable, but this
   is the case since its all in mm/memory.c.

 * Further upstream conversations:
   https://<email address hidden>
   https://<email address hidden>
   https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736<email address hidden>

 * The refactoring of folio_estimated_sharers to folio_likely_mapped_shared
   is more significant, since the logic changed from
   (folio_estimated_sharers) 'estimate the number of sharers of a folio' to
   (folio_likely_mapped_shared) 'estimate if the folio is mapped into the page
   tables of more than one MM'.

 * Since this is an estimation, the results may be unpredictable
   (especially for bigger folios), and not like expected or assumed
   (there are quite some side-notes in the code comments of bb34f78d72c2,
   that mention potential fuzzy results), hence this
   may lead to unforeseen behavior.

 * The condition statements became clearer since it's now based on
   (more or less obvious) number counts, but can still be erroneous in
   case folio_estimated_sharers does incorrect calculations.

 * Further upstream conversations:
   https://<email address hidden>
   https://<email address hidden>

 * Commit 133d04b1eee9 extends commit bda420b98505 "numa balancing: migrate
   on fault among multiple bound nodes" from allowing NUMA fault migrations
   when the executing node is part of the policy mask for MPOL_BIND,
   to also support MPOL_PREFERRED_MANY policy.
   Both cases (MPOL_BIND and MPOL_PREFERRED_MANY) are treated in the same way.
   In case the NUMA topology is not correctly considered, changes here
   may lead to decreased memory performance.
   However, the code changes themselves are relatively traceable.

 * Further upstream conversations:
   https://lkml.kernel.org/r/<email address hidden>
   https://lkml.kernel.org/r/<email address hidden>

 * Finally commit f8fd525ba3a2 ("mm/mempolicy: use numa_node_id() instead
   of cpu_to_node()") is a patchset to further optimize the cross-socket
   memory access with MPOL_PREFERRED_MANY policy.
   The mpol_misplaced changes are mainly moving from cpu_to_node to
   numa_node_id, and with that make the code more NUMA aware.
   Based on that vm_fault/vmf needs to be considered instead of
   vm_area_struct/vma.
   This may have consequences on the memory policy itself.

 * Further upstream conversations:
   https://<email address hidden>
   https://lkml.kernel.org/r/<email address hidden>
   https://<email address hidden>
   https://lkml.kernel.org/r/<email address hidden>

 * The overall patch set touches quite a bit of common code,
   but the modifications were intensely discussed with many experts
   in the various mailing-list threads that are referenced above.

[ Other Info ]

 * The first two "mm/migrate" commits are the newest and were
   upstream accepted with kernel v6.11(-rc1),
   all other are already upstream since v6.10(-rc1).

 * Hence oracular (with a planned target kernel of 6.11) is not affect,
   and the SRU is for noble only.

 * And since (nested) KVM virtualization on ppc64el was (re-)introduced
   just with noble, no older Ubuntu releases older than noble are affected.

__________

== Comment: #0 - SEETEENA THOUFEEK <email address hidden> - 2024-08-12 23:50:17 ==
+++ This bug was initially created as a clone of Bug #207985 +++

---Problem Description---
Post Migration Non-MDC L1 eralp1 crashed with migrate_misplaced_folio+0x4cc/0x5d0 (

Machine Type = na

Contact Information = <email address hidden>

---Steps to Reproduce---
 Problem description :
After 1 hour of successful migration from doodlp1 [MDC MODE] to eralp1[NON MDC mode],eralp1 guest
and dump is collected

---uname output---
na

---Debugger---
A debugger is not configured

[281827.975244] NIP [c0000000005f0620] migrate_misplaced_folio+0x4f0/0x5d0
[281827.975251] LR [c0000000005f067c] migrate_misplaced_folio+0x54c/0x5d0
[281827.975258] Call Trace:
[281827.975260] [c000001e19ff7140] [c0000000005f0670] migrate_misplaced_folio+0x540/0x5d0 (unreliable)
[281827.975268] [c000001e19ff71d0] [c00000000054c9f0] __handle_mm_fault+0xf70/0x28e0
[281827.975276] [c000001e19ff7310] [c00000000054e478] handle_mm_fault+0x118/0x400
[281827.975284] [c000001e19ff7360] [c00000000053598c] __get_user_pages+0x1ec/0x5b0
[281827.975291] [c000001e19ff7420] [c000000000536920] get_user_pages_unlocked+0x120/0x4f0
[281827.975298] [c000001e19ff74c0] [c00800001894ea9c] hva_to_pfn+0xf4/0x630 [kvm]
[281827.975316] [c000001e19ff7550] [c008000018b4efc4] kvmppc_book3s_instantiate_page+0xec/0x790 [kvm_hv]
[281827.975326] [c000001e19ff7660] [c008000018b4f750] kvmppc_book3s_radix_page_fault+0xe8/0x380 [kvm_hv]
[281827.975335] [c000001e19ff7700] [c008000018b488fc] kvmppc_book3s_hv_page_fault+0x294/0xd60 [kvm_hv]
[281827.975344] [c000001e19ff77e0] [c008000018b43f5c] kvmppc_vcpu_run_hv+0xf94/0x11d0 [kvm_hv]
[281827.975352] [c000001e19ff78a0] [c00800001896131c] kvmppc_vcpu_run+0x34/0x48 [kvm]
[281827.975365] [c000001e19ff78c0] [c00800001895c164] kvm_arch_vcpu_ioctl_run+0x39c/0x570 [kvm]
[281827.975379] [c000001e19ff7950] [c00800001894a104] kvm_vcpu_ioctl+0x20c/0x9a8 [kvm]
[281827.975391] [c000001e19ff7b30] [c000000000683974] sys_ioctl+0x574/0x16a0
[281827.975395] [c000001e19ff7c30] [c000000000030838] system_call_exception+0x168/0x310
[281827.975400] [c000001e19ff7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
[281827.975406] --- interrupt: 3000 at 0x7fffb7d4d2bc

Mirroring to distro as per message in group channel

Please pick these patches for this bug:

ee86814b0562 ("mm/migrate: move NUMA hinting fault folio isolation + checks under PTL")
4b88c23ab8c9 ("mm/migrate: make migrate_misplaced_folio() return 0 on success")
d2136d749d76 ("mm: support multi-size THP numa balancing")
6b0ed7b3c775 ("mm: factor out the numa mapping rebuilding into a new helper")
ebb34f78d72c ("mm: convert folio_estimated_sharers() to folio_likely_mapped_shared()")
133d04b1eee9 ("mm/numa_balancing: allow migrate on protnone reference with MPOL_PREFERRED_MANY policy")
f8fd525ba3a2 ("mm/mempolicy: use numa_node_id() instead of cpu_to_node()")

Thanks,
Amit

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-208549 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → kernel-package (Ubuntu)
Frank Heimes (fheimes)
affects: kernel-package (Ubuntu) → linux (Ubuntu)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
Changed in linux (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → nobody
importance: Undecided → High
Changed in ubuntu-power-systems:
importance: Undecided → High
Changed in linux (Ubuntu):
status: New → Triaged
Changed in ubuntu-power-systems:
status: New → Triaged
Changed in linux (Ubuntu Noble):
status: New → Triaged
Frank Heimes (fheimes)
summary: - ISST-LTE:KOP:1060.1:doodlp1g8:Post Migration Non-MDC L1 eralp1 crashed
- with migrate_misplaced_folio+0x4cc/0x5d0
+ Guest crahses post migration with migrate_misplaced_folio+0x4cc/0x5d0
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote : Re: Guest crahses post migration with migrate_misplaced_folio+0x4cc/0x5d0

The patches mentioned above were applied upstream either for v6.10-rc1 or for v6.11-rc1, therefore they are included in the kernel currently in oracular-proposed (6.11.0-5.5).

Changed in linux (Ubuntu Oracular):
status: Triaged → Fix Committed
Frank Heimes (fheimes)
description: updated
Frank Heimes (fheimes)
description: updated
Frank Heimes (fheimes)
summary: - Guest crahses post migration with migrate_misplaced_folio+0x4cc/0x5d0
+ Guest crashes post migration with migrate_misplaced_folio+0x4cc/0x5d0
Frank Heimes (fheimes)
description: updated
Revision history for this message
Frank Heimes (fheimes) wrote :

Pull request submitted to kernel team's mailing list:
https://lists.ubuntu.com/archives/kernel-team/2024-September/thread.html#153390
changing status to 'In Progress'.

Changed in linux (Ubuntu Noble):
status: Triaged → In Progress
Changed in ubuntu-power-systems:
status: Triaged → In Progress
Changed in linux (Ubuntu Noble):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.