NULL Pointer Dereference During KVM MMU Page Invalidation

Bug #2035166 reported by Chengen Du
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
In Progress
Undecided
Unassigned
Jammy
Fix Released
High
Chengen Du

Bug Description

[Impact]
During VM live migration, there is a potential risk of dereferencing a NULL pointer,
which can lead to memory access issues and result in an unstable environment.

[Fix]
The call trace is as follows:

kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
kernel: #PF: supervisor write access in kernel mode
kernel: #PF: error_code(0x0002) - not-present page
kernel: PGD 0 P4D 0
kernel: Oops: 0002 [#1] SMP NOPTI
kernel: CPU: 29 PID: 4063601 Comm: CPU 0/KVM Tainted: G IOE 5.15.0-53-generic #59~20.04.1-Ubuntu
kernel: Hardware name: Dell Inc. PowerEdge R640/0H28RR, BIOS 2.12.2 07/09/2021
kernel: RIP: 0010:__handle_changed_spte+0x3a9/0x620 [kvm]
kernel: Code: 48 8b 58 28 44 0f b6 63 24 48 8b 43 28 41 83 e4 0f 48 89 45 a0 0f 1f 44 00 00 45 84 d2 0f 85 06 02 00 00 48 8b 43 08 48 8b 13 <48> 89 42 08 48 89 10 44 0f b6 6b 23 48 b8 00 01 00 00 00 00 ad de
kernel: RSP: 0018:ffffb580320278a8 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffffa0fe29e94c38 RCX: 0000000000000027
kernel: RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffb5801e24ba58
kernel: RBP: ffffb58032027930 R08: 0000000000000000 R09: 0000000000000004
kernel: R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000003
kernel: R13: 0000000000000004 R14: 0000000000000000 R15: ffffb5801e235000
kernel: FS: 00007f1553fff700(0000) GS:ffffa20eff780000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000008 CR3: 000000e7f7544004 CR4: 00000000007726e0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel: <TASK>
kernel: ? __switch_to_xtra+0x109/0x510
kernel: zap_gfn_range+0x218/0x360 [kvm]
kernel: ? __smp_call_single_queue+0x59/0x90
kernel: ? alloc_cpumask_var_node+0x1/0x30
kernel: ? kvm_make_vcpus_request_mask+0x150/0x1d0 [kvm]
kernel: kvm_tdp_mmu_zap_invalidated_roots+0x5b/0xb0 [kvm]
kernel: kvm_mmu_zap_all_fast+0x19a/0x1d0 [kvm]
--
kernel: RAX: ffffffffffffffda RBX: 000000004020ae46 RCX: 00007f15aa26e3ab
kernel: RDX: 00007f1553ffe050 RSI: 000000004020ae46 RDI: 000000000000002f
kernel: RBP: 00005602a885a410 R08: 00005602a82ad000 R09: 00007f154c087470
kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f1553ffe050
kernel: R13: 00007f1553ffe160 R14: 0000000000000000 R15: 0000000000800000
kernel: </TASK>

The error occurred randomly in different production environments of the customer, all with the same call trace.
Therefore, the likelihood of other processes contaminating memory is low.
After analyzing the call trace with the help of debug symbols, we can pinpoint the source of the error.

root@focal:~/ddeb# eu-addr2line -ifae ./usr/lib/debug/lib/modules/5.15.0-53-generic/kernel/arch/x86/kvm/kvm.ko __handle_changed_spte+0x3a9
0x0000000000068109
__list_del inlined at /build/linux-hwe-5.15-ZCQu4B/linux-hwe-5.15-5.15.0/include/linux/list.h:135:2 in __handle_changed_spte
/build/linux-hwe-5.15-ZCQu4B/linux-hwe-5.15-5.15.0/include/linux/list.h:112:13
__list_del_entry
/build/linux-hwe-5.15-ZCQu4B/linux-hwe-5.15-5.15.0/include/linux/list.h:135:2
list_del
/build/linux-hwe-5.15-ZCQu4B/linux-hwe-5.15-5.15.0/include/linux/list.h:146:2
tdp_mmu_unlink_page
/build/linux-hwe-5.15-ZCQu4B/linux-hwe-5.15-5.15.0/arch/x86/kvm/mmu/tdp_mmu.c:305:2
handle_removed_tdp_mmu_page
/build/linux-hwe-5.15-ZCQu4B/linux-hwe-5.15-5.15.0/arch/x86/kvm/mmu/tdp_mmu.c:340:2
__handle_changed_spte
/build/linux-hwe-5.15-ZCQu4B/linux-hwe-5.15-5.15.0/arch/x86/kvm/mmu/tdp_mmu.c:491:3

The error occurred when the kernel attempted to delete an entry from a list.
This issue may potentially be related to timing and has proven challenging to reproduce consistently, making it difficult for us to pinpoint the cause.
It's worth noting that the current kernel has replaced the list_head with atomic_t, as indicated by the following commit.

d25ceb926436 KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages

While this patch doesn't modify the triggering logic, it replaces the problematic section with a more reliable approach while keeping the original logic unchanged.
If the issue persists, it should not result in any memory access problems.
We also requested the customer to set up a test environment and simulate a workload similar to the production environment.
The patch worked well and did not introduce any adverse effects.

[Test Plan]
Reproducing the issue has proven to be challenging.
Simulating heavy live migration activity in the customer's production environment is the appropriate approach to ensure that applying the patch will not result in any adverse effects.

[Where problems could occur]
The patch will impact the live migration workflow, but it only modifies the data structure in use, and no functionality will be altered.

Chengen Du (chengendu)
Changed in linux (Ubuntu Jammy):
assignee: nobody → Chengen Du (chengendu)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2035166

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Jammy):
status: New → Incomplete
Chengen Du (chengendu)
Changed in linux (Ubuntu Jammy):
status: Incomplete → In Progress
Changed in linux (Ubuntu):
status: Incomplete → New
Changed in linux (Ubuntu):
status: New → Incomplete
Stefan Bader (smb)
Changed in linux (Ubuntu):
status: Incomplete → Invalid
Changed in linux (Ubuntu Jammy):
importance: Undecided → Medium
importance: Medium → High
Chengen Du (chengendu)
Changed in linux (Ubuntu):
status: Invalid → In Progress
Changed in linux (Ubuntu Jammy):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.15.0-88.98 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux' to 'verification-done-jammy-linux'. If the problem still exists, change the tag 'verification-needed-jammy-linux' to 'verification-failed-jammy-linux'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-v2 verification-needed-jammy-linux
Revision history for this message
Chengen Du (chengendu) wrote :

The kernels (5.15.0-88.98) have been tested without any issues.

tags: added: verification-done-jammy-linux
removed: verification-needed-jammy-linux
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (22.7 KiB)

This bug was fixed in the package linux - 5.15.0-88.98

---------------
linux (5.15.0-88.98) jammy; urgency=medium

  * jammy/linux: 5.15.0-88.98 -proposed tracker (LP: #2038055)

  * CVE-2023-4244
    - netfilter: nf_tables: don't skip expired elements during walk
    - netfilter: nf_tables: adapt set backend to use GC transaction API
    - netfilter: nft_set_hash: mark set element as dead when deleting from packet
      path
    - netfilter: nf_tables: GC transaction API to avoid race with control plane
    - netfilter: nf_tables: remove busy mark and gc batch API
    - netfilter: nf_tables: don't fail inserts if duplicate has expired
    - netfilter: nf_tables: fix kdoc warnings after gc rework
    - netfilter: nf_tables: fix GC transaction races with netns and netlink event
      exit path
    - netfilter: nf_tables: GC transaction race with netns dismantle
    - netfilter: nf_tables: GC transaction race with abort path
    - netfilter: nf_tables: use correct lock to protect gc_list
    - netfilter: nf_tables: defer gc run if previous batch is still pending
    - netfilter: nft_dynset: disallow object maps
    - netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction

  * CVE-2023-42756
    - netfilter: ipset: Fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP

  * CVE-2023-4623
    - net/sched: sch_hfsc: Ensure inner classes have fsc curve

  * PCI BARs larger than 128GB are disabled (LP: #2037403)
    - PCI: Support BAR sizes up to 8TB

  * Fix unstable audio at low levels on Thinkpad P1G4 (LP: #2037077)
    - ALSA: hda/realtek - ALC287 I2S speaker platform support

  * Check for changes relevant for security certifications (LP: #1945989)
    - [Packaging] Add a new fips-checks script

  * Jammy update: v5.15.126 upstream stable release (LP: #2037593)
    - io_uring: gate iowait schedule on having pending requests
    - perf: Fix function pointer case
    - net/mlx5: Free irqs only on shutdown callback
    - arm64: errata: Add workaround for TSB flush failures
    - arm64: errata: Add detection for TRBE write to out-of-range
    - [Config] updateconfigs for ARM64_ERRATUM_ and
      ARM64_WORKAROUND_TSB_FLUSH_FAILURE
    - iommu/arm-smmu-v3: Work around MMU-600 erratum 1076982
    - iommu/arm-smmu-v3: Document MMU-700 erratum 2812531
    - iommu/arm-smmu-v3: Add explicit feature for nesting
    - iommu/arm-smmu-v3: Document nesting-related errata
    - arm64: dts: imx8mn-var-som: add missing pull-up for onboard PHY reset pinmux
    - word-at-a-time: use the same return type for has_zero regardless of
      endianness
    - KVM: s390: fix sthyi error handling
    - wifi: cfg80211: Fix return value in scan logic
    - net/mlx5: DR, fix memory leak in mlx5dr_cmd_create_reformat_ctx
    - net/mlx5e: fix return value check in mlx5e_ipsec_remove_trailer()
    - bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing
    - rtnetlink: let rtnl_bridge_setlink checks IFLA_BRIDGE_MODE length
    - net: dsa: fix value check in bcm_sf2_sw_probe()
    - perf test uprobe_from_different_cu: Skip if there is no gcc
    - net: sched: cls_u32: Fix match key mis-addressing
    - mISDN: hfcpci: Fix potential deadlock on &hc->l...

Changed in linux (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-gkeop/5.15.0-1032.38 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-gkeop' to 'verification-done-jammy-linux-gkeop'. If the problem still exists, change the tag 'verification-needed-jammy-linux-gkeop' to 'verification-failed-jammy-linux-gkeop'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-gkeop-v2 verification-needed-jammy-linux-gkeop
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-bluefield/5.15.0-1029.31 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-bluefield' to 'verification-done-jammy-linux-bluefield'. If the problem still exists, change the tag 'verification-needed-jammy-linux-bluefield' to 'verification-failed-jammy-linux-bluefield'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-bluefield-v2 verification-needed-jammy-linux-bluefield
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-intel-iotg/5.15.0-1044.50 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-intel-iotg' to 'verification-done-jammy-linux-intel-iotg'. If the problem still exists, change the tag 'verification-needed-jammy-linux-intel-iotg' to 'verification-failed-jammy-linux-intel-iotg'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-raspi/5.15.0-1042.45 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-raspi' to 'verification-done-jammy-linux-raspi'. If the problem still exists, change the tag 'verification-needed-jammy-linux-raspi' to 'verification-failed-jammy-linux-raspi'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-intel-iotg-v2 verification-needed-jammy-linux-intel-iotg
tags: added: kernel-spammed-jammy-linux-raspi-v2 verification-needed-jammy-linux-raspi
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-intel-iot-realtime/5.15.0-1042.44 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-intel-iot-realtime' to 'verification-done-jammy-linux-intel-iot-realtime'. If the problem still exists, change the tag 'verification-needed-jammy-linux-intel-iot-realtime' to 'verification-failed-jammy-linux-intel-iot-realtime'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-intel-iot-realtime-v2 verification-needed-jammy-linux-intel-iot-realtime
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-tegra/5.15.0-1019.19 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-nvidia-tegra' to 'verification-done-jammy-linux-nvidia-tegra'. If the problem still exists, change the tag 'verification-needed-jammy-linux-nvidia-tegra' to 'verification-failed-jammy-linux-nvidia-tegra'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-nvidia-tegra-v2 verification-needed-jammy-linux-nvidia-tegra
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-gcp-tcpx/5.15.0-1002.2 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux-gcp-tcpx' to 'verification-done-focal-linux-gcp-tcpx'. If the problem still exists, change the tag 'verification-needed-focal-linux-gcp-tcpx' to 'verification-failed-focal-linux-gcp-tcpx'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-gcp-tcpx-v2 verification-needed-focal-linux-gcp-tcpx
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-xilinx-zynqmp/5.15.0-1025.29 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-xilinx-zynqmp' to 'verification-done-jammy-linux-xilinx-zynqmp'. If the problem still exists, change the tag 'verification-needed-jammy-linux-xilinx-zynqmp' to 'verification-failed-jammy-linux-xilinx-zynqmp'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-xilinx-zynqmp-v2 verification-needed-jammy-linux-xilinx-zynqmp
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-tegra-igx/5.15.0-1006.6 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-nvidia-tegra-igx' to 'verification-done-jammy-linux-nvidia-tegra-igx'. If the problem still exists, change the tag 'verification-needed-jammy-linux-nvidia-tegra-igx' to 'verification-failed-jammy-linux-nvidia-tegra-igx'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-nvidia-tegra-igx-v2 verification-needed-jammy-linux-nvidia-tegra-igx
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-tegra-5.15/5.15.0-1019.19~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux-nvidia-tegra-5.15' to 'verification-done-focal-linux-nvidia-tegra-5.15'. If the problem still exists, change the tag 'verification-needed-focal-linux-nvidia-tegra-5.15' to 'verification-failed-focal-linux-nvidia-tegra-5.15'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-nvidia-tegra-5.15-v2 verification-needed-focal-linux-nvidia-tegra-5.15
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure-fips/5.15.0-1053.61+fips1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-azure-fips' to 'verification-done-jammy-linux-azure-fips'. If the problem still exists, change the tag 'verification-needed-jammy-linux-azure-fips' to 'verification-failed-jammy-linux-azure-fips'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-azure-fips-v2 verification-needed-jammy-linux-azure-fips
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.