Azure: TDX enabled hyper-visors cause segfault

Bug #2003714 reported by Tim Gardner
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Fix Released
High
Tim Gardner

Bug Description

SRU Justification

[Impact]

Microsoft TDX enabled hyper visors cause a segfault due to an upstream glibc bug. This can be worked around with a kernel patch.

Issue Description:

When I start an Intel TDX Ubuntu 22.04 (or RHEL 9.0) guest on Hyper-V, the guest always hits segfaults and can’t boot up. Here the kernel running in the guest is the upstream kernel + my TDX patchset, or the 5.19.0-azure kernel + the same TDX patchset:

[Fix]

We confirmed the segfault also happens to TDX guests on the KVM hypervisor. After I checked with more Intel folks, it turns out this is indeed a glibc bug (https://sourceware.org/bugzilla/show_bug.cgi?id=28784), which has been fixed in the upsteram glibc, but Ubuntu 22.04 and newer haven’t picked up the glibc fix yet.

I got a kernel side temporary workarouond from Intel: https://github.com/dcui/tdx/commit/16218cf73491e867fd39c16c9e4b8aa926cbda68, which is on the same existing branch “decui/upstream-kinetic-22.10/master-next/1209”.

[ 21.081453] Run /inits init process
[ 21.086896] with arguments:
[ 21.095790] /init
[ 21.100982] with environment:
[ 21.106611] HOME=/
[ 21.112463] TERM=linux
[ 21.119850] BOOT_IMAGE=/boot/vmlinuz-6.1.0-rc7-decui+

Loading, please wait...

Starting version 249.11-0ubuntu3.6

[ 21.253908] udevadm[144]: segfault at 56538d61e0c0 ip 00007f8f5899efeb sp 00007ffd08fb7648 error 6 in libc.so.6[7f8f58820000+195000] likely on CPU 0 (core 0, socket 0)
[ 21.316549] Code: 07 62 e1 7d 48 e7 4f 01 62 e1 7d 48 e7 67 40 62 e1 7d 48 e7 6f 41 62 61 7d 48 e7 87 00 20 00 00 62 61 7d 48 e7 8f 40 20 00 00 <62> 61 7d 48 e7 a7 00 30 00 00 62 61 7d 48 e7 af 40 30 00 00 48 83

Segmentation fault

[ 22.499317] setfont[153]: segfault at 55ef3b91b000 ip 00007f5899899fa4 sp 00007ffc8008f628 error 4 in libc.so.6[7f589971b000+195000] likely on CPU 0 (core 0, socket 0)
[ 22.602677] Code: 06 62 e1 fe 48 6f 4e 01 62 e1 fe 48 6f 66 40 62 e1 fe 48 6f 6e 41 62 61 fe 48 6f 86 00 20 00 00 62 61 fe 48 6f 8e 40 20 00 00 <62> 61 fe 48 6f a6 00 30 00 00 62 61 fe 48 6f ae 40 30 00 00 48 83
[ 22.732413] loadkeys[156]: segfault at 563ffe292000 ip 00007fbff957afa4 sp 00007ffe31453808 error 4 in libc.so.6[7fbff93fc000+195000] likely on CPU 0 (core 0, socket 0)
[ 22.833061] Code: 06 62 e1 fe 48 6f 4e 01 62 e1 fe 48 6f 66 40 62 e1 fe 48 6f 6e 41 62 61 fe 48 6f 86 00 20 00 00 62 61 fe 48 6f 8e 40 20 00 00 <62> 61 fe 48 6f a6 00 30 00 00 62 61 fe 48 6f ae 40 30 00 00 48 83

The segfault only happens to recent glibc versions (e.g. v2.35 in Ubuntu 22.04, and v2.34 in RHEL 9.0). It doesn’t happens to v2.31 in Ubuntu 20.04, or v2.32 in Ubuntu 20.10. So something in glibc must have changed between v2.32 (good) and 2.34+ (not working for TDX). The oddity is: when I run the same Ubuntu 22.04/RHEL 9.0 image as a regular non-TDX guest, the segfault never happens.

If I boot up a Ubuntu 20.04 TDX guest (which works fine), mount a Ubuntu 22.04 VHD image (“mount /dev/sdd1 /mnt”) and try to run “chroot /mnt”, I hit the same segfault:

[ 109.478556] EXT4-fs (sdd1): mounted filesystem with ordered data mode. Quota mode: none.
[ 129.224444] bash[2112]: segfault at 556987854000 ip 00007f88468c4ea4 sp 00007ffc22ecf158 error 6 in libc.so.6[7f8846828000+195000] likely on CPU 48 (core 0, socket 48)
[ 129.242434] Code: e7 bf 30 10 00 00 66 44 0f e7 87 00 20 00 00 66 44 0f e7 8f 10 20 00 00 66 44 0f e7 97 20 20 00 00 66 44 0f e7 9f 30 20 00 00 <66> 44 0f e7 a7 00 30 00 00 66 44 0f e7 af 10 30 00 00 66 44 0f e7

It looks like the application is referencing a memory location that somehow triggers a page fault, which is converted to a sigal SIGSEGV, which causes a segfault and terminates the application (I’m not sure where the below “movntdq” instructions come from):

root@decui-u2004-u28:/opt/linus-0824# echo 'Code: e7 bf 30 10 00 00 66 44 0f e7 87 00 20 00 00 66 44 0f e7 8f 10 20 00 00 66 44 0f e7 97 20 20 00 00 66 44 0f e7 9f 30 20 00 00 <66> 44 0f e7 a7 00 30 00 00 66 44 0f e7 af 10 30 00 00 66 44 0f e7' | scripts/decodecode

Code: e7 bf 30 10 00 00 66 44 0f e7 87 00 20 00 00 66 44 0f e7 8f 10 20 00 00 66 44 0f e7 97 20 20 00 00 66 44 0f e7 9f 30 20 00 00 <66> 44 0f e7 a7 00 30 00 00 66 44 0f e7 af 10 30 00 00 66 44 0f e7

All code
========
   0: e7 bf out %eax,$0xbf
   2: 30 10 xor %dl,(%rax)
   4: 00 00 add %al,(%rax)
   6: 66 44 0f e7 87 00 20 movntdq %xmm8,0x2000(%rdi)
   d: 00 00
   f: 66 44 0f e7 8f 10 20 movntdq %xmm9,0x2010(%rdi)
  16: 00 00
  18: 66 44 0f e7 97 20 20 movntdq %xmm10,0x2020(%rdi)
  1f: 00 00
  21: 66 44 0f e7 9f 30 20 movntdq %xmm11,0x2030(%rdi)
  28: 00 00
  2a:* 66 44 0f e7 a7 00 30 movntdq %xmm12,0x3000(%rdi)
<-- trapping instruction

  31: 00 00
  33: 66 44 0f e7 af 10 30 movntdq %xmm13,0x3010(%rdi)
  3a: 00 00
  3c: 66 data16
  3d: 44 rex.R
  3e: 0f .byte 0xf
  3f: e7 .byte 0xe7

Code starting with the faulting instruction

===========================================

   0: 66 44 0f e7 a7 00 30 movntdq %xmm12,0x3000(%rdi)
   7: 00 00
   9: 66 44 0f e7 af 10 30 movntdq %xmm13,0x3010(%rdi)
  10: 00 00
  12: 66 data16
  13: 44 rex.R
  14: 0f .byte 0xf
  15: e7 .byte 0xe7

After I add a delay of “sleep 2 minutes” in the kernel’s arch/x86/mm/fault.c: show_signal_msg(), it turns out somehow the application is trying to write to the end of the heap area (which doesn’t seem to be mapped in the process’s address space), and the segfault is triggered:

[ 129.224444] bash[2112]: segfault at 556987854000 ip 00007f88468c4ea4 sp 00007ffc22ecf158 error 6 in libc.so.6[7f8846828000+195000] likely on CPU 48 (core 0, socket 48)

root@decui-u2004-u28:/proc/2112# cat maps

5569874a9000-5569874d8000 r--p 00000000 08:31 1582 /mnt/usr/bin/bash
5569874d8000-5569875b7000 r-xp 0002f000 08:31 1582 /mnt/usr/bin/bash
5569875b7000-5569875f1000 r--p 0010e000 08:31 1582 /mnt/usr/bin/bash
5569875f2000-5569875f6000 r--p 00148000 08:31 1582 /mnt/usr/bin/bash
5569875f6000-5569875ff000 rw-p 0014c000 08:31 1582 /mnt/usr/bin/bash
5569875ff000-55698760a000 rw-p 00000000 00:00 0
556987833000-556987854000 rw-p 00000000 00:00 0

[heap]
7f8846400000-7f88466e9000 r--p 00000000 08:31 6124 /mnt/usr/lib/locale/locale-archive
7f8846800000-7f8846828000 r--p 00000000 08:31 4966 /mnt/usr/lib/x86_64-linux-gnu/libc.so.6
7f8846828000-7f88469bd000 r-xp 00028000 08:31 4966 /mnt/usr/lib/x86_64-linux-gnu/libc.so.6
7f88469bd000-7f8846a15000 r--p 001bd000 08:31 4966 /mnt/usr/lib/x86_64-linux-gnu/libc.so.6
7f8846a15000-7f8846a19000 r--p 00214000 08:31 4966 /mnt/usr/lib/x86_64-linux-gnu/libc.so.6
7f8846a19000-7f8846a1b000 rw-p 00218000 08:31 4966 /mnt/usr/lib/x86_64-linux-gnu/libc.so.6
7f8846a1b000-7f8846a28000 rw-p 00000000 00:00 0
7f8846b09000-7f8846b10000 r--s 00000000 08:31 3841 /mnt/usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
7f8846b10000-7f8846b13000 rw-p 00000000 00:00 0
7f8846b13000-7f8846b21000 r--p 00000000 08:31 4729 /mnt/usr/lib/x86_64-linux-gnu/libtinfo.so.6.3
7f8846b21000-7f8846b32000 r-xp 0000e000 08:31 4729 /mnt/usr/lib/x86_64-linux-gnu/libtinfo.so.6.3
7f8846b32000-7f8846b40000 r--p 0001f000 08:31 4729 /mnt/usr/lib/x86_64-linux-gnu/libtinfo.so.6.3
7f8846b40000-7f8846b44000 r--p 0002c000 08:31 4729 /mnt/usr/lib/x86_64-linux-gnu/libtinfo.so.6.3
7f8846b44000-7f8846b45000 rw-p 00030000 08:31 4729 /mnt/usr/lib/x86_64-linux-gnu/libtinfo.so.6.3
7f8846b4b000-7f8846b4d000 rw-p 00000000 00:00 0
7f8846b4d000-7f8846b4f000 r--p 00000000 08:31 4960 /mnt/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7f8846b4f000-7f8846b79000 r-xp 00002000 08:31 4960 /mnt/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7f8846b79000-7f8846b84000 r--p 0002c000 08:31 4960 /mnt/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7f8846b85000-7f8846b87000 r--p 00037000 08:31 4960 /mnt/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7f8846b87000-7f8846b89000 rw-p 00039000 08:31 4960 /mnt/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7ffc22eb1000-7ffc22ed2000 rw-p 00000000 00:00 0

[stack]

7ffc22fcd000-7ffc22fd1000 r--p 00000000 00:00 0 [vvar]
7ffc22fd1000-7ffc22fd3000 r-xp 00000000 00:00 0 [vdso]

[Test Plan]

Microsoft tested

[Where things could go wrong]

TDX is a new feature and is unlikely to have regressions.

Tim Gardner (timg-tpi)
affects: linux (Ubuntu) → linux-azure (Ubuntu)
Changed in linux-azure (Ubuntu):
assignee: nobody → Tim Gardner (timg-tpi)
importance: Undecided → High
status: New → In Progress
Tim Gardner (timg-tpi)
description: updated
description: updated
Revision history for this message
Dexuan Cui (decui) wrote :

FYI, the glibc bug is not https://sourceware.org/bugzilla/show_bug.cgi?id=28784; instead, it's Bug 30037 - glibc 2.34 and newer segfault if CPUID leaf 0x2 reports zero (https://sourceware.org/bugzilla/show_bug.cgi?id=30037)

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.19.0-1020.21 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-kinetic' to 'verification-done-kinetic'. If the problem still exists, change the tag 'verification-needed-kinetic' to 'verification-failed-kinetic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-kinetic-linux-azure verification-needed-kinetic
Tim Gardner (timg-tpi)
tags: added: verification-done-kinetic
removed: verification-needed-kinetic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (138.8 KiB)

This bug was fixed in the package linux-azure - 5.19.0-1020.21

---------------
linux-azure (5.19.0-1020.21) kinetic; urgency=medium

  * kinetic/linux-azure: 5.19.0-1020.21 -proposed tracker (LP: #2004085)

  * Azure: Fix TDX backport (LP: #2004087)
    - SAUCE: TDX: Fixed botched backport

linux-azure (5.19.0-1019.20) kinetic; urgency=medium

  * kinetic/linux-azure: 5.19.0-1019.20 -proposed tracker (LP: #2003415)

  * Azure: TDX enabled hyper-visors cause segfault (LP: #2003714)
    - SAUCE: TDX: Work around the segfault issue in glibc 2.35 in Ubuntu 22.04.

  [ Ubuntu: 5.19.0-31.32 ]

  * kinetic/linux: 5.19.0-31.32 -proposed tracker (LP: #2003423)
  * amdgpu: framebuffer is destroyed and the screen freezes with unsupported IP
    blocks (LP: #2003524)
    - drm/amd: Delay removal of the firmware framebuffer
  * Revoke & rotate to new signing key (LP: #2002812)
    - [Packaging] Revoke and rotate to new signing key

linux-azure (5.19.0-1018.19) kinetic; urgency=medium

  * kinetic/linux-azure: 5.19.0-1018.19 -proposed tracker (LP: #2001745)

  * Kinetic update: upstream stable patchset 2022-11-14 (LP: #1996540)
    - [Config] azure: updateconfigs after rebase

  * Packaging resync (LP: #1786013)
    - debian/dkms-versions -- update from kernel-versions (main/2023.01.02)

  * Kinetic linux-azure - Enable TDX guest driver w/MSFT Hyper-v (LP: #2002658)
    - clocksource/drivers/hyperv: add data structure for reference TSC MSR
    - Revert "UBUNTU: SAUCE: x86/tdx: Add TDX Guest attestation interface driver"
    - Revert "UBUNTU: SAUCE: selftests: tdx: Test GetReport TDX attestation
      feature"
    - Revert "x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has()"
    - SAUCE: x86/tdx: Add a wrapper to get TDREPORT0 from the TDX Module
    - SAUCE: virt: Add TDX guest driver
    - SAUCE: selftests/tdx: Test TDX attestation GetReport support
    - SAUCE: tdx: enable DEBUG: tools/testing/selftests/tdx/tdx_guest_test.c
    - SAUCE: tdx: swiotlb: check set_memory_decrypted()'s return value
    - SAUCE: tdx: x86/sev: mem_encrypt_free_decrypted_mem(): encrypt the pages for
      AMD SME only
    - SAUCE: tdx: x86/hyperv: Do not run swiotlb_update_mem_attributes() in
      hyperv_init()
    - SAUCE: tdx: x86/tdx: Retry TDVMCALL_MAP_GPA() when needed
    - SAUCE: tdx: x86/tdx: Support vmalloc() for tdx_enc_status_changed()
    - SAUCE: tdx: x86/hyperv: Add hv_isolation_type_tdx() to detect TDX guests
    - SAUCE: tdx: x86/tdx: Expand __tdx_hypercall() to handle more arguments
    - SAUCE: tdx: x86/hyperv: Support hypercalls for TDX guests
    - SAUCE: tdx: Drivers: hv: vmbus: Support TDX guests
    - SAUCE: tdx: x86/hyperv: Fix serial console interrupts for TDX guests
    - [Config] azure: Enable TDX guest driver
    - SAUCE: tdx: Drivers: hv: vmbus:: Fix the ARM64 build caused by recent TDX
      patches

  [ Ubuntu: 5.19.0-30.31 ]

  * kinetic/linux: 5.19.0-30.31 -proposed tracker (LP: #2001756)
  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts
    - debian/dkms-versions -- update from kernel-versions (main/2023.01.02)
  * Add some ACPI device IDs for Intel HID device (LP: #1995453)
    - platform/x86/intel/h...

Changed in linux-azure (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/6.2.0-1009.9 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-lunar' to 'verification-done-lunar'. If the problem still exists, change the tag 'verification-needed-lunar' to 'verification-failed-lunar'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-lunar-linux-azure verification-needed-lunar
Tim Gardner (timg-tpi)
tags: added: verification-done-lunar
removed: verification-needed-lunar
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.