zap_pid_ns_processes() gets stuck in a busy loop when zombie processes are in namespace

Bug #2077044 reported by Matthew Ruffell
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Committed
Medium
Matthew Ruffell
Noble
Fix Committed
Medium
Matthew Ruffell

Bug Description

BugLink: https://bugs.launchpad.net/bugs/2077044

[Impact]

A deadlock can occur in zap_pid_ns_processes() which can hang the system due to RCU getting stuck.

zap_pid_ns_processes() has a busy loop that calls kernel_wait4() on a child process of the namespace init task, waiting for it to exit. The problem is, it clears TIF_SIGPENDING, but not TIF_NOTIFY_SIGNAL as well, leading us to get stuck in the busy loop forever, due to the child sleeping in synchronize_rcu(), and is never woken up due to the parent being stuck in the busy loop and never calling schedule() or rcu_note_context_switch().

A oops is:

Watchdog: BUG: soft lockup - CPU#3 stuck for 276s! [rcudeadlock:1836]
CPU: 3 PID: 1836 Comm: rcudeadlock Tainted: G L 5.15.0-117-generic #127-Ubuntu
RIP: 0010:_raw_read_lock+0xe/0x30
Code: f0 0f b1 17 74 08 31 c0 5d c3 cc cc cc cc b8 01 00 00 00 5d c3 cc cc cc cc 0f 1f 00 0f 1f 44 00 00 b8 00 02 00 00 f0 0f c1 07 <a9> ff 01 00 00 75 05 c3 cc cc cc cc 55 48 89 e5 e8 4d 79 36 ff 5d
CR2: 000000c0002b0000
Call Trace:
 <IRQ>
 ? show_trace_log_lvl+0x1d6/0x2ea
 ? show_trace_log_lvl+0x1d6/0x2ea
 ? kernel_wait4+0xaf/0x150
 ? show_regs.part.0+0x23/0x29
 ? show_regs.cold+0x8/0xd
 ? watchdog_timer_fn+0x1be/0x220
 ? lockup_detector_update_enable+0x60/0x60
 ? __hrtimer_run_queues+0x107/0x230
 ? read_hv_clock_tsc_cs+0x9/0x30
 ? hrtimer_interrupt+0x101/0x220
 ? hv_stimer0_isr+0x20/0x30
 ? __sysvec_hyperv_stimer0+0x32/0x70
 ? sysvec_hyperv_stimer0+0x7b/0x90
 </IRQ>
 <TASK>
 ? asm_sysvec_hyperv_stimer0+0x1b/0x20
 ? _raw_read_lock+0xe/0x30
 ? do_wait+0xa0/0x310
 kernel_wait4+0xaf/0x150
 ? thread_group_exited+0x50/0x50
 zap_pid_ns_processes+0x111/0x1a0
 forget_original_parent+0x348/0x360
 exit_notify+0x4a/0x210
 do_exit+0x24f/0x3c0
 do_group_exit+0x3b/0xb0
 get_signal+0x150/0x900
 arch_do_signal_or_restart+0xde/0x100
 ? __x64_sys_futex+0x78/0x1e0
 exit_to_user_mode_loop+0xc4/0x160
 exit_to_user_mode_prepare+0xa3/0xb0
 syscall_exit_to_user_mode+0x27/0x50
 ? x64_sys_call+0x1022/0x1fa0
 do_syscall_64+0x63/0xb0
 ? __io_uring_add_tctx_node+0x111/0x1a0
 ? fput+0x13/0x20
 ? __do_sys_io_uring_enter+0x10d/0x540
 ? __smp_call_single_queue+0x59/0x90
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? syscall_exit_to_user_mode+0x2c/0x50
 ? x64_sys_call+0x1819/0x1fa0
 ? do_syscall_64+0x63/0xb0
 ? try_to_wake_up+0x200/0x5a0
 ? wake_up_q+0x50/0x90
 ? futex_wake+0x159/0x190
 ? do_futex+0x162/0x1f0
 ? __x64_sys_futex+0x78/0x1e0
 ? switch_fpu_return+0x4e/0xc0
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? syscall_exit_to_user_mode+0x2c/0x50
 ? x64_sys_call+0x1022/0x1fa0
 ? do_syscall_64+0x63/0xb0
 ? do_user_addr_fault+0x1e7/0x670
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? irqentry_exit_to_user_mode+0xe/0x20
 ? irqentry_exit+0x1d/0x30
 ? exc_page_fault+0x89/0x170
 entry_SYSCALL_64_after_hwframe+0x6c/0xd6
 </TASK>

There is no known workaround.

[Fix]

This was fixed in the below commit in 6.10-rc5:

commit 7fea700e04bd3f424c2d836e98425782f97b494e
Author: Oleg Nesterov <email address hidden>
Date: Sat Jun 8 14:06:16 2024 +0200
Subject: zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fea700e04bd3f424c2d836e98425782f97b494e

This patch has made its way to upstream stable, and is already applied to Ubuntu
kernels.

[Testcase]

There are two possible testcases to reproduce this issue.
This reproducer is courtesy of Rachel Menge, using the reproducers in her github repo:

https://github.com/rlmenge/rcu-soft-lock-issue-repro

Start a Jammy or Noble VM on Azure, D8sV3 will be plenty.

$ git clone https://github.com/rlmenge/rcu-soft-lock-issue-repro.git

npm repro:

Install Docker.

$ sudo docker run telescope.azurecr.io/issue-repro/zombie:v1.1.11
$ ./rcu-npm-repro.sh

go repro:

$ go mod init rcudeadlock.go
$ go mod tidy
$ CGO_ENABLED=0 go build -o ./rcudeadlock ./
$ sudo ./rcudeadlock

Look at dmesg. After some minutes, you should see the hung task timeout from the impact section.

[Where problems can occur]

We are clearing TIF_NOTIFY_SIGNAL in the child, in order for signal_pending() to return false and not lead us to a busy wait loop.
This change should work as intended.

If a regression were to occur, it could potentially affect all processes in namespaces.

[Other Info]

Upstream mailing list discussion:
https://lore<email address hidden>/T/

Tags: sts
Changed in linux (Ubuntu):
status: New → Fix Released
Changed in linux (Ubuntu Jammy):
status: New → Fix Committed
Changed in linux (Ubuntu Noble):
status: New → Fix Committed
Changed in linux (Ubuntu Jammy):
importance: Undecided → Medium
Changed in linux (Ubuntu Noble):
importance: Undecided → Medium
Changed in linux (Ubuntu Jammy):
assignee: nobody → Matthew Ruffell (mruffell)
Changed in linux (Ubuntu Noble):
assignee: nobody → Matthew Ruffell (mruffell)
description: updated
tags: added: sts
Revision history for this message
Matthew Ruffell (mruffell) wrote :

This should land in 5.15.0-121-generic and 6.8.0-44-generic.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.