Activity log for bug #1882623

Date Who What changed Old value New value Message
2020-06-09 03:28:34 Rakesh Ginjupalli bug added bug
2020-11-24 16:16:39 Launchpad Janitor linux-azure (Ubuntu): status New Confirmed
2021-01-22 23:10:29 Guilherme G. Piccoli bug added subscriber Guilherme G. Piccoli
2021-01-22 23:17:35 Guilherme G. Piccoli linux-azure (Ubuntu): status Confirmed In Progress
2021-01-22 23:17:37 Guilherme G. Piccoli linux-azure (Ubuntu): assignee Guilherme G. Piccoli (gpiccoli)
2021-07-23 18:09:15 Guilherme G. Piccoli description Create VM on Azure using Canonical UbuntuServer 14.04.5-LTS latest using size Standard DS2 v2 Install kexec-tools kdump-tools makedumpfile, configure boot kernel parameter crashkernel=512M Run sed -i 's/USE_KDUMP=0/USE_KDUMP=1/g' /etc/default/kdump-tools and sed -i 's/LOAD_KEXEC=true/LOAD_KEXEC=false/g' /etc/default/kexec Reboot VM, make sure crash kernel memory reserved successfully Trigger a crash by run echo 1 > /proc/sys/kernel/sysrq, then echo c > /proc/sysrq-trigger, VM will reboot automatically and after reboot, check crash dump is generated under /var/crash Install linux-azure kernel Enable private-ppa canonical-kernel-esm Install kernel linux-azure, reboot VM, kernel version 4.15.0-1084-azure Trigger a crash by run echo 1 > /proc/sys/kernel/sysrq, then echo c > /proc/sysrq-trigger, VM entered into a hung status. Attach whole serial console [ 65.452007] INFO: rcu_sched detected stalls on CPUs/tasks: [ 65.456004] 1-...!: (0 ticks this GP) idle=488/0/0 softirq=1/1 fqs=0 [ 65.456004] (detected by 0, t=15002 jiffies, g=707, c=706, q=8457) [ 65.456004] rcu_sched kthread starved for 15002 jiffies! g707 c706 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=1 [ 360.020026] INFO: rcu_sched detected stalls on CPUs/tasks: [ 360.024015] 1-...!: (10 GPs behind) idle=b34/0/0 softirq=1/1 fqs=1 [ 360.024015] (detected by 0, t=15002 jiffies, g=717, c=716, q=6429) [ 360.024015] rcu_sched kthread starved for 15000 jiffies! g717 c716 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=1 [ 420.048010] INFO: rcu_sched detected stalls on CPUs/tasks: [ 420.052006] 1-...!: (0 ticks this GP) idle=f94/0/0 softirq=1/1 fqs=0 [ 420.052006] (detected by 0, t=15002 jiffies, g=718, c=717, q=6429) [ 420.052006] rcu_sched kthread starved for 15002 jiffies! g718 c717 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=1 [Impact] * When kdumping on trusty/4.15 in an Azure instance, we observe quite frequently a stall on the kdump kernel, it gets blocked and soon we see a stack like the following: [ 65.452007] INFO: rcu_sched detected stalls on CPUs/tasks: [ 65.456004] 1-...!: (0 ticks this GP) idle=488/0/0 softirq=1/1 fqs=0 [ 65.456004] (detected by 0, t=15002 jiffies, g=707, c=706, q=8457) [ 65.456004] rcu_sched kthread starved for 15002 jiffies! g707 c706 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=1 * By using the Azure serial console, we collected a sysrq-w when the issue happens: [ 529.515013] sysrq: Show Blocked State [ 529.517730] task PC stack pid father [ 529.519006] kworker/u4:2 D 0 94 2 0x80000000 [ 529.519006] Workqueue: events_unbound fsnotify_mark_destroy_workfn [ 529.519006] Call Trace: [ 529.519006] __schedule+0x292/0x880 [ 529.519006] schedule+0x36/0x80 [ 529.519006] schedule_timeout+0x1d5/0x2f0 [ 529.519006] ? check_preempt_wakeup+0x162/0x260 [ 529.519006] wait_for_completion+0xa5/0x110 [ 529.519006] ? wake_up_q+0x80/0x80 [ 529.519006] __synchronize_srcu.part.14+0x67/0x80 [ 529.519006] ? trace_raw_output_rcu_utilization+0x50/0x50 [ 529.519006] ? __switch_to_asm+0x41/0x70 [ 529.519006] synchronize_srcu+0xd1/0xd6 [ 529.519006] fsnotify_mark_destroy_workfn+0x6d/0xc0 [ 529.519006] process_one_work+0x14e/0x390 [ 529.519006] worker_thread+0x1cc/0x3d0 [ 529.519006] kthread+0x105/0x140 [ 529.519006] ? max_active_store+0x60/0x60 [ 529.519006] ? kthread_bind+0x20/0x20 [ 529.519006] ret_from_fork+0x35/0x40 [ 529.519006] udevadm D 0 544 1 0x00000000 [ 529.519006] Call Trace: [ 529.519006] __schedule+0x292/0x880 [ 529.519006] schedule+0x36/0x80 [ 529.519006] schedule_timeout+0x1d5/0x2f0 [ 529.519006] ? try_to_wake_up+0x4a/0x460 [ 529.519006] ? try_to_wake_up+0x4a/0x460 [ 529.519006] wait_for_completion+0xa5/0x110 [ 529.519006] ? wake_up_q+0x80/0x80 [ 529.519006] __flush_work.isra.29+0x119/0x1b0 [ 529.519006] ? destroy_worker+0x90/0x90 [ 529.519006] flush_delayed_work+0x3f/0x50 [ 529.519006] fsnotify_wait_marks_destroyed+0x15/0x20 [ 529.519006] fsnotify_destroy_group+0x4e/0xc0 [ 529.519006] inotify_release+0x1e/0x50 [ 529.519006] __fput+0xea/0x220 [ 529.519006] ____fput+0xe/0x10 [ 529.519006] task_work_run+0x8c/0xb0 [ 529.519006] exit_to_usermode_loop+0x70/0xa9 [ 529.519006] do_syscall_64+0x1b5/0x1e0 [ 529.519006] entry_SYSCALL_64_after_hwframe+0x41/0xa6 [ 529.519006] dhclient D 0 573 572 0x00000000 [ 529.519006] Call Trace: [ 529.519006] __schedule+0x292/0x880 [ 529.519006] schedule+0x36/0x80 [ 529.519006] schedule_timeout+0x1d5/0x2f0 [ 529.519006] ? aa_profile_af_perm+0xb4/0xf0 [ 529.519006] wait_for_completion+0xa5/0x110 [ 529.519006] ? wake_up_q+0x80/0x80 [ 529.519006] __wait_rcu_gp+0x123/0x150 [ 529.519006] synchronize_sched+0x4e/0x60 [ 529.519006] ? __call_rcu+0x2f0/0x2f0 [ 529.519006] ? trace_raw_output_rcu_utilization+0x50/0x50 [ 529.519006] synchronize_net+0x1c/0x30 [ 529.519006] __unregister_prot_hook+0xcd/0xf0 [ 529.519006] packet_do_bind+0x1bd/0x250 [ 529.519006] packet_bind+0x2f/0x50 [ 529.519006] SYSC_bind+0xd8/0x110 [ 529.519006] ? sock_alloc_file+0x91/0x130 [ 529.519006] SyS_bind+0xe/0x10 [ 529.519006] do_syscall_64+0x80/0x1e0 [ 529.519006] entry_SYSCALL_64_after_hwframe+0x41/0xa6 * Bisecting mainline kernels, we found that v4.17-rc1 didn't reproduce the issue, whereas v4.16 reproduced. Then, a fine-grained git bisect led us to the fix - the following patch, when backported to a problematic version, fixes the issue: d8e462e19305 ("Drivers: hv: vmbus: Implement Direct Mode for stimer0") * In Azure/Hyper-V, before the aforementioned commit, timer interrupts were passed to the hypervisor through a vmbus message, a mechanism of communication of hyper-v guests/hypervisor. With the patch, a check is made (through MSR-like mechanism) and if the hypervisor supports, a direct timer IRQ mechanism is put in-place instead of the vmbus channel. * Our theory is that on kdump kernel, specially due to the single cpu nature, the vmbus-messaged timer IRQ could interfere with scheduling and create a dead-lock condition, which is what we observe from the stack traces. Hence, we hereby propose to backport such patch to azure 4.15 kernels. * Our hypothesis about why this wasn't observed in Bionic is that changes in the kdump mechanism make it start earlier, and this might hide the schedule issue. Lack of testing also could be the cause of the non-observability. [Test Case] * Perform a kdump in Trusty/4.15 and after some rounds, you should see the mentioned RCU stack trace. With the patch, no issue is observed, and we could perform ~30 kdumps in a row. [Where problems could occur] * Since the patch touches IRQ code (although restricted to Hyper-v), it's a somewhat invasive change. It could lead to issues in the IRQ subsystem, although the patch is upstream since 2018 and no fixes for that were found (we count on Fixes tag for that).
2021-07-23 18:09:25 Guilherme G. Piccoli nominated for series Ubuntu Bionic
2021-07-23 18:09:25 Guilherme G. Piccoli bug task added linux-azure (Ubuntu Bionic)
2021-07-23 18:09:39 Guilherme G. Piccoli linux-azure (Ubuntu): assignee Guilherme G. Piccoli (gpiccoli)
2021-07-23 18:09:50 Guilherme G. Piccoli linux-azure (Ubuntu): assignee Marcelo Cerri (mhcerri)
2021-07-23 18:10:00 Guilherme G. Piccoli linux-azure (Ubuntu Bionic): assignee Marcelo Cerri (mhcerri)
2021-07-23 18:10:06 Guilherme G. Piccoli linux-azure (Ubuntu Bionic): status New In Progress
2021-07-23 18:10:10 Guilherme G. Piccoli linux-azure (Ubuntu): importance Undecided Medium
2021-07-23 18:10:11 Guilherme G. Piccoli linux-azure (Ubuntu Bionic): importance Undecided Medium
2021-08-21 02:21:18 Kelsey Steele linux-azure (Ubuntu Bionic): status In Progress Fix Committed
2021-08-23 11:54:00 Krzysztof Kozlowski bug task added linux-azure-4.15 (Ubuntu)
2021-08-23 11:54:18 Krzysztof Kozlowski linux-azure-4.15 (Ubuntu): status New Fix Committed
2021-08-23 11:54:20 Krzysztof Kozlowski linux-azure-4.15 (Ubuntu Bionic): status New Fix Committed
2021-08-24 16:27:37 Ubuntu Kernel Bot tags verification-needed-bionic
2021-09-03 20:24:42 Marcelo Cerri tags verification-needed-bionic verification-done-bionic
2021-09-07 14:49:36 Launchpad Janitor linux-azure-4.15 (Ubuntu Bionic): status Fix Committed Fix Released
2021-09-07 14:49:36 Launchpad Janitor cve linked 2021-3653
2021-09-07 14:49:36 Launchpad Janitor cve linked 2021-3656
2022-02-23 09:27:18 Krzysztof Kozlowski linux-azure-4.15 (Ubuntu): status Fix Committed Fix Released
2022-02-23 09:27:45 Krzysztof Kozlowski linux-azure (Ubuntu Bionic): status Fix Committed Fix Released
2022-02-23 09:27:47 Krzysztof Kozlowski linux-azure (Ubuntu): status In Progress Fix Released