Problem Analysis ================ The dmesg log 'crash/201809061748/dmesg.201809061748' shows the CPU soft lockup occurs 25 seconds after the 'sdb' virtio-scsi drive is removed. This seems to indicate the events are related (there's usually an extra 2s-3s between an event and the report of the 22s or 23s stuck, for some reason). [ 3002.697474] sd 0:0:2:0: [sdb] Synchronizing SCSI cache [ 3002.697545] sd 0:0:2:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [ 3028.294602] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [id:2887] CPU 1 is waiting on another CPU (to finish something) ------------------------------- The locked up 'id' process is in page fault handling stack (which is OK/normal), calling a function on many (SMP) cpus (see smp_call_function_many() in the top of the stack), and is specifically in the call to one CPU (see smp_call_function_single+0xae/0x110() in the RIP register). [ 3028.301755] CPU: 1 PID: 2887 Comm: id Not tainted 4.4.0-133-generic #159~14.04.1-Ubuntu ... [ 3028.301760] RIP: 0010:[] [] smp_call_function_single+0xae/0x110 ... [ 3028.301797] Call Trace: [ 3028.301803] [] ? do_kernel_range_flush+0x40/0x40 [ 3028.301805] [] smp_call_function_many+0x22e/0x270 [ 3028.301808] [] native_flush_tlb_others+0x48/0x120 [ 3028.301810] [] flush_tlb_mm_range+0x9d/0x180 [ 3028.301815] [] ptep_clear_flush+0x53/0x60 [ 3028.301819] [] wp_page_copy.isra.58+0x29d/0x530 [ 3028.301822] [] do_wp_page+0x8d/0x590 [ 3028.301824] [] handle_mm_fault+0xd86/0x1ac0 [ 3028.301829] [] ? free_pages+0x13/0x20 [ 3028.301835] [] ? finish_task_switch+0x244/0x2a0 [ 3028.301840] [] __do_page_fault+0x19b/0x430 [ 3028.301843] [] do_page_fault+0x22/0x30 [ 3028.301847] [] page_fault+0x28/0x30 The smp_call_function_many() address line in the stack trace (smp_call_function_many+0x22e / ffffffff81102e9e) reflects it's called smp_call_function_single(): ffffffff81102c70 : ... ffffffff81102e99: e8 d2 fb ff ff callq ffffffff81102a70 ffffffff81102e9e: 48 83 c4 10 add $0x10,%rsp ... which per the address in the RIP register (smp_call_function_single+0xae ffffffff81102b1e) is in the (inlined) call to csd_lock_wait() after generic_exec_single(). csd_lock_wait() spins in the value of csd->flags with cpu_relax() / 'pause' instruction, waiting for it to be unlocked (i.e., not have the CSD_FLAG_LOCK flag / 0x1 value in the flags field / offset 0x18) ffffffff81102a70 : ... ffffffff81102b0f: e8 0c fe ff ff callq ffffffff81102920 ffffffff81102b14: 8b 55 e8 mov -0x18(%rbp),%edx ffffffff81102b17: 83 e2 01 and $0x1,%edx ffffffff81102b1a: 74 de je ffffffff81102afa ffffffff81102b1c: f3 90 pause ffffffff81102b1e: 8b 55 e8 mov -0x18(%rbp),%edx ffffffff81102b21: 83 e2 01 and $0x1,%edx ffffffff81102b24: 75 f6 jne ffffffff81102b1c """ smp_call_function_single() ... err = generic_exec_single(cpu, csd, func, info); if (wait) csd_lock_wait(csd); ... """ """ static void csd_lock_wait(struct call_single_data *csd) { while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK) cpu_relax(); } """ kernel/smp.c: CSD_FLAG_LOCK = 0x01, include/linux/smp.h: struct call_single_data { struct llist_node llist; smp_call_func_t func; void *info; unsigned int flags; }; $ pahole -C call_single_data vmlinux-4.4.0-133-generic ... struct call_single_data { struct llist_node llist; /* 0 8 */ smp_call_func_t func; /* 8 8 */ void * info; /* 16 8 */ unsigned int flags; /* 24 4 */ ### ^ 24 == 0x18 ... }; So this shows that CPU 1 is spinning/waiting on another CPU to finish something. What the (an)other CPU is doing ------------------------------- This guest has only 2 CPUs (confirmed with the crashdump file 'dump.201809061748'), so CPU 0 is the only other CPU that could be doing something. KERNEL: /home/mfo/<...>/vmlinux-4.4.0-133-generic DUMPFILE: dump.201809061748 [PARTIAL DUMP] CPUS: 2 The NMI to other CPUs in the dmesg file reveals it's (still -- 20ish seconds later) in the SCSI drive removal stack, specifically in the virtio-scsi target removal/destroy handler (see virtscsi_target_destroy() in RIP register). [ 3028.301874] Sending NMI to other CPUs: [ 3028.302876] NMI backtrace for cpu 0 [ 3028.302882] CPU: 0 PID: 2246 Comm: kworker/0:0 Not tainted 4.4.0-133-generic #159~14.04.1-Ubuntu ... [ 3028.302887] RIP: 0010:[] [] virtscsi_target_destroy+0x19/0x30 [virtio_scsi] ... [ 3028.302907] Call Trace: [ 3028.302908] [] scsi_target_destroy+0x82/0xd0 [ 3028.302909] [] scsi_target_reap_ref_release+0x35/0x40 [ 3028.302910] [] scsi_target_reap+0x2c/0x30 [ 3028.302911] [] __scsi_remove_device+0x89/0xe0 [ 3028.302912] [] scsi_remove_device+0x26/0x40 [ 3028.302914] [] virtscsi_handle_event+0x127/0x1e0 [virtio_scsi] [ 3028.302915] [] process_one_work+0x156/0x400 [ 3028.302916] [] worker_thread+0x11a/0x480 [ 3028.302918] [] ? rescuer_thread+0x310/0x310 [ 3028.302919] [] kthread+0xd8/0xf0 [ 3028.302920] [] ? __schedule+0x2a2/0x820 [ 3028.302921] [] ? kthread_park+0x60/0x60 [ 3028.302922] [] ret_from_fork+0x55/0x80 [ 3028.302924] [] ? kthread_park+0x60/0x60 Looking at the region pointed to by the RIP register, virtscsi_target_destroy+0x19, (it's in '0x39' counting from '0x20', just looking at the kernel module objdump).. It's similarly looping/spinning with pause (cpu_relax()), checking for the 'tgt->reqs' counter until it becomes zero (i.e., completion of all requests in the virtio SCSI target). 0000000000000020 : 20: e8 00 00 00 00 callq 25 25: 55 push %rbp 26: 48 8b bf 28 03 00 00 mov 0x328(%rdi),%rdi 2d: 48 89 e5 mov %rsp,%rbp 30: 8b 47 04 mov 0x4(%rdi),%eax 33: 85 c0 test %eax,%eax 35: 74 09 je 40 37: f3 90 pause 39: 8b 47 04 mov 0x4(%rdi),%eax 3c: 85 c0 test %eax,%eax 3e: 75 f7 jne 37 40: e8 00 00 00 00 callq 45 45: 5d pop %rbp 46: c3 retq 47: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 4e: 00 00 """ static void virtscsi_target_destroy(struct scsi_target *starget) { struct virtio_scsi_target_state *tgt = starget->hostdata; /* we can race with concurrent virtscsi_complete_cmd */ while (atomic_read(&tgt->reqs)) cpu_relax(); kfree(tgt); } """ Let's check what that 'tgt->regs' counter value is with the crashdump. $ crash ~/ddebs/linux-image-4.4.0-133-generic-dbgsym_4.4.0-133.159~14.04.1_amd64/usr/lib/debug/boot/vmlinux-4.4.0-133-generic crash/201809061748/dump.201809061748 crash> set -c 0 PID: 2246 COMMAND: "kworker/0:0" TASK: ffff880035960000 [THREAD_INFO: ffff880212a90000] CPU: 0 STATE: TASK_RUNNING (ACTIVE) crash> bt PID: 2246 TASK: ffff880035960000 CPU: 0 COMMAND: "kworker/0:0" #0 [ffff88021fc08e38] crash_nmi_callback at ffffffff8104f727 #1 [ffff88021fc08e48] nmi_handle at ffffffff81031407 #2 [ffff88021fc08ea0] default_do_nmi at ffffffff810319e0 #3 [ffff88021fc08ec0] do_nmi at ffffffff81031b19 #4 [ffff88021fc08ee8] end_repeat_nmi at ffffffff81822e17 [exception RIP: virtscsi_target_destroy+25] RIP: ffffffffc0002039 RSP: ffff880212a93d38 RFLAGS: 00000006 RAX: 0000000000001eb2 RBX: ffff880216be7000 RCX: ffff8802179e0a20 RDX: ffffffffc0002020 RSI: ffff880035bddc28 RDI: ffff880211ea3420 RBP: ffff880212a93d38 R8: 0000000000000000 R9: ffffffff813eb3e5 R10: ffff88021fc1a6c0 R11: 0000000000000000 R12: ffff880035bddc00 R13: ffff880035bddc28 R14: 0000000000000282 R15: 0000000000000002 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 --- --- #5 [ffff880212a93d38] virtscsi_target_destroy at ffffffffc0002039 [virtio_scsi] #6 [ffff880212a93d40] scsi_target_destroy at ffffffff815ad4f2 #7 [ffff880212a93d70] scsi_target_reap_ref_release at ffffffff815ad575 #8 [ffff880212a93d90] scsi_target_reap at ffffffff815aefac #9 [ffff880212a93da0] __scsi_remove_device at ffffffff815b14f9 #10 [ffff880212a93dc0] scsi_remove_device at ffffffff815b1576 #11 [ffff880212a93de0] virtscsi_handle_event at ffffffffc0002bc7 [virtio_scsi] #12 [ffff880212a93e20] process_one_work at ffffffff8109b0b6 #13 [ffff880212a93e68] worker_thread at ffffffff8109ba9a #14 [ffff880212a93ec8] kthread at ffffffff810a0f78 #15 [ffff880212a93f50] ret_from_fork at ffffffff818203b5 the stack trace in crash gives us the register values at the time virtscsi_target_destroy() was interrupted while running, and RIP (instruction pointer / next instruction to execute) shows it's about to reload the EAX register with the value it loops for while checking. EAX is tgt->reqs, which loaded as: scsi_target is the first argument, thus in RDI register, its 'hostdata' field is in offset 0x328 (which is cast to virtio_scsi_target_state), and its 'reqs' field is in offset 0x4: $ pahole -C scsi_target vmlinux-4.4.0-133-generic | grep hostdata ... void * hostdata; /* 808 8 */ ### ^ 808 = 0x328 ... $ pahole -C virtio_scsi_target_state lib/modules/4.4.0-133-generic/kernel/drivers/scsi/virtio_scsi.ko | grep reqs ... atomic_t reqs; /* 4 4 */ ... So, let's get the value from memory, based in the RDI register (which at that point has the tgt pointer in it): crash> mod -s virtio_scsi /home/mfo/ddebs/linux-image-4.4.0-133-generic-dbgsym_4.4.0-133.159~14.04.1_amd64/usr/lib/debug/lib/modules/4.4.0-133-generic/kernel/drivers/scsi/virtio_scsi.ko crash> struct virtio_scsi_target_state.reqs ffff880211ea3420 reqs = { counter = 0x1eb2 } crash> eval 0x1eb2 hexadecimal: 1eb2 decimal: 7858 And it's indeed a non-zero value, which seems quite high actually, around ~8k requests. This is why it's looping. The value does not seem to decrease, though. Bug in virtio-scsi driver ------------------------- It's unlikely that all these were pending and did not finish after all this time (~25 seconds since the SCSI removal started, at least +12 seconds since 'sbd' was unmounted -- sbd1 and sdc1 are mounted, data is copied, both are unmounted, sdc1 is mounted). [ 2215.675484] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null) [ 2222.563236] EXT4-fs (sdc1): mounted filesystem with ordered data mode. Opts: (null) [ 2990.023442] EXT4-fs (sdc1): mounted filesystem with ordered data mode. Opts: (null) [ 3002.697474] sd 0:0:2:0: [sdb] Synchronizing SCSI cache [ 3002.697545] sd 0:0:2:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [ 3028.294602] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [id:2887] So this look a problem in the driver. The wait/spin loop is introduced by a downstream/SAUCE commit to avoid a race with virtscsi_complete_cmd(). This is not present upstream, which dropped the tgt-reqs counter altogether. However, per the commit description, we still need to synchronize with that function in Xenial. So, instead of finding an alternative way to sync for now, let's check why the reqs counter has problems. Studying the driver code for references to tgt->reqs, we see it's only changes in 3 sites> 1) decremented in one site, virtscsi_complete_cmd(), which is a callback fired after a SCSI command is completed by the virtio-scsi adapter), 2) incremented in the .queuecommand() path, before calling virtscsi_queuecommand() in the only two callers of the .queuecommand() implementation (virtscsi_queuecommand()). 2.1) - virtscsi_pick_vq_mq() 2.2) - virtscsi_pick_vq() Complete command path: virtscsi_complete_cmd() ... sc->scsi_done(sc); atomic_dec(&tgt->reqs); } In the 1-queue path: virtscsi_queuecommand_single(): ... atomic_inc(&tgt->reqs); return virtscsi_queuecommand(vscsi, &vscsi->req_vqs[0], sc); } In the >1-queues (not blk-mq) path: virtio_scsi_vq *virtscsi_pick_vq_mq() ... atomic_inc(&tgt->reqs); return &vscsi->req_vqs[hwq]; struct virtio_scsi_vq() ... if (atomic_inc_return(&tgt->reqs) > 1) { ... ... with their only caller being: virtscsi_queuecommand_multi(struct Scsi_Host *sh, struct scsi_cmnd *sc) { ... if (shost_use_blk_mq(sh)) req_vq = virtscsi_pick_vq_mq(vscsi, tgt, sc); else req_vq = virtscsi_pick_vq(vscsi, tgt); return virtscsi_queuecommand(vscsi, req_vq, sc); These are the SCSI host templates registering those 2 functions: static struct scsi_host_template virtscsi_host_template_single = { ... .queuecommand = virtscsi_queuecommand_single, ... static struct scsi_host_template virtscsi_host_template_multi = { ... .queuecommand = virtscsi_queuecommand_multi, ... Now, observing the implementation of virtscsi_queuecommand(), the problem happens when the command fails to be kicked to the virtio-scsi adapter: The function is called with the tgt-reqs counter already incremented. But if virtscsi_kick_cmd() fails with anything other than -EIO, the SCSI command will be requeued, and will go through the .queuecommand() path again, which increments the tgt->reqs counter _again_, for the same SCSI command (note this may repeat multiple times for the same command, depending on the virtio-scsi adapter/queue condition/processing/etc). (The EIO case is not a problem because it completes the SCSI command, which decrements the tgt-reqs counter). virtscsi_queuecommand() ... ret = virtscsi_kick_cmd(req_vq, cmd, req_size, sizeof(cmd->resp.cmd)); if (ret == -EIO) { cmd->resp.cmd.response = VIRTIO_SCSI_S_BAD_TARGET; spin_lock_irqsave(&req_vq->vq_lock, flags); virtscsi_complete_cmd(vscsi, cmd); spin_unlock_irqrestore(&req_vq->vq_lock, flags); } else if (ret != 0) { return SCSI_MLQUEUE_HOST_BUSY; } return 0; } So, the solution to keep the tgt-reqs counter correct in this case is to decrement the tgt-reqs counter in that error condition, when we know the command will be requeued (and thus tgt->reqs will be incremented _again_). Synthetic test-case ------------------- I could successfully reproduce this problem by attaching GDB to the QEMU GDB server to debug/play with the Linux kernel, and simulated that the check for the available elements in the virtio ring/queue were zero, so that error case happened. I'll detail the steps too. This was also used to verify the proposed fix. The fix also has been tested with the test-case reported by the customer, which is mounting the virtio-scsi disk, copying out of it, unmounting it, and dettaching it from the guest instance. This test-case is a loop doing that procedure; it usually fails in the first or second iteration, and with the fix it has been running for 35 iterations without problems.