A userspace process hangs in d-state forever in a virtual machine environment with a virtio-scsi disk

Bug #1821738 reported by Denis Plotnikov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Ubuntu 4.15.0-46.49-generic 4.15.18

It happens because the process is waiting for its request completion which never happens.

The reason for the hung request is a race condition inside the block layer.

Namely, there is a race condition with a long request.

Each request has a timer. When timer fires it sets REQ_ATOM_COMPLETE and clears it after finishing.

The request completion checks REQ_ATOM_COMPLETE and if it is set the completion returns doing nothing and never executes again, thinking that the request doesn't need any attention anymore since it's actually completed.

Thus, if the request completion starts executing when the timer handler is in progress it just returns seeing that the complete flag is set, then the timer clears the complete flag and the request stays in the system forever executing the timer handler again and again which just rearms itself.

This happens with the long-running requests only. By default, the request timeout is 30 seconds so there should be a request which execution time > 30 seconds.
This is a rare case for local hardware storages but may appear more often when the storage is accessed via a network.

The behavior described affects mainstream 4.13, 4.14, 4.15 kernels and rh7-3.10.0-957.5.1.el7 kernel based systems.

Before 4.13 - the timer didn't rearm itself and just aborted the request. The patch rearming the timer was introduced in 4.13: e72c9a2a67a6400c "scsi: virtio_scsi: let host do exception handling"

After 4.15 the block layer switched to using MQ scheme in block layer which isn't prone to this kind of races. In recent kernel >=5.0 there is the only MQ scheme left and the legacy race-prone block layer code has been removed.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1821738

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Denis Plotnikov (denis.plotnikov) wrote :
Revision history for this message
Denis Plotnikov (denis.plotnikov) wrote :
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Is there an upstream commit to fix the issue?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.