soft lockup from bcache leading to high load and lockup on trusty

Bug #1757277 reported by Drew Freiberger on 2018-03-20
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Trusty
High
Unassigned

Bug Description

I have an environment with Dell R630 servers with RAID controllers with two virtual disks and 22 passthru devices. 2 SAS SSDs and 20 HDDs are setup in 2 bcache cachesets with a resulting 20 mounted xfs filesystems running bcache backending an 11 node swift cluster (one zone has 1 fewer nodes). Two of the zones have these nodes as described above and they appear to be exibiting soft lockups in the bcache thread of the kernel causing other kernel threads to go into i/o blocking state an keeping processes on any bcache from being successful. disk access to the virtual disks mounted with out bcache is still possible when this lockup occurs.

https://pastebin.ubuntu.com/p/mtn47QqBJ3/

There are several softlockup messages found in the dmesg and many of
the dumpstack are locked inside the bch_writeback_thread();

static int bch_writeback_thread(void *arg)
{
[...]
while (!kthread_should_stop()) {
down_write(&dc->writeback_lock);
[...]
}

One coredump is found when the kswapd is doing the reclaim about the
xfs inode cache.

__xfs_iflock(
struct xfs_inode *ip)
{
do {
prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
if (xfs_isiflocked(ip))
io_schedule();
} while (!xfs_iflock_nowait(ip));

- Possible fix commits:

1). 9baf30972b55 bcache: fix for gc and write-back race
https://www.spinics.net/lists/linux-bcache/msg04713.html

- Related discussions:

1). Re: [PATCH] md/bcache: Fix a deadlock while calculating writeback rate
https://www.spinics.net/lists/linux-bcache/msg04617.html

2). Re: hang during suspend to RAM when bcache cache device is attached
https://www.spinics.net/lists/linux-bcache/msg04636.html

We are running trusty/mitaka swift storage on these nodes with 4.4.0-111 kernel (linux-image-generic-lts-xenial).

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1757277

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty
tags: added: kernel-da-key
Changed in linux (Ubuntu):
status: Incomplete → Triaged
importance: Undecided → High
Changed in linux (Ubuntu Trusty):
status: New → Triaged
importance: Undecided → High
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Trusty):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
status: Triaged → In Progress
Changed in linux (Ubuntu Trusty):
status: Triaged → In Progress
Joseph Salisbury (jsalisbury) wrote :

Commit 9baf30972b55 is in the Xenial kernel as of 4.4.0-98. It sounds like you are running the 4.4.0-111 kernel, correct?

The second patch you mention was never accepted into mainline. We can test it out. However, it might be good to first test the mainline kernel to see if some other commit already fixed this bug. Can you test the current mainline kernel? It can be downloaded from:

 http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16-rc6

Chris Gregan (cgregan) wrote :

@Drew
Do you have an update to the request above?

Drew Freiberger (afreiberger) wrote :

Joseph,

I'm currently testing a 4.15.0-13 kernel from xenial-16.04-edge path on these hosts. I just had the issue exhibit before the kernel change, so we should know within a couple days if that helps. Unfortunately, the logs for this system beyond those shared are not available publicly.

Changed in linux (Ubuntu Trusty):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu):
assignee: Joseph Salisbury (jsalisbury) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers