Comment 110 for bug 1505564

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Hello Junien,

After your last crash - similar to previous ones - one thing called my attention: For the first time we had one CPU RCU stall detected by another CPU. This made me think that it wasn't only related to the SMP logic - like I believed - but the stall occurred also somewhere else.

----
[ 5792.466770] INFO: rcu_sched detected stalls on CPUs/tasks: { 7} (detected by 15, t=15003 jiffies, g=182379, c=182378, q=0)
----

And this stall happened before Async I/O callbacks started to be suppressed:

----
[ 5793.190218] block nbd6: Attempted send on closed socket
[ 5793.190221] blk_update_request: 1154 callbacks suppressed
[ 5793.190223] blk_update_request: I/O error, dev nbd6, sector 125828992
[ 5793.190226] buffer_io_error: 1151 callbacks suppressed
[ 5793.190227] Buffer I/O error on dev nbd6, logical block 125828992, async page read
[ 5793.190235] block nbd6: Attempted send on closed socket
[ 5793.190237] blk_update_request: I/O error, dev nbd6, sector 125828993
[ 5793.190238] Buffer I/O error on dev nbd6, logical block 125828993, async page read
[ 5793.190242] block nbd6: Attempted send on closed socket
[ 5793.190243] blk_update_request: I/O error, dev nbd6, sector 125828994
[ 5793.190245] Buffer I/O error on dev nbd6, logical block 125828994, async page read
[ 5793.190248] block nbd6: Attempted send on closed socket
----

Digging upstream (from 3.13 to HEAD) I could see there were not a huge amount of fixes:

----
$ git log --pretty=oneline v3.13..HEAD -- drivers/block/nbd.c | wc -l
31
----

For nbd.c and I identified an improvement on nbd timeout handling:

----
commit 7e2893a16d3e71035a38122a77bc55848a29f0e4
Author: Markus Pargmann <email address hidden>
Date: Mon Aug 17 08:20:00 2015 +0200

    nbd: Fix timeout detection
----

This fix is pretty recent (4.3) and it fit to the case: 3.18 kernel facing the same issue.

Later I found out that Debian had a similar bug:

----
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=770479
https://lists.debian.org/debian-kernel/2015/05/msg00054.html
----

for kernel 3.16, complaining about messages like this:

----
[ 5793.190242] block nbd6: Attempted send on closed socket
----

And the lack of proper timeout for nbd connections (now based on timeout after IO submission).

SO...

The backport shall be easy* and I'll probably make one PPA containing a 3.18 (+ this patch) available for you tomorrow.

* 2 out of 12 hunks FAILED -- saving rejects to file drivers/block/nbd.c.rej
* Debian has a 3.16 version already

Thank you

Rafael Tinoco