After your last crash - similar to previous ones - one thing called my attention: For the first time we had one CPU RCU stall detected by another CPU. This made me think that it wasn't only related to the SMP logic - like I believed - but the stall occurred also somewhere else.
----
[ 5792.466770] INFO: rcu_sched detected stalls on CPUs/tasks: { 7} (detected by 15, t=15003 jiffies, g=182379, c=182378, q=0)
----
And this stall happened before Async I/O callbacks started to be suppressed:
----
[ 5793.190218] block nbd6: Attempted send on closed socket
[ 5793.190221] blk_update_request: 1154 callbacks suppressed
[ 5793.190223] blk_update_request: I/O error, dev nbd6, sector 125828992
[ 5793.190226] buffer_io_error: 1151 callbacks suppressed
[ 5793.190227] Buffer I/O error on dev nbd6, logical block 125828992, async page read
[ 5793.190235] block nbd6: Attempted send on closed socket
[ 5793.190237] blk_update_request: I/O error, dev nbd6, sector 125828993
[ 5793.190238] Buffer I/O error on dev nbd6, logical block 125828993, async page read
[ 5793.190242] block nbd6: Attempted send on closed socket
[ 5793.190243] blk_update_request: I/O error, dev nbd6, sector 125828994
[ 5793.190245] Buffer I/O error on dev nbd6, logical block 125828994, async page read
[ 5793.190248] block nbd6: Attempted send on closed socket
----
Digging upstream (from 3.13 to HEAD) I could see there were not a huge amount of fixes:
Hello Junien,
After your last crash - similar to previous ones - one thing called my attention: For the first time we had one CPU RCU stall detected by another CPU. This made me think that it wasn't only related to the SMP logic - like I believed - but the stall occurred also somewhere else.
----
[ 5792.466770] INFO: rcu_sched detected stalls on CPUs/tasks: { 7} (detected by 15, t=15003 jiffies, g=182379, c=182378, q=0)
----
And this stall happened before Async I/O callbacks started to be suppressed:
----
[ 5793.190218] block nbd6: Attempted send on closed socket
[ 5793.190221] blk_update_request: 1154 callbacks suppressed
[ 5793.190223] blk_update_request: I/O error, dev nbd6, sector 125828992
[ 5793.190226] buffer_io_error: 1151 callbacks suppressed
[ 5793.190227] Buffer I/O error on dev nbd6, logical block 125828992, async page read
[ 5793.190235] block nbd6: Attempted send on closed socket
[ 5793.190237] blk_update_request: I/O error, dev nbd6, sector 125828993
[ 5793.190238] Buffer I/O error on dev nbd6, logical block 125828993, async page read
[ 5793.190242] block nbd6: Attempted send on closed socket
[ 5793.190243] blk_update_request: I/O error, dev nbd6, sector 125828994
[ 5793.190245] Buffer I/O error on dev nbd6, logical block 125828994, async page read
[ 5793.190248] block nbd6: Attempted send on closed socket
----
Digging upstream (from 3.13 to HEAD) I could see there were not a huge amount of fixes:
----
$ git log --pretty=oneline v3.13..HEAD -- drivers/block/nbd.c | wc -l
31
----
For nbd.c and I identified an improvement on nbd timeout handling:
---- 35a38122a77bc55 848a29f0e4
commit 7e2893a16d3e710
Author: Markus Pargmann <email address hidden>
Date: Mon Aug 17 08:20:00 2015 +0200
nbd: Fix timeout detection
----
This fix is pretty recent (4.3) and it fit to the case: 3.18 kernel facing the same issue.
Later I found out that Debian had a similar bug:
---- /bugs.debian. org/cgi- bin/bugreport. cgi?bug= 770479 /lists. debian. org/debian- kernel/ 2015/05/ msg00054. html
https:/
https:/
----
for kernel 3.16, complaining about messages like this:
----
[ 5793.190242] block nbd6: Attempted send on closed socket
----
And the lack of proper timeout for nbd connections (now based on timeout after IO submission).
SO...
The backport shall be easy* and I'll probably make one PPA containing a 3.18 (+ this patch) available for you tomorrow.
* 2 out of 12 hunks FAILED -- saving rejects to file drivers/ block/nbd. c.rej
* Debian has a 3.16 version already
Thank you
Rafael Tinoco