Comment 8 for bug 1638700

Revision history for this message
Ming Lei (tom-leiming) wrote : Re: [Bug 1638700] Re: hio: SSD data corruption under stress test

On Thu, Nov 3, 2016 at 5:42 AM, Kamal Mostafa <email address hidden> wrote:
> Ming Lei comment #2 says you're the author of this patch to the hio
> driver:
>
> +#if (LINUX_VERSION_CODE >= KERNEL_VERSION(4,3,0))
> + blk_queue_split(q, &bio, q->bio_split);
> +#endif
> +
>
> Can you provide us with a short explanation for the git log, and also
> your Signed-off-by line for that patch?

Sure, please see the attachment.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1638700
>
> Title:
> hio: SSD data corruption under stress test
>
> Status in linux package in Ubuntu:
> In Progress
> Status in linux source package in Xenial:
> In Progress
> Status in linux source package in Yakkety:
> In Progress
> Status in linux source package in Zesty:
> In Progress
>
> Bug description:
> {forward from James Troup}:
>
> Just to followup to this with a little more information, we have now
> reproduced this in the following scenarios:
>
> * Ubuntu kernel 4.4 (i.e. 16.04) and kernel 4.8 (i.e. HWE-Y)
> * With and without Bcache involved
> * With both XFS and ext4
> * With HIO driver versions 2.1.0-23 and 2.1.0-25
> * With HIO Firmware 640 and 650
> * With and without the following two patches
> - https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/xenial/commit/?id=7290fa97b945c288d8dd8eb8f284b98cb495b35b
> - https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/xenial/commit/?id=901a3142db778ddb9ed6a9000ce8e5b0f66c48ba
>
> In all cases, we applied the following two patches in order to get hio
> to build at all with a 4.4 or later kernel:
>
> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/xenial/commit/?id=0abbb90372847caeeedeaa9db0f21e05ad8e9c74
> https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/xenial/commit/?id=a0705c5ff3d12fc31f18f5d3c8589eaaed1aa577
>
> We've confirmed that we can reproduce the corruption on any machine in
> Tele2's Vienna facility.
>
> We've confirmed that, other than 1 machine, the 'hio_info' command
> says the health is 'OK'.
>
> Our most common reproducer is one of two scenarios:
>
> a) http://paste.ubuntu.com/23405150/
>
> b) http://paste.ubuntu.com/23405234/
>
> In the last example, it's possible to see corruption faster by
> increasing the 'count' argument to dd and avoid it by lowering it.
> e.g. on the machine I'm currently testing on count=52450 doesn't
> appear to show corruption, but a count of even 53000 would show it
> immediately every time.
>
> I hope this helps - please let us know what further information we can
> provide to debug this problem.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1638700/+subscriptions