Data Corruption with Sil 3114 and 3124 controlleurs

Bug #1861300 reported by xenoxis on 2020-01-29
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned

Bug Description

Silicon Image 3114 and 3124 seem to have data corruption at writing data (particularly with big files or when the writing concern lot of files).
I used a PCI card with the Sil 3114 chip; when copying files and directories from a HDD to an other, data corruption appears and seems to be random in files.
But every checks about the computer (RAM, HDD ...) have gone right.
Don't know what happening really, but I supposed that the module which control the Sil 3114 (and so the Sil 3124, cause this module was written from the 3114's module) contain a bug.
I've tried to enable the Sil 3114 module's option called "slow_down" which should solve some "problems", no difference with and without this option activated. The Sil 3124 module's doesn't have this option.

Drives connected to Sil3124 controller seems to be affected by data corruption when at least 2 process attempts to writing at the same drive (obviously not in the same directory nor the same file), against Sil3114 which the bug seems to appears anytime.

I have already posted a question : https://unix.stackexchange.com/questions/564067/file-corruption-between-2-hdd with more precisions about what have done until now.

Now i use a PCI card based on the Sil 3124 controller, seems to have the same data corruption issue as the Sil 3114, but less pronounced (less often).

Running Ubuntu 16.04 server i386 with Linux Kernel 4.4.211.

xenoxis (xenoxis) on 2020-01-29
description: updated
description: updated
description: updated
xenoxis (xenoxis) on 2020-01-29
description: updated
description: updated
summary: - Data Corruption with Sil 3114 and 3124 controlleur
+ Data Corruption with Sil 3114 and 3124 controlleurs

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1861300/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
xenoxis (xenoxis) on 2020-01-29
tags: added: i386
tags: added: xenial
affects: ubuntu → linux (Ubuntu)

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1861300

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
xenoxis (xenoxis) wrote :

No logs in dmesg nor anything else, cause the driver tell to linux that everything is ok.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Kai-Heng Feng (kaihengfeng) wrote :

Possible to try amd64 instead of i386?

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
xenoxis (xenoxis) wrote :

No cause the CPU is i386 only.
Don't think change kernel architecture would help.

xenoxis (xenoxis) on 2020-02-09
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Kai-Heng Feng (kaihengfeng) wrote :

Please run apport-collect to collect necessary logs.

xenoxis (xenoxis) wrote :

I can't, cause I get the error "No REFERER Header", i'm using Lynx cause i'm running SSH console on the server, so there's no X11 server to run a better web browser.

How could I get logs ? Even if there's no error about data corruption in linux logs ? Cause Linux can't notice that data being corrupted by a driver

Kai-Heng Feng (kaihengfeng) wrote :

Ok, please try disabling via kernel parameter "libata.force=noncq".

xenoxis (xenoxis) wrote :

It seems that it solve the problem, but I need to do more tests to confirm this.

In the case it has fix the problem, can you tell me why this kernel parameter has fixed the bug ? I know that it disabled the NCQ queue, but this feature is include inside the microprocessor of each of my HDD isn't it ? This is independent from my SATA controller, no ?

I've checked if Sil 3114 and Sil 3124 support NQC, it seems that 3124 support it for sure, less for the 3114.

xenoxis (xenoxis) wrote :

Ok, I've still data corruption thought Samba share copy (between my PC on Windows and the server), but it seems to be ok with copy between HDD, i'm still calculating checksums with a high amount of data copied ...

Kai-Heng Feng (kaihengfeng) wrote :

Ok, if disabling NCQ solves the issue we need to fix it in the kernel.

xenoxis (xenoxis) wrote :

Ok, after lots of tests (copies betweens HDD, copies through SAMBA ...), it seems that this option fix the problem.
I'm not still 100% sure, i'm still doing tests and i'll use the server for a long period of time, so i'll report if the issue still there.

xenoxis (xenoxis) wrote :

I don't know why and how, but it seems that data corruption still appears in the case that we read and write things on the same HDD

xenoxis (xenoxis) wrote :

After somes tests, seems that some data through Samba are copied well, but some fail every time I try to copy them, the checksum change at every copies.

I don't know if there's consecutive bits that failed the transfer or something else, but the fact is there : the bug is still there (even if Disabling NCQ made a difference).

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers