Crucial M550 1TB SSD missing from NCQ TRIM blacklist

Bug #1363462 reported by Carl-Daniel Hailfinger
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linux
Unknown
Unknown
linux (Ubuntu)
Incomplete
Medium
Joseph Salisbury

Bug Description

I own a Crucial/Micron M550 1TB SSD which does has data loss when using NCQ TRIM.
The current Ubuntu Trusty kernel has a blacklist which matches all M550 SSDs except the 1024 GB (1 TB) version because the matching pattern is limited to 3 digits for size. Upstream Linux has fixed this bug already:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2a13772a144d2956a7fedd18685921d0a9b8b783

Please try to get this merged and tell me where I can start an installation with a fixed kernel so I won't have a corrupt disk already on the end of the installation. Thanks!

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1363462

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Carl-Daniel Hailfinger (hailfinger) wrote :

It's a bit difficult getting logs from a corrupt hard disk. This bug report was created with the help of my secondary computer running another distro.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: patch
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The upstream patch was cc'd to stable, so trusty will get the fix when the upstream commit comes in via the normal stable update process.

tags: added: kernel-fixed-upstream trusty
Changed in linux (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'll build a test kernel with a cherry-pick of upstream 2a13772a144d2956a7fedd18685921d0a9b8b783

Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
status: Confirmed → In Progress
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Trusty test kernel with a cherry pick of commit 2a13772. Can you give this kernel a test and see if it resolves this bug? If so, we can submit an SRU request until the fix comes from upstream.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1363462/

Changed in linux (Ubuntu):
status: In Progress → Incomplete
Revision history for this message
Jason M. (jason-ubuntu) wrote :

I have the same problem with two Crucial 1TB M550s:

 [ 4.094763] ata4.00: ATA-9: Crucial_CT1024M550SSD1, MU01, max UDMA/133
 [ 4.095063] ata3.00: ATA-9: Crucial_CT1024M550SSD1, MU01, max UDMA/133

and have suffered data-loss in a RAID1 config that required a reinstall. I also see the errors on my two other Intel SSDs:

 [ 4.096849] ata2.00: ATA-9: INTEL SSDSC2BW120A4, DC32, max UDMA/133
 [ 4.097088] ata1.00: ATA-9: INTEL SSDSC2BW120A4, DC32, max UDMA/133

however the Intel SSDs do not seem to have suffered data loss. A sample error from syslog, on one of the Intel SSDs:

 [1767684.772447] ata1.00: exception Emask 0x1 0 SAct 0x10 SErr 0x480100 action 0x6 frozen
 [1767684.772456] ata1.00: irq_stat 0x08000000, interface fatal error
 [1767684.772462] ata1: SError: { UnrecovData 10B8B Handshk }
 [1767684.772469] ata1.00: failed command: WRITE FPDMA QUEUED
 [1767684.772479] ata1.00: cmd 61/b0:20:a8:a0:bf/03:00:04:00:00/40 tag 4 ncq 483328 out
 [1767684.772479] res 40/00:20:a8:a0:bf/00:00:04:00:00/40 Emask 0x10 (ATA bus error)
 [1767684.772484] ata1.00: status: { DRDY }
 [1767684.772492] ata1: hard resetting link
 [1767685.088404] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 [1767685.129320] ata1.00: configured for UDMA/133
 [1767685.129339] ata1: EH complete

The motherboard is a Gigabyte H87N w/ SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05). The lspci is attached.

uname -rv: 3.13.0-39-generic #66-Ubuntu SMP Tue Oct 28 13:30:27 UTC 2014
lsb_release -d: Description: Ubuntu 14.04.1 LTS

Adding libata.force=noncq to /etc/default/grub:GRUB_CMDLINE_LINUX_DEFAULT has stopped the errors for the last 5-days, and there have been no spontaneous RAID1 resyncs.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Jason, can you test the kernel posted in comment #5?

Revision history for this message
Jason M. (jason-ubuntu) wrote :

Hi @Joseph,

Good news and bad news.

Good news: the Crucical 1TB M550s were recognized and queued trim support was disabled:
  [ 3.661992] ata4.00: disabling queued TRIM support
  [ 3.661993] ata4.00: ATA-9: Crucial_CT1024M550SSD1, MU01, max UDMA/133
  [ 3.661994] ata4.00: 2000409264 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
  [ 3.662337] ata3.00: disabling queued TRIM support
  [ 3.662337] ata3.00: ATA-9: Crucial_CT1024M550SSD1, MU01, max UDMA/133
  [ 3.662338] ata3.00: 2000409264 sectors, multi 16: LBA48 NCQ (depth 31/32), AA

The bad news: it did not stop "failed command: WRITE FPDMA QUEUED" errors, e.g.
  [ 12.098247] ata4.00: exception Emask 0x10 SAct 0x60000001 SErr 0x400100 action 0x6 frozen
  [ 12.098266] ata4.00: irq_stat 0x08000000, interface fatal error
  [ 12.098280] ata4: SError: { UnrecovData Handshk }
  [ 12.098291] ata4.00: failed command: WRITE FPDMA QUEUED
  [ 12.098304] ata4.00: cmd 61/00:00:48:23:01/04:00:03:00:00/40 tag 0 ncq 524288 out
  [ 12.098304] res 40/00:f4:48:1f:01/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
  [ 12.098346] ata4.00: status: { DRDY }
  [ 12.098355] ata4.00: failed command: WRITE FPDMA QUEUED
  [ 12.098368] ata4.00: cmd 61/00:e8:48:1b:01/04:00:03:00:00/40 tag 29 ncq 524288 out
  [ 12.098368] res 40/00:f4:48:1f:01/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
  [ 12.098409] ata4.00: status: { DRDY }
  [ 12.098417] ata4.00: failed command: WRITE FPDMA QUEUED
  [ 12.098429] ata4.00: cmd 61/00:f0:48:1f:01/04:00:03:00:00/40 tag 30 ncq 524288 out
  [ 12.098429] res 40/00:f4:48:1f:01/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
  [ 12.098466] ata4.00: status: { DRDY }
  [ 12.098476] ata4: hard resetting link

A RAID1 scrub indicated there was a repair on ATA4, so these resets caused some data issue(es).

So far, only "libata.force=X.YY:noncq" seems to fix the problem. After reading quite a few articles, I'm beginning to think there are some pretty serious issues w/ the kernel regarding SSDs and controller interactions. There are many, many folks having trouble, and the consistent response is to disable NCQ, although there may also be some interactions w/ MSI and SSDs, as least as indicated by
  - http://article.gmane.org/gmane.linux.ide/58740
  - https://bugzilla.kernel.org/show_bug.cgi?id=60731

What surprises me is that so many vendors are providing SSD backed VMs based on Linux, and yet it's mostly individuals seeing these problems with common distros, e.g. Ubuntu, Fedora, CentOS. Is there some major disconnect here?

Revision history for this message
madbiologist (me-again) wrote :

Micron has released an updated firmware (MU02) for M510/M550/MX100 drives to fix the issues with queued TRIM. Queued TRIM remains broken on M500 but is working fine on later drives such as M600 and MX200.

The upstream 3.19 kernel has tweaked the blacklist to reflect the above. This change is also in the Ubuntu 3.19.0-18.18 Vivid kernel and the 3.16.0-43.58~14.04.1 Trusty kernel.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.