Linux 4.15 and onwards fails to initialize some hard drives

Bug #1783906 reported by danieru on 2018-07-26
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Status tracked in Cosmic
Bionic
Medium
Joseph Salisbury
Cosmic
Medium
Joseph Salisbury

Bug Description

I have two hard drives, the main hard drive is a TOSHIBA DT01ACA200 the second backup hard drive is a Western Digital WD5003AZEX. I installed lubuntu 18.04.1 on the Toshiba HDD and it boots just fine, the issue is with the second hard drive, when installing the WD HDD wouldn't even come as an option to install, and after boot the WD HDD still wouldn't come up, this is the dmesg with the stock kernel (4.15) https://paste.ubuntu.com/p/kpxh94v2SK/

ata6 is the WD HDD that refuses to work. The messages:
[ 302.107650] ata6: SError: { CommWake 10B8B Dispar DevExch }
[ 302.107658] ata6: hard resetting link
[ 307.860291] ata6: link is slow to respond, please be patient (ready=0)
[ 312.120898] ata6: COMRESET failed (errno=-16)
[ 363.445120] INFO: task kworker/u8:5:201 blocked for more than 120 seconds.
[ 363.445131] Not tainted 4.15.0-29-generic #31-Ubuntu
[ 363.445135] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 363.445140] kworker/u8:5 D 0 201 2 0x80000000
[ 363.445155] Workqueue: events_unbound async_run_entry_fn
[ 363.445157] Call Trace:
[ 363.445171] __schedule+0x291/0x8a0
[ 363.445177] schedule+0x2c/0x80
[ 363.445182] ata_port_wait_eh+0x7c/0xf0
[ 363.445186] ? wait_woken+0x80/0x80
[ 363.445189] ata_port_probe+0x28/0x40
[ 363.445192] async_port_probe+0x2e/0x52
[ 363.445196] async_run_entry_fn+0x3c/0x150
[ 363.445199] process_one_work+0x1de/0x410
[ 363.445203] worker_thread+0x32/0x410
[ 363.445207] kthread+0x121/0x140
[ 363.445210] ? process_one_work+0x410/0x410
[ 363.445214] ? kthread_create_worker_on_cpu+0x70/0x70
[ 363.445218] ret_from_fork+0x22/0x40

Repeat constantly. Also when I try to turn off the computer, the computer seem to freeze, the lights of the keyboard and mouse turn off and the computer just stay on.

I tried Tiny Core 9.0 which has linux 4.14.10, and i didn't had this issue, i also installed linux 4.14 on this lubuntu 18.04 using Ukuu Kernel Update Utility. And with this kernel version, or any previous version the WD HDD does work again. Here's a dmesg of lubuntu 18.04 with linux 4.14 and the WD HDD finally coming up at the end: https://paste.ubuntu.com/p/Gd3cGFbjTJ/

Also tried with with linux 4.17 but the WD HDD would also refuse to work on this version. Here's another dmesg with this version: https://paste.ubuntu.com/p/PmNn96vZZv/

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-29-generic 4.15.0-29.31
ProcVersionSignature: Ubuntu 4.15.0-29.31-generic 4.15.18
Uname: Linux 4.15.0-29-generic x86_64
AlsaVersion: Advanced Linux Sound Architecture Driver Version k4.15.0-29-generic.
ApportVersion: 2.20.9-0ubuntu7.2
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: testtest 756 F.... pulseaudio
 /dev/snd/controlC1: testtest 756 F.... pulseaudio
Card0.Amixer.info:
 Card hw:0 'NVidia_1'/'HDA NVidia at 0xfe020000 irq 22'
   Mixer name : 'Realtek ALC1200'
   Components : 'HDA:10ec0888,10ec0000,00100101 HDA:10de0002,10de0101,00100000'
   Controls : 56
   Simple ctrls : 21
Card1.Amixer.info:
 Card hw:1 'NVidia'/'HDA NVidia at 0xfcffc000 irq 16'
   Mixer name : 'Nvidia GPU 42 HDMI/DP'
   Components : 'HDA:10de0042,38422651,00100100'
   Controls : 21
   Simple ctrls : 3
CurrentDesktop: LXDE
Date: Thu Jul 26 17:10:58 2018
HibernationDevice: RESUME=UUID=17e70869-516d-4b63-b900-e92e3c4b73b6
InstallationDate: Installed on 2018-07-26 (0 days ago)
InstallationMedia: Lubuntu 18.04.1 LTS "Bionic Beaver" - Release amd64 (20180725)
MachineType: 113 1
ProcFB: 0 nouveaufb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-29-generic root=UUID=71cf0a32-7827-49be-a2c0-cd50a72c26a1 ro quiet splash vt.handoff=1
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-29-generic N/A
 linux-backports-modules-4.15.0-29-generic N/A
 linux-firmware 1.173.1
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 09/30/2008
dmi.bios.vendor: Phoenix Technologies, LTD
dmi.bios.version: 6.00 PG
dmi.board.name: 113-M2-E113
dmi.board.vendor: EVGA
dmi.board.version: 1
dmi.chassis.asset.tag: Unknow
dmi.chassis.type: 3
dmi.chassis.vendor: EVGA
dmi.chassis.version: 113-M2-E113
dmi.modalias: dmi:bvnPhoenixTechnologies,LTD:bvr6.00PG:bd09/30/2008:svn113:pn1:pvr1:rvnEVGA:rn113-M2-E113:rvr1:cvnEVGA:ct3:cvr113-M2-E113:
dmi.product.name: 1
dmi.product.version: 1
dmi.sys.vendor: 113

danieru (danigaritarojas) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
danieru (danigaritarojas) wrote :

I forgot to mention the Western Digital hard drive has always been slow to actually show up, and I’ve always gotten those "COMRESET failed, link is slow to respond" messages since I bought it.

And two details I forgot to mention about the hard drives:
1 The Toshiba hard drive that works is formatted with GPT.
2 The Western Digital hard drive that doesn't work with linux 4.15+ is formatted with MBR.
And just for the record, usb flash drives formatted with MBR work just fine on linux 4.15.

Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.18 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.18-rc6

tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
danieru (danigaritarojas) wrote :

I first experienced this issue while testing the second beta of ubuntu 18.04, as i explained in the original bug report this issue doesn't happen if i use linux 4.14

I've installed linux 4.18rc6 as explained in the wiki, but the WD HDD still won't come up.
Here's the dmesg with this kernel: https://paste.ubuntu.com/p/dz8rZZNmHP/
Aside from the WD HDD still not working with this kernel i noticed that unlike with linux 4.15 and 4.17, with this 4.18rc6 my computer wouldn't freeze while trying to turn it off.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream
danieru (danigaritarojas) wrote :

I went ahead and test this bug from linux 4.15rc1 and up to linux 4.15rc5.
What i found is that linux 4.15rc3 is the last RC version where this bug doesn't occur, and linux 4.15rc4 is the first RC version where this bug occur.

Here's the dmesg with linux 4.15rc3 and the WD HDD working: https://paste.ubuntu.com/p/6DX5TfzMkW/
And here's the dmesg with linux 4.15rc4 and the WD HDD failing: https://paste.ubuntu.com/p/vF7Zs8xgjT/

Joseph Salisbury (jsalisbury) wrote :

Thanks for the testing. I'll review the commits between -rc3 and -rc4. If nothing sticks out, I'll start a kernel bisect a build a test kernel.

Changed in linux (Ubuntu Bionic):
status: New → In Progress
Changed in linux (Ubuntu Cosmic):
status: Confirmed → In Progress
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Cosmic):
assignee: nobody → Joseph Salisbury (jsalisbury)
Joseph Salisbury (jsalisbury) wrote :

Commit 2dc0b46b5ea3 in v4.15-rc4 looks like it could be related. I built a Bionic test kernel with this commit reverted. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1783906

Can you test this kernel and see if it resolves this bug?

Thanks in advance!

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

danieru (danigaritarojas) wrote :

Your test kernel with commit 2dc0b46b5ea3 does indeed fix the issue with the WD HDD. Here's the dmesg with your test kernel: https://paste.ubuntu.com/p/745NscYFJh/

As you can see at the top i was using: "[ 0.000000] Linux version 4.15.0-29-generic (root@kathleen) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #32~lp1783906Commit2dc0b46b5ea3Reverted SMP Mon Jul 30 14:47:35 (Ubuntu 4.15.0-29.32~lp1783906Commit2dc0b46b5ea3Reverted-generic 4.15.18)"

And at the end you can see the WD HDD working: "[ 57.091948] ata6.00: ATA-8: WDC WD5003AZEX-00K1GA0, 80.00A80, max UDMA/133"

This however did not fix the freeze when rebooting, that bug seems also to be introduced in linux 4.15 but i'll have to do more test about that one and report it as a separate bug. Any ideas on how to get information, dmesg, logs, after the computer freeze before rebooting would be helpful. As i have absolutely no idea what causes that freeze. Only thing i know is that it also doesn't happen on linux 4.14

Hi David,

A kernel bug report was opened against Ubuntu [0].  This bug is a
regression introduced in v4.15-rc4.  The following commit was identified
as the cause of the regression:

        2dc0b46b5ea3 ("libata: sata_down_spd_limit should return if
driver has not recorded sstatus speed")

I was hoping to get your feedback, since you are the patch author.  Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request?

Thanks,

Joe

http://pad.lv/1783906

David Milburn (dmilburn) wrote :

Hi Joe,

Can we put some debug in sata_down_spd_limit() and see some of the values
for spdlimit, sstatus, spd, mask, right before the change to not force the mask.
Also, can we track the exact path of calling sata_down_spd_limit().

The the intent of the patch was not to force the speed down before reading the
link speed from SStatus, it doesn't change mask. Thanks.

Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with debug output as requested by upstream. Can you test this kernel and post your syslog or dmesg output?

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1783906

This kernel should exhibit the bug, but will write to syslog with output like:
***-> Value of spd_limit: N
***-> Value of sstatus: N
***-> Value of spd: N
***-> Value of mask: N

I did a dump_stack in the function, so you should see a new stack trace as well.

danieru (danigaritarojas) wrote :

Here's dmesg with 4.15.0-30-generic #33~lp1783906DEBUG running for 10 mins:
https://paste.ubuntu.com/p/Qc9Kpb62ch/

David Milburn (dmilburn) wrote :

Ok, thanks, please let me look through the output.

David Milburn (dmilburn) wrote :

Hi Joe,

The original intent of the patch was not forcing a 6Gbs drive down to 1.5Gbs after hotplug.

Noting,

SSTATUS = 275 = 0x113 = 0001 0001 0011

That corresponds to ACTIVE PM STATE | GEN1 SPEED | DEVICE DETECTED

spd = 1 (corresponds to 1.5Gbps)

One question, the print for mask came after these 2 lines of code, right?

       /* unconditionally mask off the highest bit */
        bit = fls(mask) - 1;
        mask &= ~(1 << bit);

In your debug kernel, would please remove the following 2 lines of code (so the code falls thru)

        if (spd > 1)
                mask &= (1 << (spd - 1)) - 1;
        else <=====
                return -EINVAL; <===== Remove these 2 lines of code.

And finally, at the end __sata_set_spd_needed(), would you please print out these values?

spd
target
*scontrol

The original patch didn't force changing mask, but, it does "return -EINVAL", I think it
may fix the problem just letting it fall thru to the end of sata_down_spd_limit(), but it would
still help to see the original debug values and these new ones with possible fix. Thank you.

Joseph Salisbury (jsalisbury) wrote :

Thanks for the response, David! Correct the print for the mask came after those 2 lines of code. Here is the snippet:

 /* unconditionally mask off the highest bit */
        bit = fls(mask) - 1;
        mask &= ~(1 << bit);

        /* Debug added for lp1783906: */
        dump_stack();
        pr_info("***-> Function calling sata_down_spd_limit: %pf", __builtin_return_address(0));
        printk(KERN_DEBUG "***-> Value of spd_limit: %u\n", spd_limit);
        printk(KERN_DEBUG "***-> Value of sstatus: %u\n", sstatus);
        printk(KERN_DEBUG "***-> Value of spd: %u\n", spd);
        printk(KERN_DEBUG "***-> Value of mask: %u\n", mask);

        /*
         * Mask off all speeds higher than or equal to the current one. At
         * this point, if current SPD is not available and we previously
         * recorded the link speed from SStatus, the driver has already
         * masked off the highest bit so mask should already be 1 or 0.
         * Otherwise, we should not force 1.5Gbps on a link where we have
         * not previously recorded speed from SStatus. Just return in this
         * case.
         */
        if (spd > 1)
                mask &= (1 << (spd - 1)) - 1;

Joseph Salisbury (jsalisbury) wrote :

I'll build another test kernel with you're suggestions and ask @danieru to test.

Joseph Salisbury (jsalisbury) wrote :

I built a second test kernel with additional debug output as requested by David. This kernel also has the two lines removed request by David.

Can you test this kernel and post your syslog or dmesg output?

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1783906

danieru (danigaritarojas) wrote :

Here's dmesg with #33~lp1783906DEBUGv2: https://paste.ubuntu.com/p/jK6zjKjtqc/
I expected this to have the bug and only print additional debug info, but it seems this fix the bug. The second drive (WDC) came up and I could mount it's partitions and read info from them.

Just to make sure, I did a second run and everything seemed to still work fine, here's the dmesg of the second run: https://paste.ubuntu.com/p/Tj5gnTWbs3/

(didn't test write now that i think about it)

David Milburn (dmilburn) wrote :

Hi,

I think the fix is removing the "return" and letting the code fall through in sata_down_spd_limit().
Please give me some time to review the latest log, and I will need to reconfigure a couple of local
systems and re-test with that change. Thanks.

David Milburn (dmilburn) wrote :

Hi,

May I ask for one more test? Looking at the code some more, I don't think I can just remove
the return. The root of the problem is hard reset fails and sata_link_hardreset() is never able
to reconfigure the speed. This patch sets link->sata_spd_limit before returning, I have been
testing linux-4.19-rc1 successfully with a 6Gb drive on AHCI platform.

Would you mind testing this patch with no debug?

If all goes well, I will submit upstream. Thanks.

tags: added: patch
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with the patch from David posted in comment #21. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1783906

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

Thanks in advance!

To post a comment you must log in.