Hard disk writes fail in 16.04 daily on nForce 430

Bug #1561830 reported by Stephen Worthington
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
In Progress
High
Unassigned
Xenial
In Progress
High
Unassigned

Bug Description

I have an old PC I use for testing new operating systems. It has previously had Ubuntu 15.10 installed and working. The motherboard is an Asus M2NPV-VM, with Nvidia nForce 430 chipset and Nvidia GeForce 6150 GPU. I have installed an Nvidia GT220 card to use for more modern video support.

When I attempt to install Ubuntu 16.04 beta (daily xenial-desktop-amd64.iso file downloaded 24/03/2016 18:17), it starts to write to the hard disk (Samsung HD103UJ), and after a short time the install got lots of disk write errors in kern.log. After the errors, the disk was unable to be read either, with "fdisk -l /dev/sda" failing to read a sector, where it had worked before starting the install. Unplugging the SATA cable to the drive and plugging it in again made the drive work again (on /dev/sdc), but another attempt to install failed with the same write errors.

I noticed that the log had swap write errors also, so I rebooted the install DVD again, and this time did a "swapoff -a" command before attempting to install, but got the same errors again. So I found my Ubuntu 15.10 install DVD and tried a new install from that, which worked just fine.

On rebooting with my 16.04 daily DVD, I again did "swapoff -a" so that the DVD based system would run normally, then tried mounting the EXT4 system partition I had just installed using the 15.10 install DVD. That worked, so I tried dd commands to do test writes to that partition. The following commands worked:

  dd if=/dev/zero of=/mnt/sda8/tmp/output bs=8k count=10k
  dd if=/dev/zero of=/mnt/sda8/tmp/output bs=8k count=100k

but when I did this command:

  dd if=/dev/zero of=/mnt/sda8/tmp/output bs=8k count=1000k

after a while errors started appearing in kern.log, just as with the attempts to install 16.04.

It appears that with sustained write activity, the errors will start and then the drive will become unusable until it is unplugged and plugged in again.

I have attached the kern.log and syslog files from the 15.10 install that worked, and the 16.04 install attempt that failed. The first error message appears to be this:

ata3: EH in SWNCQ mode,QC:qc_active 0x1FFF sactive 0x1FFF
ata3: SWNCQ:qc_active 0x1 defer_bits 0x1FFE last_issue_tag 0x0
dhfis 0x1 dmafis 0x0 sdbfis 0x0

which leads me to suspect a problem with the handling of the SATA controller's interrupts.
---
ApportVersion: 2.20-0ubuntu3
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: ubuntu 2233 F.... pulseaudio
 /dev/snd/controlC1: ubuntu 2233 F.... pulseaudio
CasperVersion: 1.368
DistroRelease: Ubuntu 16.04
IwConfig:
 enp0s20 no wireless extensions.

 lo no wireless extensions.

 enp2s9 no wireless extensions.
LiveMediaBuild: Ubuntu 16.04 LTS "Xenial Xerus" - Beta amd64 (20160323)
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 002: ID 0458:0118 KYE Systems Corp. (Mouse Systems)
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: System manufacturer System Product Name
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 nouveaufb
ProcKernelCmdLine: file=/cdrom/preseed/hostname.seed boot=casper initrd=/casper/initrd.lz quiet splash ---
ProcVersionSignature: Ubuntu 4.4.0-15.31-generic 4.4.6
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-15-generic N/A
 linux-backports-modules-4.4.0-15-generic N/A
 linux-firmware 1.157
RfKill:

Tags: xenial
Uname: Linux 4.4.0-15-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 08/07/2008
dmi.bios.vendor: Phoenix Technologies, LTD
dmi.bios.version: ASUS M2NPV-VM ACPI BIOS Revision 1401
dmi.board.name: M2NPV-VM
dmi.board.vendor: ASUSTek Computer INC.
dmi.board.version: 1.xx
dmi.chassis.asset.tag: 123456789000
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnPhoenixTechnologies,LTD:bvrASUSM2NPV-VMACPIBIOSRevision1401:bd08/07/2008:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTekComputerINC.:rnM2NPV-VM:rvr1.xx:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
Stephen Worthington (stephen-jsw) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1561830/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1561830

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Stephen Worthington (stephen-jsw) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected xenial
description: updated
Revision history for this message
Stephen Worthington (stephen-jsw) wrote : CRDA.txt

apport information

Revision history for this message
Stephen Worthington (stephen-jsw) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Stephen Worthington (stephen-jsw) wrote : JournalErrors.txt

apport information

Revision history for this message
Stephen Worthington (stephen-jsw) wrote : Lspci.txt

apport information

Revision history for this message
Stephen Worthington (stephen-jsw) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Stephen Worthington (stephen-jsw) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Stephen Worthington (stephen-jsw) wrote : ProcModules.txt

apport information

Revision history for this message
Stephen Worthington (stephen-jsw) wrote : UdevDb.txt

apport information

Revision history for this message
Stephen Worthington (stephen-jsw) wrote : WifiSyslog.txt

apport information

Revision history for this message
Stephen Worthington (stephen-jsw) wrote :

apport-collect got an error when running dpkg-query on the linux package, presumably due to this being a live DVD boot. Here is the output of "uname -a":

Linux ubuntu 4.4.0-15-generic #31-Ubuntu SMP Fri Mar 18 19:08:31 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.5 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.5-wily/

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
Revision history for this message
Stephen Worthington (stephen-jsw) wrote :

I would be happy to try a mainline kernel, but I am getting this problem when trying to install Ubuntu 16.04 daily from a live install DVD. So is there any way to get a mainline kernel into the live DVD, so I can run it from there? Alternatively, I can reinstall Ubuntu 15.10 and try it there, if that would work. I would also need to try a kernel that did have the problem with Ubuntu 15.10, to properly verify the 4.5 kernel fixed it, so what kernel would I need to try for that?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I think the best way to test the mainline kernel, would be to re-install 15.10 and then test it. That will also allow testing of prior 16.04 kernels, so we can perform a kernel bisect.

Revision history for this message
Stephen Worthington (stephen-jsw) wrote :
Download full text (4.8 KiB)

After quite a bit of messing around, I found a way to test kernels on 16.04 beta properly. I installed my HightPoint Rocket 622A eSATA card and plugged the Samsung HD103UJ drive into it using a long eSATA to SATA cable. That allowed me to boot from my 16.04 daily DVD and do an install to the HD103UJ without any problems. I did an apt-get upgrade and got the latest kernel (4.4.0-16-generic), and also installed the mainline kernels 4.4.0-040400-generic and 4.5.0-040500-generic. Then I checked the other drive installed on the box (Seagate ST31000528AS, used for testing Windows 10) and found that its 200 Gibyte NTFS data partition was almost empty, so I resized it and created a tiny EXT2 partition for Grub2, a 50 Gibyte EXT4 partition for intalling to, and a 10 Gibyte swap partition, all at the end of that drive. Then I rebooted to the 16.04 beta DVD and installed to the Seagate drive. Then I rebooted to the install on the Samsung drive and ran update-grub, to get the install on the Seagate drive bootable from the Grub on the Samsung drive. Then I booted using the Samsung drive and selected the new 16.04 beta install on the Seagate drive to boot. It did, to my surprise, as the Seagate drive was on the motherboard nForce 430 SATA controller. So I then mounted the Samsung drive from the booted Seagate install, and tried the test dd commands, and they also all worked with no errors.

So the first conclusion I have come to is that the bug seems to only be triggered by the Samsung HD103UJ drive when it is on a motherboard nForce 430 SATA port. It does not happen when that drive is on the Rocket 622A's Marvell SATA port. And the bug also does not happen when using the Seagate ST31000528AS drive on a motherboard nForce 430 SATA port. It seems to require that particular drive on that particular SATA controller, and using the standard 16.04 beta kernels, for the bug to occur.

To prevent problems with the swapper using the swap partition on the Samsung HD103UJ drive, I edited fstab on both 16.04 installs to use the new swap partition on the Seagate ST31000528AS drive only.

The next test was to shut down and move the Samsung HD103UJ to its motherboard nForce 430 SATA port, then reboot using the Grub on that drive to run the install on the Seagate ST31000528AS drive. Again, the boot worked, which I expected as there should be little or no writing to the Samsung drive during that boot process. I mounted the Samsung drive 16.04 install partition from the Seagate install, and ran the test dd commands. I was again surprised that they worked without errors - I would have expected that a boot of the 16.04 beta standard kernels from that drive would work the same as a boot of the 16.04 standard kernel from my install DVD, and would fail when writing to the Samsung HD103UJ drive when it is on the motherboard nForce 430 SATA port.

The next test was to reboot to the 16.04 beta partition on the Samsung HD103UJ drive. As expected, that boot failed badly, and I had to use the PC's reset button to restart it, after which I rebooted to the 16.04 beta install on the Seagate ST31000528AS drive again and used that install to run fsck to repair the 16.04 bet...

Read more...

Revision history for this message
Stephen Worthington (stephen-jsw) wrote :

It looks like I spoke too soon - I have now had two failures with the mainline 4.5.0 kernel booted from the Samsung HD103UJ drive. The first of these booted normally, but died when I tried an apt-get update. The second failed during boot, as most have in the past. So it seems that the bug is a bit less likely to occur with the mainline 4.5.0 kernel, but it is still there. So I ran another test booting from the Seagate ST31000528AS drive using the mainline 4.5.0 kernel, then used the Ubuntu Disks tool to run a write benchmark test on the now unused swap partition (/dev/sda7) on the Samsung HD103UJ drive. After a few seconds, the bug occurred. See the attached kern.log file. The error messages are pretty much the same as usual for this bug.

Revision history for this message
Stephen Worthington (stephen-jsw) wrote :

I have done more testing to narrow down where this bug first appeared in the kernels. I have been using my 16.04 install on the Seagate ST31000528AS drive, and installing earlier Ubuntu kernels on it. The Drives tool is being used to run a write benchmark on the swap partition on the Samsung HD103UJ drive. The Ubuntu 4.3.0-7.18 kernel is the last one that works. The 4.4.0-1.15 kernel has the bug and fails during the write benchmark. I also tried the latest kernel I could find, mainline 4.6.0-40600rc2, and that failed also.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you see if the upstream 4.4-rc1 has the bug? It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc1+cod1-wily/

Revision history for this message
Stephen Worthington (stephen-jsw) wrote :

I had tried a few more mainline kernels earlier today, including 4.4-rc1, before I saw your post. The results:

  mainline-4.3.1-40301 OK
  mainline-4.3.6-40306 OK
  mainline-4.4.0-040400rc1 Fails

So it is looking like the problem was introduced in the transition from the last 4.3 kernel to the first 4.4 ones. Is it possible to do a git bisect on that somehow?

Revision history for this message
Stephen Worthington (stephen-jsw) wrote :

I have now finished bisecting the mainline kernel, and this is the commit that causes this problem:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=64d513ac31bd02a3c9b69ef04444f36c196f9a9d

commit 64d513ac31bd02a3c9b69ef04444f36c196f9a9d
Author: Christoph Hellwig <email address hidden>
Date: Thu Oct 8 09:28:04 2015 +0100

    scsi: use host wide tags by default

    This patch changes the !blk-mq path to the same defaults as the blk-mq
    I/O path by always enabling block tagging, and always using host wide
    tags. We've had blk-mq available for a few releases so bugs with
    this mode should have been ironed out, and this ensures we get better
    coverage of over tagging setup over different configs.

    Signed-off-by: Christoph Hellwig <email address hidden>
    Acked-by: Jens Axboe <email address hidden>
    Reviewed-by: Hannes Reinecke <email address hidden>
    Signed-off-by: James Bottomley <email address hidden>

Changed in linux (Ubuntu Xenial):
status: New → Confirmed
importance: Undecided → High
tags: added: performing-bisect
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Xenial test kernel with a revert of commit 64d513ac31. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1561830/

Can you test this kernel and see if it resolves this bug?

Revision history for this message
Stephen Worthington (stephen-jsw) wrote :

Yes, your test kernel fixes the bug. I ran a test before installing it, and the bug happened rapidly at the start of the benchmark test on the Samsung HD103UJ drive. Then I installed your reverted kernel, rebooted and ran six benchmark tests without any failures. Previously, it has always failed on the first or second benchmark test if it was ever going to fail.

I noticed that the benchmark results for the read speed of the drive are a little lower with the patch reverted, down from about 103 Mbytes/s to about 97 Mbytes/s. The write speeds seem unaffected at about 100 Mbyte/s. So it looks like whatever the reverted code is doing is producing a worthwhile read speed increase.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I pinged the upstream patch author regarding this regression. I'm just awaiting some feedback:

https://lkml.org/lkml/2016/5/6/290

Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Changed in linux (Ubuntu Xenial):
status: Confirmed → In Progress
Changed in linux (Ubuntu Xenial):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu):
assignee: Joseph Salisbury (jsalisbury) → nobody
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.