Ubuntu
linux package

Hard disk writes fail in 16.04 daily on nForce 430

Xenial (16.04)
Bug #1561830

Bug #1561830 reported by Stephen Worthington on 2016-03-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	In Progress	High	Unassigned
	Xenial	In Progress	High	Unassigned

Bug Description

I have an old PC I use for testing new operating systems. It has previously had Ubuntu 15.10 installed and working. The motherboard is an Asus M2NPV-VM, with Nvidia nForce 430 chipset and Nvidia GeForce 6150 GPU. I have installed an Nvidia GT220 card to use for more modern video support.

When I attempt to install Ubuntu 16.04 beta (daily xenial-desktop-amd64.iso file downloaded 24/03/2016 18:17), it starts to write to the hard disk (Samsung HD103UJ), and after a short time the install got lots of disk write errors in kern.log. After the errors, the disk was unable to be read either, with "fdisk -l /dev/sda" failing to read a sector, where it had worked before starting the install. Unplugging the SATA cable to the drive and plugging it in again made the drive work again (on /dev/sdc), but another attempt to install failed with the same write errors.

I noticed that the log had swap write errors also, so I rebooted the install DVD again, and this time did a "swapoff -a" command before attempting to install, but got the same errors again. So I found my Ubuntu 15.10 install DVD and tried a new install from that, which worked just fine.

On rebooting with my 16.04 daily DVD, I again did "swapoff -a" so that the DVD based system would run normally, then tried mounting the EXT4 system partition I had just installed using the 15.10 install DVD. That worked, so I tried dd commands to do test writes to that partition. The following commands worked:

dd if=/dev/zero of=/mnt/sda8/tmp/output bs=8k count=10k
dd if=/dev/zero of=/mnt/sda8/tmp/output bs=8k count=100k

but when I did this command:

dd if=/dev/zero of=/mnt/sda8/tmp/output bs=8k count=1000k

after a while errors started appearing in kern.log, just as with the attempts to install 16.04.

It appears that with sustained write activity, the errors will start and then the drive will become unusable until it is unplugged and plugged in again.

I have attached the kern.log and syslog files from the 15.10 install that worked, and the 16.04 install attempt that failed. The first error message appears to be this:

ata3: EH in SWNCQ mode,QC:qc_active 0x1FFF sactive 0x1FFF
ata3: SWNCQ:qc_active 0x1 defer_bits 0x1FFE last_issue_tag 0x0
dhfis 0x1 dmafis 0x0 sdbfis 0x0

which leads me to suspect a problem with the handling of the SATA controller's interrupts.
---
ApportVersion: 2.20-0ubuntu3
Architecture: amd64
AudioDevicesInUse:
USER PID ACCESS COMMAND
/dev/snd/controlC0: ubuntu 2233 F.... pulseaudio
/dev/snd/controlC1: ubuntu 2233 F.... pulseaudio
CasperVersion: 1.368
DistroRelease: Ubuntu 16.04
IwConfig:
enp0s20 no wireless extensions.

lo no wireless extensions.

enp2s9 no wireless extensions.
LiveMediaBuild: Ubuntu 16.04 LTS "Xenial Xerus" - Beta amd64 (20160323)
Lsusb:
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 002: ID 0458:0118 KYE Systems Corp. (Mouse Systems)
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: System manufacturer System Product Name
Package: linux (not installed)
ProcEnviron:
TERM=xterm-256color
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB: 0 nouveaufb
ProcKernelCmdLine: file=/cdrom/preseed/hostname.seed boot=casper initrd=/casper/initrd.lz quiet splash ---
ProcVersionSignature: Ubuntu 4.4.0-15.31-generic 4.4.6
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
linux-restricted-modules-4.4.0-15-generic N/A
linux-backports-modules-4.4.0-15-generic N/A
linux-firmware 1.157
RfKill:

Tags: xenial
Uname: Linux 4.4.0-15-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 08/07/2008
dmi.bios.vendor: Phoenix Technologies, LTD
dmi.bios.version: ASUS M2NPV-VM ACPI BIOS Revision 1401
dmi.board.name: M2NPV-VM
dmi.board.vendor: ASUSTek Computer INC.
dmi.board.version: 1.xx
dmi.chassis.asset.tag: 123456789000
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnPhoenixTechnologies,LTD:bvrASUSM2NPV-VMACPIBIOSRevision1401:bd08/07/2008:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTekComputerINC.:rnM2NPV-VM:rvr1.xx:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

See original description

Tags:

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-03-25:

logs.tar.gz Edit (184.5 KiB, application/x-tar)

Revision history for this message

Ubuntu Foundations Team Bug Bot (crichton) wrote on 2016-03-25:

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1561830/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags:

added: bot-comment

Stephen Worthington (stephen-jsw) on 2016-03-25

affects:

ubuntu → linux (Ubuntu)

Revision history for this message

Brad Figg (brad-figg) wrote on 2016-03-25: Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1561830

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-03-25: AlsaInfo.txt

AlsaInfo.txt Edit (48.1 KiB, text/plain)

apport information

tags:	added: apport-collected xenial
description:	updated

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-03-25: CRDA.txt

CRDA.txt Edit (392 bytes, text/plain)

apport information

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-03-25: CurrentDmesg.txt

CurrentDmesg.txt Edit (247.0 KiB, text/plain)

apport information

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-03-25: JournalErrors.txt

JournalErrors.txt Edit (75.3 KiB, text/plain)

apport information

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-03-25: Lspci.txt

Lspci.txt Edit (25.9 KiB, text/plain)

apport information

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-03-25: ProcCpuinfo.txt

ProcCpuinfo.txt Edit (1.6 KiB, text/plain)

apport information

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-03-25: ProcInterrupts.txt

#10

ProcInterrupts.txt Edit (1.9 KiB, text/plain)

apport information

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-03-25: ProcModules.txt

#11

ProcModules.txt Edit (4.4 KiB, text/plain)

apport information

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-03-25: UdevDb.txt

#12

UdevDb.txt Edit (135.8 KiB, text/plain)

apport information

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-03-25: WifiSyslog.txt

#13

WifiSyslog.txt Edit (710.5 KiB, text/plain)

apport information

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-03-25:

#14

apport-collect got an error when running dpkg-query on the linux package, presumably due to this being a live DVD boot. Here is the output of "uname -a":

Linux ubuntu 4.4.0-15-generic #31-Ubuntu SMP Fri Mar 18 19:08:31 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2016-03-28:

#15

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.5 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.5-wily/

Changed in linux (Ubuntu):
importance:	Undecided → High
tags:	added: kernel-da-key

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-03-29:

#16

I would be happy to try a mainline kernel, but I am getting this problem when trying to install Ubuntu 16.04 daily from a live install DVD. So is there any way to get a mainline kernel into the live DVD, so I can run it from there? Alternatively, I can reinstall Ubuntu 15.10 and try it there, if that would work. I would also need to try a kernel that did have the problem with Ubuntu 15.10, to properly verify the 4.5 kernel fixed it, so what kernel would I need to try for that?

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2016-03-29:

#17

I think the best way to test the mainline kernel, would be to re-install 15.10 and then test it. That will also allow testing of prior 16.04 kernels, so we can perform a kernel bisect.

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-04-01:

#18

Download full text (4.8 KiB)

After quite a bit of messing around, I found a way to test kernels on 16.04 beta properly. I installed my HightPoint Rocket 622A eSATA card and plugged the Samsung HD103UJ drive into it using a long eSATA to SATA cable. That allowed me to boot from my 16.04 daily DVD and do an install to the HD103UJ without any problems. I did an apt-get upgrade and got the latest kernel (4.4.0-16-generic), and also installed the mainline kernels 4.4.0-040400-generic and 4.5.0-040500-generic. Then I checked the other drive installed on the box (Seagate ST31000528AS, used for testing Windows 10) and found that its 200 Gibyte NTFS data partition was almost empty, so I resized it and created a tiny EXT2 partition for Grub2, a 50 Gibyte EXT4 partition for intalling to, and a 10 Gibyte swap partition, all at the end of that drive. Then I rebooted to the 16.04 beta DVD and installed to the Seagate drive. Then I rebooted to the install on the Samsung drive and ran update-grub, to get the install on the Seagate drive bootable from the Grub on the Samsung drive. Then I booted using the Samsung drive and selected the new 16.04 beta install on the Seagate drive to boot. It did, to my surprise, as the Seagate drive was on the motherboard nForce 430 SATA controller. So I then mounted the Samsung drive from the booted Seagate install, and tried the test dd commands, and they also all worked with no errors.

So the first conclusion I have come to is that the bug seems to only be triggered by the Samsung HD103UJ drive when it is on a motherboard nForce 430 SATA port. It does not happen when that drive is on the Rocket 622A's Marvell SATA port. And the bug also does not happen when using the Seagate ST31000528AS drive on a motherboard nForce 430 SATA port. It seems to require that particular drive on that particular SATA controller, and using the standard 16.04 beta kernels, for the bug to occur.

To prevent problems with the swapper using the swap partition on the Samsung HD103UJ drive, I edited fstab on both 16.04 installs to use the new swap partition on the Seagate ST31000528AS drive only.

The next test was to shut down and move the Samsung HD103UJ to its motherboard nForce 430 SATA port, then reboot using the Grub on that drive to run the install on the Seagate ST31000528AS drive. Again, the boot worked, which I expected as there should be little or no writing to the Samsung drive during that boot process. I mounted the Samsung drive 16.04 install partition from the Seagate install, and ran the test dd commands. I was again surprised that they worked without errors - I would have expected that a boot of the 16.04 beta standard kernels from that drive would work the same as a boot of the 16.04 standard kernel from my install DVD, and would fail when writing to the Samsung HD103UJ drive when it is on the motherboard nForce 430 SATA port.

After quite a bit of messing around, I found a way to test kernels on 16.04 beta properly.  I installed my HightPoint Rocket 622A eSATA card and plugged the Samsung HD103UJ drive into it using a long eSATA to SATA cable.  That allowed me to boot from my 16.04 daily DVD and do an install to the HD103UJ without any problems.  I did an apt-get upgrade and got the latest kernel (4.4.0-16-generic), and also installed the mainline kernels 4.4.0-040400-generic and 4.5.0-040500-generic.  Then I checked the other drive installed on the box (Seagate ST31000528AS, used for testing Windows 10) and found that its 200 Gibyte NTFS data partition was almost empty, so I resized it and created a tiny EXT2 partition for Grub2, a 50 Gibyte EXT4 partition for intalling to, and a 10 Gibyte swap partition, all at the end of that drive.  Then I rebooted to the 16.04 beta DVD and installed to the Seagate drive.  Then I rebooted to the install on the Samsung drive and ran update-grub, to get the install on the Seagate drive bootable from the Grub on the Samsung drive.  Then I booted using the Samsung drive and selected the new 16.04 beta install on the Seagate drive to boot.  It did, to my surprise, as the Seagate drive was on the motherboard nForce 430 SATA controller.  So I then mounted the Samsung drive from the booted Seagate install, and tried the test dd commands, and they also all worked with no errors.

So the first conclusion I have come to is that the bug seems to only be triggered by the Samsung HD103UJ drive when it is on a motherboard nForce 430 SATA port.  It does not happen when that drive is on the Rocket 622A's Marvell SATA port.  And the bug also does not happen when using the Seagate ST31000528AS drive on a motherboard nForce 430 SATA port.  It seems to require that particular drive on that particular SATA controller, and using the standard 16.04 beta kernels, for the bug to occur.

To prevent problems with the swapper using the swap partition on the Samsung HD103UJ drive, I edited fstab on both 16.04 installs to use the new swap partition on the Seagate ST31000528AS drive only.

The next test was to shut down and move the Samsung HD103UJ to its motherboard nForce 430 SATA port, then reboot using the Grub on that drive to run the install on the Seagate ST31000528AS drive.  Again, the boot worked, which I expected as there should be little or no writing to the Samsung drive during that boot process.  I mounted the Samsung drive 16.04 install partition from the Seagate install, and ran the test dd commands.  I was again surprised that they worked without errors - I would have expected that a boot of the 16.04 beta standard kernels from that drive would work the same as a boot of the 16.04 standard kernel from my install DVD, and would fail when writing to the Samsung HD103UJ drive when it is on the motherboard nForce 430 SATA port.

The next test was to reboot to the 16.04 beta partition on the Samsung HD103UJ drive.  As expected, that boot failed badly, and I had to use the PC's reset button to restart it, after which I rebooted to the 16.04 beta install on the Seagate ST31000528AS drive again and used that install to run fsck to repair the 16.04 beta install partition on the Samsung HD103UJ drive.   The fsck check showed two errors that needed fixing, where the number of blocks and number of inodes were both wrong.  Once fsck had fixed the partition, I mounted it and looked at the kern.log file from the bad boot.  It looked normal up to a certain point, after which it was corrupt - I think it had a block full of zeroes.  So it looks like as soon as the bug hits, no more successful log writes occur, which makes it difficult to debug.

I do have a serial port on this motherboard, so I looked to see if I could use that to get debug information during a bad boot, but it turned out that I do not have the necessary serial cross-over cable to plug the motherboard's serial port into any of my other PCs' serial ports.  Last time I needed a cross-over cable, I must have borrowed one from work, and unfortunately that is no longer possible.

The next test I ran was to boot the Samsung HD103UJ install on the nForce 430 port, but using Grub to select the mainline  4.4.0-040400-generic kernel.  That also failed badly in exactly the sam manner, so I rebooted and repaired the partition again, ready for the final test.

For the last test, I rebooted to the Samsung HD103UJ install on the nForce 430 port using the mainline 4.5.0-040500-generic kernel, and it booted without errors.

So it looks like whatever bug is causing this problem has already been fixed in the upstream 4.5.0 kernels.  However, if 16.04 is going to be released using 4.4.0 kernels, I hope the fix for this bug can be backported before 16.04 is released.  Are there any more tests I should do to help with this?  Is there any more information I can provide?

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-04-04:

#19

kern.log Edit (168.5 KiB, text/plain)

It looks like I spoke too soon - I have now had two failures with the mainline 4.5.0 kernel booted from the Samsung HD103UJ drive. The first of these booted normally, but died when I tried an apt-get update. The second failed during boot, as most have in the past. So it seems that the bug is a bit less likely to occur with the mainline 4.5.0 kernel, but it is still there. So I ran another test booting from the Seagate ST31000528AS drive using the mainline 4.5.0 kernel, then used the Ubuntu Disks tool to run a write benchmark test on the now unused swap partition (/dev/sda7) on the Samsung HD103UJ drive. After a few seconds, the bug occurred. See the attached kern.log file. The error messages are pretty much the same as usual for this bug.

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-04-11:

#20

I have done more testing to narrow down where this bug first appeared in the kernels. I have been using my 16.04 install on the Seagate ST31000528AS drive, and installing earlier Ubuntu kernels on it. The Drives tool is being used to run a write benchmark on the swap partition on the Samsung HD103UJ drive. The Ubuntu 4.3.0-7.18 kernel is the last one that works. The 4.4.0-1.15 kernel has the bug and fails during the write benchmark. I also tried the latest kernel I could find, mainline 4.6.0-40600rc2, and that failed also.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2016-04-18:

#21

Can you see if the upstream 4.4-rc1 has the bug? It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc1+cod1-wily/

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-04-19:

#22

I had tried a few more mainline kernels earlier today, including 4.4-rc1, before I saw your post. The results:

  mainline-4.3.1-40301 OK
  mainline-4.3.6-40306 OK
  mainline-4.4.0-040400rc1 Fails

So it is looking like the problem was introduced in the transition from the last 4.3 kernel to the first 4.4 ones. Is it possible to do a git bisect on that somehow?

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-04-27:

#23

I have now finished bisecting the mainline kernel, and this is the commit that causes this problem:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=64d513ac31bd02a3c9b69ef04444f36c196f9a9d

commit 64d513ac31bd02a3c9b69ef04444f36c196f9a9d
Author: Christoph Hellwig <email address hidden>
Date: Thu Oct 8 09:28:04 2015 +0100

scsi: use host wide tags by default

    This patch changes the !blk-mq path to the same defaults as the blk-mq
    I/O path by always enabling block tagging, and always using host wide
    tags. We've had blk-mq available for a few releases so bugs with
    this mode should have been ironed out, and this ensures we get better
    coverage of over tagging setup over different configs.

    Signed-off-by: Christoph Hellwig <email address hidden>
    Acked-by: Jens Axboe <email address hidden>
    Reviewed-by: Hannes Reinecke <email address hidden>
    Signed-off-by: James Bottomley <email address hidden>

Joseph Salisbury (jsalisbury) on 2016-05-04

Changed in linux (Ubuntu Xenial):
status:	New → Confirmed
importance:	Undecided → High
tags:	added: performing-bisect

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2016-05-04:

#24

I built a Xenial test kernel with a revert of commit 64d513ac31. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1561830/

Can you test this kernel and see if it resolves this bug?

Revision history for this message

Stephen Worthington (stephen-jsw) wrote on 2016-05-05:

#25

Yes, your test kernel fixes the bug. I ran a test before installing it, and the bug happened rapidly at the start of the benchmark test on the Samsung HD103UJ drive. Then I installed your reverted kernel, rebooted and ran six benchmark tests without any failures. Previously, it has always failed on the first or second benchmark test if it was ever going to fail.

I noticed that the benchmark results for the read speed of the drive are a little lower with the patch reverted, down from about 103 Mbytes/s to about 97 Mbytes/s. The write speeds seem unaffected at about 100 Mbyte/s. So it looks like whatever the reverted code is doing is producing a worthwhile read speed increase.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2016-05-11:

#26

I pinged the upstream patch author regarding this regression. I'm just awaiting some feedback:

https://lkml.org/lkml/2016/5/6/290

Joseph Salisbury (jsalisbury) on 2016-05-16

Changed in linux (Ubuntu):
assignee:	nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
assignee:	nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
status:	Confirmed → In Progress
Changed in linux (Ubuntu Xenial):
status:	Confirmed → In Progress

Joseph Salisbury (jsalisbury) on 2019-01-09

Changed in linux (Ubuntu Xenial):
assignee:	Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu):
assignee:	Joseph Salisbury (jsalisbury) → nobody

Brad Figg (brad-figg) on 2019-07-24

tags:

added: cscc

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

Hard disk writes fail in 16.04 daily on nForce 430

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package