disk I/O race condition after update

Bug #986654 reported by Doug Smythies
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
udev (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

About 8 times over this cycle, I have installed various versions of 12.04 server edition on a pathetic old test computer.
In addition to the regular testiing, the purpose is to verify minimum system specifications.
The issue raised herein first appeared with the i386 server ISO of 2012.04.12.
The issue remains with an up to date system as of 2012.04.21.
The issue does NOT exist with a fresh install from the i386 server ISO of 2012.03.27, which the most recent preceeding ISO I had.

The issue: Under very intensive disk I/O situations, the system can lock up. Eventually (I think after about 30 seconds, I am actually rarely standing beside the computer when this occurs) the system does realize it is frozen and manages to resume. It appears as though the computer is waiting for some data from the disk, but the disk doesn't think it has anything to do. I.E. they are out of sync. The appropriate lines from kern.log will be attached.

The issue does not appear to be with the kernel itself, because it can be created by starting from the fresh install from the 2012.03.27 ISO and doing "apt-get update" and "apt-get upgrade" but not "apt-get dist-upgrade". I do not know which package introduced the issue, which is why I have not been able to run "ubuntu-bug <packagename>" for this report. I did list them all before any updates and after, and will post both the difference file and my edited difference file, where I took my best guess at editing out ones that I didn;t think would be contain the root cause.

Note also bug number 978384, which seems similar but not the same. Regardless, the test kernel page does have the verion i would need to try.

For testing for this issue I use "sudo update-apt-xapian-index --force", but I have seen the same issue a few times other other heavy disk usage conditions.

This issue has been demonstrated with two older style ATA hard drives. Both drives have been health tested with disk test tools and the system booted from a freedos ISO.

The enitre start from a fresh install from the 2012.03.37 ISO and test and sow no issue and then upgrade and test and show issue has been rpeated several times. This latest test included 8 times running "sudo update-apt-xapian-index --force" without any problem on a fresh installation and 9 times running ""sudo update-apt-xapian-index --force" after only "apt-get update" and "apt-get upgrade" and re-booting, thus running the same kernel.

It is possible that my CPU is the problem, being below the minimum server edition specifications (200 Mhz, whereas mininmum spec is 300 Mhz). However, the CPU is largely idle with these tests, as it mostly waits for disk I/O. (O.K., it also does have some pretty busy periods.)

Attachments will be added over the next hour.

doug@test-smy:~/source-temp$ uname -a
Linux test-smy 3.2.0-23-generic-pae #36-Ubuntu SMP Tue Apr 10 22:19:09 UTC 2012 i686 i686 i386 GNU/Linux
doug@test-smy:~/source-temp$ cat /proc/version
Linux version 3.2.0-23-generic-pae (buildd@palmer) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu4) ) #36-Ubuntu SMP Tue Apr 10 22:19:09 UTC 2012
doug@test-smy:~/source-temp$ lsb_release -rd
Description: Ubuntu 12.04 LTS
Release: 12.04

Tags: precise
tags: added: precise
Revision history for this message
Doug Smythies (dsmythies) wrote :

For these two lines in the edited kern.log.txt file:

Apr 20 23:49:33 test-smy kernel: [ 4662.839710] Clocksource tsc unstable (delta = 303994203 ns)
Apr 20 23:49:33 test-smy kernel: [ 4662.841953] Switching to clocksource pit

I do not think the clock was actually unstable. I think it got out of sync because of the freeze and then was interpreted as unstable. I have no real proof, though.

Revision history for this message
Doug Smythies (dsmythies) wrote :
Revision history for this message
Doug Smythies (dsmythies) wrote :
Revision history for this message
Doug Smythies (dsmythies) wrote :

I made a mistake with the version in my orignal posting above. The listed versions were as the system was running, and tested, at at the time of writing this bug report. The actual conditions for most of the testing and attachments were with the original kernel from the 2012.03.27 ISO or:

Linux test-smy 3.2.0-20-generic-pae #32-Ubuntu SMP Thu Mar 22 02:43:40 UTC 2012 i686 i686 i386 GNU/Linux
Linux version 3.2.0-20-generic-pae (buildd@roseapple) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu3) ) #32-Ubuntu SMP Thu Mar 22 02:43:40 UTC 2012

Revision history for this message
Doug Smythies (dsmythies) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/986654/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Revision history for this message
Doug Smythies (dsmythies) wrote :

Believe me, I have been trying to figure out the proper package to relate this bug to. For two reasons: First, for proper filing of this bug; Second, so that I can investigate further myself.

I also can not get "apport-collect 986654" to work because the computer is a server only, and it seems to want to start some web browser.

I found bug 929545, which seems possibly related except that the dates for changes don't agree. I also see that ata_piix stuff doesn't change for the issue being present or not (see attachment).

Revision history for this message
Doug Smythies (dsmythies) wrote :

My best guess is that this should be package="linux"
I do see acticity for ata_piix.c around the right time frame, but I cann't find the dated change history for the file.

affects: ubuntu → linux (Ubuntu)
Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.4kernel[1] (Not a kernel in the daily directory). Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-rc3-precise/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
tags: added: needs-upstream-testing
Revision history for this message
Doug Smythies (dsmythies) wrote :

In my original posting, I did not mention that I had also tested a bunnch of mainline kernels, and always had this issue.
I tested 3.4RC4 this morning, and it has the issue of the bug posting. The edited kern.log file is attached.

Even if the issue did not appear with kernel 3.4RC4, I would be reluctant to call this issue solved. Why? Two reasons: First, the execution of what I am using as a test for this, "sudo update-apt-xapian-index --force", is very different for kernel 3.3.2 and 3.4RC4 taking only an average of 23 minutes instead of an average of 45 minutes to an hour previously; Second, since this seems to be some race condition how do we know that some subtle change is merely hiding it, with the underlying root issue still there? (It could be argued that the root underlying issue was always there, and it was a subtle change that revealed it to begin with. O.K. but I doubt it, because I have been running Ubuntu server edition on this computer for several years, and have never seen this before). It is for these reasons that I was trying to narrow down and find the actual code change that introduced the issue. Then the plan would be to test with and without the change to verify (i.e. a single variable test).

For reference:
doug@test-smy:~/temp-kernel$ uname -a
Linux test-smy 3.4.0-030400rc4-generic-pae #201204230908 SMP Mon Apr 23 13:23:25 UTC 2012 i686 i686 i386 GNU/Linux
doug@test-smy:~/temp-kernel$ cat /proc/version
Linux version 3.4.0-030400rc4-generic-pae (apw@gomeisa) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) ) #201204230908 SMP Mon Apr 23 13:23:25 UTC 2012

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream
removed: needs-upstream-testing
Revision history for this message
Doug Smythies (dsmythies) wrote :

As mentioned in the original bug report, everything works fine starting from a fresh installation of the 2012.03.27 ISO.
This includes with any kernel.
Once "sudo apt-get update" and "sudo apt-get upgrade" are done, then any and every kernel shows this issue.

Starting from a fresh install of the 2012.03.27 ISO, only 2 package changes were made via:
"sudo apt-get install e2fsprogs e2fslibs"
Then the problem of this bug report was present.
Before and after package list differences:

doug@test-smy:~/temp-kernel$ diff -b pkg_list1 pkg_list2
< ii e2fslibs 1.42-1ubuntu1 ext2/ext3/ext4 file system libraries
< ii e2fsprogs 1.42-1ubuntu1 ext2/ext3/ext4 file system utilities
---
> ii e2fslibs 1.42-1ubuntu2 ext2/ext3/ext4 file system libraries
> ii e2fsprogs 1.42-1ubuntu2 ext2/ext3/ext4 file system utilities

I got the source code for e2f2progs and see a lot of files dated March 30. What I would like to do is compare with the source code for version 1.42ubuntu1, but I don't know how.

Revision history for this message
Doug Smythies (dsmythies) wrote :

The error even of posting 11 above was during re-boot and not during the disk thrashing test as I originally thought.
I have not seen the error during re-boot before.

Revision history for this message
Doug Smythies (dsmythies) wrote :

This issue has been isolated down to the "udev" package.

Using "udev 175-0ubuntu6" the disk I/O is fine
Using "udev 175-0ubuntu9" the intensive disk bashing test fails.

The test has been repeated now a few times of starting from a fesh installation of the 2012.03.27 ISO, verifying things work properly, then making only the one package upgrade change, and then verifying that the disk test fails.

doug@test-smy:~/temp-kernel$ diff b3_pkg.txt b3_udev.txt
390c390
< ii udev 175-0ubuntu6 rule-based device node and kernel event manager
---
> ii udev 175-0ubuntu9 rule-based device node and kernel event manager

The next step will be to attempt to isolate the exact change within that package.

affects: linux (Ubuntu) → udev (Ubuntu)
Changed in udev (Ubuntu):
importance: Medium → Undecided
status: Confirmed → New
Revision history for this message
Doug Smythies (dsmythies) wrote :

The new udev, 175-0ubuntu9.1, does not fix this problem.
I have not been able to compile udev yet to be able to isolate things further. I have been using http://developer.ubuntu.com/packaging/html/ as a reference.

Revision history for this message
Doug Smythies (dsmythies) wrote :

I wrote a program to help demonstrate the issue. It is now my preferred method of test.
Finally, I have been able to compile and make the .debs that I was missing.
Now the issue has been isolated as having being introduced in udev 175-0ubuntu7.
It seems most likely to have been introduced with either revision 2760 or 2761.
Now I will try to figure out how to build 2760 and 2761 (I have 2759 because it = 175-0ubuntu6).

tags: removed: bot-comment kernel-bug-exists-upstream
Revision history for this message
Doug Smythies (dsmythies) wrote :

Posting a few files that may or may not be of interest.

Revision history for this message
Doug Smythies (dsmythies) wrote :
Revision history for this message
Doug Smythies (dsmythies) wrote :
Revision history for this message
Doug Smythies (dsmythies) wrote :
Revision history for this message
Doug Smythies (dsmythies) wrote :

Basically this diff file shows stuff as expected.

These tests were all done with packages built that were basically 175-0ubuntu9 but with Launchpad revision number 2761 taken out and either 2759 or 2760 put back. (for whatever reason 9.1 was not available yet).

Summary: the one line change introduced with revision 2760 causes the occasional 30 second lock up on the computer. However, it remains unclear as to why.

Revision history for this message
Doug Smythies (dsmythies) wrote :

Issue persists with 12.10 and udev 175-0ubuntu13.

Workaround: If not being used unplug the CD-ROM.

Revision history for this message
Doug Smythies (dsmythies) wrote :

The issue has been isolated to the single line change of launchpad revision 2760 (while revision 2761 edits the same line again, the results are the same).

 Specifically: in the file rules.d/60-persistent-storage.rules, this line:
 ACTION=="add", KERNEL=="sr*", ATTR{events_poll_msecs}=="0", ATTR{events_poll_msecs}="2000"
 works fine, but when it is changed to this (2760):
 ACTION=="add", KERNEL=="sr*", ATTR{events_poll_msecs}=="-1", ATTR{events_poll_msecs}="2000"
 or this (2761 and subsequent):
 ACTION=="add", ATTR{removable}=="1", ATTR{events_poll_msecs}=="-1", ATTR{events_poll_msecs}="2000"
 the problem exists.

 Why? The rule is for "sr" or "removeable" devices. The hard disk is an "sd" device. Why does the change result in issues with the hard disk?

Revision history for this message
Doug Smythies (dsmythies) wrote :

Issue persists with 13.04 (development), which is not surprising since the udev version is the same.

doug@test-smy:/$ uname -a
Linux test-smy 3.7.0-2-generic #8-Ubuntu SMP Thu Nov 15 16:21:20 UTC 2012 i686 i686 i686 GNU/Linux
doug@test-smy:/$ cat /proc/version
Linux version 3.7.0-2-generic (buildd@aatxe) (gcc version 4.7.2 (Ubuntu/Linaro 4.7.2-5ubuntu7) ) #8-Ubuntu SMP Thu Nov 15 16:21:20 UTC 2012
doug@test-smy:/$ dpkg -l | grep udev
ii libudev0:i386 175-0ubuntu13 i386 udev library
ii udev 175-0ubuntu13 i386 rule-based device node and kernel event manager

Revision history for this message
Doug Smythies (dsmythies) wrote :

Issue persists with up to date 13.04 (devlopment). Reverting the one line (see post #22) in /lib/udev/rules.d/60-persistent-storage.rules to the pre revision 2760 state fixes the issue. I went back and forth a few times. After editing the file run "sudo update-initramfs -u" then re-boot(and that is so very much easier than what I was doing, re-compiling the whole kernel).

doug@test-smy:~$ uname -a
Linux test-smy 3.8.0-0-generic #4-Ubuntu SMP Tue Jan 15 20:39:36 UTC 2013 i686 i686 i686 GNU/Linux
doug@test-smy:~$ dpkg -l | grep udev
ii libudev0:i386 175-0ubuntu17 i386 udev library
ii udev 175-0ubuntu17 i386 rule-based device node and kernel event manager

Revision history for this message
Doug Smythies (dsmythies) wrote :

From post #22 above: "Why? The rule is for "sr" or "removeable" devices. The hard disk is an "sd" device. Why does the change result in issues with the hard disk?"

The reason is that the single udev rule change that "introduced" this issue, actually just created a new, and more probable, way to demonstrate a pre-existing issue.

The motherboard has two IDE controllers, the primary uses interrupt 14 and the secondary uses interrupt 15. It turns out that those two interrupts do not co-habitat well, and perhaps never did. The single line udev rule change that started this whole saga, also created a steady stream of interrupt 15's, even if the CD-ROM drive was not being used. The hard drive was on the primary IDE controller, using interrupt 14. The CD-ROM drive was on the secondary IDE controller using interrupt 15. Before, the single line udev rule change, there was never an interrupt 15 if the CD-ROM drive was not being used.

So now the question becomes why do the two interrupts not work well together, particularly when one considers that they are so very basic to IDE and PATA systems?

Revision history for this message
A. Eibach (andi3) wrote :

Thanks a lot for your hard work in investigating this issue, very appreciated!

I've been fighting with this IRQ problem for almost 2 years now. Same thing as you reported, those random "soft resetting link" messages and drive resets out of the blue.
Unfortunately, no one wants to tackle this issue, which I believe is also a bug deeply rooted in the 3.x kernels.

The only way out _for me_ was just playing dice with the drives: swapping what is on the external PCI IDE/SATA controller, until there are no more errors. It's like human beings: some pairs just would not match. :)
It can, however, get quite time-consuming with lots of drives, and 4 hours of continuous "reswap-reboot" cycles are not rare at all. But once it works, it will keep working, so it pays off after all :)

BTW I'm sick and tired of hearing "your drive is faulty". NO. IT'S NOT! It will sometimes just work solely on the onboard SATA controller and not on _any_ SATA controller card plugged into the PCI port.
Besides, I am pretty sure that people even tossed out their innocent drives just because that kernel or udev bug (or feature?!) drove them insane.

P.S. bug 978384 either does not exist, was removed or you mistyped the bug number. Gives me a 404 here...

Revision history for this message
Doug Smythies (dsmythies) wrote :

I remain of the opinion that the root issue here is a subtle timing issue.
I spent a tremendous amount of time on this. I have thrashed two old PATA drives to death. Now, the most recent drive I recovered and am using does not have the issue (the drive is a little slower than the others I had tried in the past).

Yes, I must have mistyped that bug number reference, but now a year and half later and from what little I recall about it, I don't think it is relevant.

Revision history for this message
Doug Smythies (dsmythies) wrote :

I did not realize that I hadn't previously given a link to my related web notes:
http://www.smythies.com/~doug/network/hd_race/index.html

@andi3: you could get this bug report confirmed by clicking that this effects you also. (although I see that you did not subscribe, and myself, I don't currently have a way to test further at the moment)

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in udev (Ubuntu):
status: New → Confirmed
Revision history for this message
Magesh GV (magesh-gv) wrote :

The issue is seen on Ubuntu 13.04 also. Although in my case it is not a real physical machine but a VM running ubuntu.

To post a comment you must log in.