disk I/O race condition after update
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | udev (Ubuntu) |
Undecided
|
Unassigned | ||
Bug Description
About 8 times over this cycle, I have installed various versions of 12.04 server edition on a pathetic old test computer.
In addition to the regular testiing, the purpose is to verify minimum system specifications.
The issue raised herein first appeared with the i386 server ISO of 2012.04.12.
The issue remains with an up to date system as of 2012.04.21.
The issue does NOT exist with a fresh install from the i386 server ISO of 2012.03.27, which the most recent preceeding ISO I had.
The issue: Under very intensive disk I/O situations, the system can lock up. Eventually (I think after about 30 seconds, I am actually rarely standing beside the computer when this occurs) the system does realize it is frozen and manages to resume. It appears as though the computer is waiting for some data from the disk, but the disk doesn't think it has anything to do. I.E. they are out of sync. The appropriate lines from kern.log will be attached.
The issue does not appear to be with the kernel itself, because it can be created by starting from the fresh install from the 2012.03.27 ISO and doing "apt-get update" and "apt-get upgrade" but not "apt-get dist-upgrade". I do not know which package introduced the issue, which is why I have not been able to run "ubuntu-bug <packagename>" for this report. I did list them all before any updates and after, and will post both the difference file and my edited difference file, where I took my best guess at editing out ones that I didn;t think would be contain the root cause.
Note also bug number 978384, which seems similar but not the same. Regardless, the test kernel page does have the verion i would need to try.
For testing for this issue I use "sudo update-
This issue has been demonstrated with two older style ATA hard drives. Both drives have been health tested with disk test tools and the system booted from a freedos ISO.
The enitre start from a fresh install from the 2012.03.37 ISO and test and sow no issue and then upgrade and test and show issue has been rpeated several times. This latest test included 8 times running "sudo update-
It is possible that my CPU is the problem, being below the minimum server edition specifications (200 Mhz, whereas mininmum spec is 300 Mhz). However, the CPU is largely idle with these tests, as it mostly waits for disk I/O. (O.K., it also does have some pretty busy periods.)
Attachments will be added over the next hour.
doug@test-
Linux test-smy 3.2.0-23-
doug@test-
Linux version 3.2.0-23-
doug@test-
Description: Ubuntu 12.04 LTS
Release: 12.04
| tags: | added: precise |
| Doug Smythies (dsmythies) wrote : | #1 |
| Doug Smythies (dsmythies) wrote : | #2 |
| Doug Smythies (dsmythies) wrote : | #3 |
| Doug Smythies (dsmythies) wrote : | #4 |
I made a mistake with the version in my orignal posting above. The listed versions were as the system was running, and tested, at at the time of writing this bug report. The actual conditions for most of the testing and attachments were with the original kernel from the 2012.03.27 ISO or:
Linux test-smy 3.2.0-20-
Linux version 3.2.0-20-
| Doug Smythies (dsmythies) wrote : | #5 |
Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https:/
To change the source package that this bug is filed about visit https:/
[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]
| tags: | added: bot-comment |
| Doug Smythies (dsmythies) wrote : | #7 |
Believe me, I have been trying to figure out the proper package to relate this bug to. For two reasons: First, for proper filing of this bug; Second, so that I can investigate further myself.
I also can not get "apport-collect 986654" to work because the computer is a server only, and it seems to want to start some web browser.
I found bug 929545, which seems possibly related except that the dates for changes don't agree. I also see that ata_piix stuff doesn't change for the issue being present or not (see attachment).
| Doug Smythies (dsmythies) wrote : | #8 |
My best guess is that this should be package="linux"
I do see acticity for ata_piix.c around the right time frame, but I cann't find the dated change history for the file.
| affects: | ubuntu → linux (Ubuntu) |
| Changed in linux (Ubuntu): | |
| status: | New → Confirmed |
| Joseph Salisbury (jsalisbury) wrote : | #9 |
Would it be possible for you to test the latest upstream kernel? Refer to https:/
If this bug is fixed in the mainline kernel, please add the following tag 'kernel-
If the mainline kernel does not fix this bug, please add the tag: 'kernel-
If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".
Thanks in advance.
http://
| Changed in linux (Ubuntu): | |
| importance: | Undecided → Medium |
| status: | Confirmed → Incomplete |
| tags: | added: needs-upstream-testing |
| Doug Smythies (dsmythies) wrote : | #10 |
In my original posting, I did not mention that I had also tested a bunnch of mainline kernels, and always had this issue.
I tested 3.4RC4 this morning, and it has the issue of the bug posting. The edited kern.log file is attached.
Even if the issue did not appear with kernel 3.4RC4, I would be reluctant to call this issue solved. Why? Two reasons: First, the execution of what I am using as a test for this, "sudo update-
For reference:
doug@test-
Linux test-smy 3.4.0-030400rc4
doug@test-
Linux version 3.4.0-030400rc4
| Changed in linux (Ubuntu): | |
| status: | Incomplete → Confirmed |
| tags: |
added: kernel-bug-exists-upstream removed: needs-upstream-testing |
| Doug Smythies (dsmythies) wrote : | #11 |
As mentioned in the original bug report, everything works fine starting from a fresh installation of the 2012.03.27 ISO.
This includes with any kernel.
Once "sudo apt-get update" and "sudo apt-get upgrade" are done, then any and every kernel shows this issue.
Starting from a fresh install of the 2012.03.27 ISO, only 2 package changes were made via:
"sudo apt-get install e2fsprogs e2fslibs"
Then the problem of this bug report was present.
Before and after package list differences:
doug@test-
< ii e2fslibs 1.42-1ubuntu1 ext2/ext3/ext4 file system libraries
< ii e2fsprogs 1.42-1ubuntu1 ext2/ext3/ext4 file system utilities
---
> ii e2fslibs 1.42-1ubuntu2 ext2/ext3/ext4 file system libraries
> ii e2fsprogs 1.42-1ubuntu2 ext2/ext3/ext4 file system utilities
I got the source code for e2f2progs and see a lot of files dated March 30. What I would like to do is compare with the source code for version 1.42ubuntu1, but I don't know how.
| Doug Smythies (dsmythies) wrote : | #12 |
The error even of posting 11 above was during re-boot and not during the disk thrashing test as I originally thought.
I have not seen the error during re-boot before.
| Doug Smythies (dsmythies) wrote : | #13 |
This issue has been isolated down to the "udev" package.
Using "udev 175-0ubuntu6" the disk I/O is fine
Using "udev 175-0ubuntu9" the intensive disk bashing test fails.
The test has been repeated now a few times of starting from a fesh installation of the 2012.03.27 ISO, verifying things work properly, then making only the one package upgrade change, and then verifying that the disk test fails.
doug@test-
390c390
< ii udev 175-0ubuntu6 rule-based device node and kernel event manager
---
> ii udev 175-0ubuntu9 rule-based device node and kernel event manager
The next step will be to attempt to isolate the exact change within that package.
| affects: | linux (Ubuntu) → udev (Ubuntu) |
| Changed in udev (Ubuntu): | |
| importance: | Medium → Undecided |
| status: | Confirmed → New |
| Doug Smythies (dsmythies) wrote : | #14 |
The new udev, 175-0ubuntu9.1, does not fix this problem.
I have not been able to compile udev yet to be able to isolate things further. I have been using http://
| Doug Smythies (dsmythies) wrote : | #15 |
I wrote a program to help demonstrate the issue. It is now my preferred method of test.
Finally, I have been able to compile and make the .debs that I was missing.
Now the issue has been isolated as having being introduced in udev 175-0ubuntu7.
It seems most likely to have been introduced with either revision 2760 or 2761.
Now I will try to figure out how to build 2760 and 2761 (I have 2759 because it = 175-0ubuntu6).
| tags: | removed: bot-comment kernel-bug-exists-upstream |
| Doug Smythies (dsmythies) wrote : | #16 |
Posting a few files that may or may not be of interest.
| Doug Smythies (dsmythies) wrote : | #17 |
| Doug Smythies (dsmythies) wrote : | #18 |
| Doug Smythies (dsmythies) wrote : | #19 |
| Doug Smythies (dsmythies) wrote : | #20 |
Basically this diff file shows stuff as expected.
These tests were all done with packages built that were basically 175-0ubuntu9 but with Launchpad revision number 2761 taken out and either 2759 or 2760 put back. (for whatever reason 9.1 was not available yet).
Summary: the one line change introduced with revision 2760 causes the occasional 30 second lock up on the computer. However, it remains unclear as to why.
| Doug Smythies (dsmythies) wrote : | #21 |
Issue persists with 12.10 and udev 175-0ubuntu13.
Workaround: If not being used unplug the CD-ROM.
| Doug Smythies (dsmythies) wrote : | #22 |
The issue has been isolated to the single line change of launchpad revision 2760 (while revision 2761 edits the same line again, the results are the same).
Specifically: in the file rules.d/
ACTION=="add", KERNEL=="sr*", ATTR{events_
works fine, but when it is changed to this (2760):
ACTION=="add", KERNEL=="sr*", ATTR{events_
or this (2761 and subsequent):
ACTION=="add", ATTR{removable}
the problem exists.
Why? The rule is for "sr" or "removeable" devices. The hard disk is an "sd" device. Why does the change result in issues with the hard disk?
| Doug Smythies (dsmythies) wrote : | #23 |
Issue persists with 13.04 (development), which is not surprising since the udev version is the same.
doug@test-smy:/$ uname -a
Linux test-smy 3.7.0-2-generic #8-Ubuntu SMP Thu Nov 15 16:21:20 UTC 2012 i686 i686 i686 GNU/Linux
doug@test-smy:/$ cat /proc/version
Linux version 3.7.0-2-generic (buildd@aatxe) (gcc version 4.7.2 (Ubuntu/Linaro 4.7.2-5ubuntu7) ) #8-Ubuntu SMP Thu Nov 15 16:21:20 UTC 2012
doug@test-smy:/$ dpkg -l | grep udev
ii libudev0:i386 175-0ubuntu13 i386 udev library
ii udev 175-0ubuntu13 i386 rule-based device node and kernel event manager
| Doug Smythies (dsmythies) wrote : | #24 |
Issue persists with up to date 13.04 (devlopment). Reverting the one line (see post #22) in /lib/udev/
doug@test-smy:~$ uname -a
Linux test-smy 3.8.0-0-generic #4-Ubuntu SMP Tue Jan 15 20:39:36 UTC 2013 i686 i686 i686 GNU/Linux
doug@test-smy:~$ dpkg -l | grep udev
ii libudev0:i386 175-0ubuntu17 i386 udev library
ii udev 175-0ubuntu17 i386 rule-based device node and kernel event manager
| Doug Smythies (dsmythies) wrote : | #25 |
From post #22 above: "Why? The rule is for "sr" or "removeable" devices. The hard disk is an "sd" device. Why does the change result in issues with the hard disk?"
The reason is that the single udev rule change that "introduced" this issue, actually just created a new, and more probable, way to demonstrate a pre-existing issue.
The motherboard has two IDE controllers, the primary uses interrupt 14 and the secondary uses interrupt 15. It turns out that those two interrupts do not co-habitat well, and perhaps never did. The single line udev rule change that started this whole saga, also created a steady stream of interrupt 15's, even if the CD-ROM drive was not being used. The hard drive was on the primary IDE controller, using interrupt 14. The CD-ROM drive was on the secondary IDE controller using interrupt 15. Before, the single line udev rule change, there was never an interrupt 15 if the CD-ROM drive was not being used.
So now the question becomes why do the two interrupts not work well together, particularly when one considers that they are so very basic to IDE and PATA systems?
| A. Eibach (andi3) wrote : | #26 |
Thanks a lot for your hard work in investigating this issue, very appreciated!
I've been fighting with this IRQ problem for almost 2 years now. Same thing as you reported, those random "soft resetting link" messages and drive resets out of the blue.
Unfortunately, no one wants to tackle this issue, which I believe is also a bug deeply rooted in the 3.x kernels.
The only way out _for me_ was just playing dice with the drives: swapping what is on the external PCI IDE/SATA controller, until there are no more errors. It's like human beings: some pairs just would not match. :)
It can, however, get quite time-consuming with lots of drives, and 4 hours of continuous "reswap-reboot" cycles are not rare at all. But once it works, it will keep working, so it pays off after all :)
BTW I'm sick and tired of hearing "your drive is faulty". NO. IT'S NOT! It will sometimes just work solely on the onboard SATA controller and not on _any_ SATA controller card plugged into the PCI port.
Besides, I am pretty sure that people even tossed out their innocent drives just because that kernel or udev bug (or feature?!) drove them insane.
P.S. bug 978384 either does not exist, was removed or you mistyped the bug number. Gives me a 404 here...
| Doug Smythies (dsmythies) wrote : | #27 |
I remain of the opinion that the root issue here is a subtle timing issue.
I spent a tremendous amount of time on this. I have thrashed two old PATA drives to death. Now, the most recent drive I recovered and am using does not have the issue (the drive is a little slower than the others I had tried in the past).
Yes, I must have mistyped that bug number reference, but now a year and half later and from what little I recall about it, I don't think it is relevant.
| Doug Smythies (dsmythies) wrote : | #28 |
I did not realize that I hadn't previously given a link to my related web notes:
http://
@andi3: you could get this bug report confirmed by clicking that this effects you also. (although I see that you did not subscribe, and myself, I don't currently have a way to test further at the moment)
| Launchpad Janitor (janitor) wrote : | #29 |
Status changed to 'Confirmed' because the bug affects multiple users.
| Changed in udev (Ubuntu): | |
| status: | New → Confirmed |
| Magesh GV (magesh-gv) wrote : | #30 |
The issue is seen on Ubuntu 13.04 also. Although in my case it is not a real physical machine but a VM running ubuntu.


For these two lines in the edited kern.log.txt file:
Apr 20 23:49:33 test-smy kernel: [ 4662.839710] Clocksource tsc unstable (delta = 303994203 ns)
Apr 20 23:49:33 test-smy kernel: [ 4662.841953] Switching to clocksource pit
I do not think the clock was actually unstable. I think it got out of sync because of the freeze and then was interpreted as unstable. I have no real proof, though.