Extremely high IOWait and processes hungs after utopic->vivid upgrade with Iomega Zip drive (IDE)

Bug #1451277 reported by Sergio Callegari
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Medium
Unassigned

Bug Description

On Kubuntu, 64 bit AMD Phenom2 machine with 4 cores, and old (Geoforge 7025) Nvidia graphics.
Using LVM. Machine also has a 2 disk software raid (mirroring).

The machine was working fine before the upgrade, now it is almost impossible to use it... I see Wait in Top always above 60%; in dmesg after a few minutes that the machine is up even if the machine does nothing. I see hung processes (typically vgs, but occasionally also others). LVM related commands (e.g. lvdisplay) almost always hung at the terminal. I also see in ps some kernel workers almost constantly stuck in D state. Also, the machine takes ages to shutdown.

Issue remains there also if I disable X (switch off the X login manager, sddm) and check at the console.

Sorry I cannot provide details with apport, but the machine has been switched off and I will not be able to take it up again before a few days from now.

The symptom I am observing may also be a sign of failing hardware, so please leave the bug as unconfirmed until I make some more tests or someone provide details on similar issues. In the meantime I would like to see if others are experiencing a similar serious issue. Is there anything specific that has changed between utopic and vivid, I should look into?

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1451277/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Revision history for this message
Sergio Callegari (callegar) wrote :

After more research, I can confirm the bug and I hope I have also determined the cause...

1) Bug also happens when booting from the vivid iso image... so it is not dependent from leftovers due to the utopic->vivid upgrade.

2) Bug stops manifesting if I detach the IOMEGA ZIP drive (with ide connection) that I have in this machine

I understand that the piece of hardware is obsolete, but it was working just fine with utopic.

It is still not clear too me if the bug is in the kernel itself or in something in the system that causes it to be polled too frequently.

Please, propagate upstream.

affects: ubuntu → linux-meta (Ubuntu)
Brad Figg (brad-figg)
affects: linux-meta (Ubuntu) → linux (Ubuntu)
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1451277

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: vivid
Changed in linux (Ubuntu):
status: Incomplete → New
Revision history for this message
Sergio Callegari (callegar) wrote : Re: Extremely high IOWait and processes hungs after utopic->vivid upgrade

Brad, I cannot provide log files:

- If I boot with the zip disk the machine goes to 60% iowait and processes dealing with I/O start hunging. I do not want to risk my data just to try running apport (that would likely fail anyway).
- If I boot without the zip disk, the log files cannot report anything because the machine works.

I have already spent one whole night, diagnosing this issue down to the ATAPI zip drive.

So I reset the bug as new. Please confirm it is propagated upstream.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1451277

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
status: Incomplete → Invalid
status: Invalid → Confirmed
status: Confirmed → Incomplete
Revision history for this message
Sergio Callegari (callegar) wrote : Re: Extremely high IOWait and processes hungs after utopic->vivid upgrade

apport-collect 1451277
ERROR: Could not import module, is a package upgrade in progress? Error: No module named PyQt5.QtCore

Please do not ask things that are clearly beyond the possibilities of kubuntu vivid after the plasma 5 transition.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.0 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.1-rc2-vivid/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Sergio Callegari (callegar) wrote :

After tons of testing, I strongly suspect that the issue was with a failing SATA cable, passing next to the IDE cable of the ZIP disk. Whenever detaching the ZIP, I was probably bending the SATA cable in such a way that made it work more reliably. Now the cable has failed altogether and, after replacement, the high IOWAIT seems to be gone, even with the ZIP disk attached.

Sorry for the noise

Revision history for this message
Sergio Callegari (callegar) wrote :
Download full text (3.4 KiB)

Please disregard my previous message.

The bug is definitely there.

In some cases it does not manifest immediately. Conversely, it is quite frequent to see the IOWAIT close to zero at system startup. Then after a while the system is up, the IOWAIT jumps for no reason at about 50% and stays there.

When this happens, nothing is initially reported in dmesg/syslog. However the system starts showing some issues.

1) Trying to mount the zip drive, namely

sudo mount /dev/sda /mnt/tmp

hangs forever. The process cannot be interrupted. Interestingly, this happens even if there is no disk in the drive, a situation that the kernel should see immediately, even before trying the mount action. When the hang occurs, the kernel starts complaining in dmesg about a hung process:

[11877.606063] INFO: task mount:14652 blocked for more than 120 seconds.
[11877.606077] Tainted: P C OE 3.19.0-18-generic #18-Ubuntu
[11877.606082] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11877.606088] mount D ffff88006ad038f8 0 14652 14651 0x00000000
[11877.606099] ffff88006ad038f8 ffff880062ecebf0 0000000000014200 ffff88006ad03fd8
[11877.606108] 0000000000014200 ffff88011a818000 ffff880062ecebf0 ffff88011fcd4200
[11877.606115] ffff88006ad03a50 7fffffffffffffff ffff88006ad03a48 ffff880062ecebf0
[11877.606121] Call Trace:
[11877.606139] [<ffffffff817c4f99>] schedule+0x29/0x70
[11877.606149] [<ffffffff817c857c>] schedule_timeout+0x20c/0x280
[11877.606161] [<ffffffff8109ed1d>] ? ttwu_do_activate.constprop.94+0x5d/0x70
[11877.606169] [<ffffffff810a1c19>] ? try_to_wake_up+0x1e9/0x340
[11877.606178] [<ffffffff817c6954>] wait_for_completion+0xa4/0x170
[11877.606183] [<ffffffff810a1de0>] ? wake_up_state+0x20/0x20
[11877.606191] [<ffffffff8108ef1a>] flush_work+0xea/0x1c0
[11877.606200] [<ffffffff8108bb10>] ? destroy_worker+0xa0/0xa0
[11877.606206] [<ffffffff8108f0f8>] __cancel_work_timer+0x98/0x1b0
[11877.606214] [<ffffffff813949f1>] ? exact_lock+0x11/0x20
[11877.606223] [<ffffffff81509d72>] ? kobj_lookup+0x112/0x170
[11877.606230] [<ffffffff813939f0>] ? disk_map_sector_rcu+0x80/0x80
[11877.606237] [<ffffffff8108f243>] cancel_delayed_work_sync+0x13/0x20
[11877.606243] [<ffffffff81395991>] disk_block_events+0x81/0x90
[11877.606252] [<ffffffff8122d64b>] __blkdev_get+0x5b/0x490
[11877.606259] [<ffffffff8122dac1>] blkdev_get+0x41/0x390
[11877.606266] [<ffffffff8122de70>] ? blkdev_get_by_dev+0x60/0x60
[11877.606273] [<ffffffff8122decf>] blkdev_open+0x5f/0x90
[11877.606281] [<ffffffff811f0d82>] do_dentry_open+0x1d2/0x330
[11877.606288] [<ffffffff811f1049>] vfs_open+0x49/0x50
[11877.606296] [<ffffffff81201b47>] do_last+0x227/0x12c0
[11877.606305] [<ffffffff812041e8>] path_openat+0x88/0x610
[11877.606313] [<ffffffff8120598a>] do_filp_open+0x3a/0xb0
[11877.606320] [<ffffffff81212777>] ? __alloc_fd+0xa7/0x130
[11877.606328] [<ffffffff811f299a>] do_sys_open+0x12a/0x280
[11877.606334] [<ffffffff810963ef>] ? __put_cred+0x3f/0x60
[11877.606341] [<ffffffff811f1e70>] ? SyS_access+0x1c0/0x210
[11877.606348] [<ffffffff811f2b0e>] SyS_open+0x1e/0x20
[11877.606356] [<ffffffff817c990d>] system_call_fastpath+0x16/0...

Read more...

Revision history for this message
Sergio Callegari (callegar) wrote :

3.16.7-031607-generic seems to have no issue.
3.19.0-18-generic has the issue
4.0.4-040004-generic cannot be tested because it is impossible for me to use X with it (the nvidia proprietary driver 304 does not work with it and Kwin does not work on my hardware without it).

This is already good news since staying with the utopic kernel seems to be a workable workaround.

In the future I'll try 3.17 and 3.18 if possible.

Will take some time since the machine is only rarely used after the vivid upgrade as plasma 5 is too immature for regular use.

Revision history for this message
Sergio Callegari (callegar) wrote :

3.18.14-031814-generic seems to have the issue

summary: Extremely high IOWait and processes hungs after utopic->vivid upgrade
+ with Iomega Zip drive (IDE)
Revision history for this message
Sergio Callegari (callegar) wrote :

Installed python-pyqt5 as reported in bug 1439784 to make apport-collect work.

Still getting error apport-collect crashing though.

dpkg-query: no packages found matching linux
Traceback (most recent call last):
  File "/usr/share/apport/apport-kde", line 530, in <module>
    sys.exit(UserInterface.run_argv())
  File "/usr/lib/python2.7/dist-packages/apport/ui.py", line 652, in run_argv
    return self.run_update_report()
  File "/usr/lib/python2.7/dist-packages/apport/ui.py", line 568, in run_update_report
    response = self.ui_present_report_details(allowed_to_report)
  File "/usr/share/apport/apport-kde", line 367, in ui_present_report_details
    desktop_info)
  File "/usr/share/apport/apport-kde", line 184, in __init__
    self.ui.ui_update_view(self)
  File "/usr/share/apport/apport-kde", line 358, in ui_update_view
    QTreeWidgetItem(keyitem, [str(line)])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 21: ordinal not in range(128)

Revision history for this message
Sergio Callegari (callegar) wrote :

Given that apport collect fails, can the incomplete tag be removed anyway?

Revision history for this message
Sergio Callegari (callegar) wrote :

Bug present with 3.17.8 mainline

Revision history for this message
Sergio Callegari (callegar) wrote :

3.17.0 impossible to test because, for some reason, it fails to boot to X with the nvidia proprietary driver, so it is hard to test in realistic conditions.

In any case, I would say that the issue has been introduced in the 3.16 -> 3.17 transition.

Can this be reported upstream?

Revision history for this message
Sergio Callegari (callegar) wrote :

I would also rise priority, since the issue

- makes it often impossible to shut down the machine cleanly, requiring a hard reset which may result in data loss;
- I imagine that it can make it quite hard to install security updates from ubuntu as any update of the kernel package tends to result in the system hanging at the make initramfs phase, that probably queries all system drives.

Revision history for this message
Sergio Callegari (callegar) wrote :

Also

hdparm -i /dev/sda

hangs when the iowait goes high.

Note that the issue happens with the drive unmounted.

Revision history for this message
Sergio Callegari (callegar) wrote :

I have tried to look into the issue, but I am encountering a situation that is extremely weird. Hence I hope that someone might be able to help.

The latest kernel that works from the ubuntu mainline ppa repo is 3.16.7-031607_3.16.7-031607.201410301735
Kernel 3.17.0-031700_3.17.0-031700.201410060605 already shows the issue.

However, if I try to compile myself 3.17 from Linus following the instructions in https://wiki.ubuntu.com/KernelTeam/GitKernelBuild

I get:

- a kernel that is clearly completely different in configuration from that in the ubuntu ppa mainline (different package names, much much larger kernel image package) and, most important,
- a kernel that apparently does not show the IOWAIT issue.

At the same time, I am recently experiencing all sort of issues with nvidia 304 and kde plasma 5 causing continuous hangs in plasmashell even if compositing is off, so it is quite difficult to properly test for me.

Revision history for this message
Sergio Callegari (callegar) wrote :

Please ignore my previous message. Did not wait enough. Bug is present in 3.17 upstream

Revision history for this message
Sergio Callegari (callegar) wrote :

Bisected.

Bug seen after commit 045065d8a300a37218c548e9aa7becd581c6a0e8 [SCSI] fix qemu boot hang problem.

Actually, the problem is not with that commit that is perfectly fine. That commit merely lets the real bug be triggered.

Real bug is with scsi host lock and likely to be triggered by mixing a slow and a fast device on the same IDE/SATA channel.

Bug is also biting other distros. See https://bbs.archlinux.org/viewtopic.php?id=189324

Bug is known to kernel developers.

https://lkml.org/lkml/2015/8/16/44

https://bugzilla.kernel.org/show_bug.cgi?id=87581

Bug is fixed by following patch by Christoph Hellwig, that unfortunately is not yet applied upstream

https://lkml.org/lkml/2014/11/20/581

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Sergio Callegari (callegar) wrote :

From LKML conversation, fix https://lkml.org/lkml/2014/11/20/581 is not likely to be applied upstream, because it probably just papers over the real bug.

The real issue appears with the patch series "scsi: convert host_busy to atomic_t" that causes regressions on some hardware configurations. Might not be due to this series either, but to some other race that this change helps triggering.

I wonder if papering over can be acceptable downstream.

Revision history for this message
penalvch (penalvch) wrote :

Sergio Callegari, please boot into a live environment via http://cdimage.ubuntu.com/daily-live/current/ and execute the following command only once, as it will automatically gather debugging information, in a terminal:
apport-collect 1451277

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Sergio Callegari (callegar) wrote :

Hi, I forgot to report:

the Wily kernel does not have the issue. Only the Vivid kernel has it.

As a matter of fact, according to conversation I had with one of the kernel developers, the issue may be due to a race. The more modern kernels do not seem to trigger the problem.

I cannot test the most recent ubuntu kernel in the live image right now because the machine where I have the issue is currently packed in a box in view of a relocation. But I'll test as soon as it is up again.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.