[2.6.27-2.3] (sometimes temporary ?!) system deadlock with io_schedule

Bug #262843 reported by Kees Cook
294
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
High
Unassigned

Bug Description

System started to have IO lock-ups until I maximized (?!) one of the deadlocked processes, and then all blocked processes continued. Most of the time, though, it just slowly hangs more and more of the system until I have a full hardlock.

Revision history for this message
Kees Cook (kees) wrote :
Download full text (13.1 KiB)

[23390.392042] INFO: task firefox:31170 blocked for more than 120 seconds.
[23390.392047] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[23390.392049] firefox D 0000000100578bba 0 31170 1
[23390.392055] ffff88018008fb98 0000000000000086 ffff880203da21b8 00008000048a2780
[23390.392059] ffffffff807c4400 ffffffff807c4400 ffffffff807c4400 ffffffff807c4400
[23390.392062] ffffffff807c4400 ffffffff807c4400 ffffffff807c4400 ffffffff807c4400
[23390.392064] Call Trace:
[23390.392070] [<ffffffff80519767>] io_schedule+0x37/0x50
[23390.392073] [<ffffffff802adbed>] sync_page+0x3d/0x70
[23390.392075] [<ffffffff802adc2e>] sync_page_killable+0xe/0x50
[23390.392077] [<ffffffff80519c4a>] __wait_on_bit_lock+0x5a/0x90
[23390.392079] [<ffffffff802adc20>] ? sync_page_killable+0x0/0x50
[23390.392081] [<ffffffff802ad779>] __lock_page_killable+0x69/0x70
[23390.392084] [<ffffffff80266bc0>] ? wake_bit_function+0x0/0x50
[23390.392086] [<ffffffff802ad613>] ? find_get_page+0x23/0xb0
[23390.392089] [<ffffffff802af5c5>] do_generic_file_read+0x315/0x4a0
[23390.392091] [<ffffffff802ad140>] ? file_read_actor+0x0/0x160
[23390.392093] [<ffffffff802af821>] generic_file_aio_read+0xd1/0x1b0
[23390.392111] [<ffffffffa0556371>] nfs_file_read+0x101/0x140 [nfs]
[23390.392114] [<ffffffff802eac79>] do_sync_read+0xf9/0x140
[23390.392125] [<ffffffffa055945a>] ? nfs_getattr+0x8a/0x140 [nfs]
[23390.392127] [<ffffffff80266b80>] ? autoremove_wake_function+0x0/0x40
[23390.392130] [<ffffffff80387ee8>] ? apparmor_file_permission+0x28/0x30
[23390.392133] [<ffffffff80362aa6>] ? security_file_permission+0x16/0x20
[23390.392135] [<ffffffff802eb458>] vfs_read+0xc8/0x130
[23390.392137] [<ffffffff802eb5c5>] sys_read+0x55/0x90
[23390.392141] [<ffffffff8021288a>] system_call_fastpath+0x16/0x1b
[23390.392142]
[23523.536041] INFO: task firefox:31170 blocked for more than 120 seconds.
[23523.536046] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[23523.536049] firefox D 0000000100578bba 0 31170 1
[23523.536052] ffff88018008fb98 0000000000000086 ffff880203da21b8 00008000048a2780
[23523.536055] ffffffff807c4400 ffffffff807c4400 ffffffff807c4400 ffffffff807c4400
[23523.536058] ffffffff807c4400 ffffffff807c4400 ffffffff807c4400 ffffffff807c4400
[23523.536061] Call Trace:
[23523.536067] [<ffffffff80519767>] io_schedule+0x37/0x50
[23523.536070] [<ffffffff802adbed>] sync_page+0x3d/0x70
[23523.536072] [<ffffffff802adc2e>] sync_page_killable+0xe/0x50
[23523.536074] [<ffffffff80519c4a>] __wait_on_bit_lock+0x5a/0x90
[23523.536077] [<ffffffff802adc20>] ? sync_page_killable+0x0/0x50
[23523.536079] [<ffffffff802ad779>] __lock_page_killable+0x69/0x70
[23523.536081] [<ffffffff80266bc0>] ? wake_bit_function+0x0/0x50
[23523.536083] [<ffffffff802ad613>] ? find_get_page+0x23/0xb0
[23523.536086] [<ffffffff802af5c5>] do_generic_file_read+0x315/0x4a0
[23523.536088] [<ffffffff802ad140>] ? file_read_actor+0x0/0x160
[23523.536090] [<ffffffff802af821>] generic_file_aio_read+0xd1/0x1b0
[23523.536109] [<ffffffffa0556371>] nfs_file_read+0x101/0x140 [nfs]
[23523.536112] [<ffffffff802eac79>] do_sync_read+0xf9/0x140
[2...

Revision history for this message
Kees Cook (kees) wrote :

To clarify -- I actually unminimized my running liferea (which was not blocked, and which does not use NFS), and that seemed to suddenly unblock all the D tasks.

Changed in linux:
assignee: nobody → ubuntu-kernel-team
importance: Undecided → High
status: New → Triaged
Revision history for this message
Michael Bienia (geser) wrote :
Download full text (21.2 KiB)

I see this too on my AMD64 running intrepid.

When it happens I notice it because my keyboard doesn't react to any key presses anymore (using the mouse still works).
After some time everything is back to normal.

[14384.920039] INFO: task postgres:5413 blocked for more than 120 seconds.
[14384.920054] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[14384.920059] postgres D ffff88002800fb98 0 5413 5004
[14384.920074] ffff88011ccfda58 0000000000000086 ffff88011ccfd9e8 ffffffff8039184d
[14384.920096] ffffffff807c4400 ffffffff807c4400 ffffffff807c4400 ffffffff807c4400
[14384.920116] ffffffff807c4400 ffffffff807c4400 ffffffff807c4400 ffffffff807c4400
[14384.920133] Call Trace:
[14384.920157] [<ffffffff8039184d>] ? blk_unplug+0x3d/0x70
[14384.920172] [<ffffffff80519767>] io_schedule+0x37/0x50
[14384.920182] [<ffffffff803151d5>] sync_buffer+0x45/0x50
[14384.920190] [<ffffffff80519c4a>] __wait_on_bit_lock+0x5a/0x90
[14384.920197] [<ffffffff80315190>] ? sync_buffer+0x0/0x50
[14384.920205] [<ffffffff80315190>] ? sync_buffer+0x0/0x50
[14384.920212] [<ffffffff80519cf9>] out_of_line_wait_on_bit_lock+0x79/0x90
[14384.920222] [<ffffffff80266bc0>] ? wake_bit_function+0x0/0x50
[14384.920231] [<ffffffff803152b6>] __lock_buffer+0x36/0x40
[14384.920270] [<ffffffffa023f079>] lock_buffer+0x29/0x30 [jbd]
[14384.920286] [<ffffffffa02408cb>] journal_invalidatepage+0x8b/0xf0 [jbd]
[14384.920313] [<ffffffffa02559f1>] ext3_invalidatepage+0x41/0x50 [ext3]
[14384.920323] [<ffffffff802b8e65>] do_invalidatepage+0x25/0x30
[14384.920330] [<ffffffff802b906a>] truncate_complete_page+0x6a/0x80
[14384.920339] [<ffffffff802b921e>] truncate_inode_pages_range+0x16e/0x400
[14384.920362] [<ffffffffa02627c4>] ? __ext3_journal_stop+0x34/0x60 [ext3]
[14384.920372] [<ffffffff80300241>] ? switch_names+0xc1/0xd0
[14384.920381] [<ffffffff802b94c5>] truncate_inode_pages+0x15/0x20
[14384.920401] [<ffffffffa025973d>] ext3_delete_inode+0x1d/0x110 [ext3]
[14384.920419] [<ffffffffa0259720>] ? ext3_delete_inode+0x0/0x110 [ext3]
[14384.920429] [<ffffffff80303749>] generic_delete_inode+0xb9/0x180
[14384.920438] [<ffffffff80303835>] generic_drop_inode+0x25/0x30
[14384.920447] [<ffffffff803024e2>] iput+0x62/0x70
[14384.920455] [<ffffffff802ff1f0>] dentry_iput+0x90/0xe0
[14384.920463] [<ffffffff802ff304>] d_kill+0x44/0x70
[14384.920471] [<ffffffff80301475>] dput+0x95/0x190
[14384.920478] [<ffffffff802f750b>] sys_renameat+0x20b/0x240
[14384.920489] [<ffffffff802cedde>] ? free_pages_and_swap_cache+0x8e/0xb0
[14384.920499] [<ffffffff803a9358>] ? __up_write+0x68/0x140
[14384.920508] [<ffffffff802f755b>] sys_rename+0x1b/0x20
[14384.920519] [<ffffffff8021288a>] system_call_fastpath+0x16/0x1b
[14384.920525]
[14384.920556] INFO: task firefox:13261 blocked for more than 120 seconds.
[14384.920562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[14384.920568] firefox D ffff88002800e068 0 13261 1
[14384.920581] ffff8800df40fd58 0000000000000082 ffff8800df40fdc8 ffff880000001000
[14384.920600] ffffffff807c4400 ffffffff807c4400 ffffffff807c4400 ffffffff807c4400
[14384.920620] ffffffff807c4400 ffffff...

Revision history for this message
Kees Cook (kees) wrote : Re: [2.6.27-2.3] temporary (?!) system deadlock with io_schedule

Note that NFS is not required (Michael Bienia is not using NFS), so this seems like a more general problem.

Kees Cook (kees)
description: updated
Revision history for this message
Michael Bienia (geser) wrote :

As this problem was bugging me too much, I switched back to 2.6.27-1.2 where I didn't see this yet.

Revision history for this message
Michael Bienia (geser) wrote :

Still happens with 2.6.27-4.5 on AMD64:

[ 3472.484065] INFO: task mutt:32170 blocked for more than 120 seconds.
[ 3472.484077] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3472.484082] mutt D 0000000000000002 0 32170 1
[ 3472.484090] ffff88007e52bb08 0000000000000046 ffff880079659ed8 ffff88011986e000
[ 3472.484098] ffffffff807c4400 ffffffff807c4400 ffffffff807c4400 ffffffff807c4400
[ 3472.484105] ffffffff807c4400 ffffffff807c4400 ffffffff807c4400 ffffffff807c4400
[ 3472.484110] Call Trace:
[ 3472.484125] [<ffffffff803a8da9>] ? rb_insert_color+0x109/0x140
[ 3472.484135] [<ffffffff80519b55>] schedule_timeout+0x95/0xd0
[ 3472.484144] [<ffffffff8026ac7f>] ? hrtimer_start+0xdf/0x1b0
[ 3472.484150] [<ffffffff80519906>] wait_for_common+0xb6/0x1a0
[ 3472.484158] [<ffffffff802474d0>] ? default_wake_function+0x0/0x10
[ 3472.484165] [<ffffffff80519a88>] wait_for_completion+0x18/0x20
[ 3472.484172] [<ffffffff80262b12>] flush_cpu_workqueue+0x62/0xa0
[ 3472.484178] [<ffffffff80262770>] ? wq_barrier_func+0x0/0x20
[ 3472.484184] [<ffffffff80262ca3>] flush_workqueue+0x53/0x70
[ 3472.484190] [<ffffffff80262cd5>] flush_scheduled_work+0x15/0x20
[ 3472.484197] [<ffffffff8040cdfd>] tty_ldisc_release+0x4d/0x1f0
[ 3472.484203] [<ffffffff80245cb0>] ? __cond_resched+0x20/0x50
[ 3472.484208] [<ffffffff8051b33e>] ? _spin_lock+0xe/0x20
[ 3472.484215] [<ffffffff8040685d>] release_dev+0x45d/0x5c0
[ 3472.484222] [<ffffffff802e326e>] ? __free_slab+0x9e/0x140
[ 3472.484229] [<ffffffff803076d6>] ? mntput_no_expire+0x36/0x160
[ 3472.484235] [<ffffffff804069de>] tty_release+0x1e/0x30
[ 3472.484241] [<ffffffff802ec929>] __fput+0xc9/0x1b0
[ 3472.484245] [<ffffffff802eca35>] fput+0x25/0x30
[ 3472.484250] [<ffffffff802e8ddb>] filp_close+0x5b/0x90
[ 3472.484256] [<ffffffff802511e2>] put_files_struct+0x82/0xe0
[ 3472.484261] [<ffffffff80251294>] exit_files+0x54/0x70
[ 3472.484266] [<ffffffff802531bd>] do_exit+0x2ad/0x410
[ 3472.484271] [<ffffffff80253367>] do_group_exit+0x47/0xc0
[ 3472.484277] [<ffffffff802533f7>] sys_exit_group+0x17/0x20
[ 3472.484284] [<ffffffff8021288a>] system_call_fastpath+0x16/0x1b

Revision history for this message
Kees Cook (kees) wrote :

I think this is likely due to one of the two AppArmor patches between -1.2 and -2.3. I presently suspect the ptrace changes, and my most recent errors while running -2.3 triggered after doing a bunch of ptrace'ing with gdb. (Currently building a bisected kernel...)

Kees Cook (kees)
Changed in linux:
milestone: none → ubuntu-8.10
Revision history for this message
Kees Cook (kees) wrote :

This bug hasn't hit me since -5, but geser mentioned he is still having problems.

Changed in linux:
status: Triaged → Confirmed
Revision history for this message
Michael Bienia (geser) wrote :

I still see it with -5. I see it also with -6 and -7 but both show also ata resets for me where I don't know if it's related or not.

Here is some output for -5:
$ uname -a
Linux vorlon 2.6.27-5-generic #1 SMP Fri Oct 3 00:36:38 UTC 2008 x86_64 GNU/Linux

[ 462.784063] INFO: task synaptic:6408 blocked for more than 120 seconds.
[ 462.784076] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 462.784082] synaptic D 0000000000000002 0 6408 6384
[ 462.784095] ffff8800df1c3bc8 0000000000000082 ffff88015f1c3d57 ffff8800df1c3c78
[ 462.784107] ffffffff807a5400 ffffffff807a5400 ffffffff807a5400 ffffffff807a5400
[ 462.784115] ffffffff807a5400 ffffffff807a5400 ffffffff807a5400 ffffffff807a5400
[ 462.784123] Call Trace:
[ 462.784150] [<ffffffff8033a350>] ? proc_delete_inode+0x0/0x70
[ 462.784157] [<ffffffff8033a350>] ? proc_delete_inode+0x0/0x70
[ 462.784167] [<ffffffff805023d5>] schedule_timeout+0x95/0xd0
[ 462.784176] [<ffffffff80502186>] wait_for_common+0xb6/0x1a0
[ 462.784185] [<ffffffff802475c0>] ? default_wake_function+0x0/0x10
[ 462.784193] [<ffffffff80502308>] wait_for_completion+0x18/0x20
[ 462.784202] [<ffffffff80262cd2>] flush_cpu_workqueue+0x62/0xa0
[ 462.784209] [<ffffffff80262930>] ? wq_barrier_func+0x0/0x20
[ 462.784217] [<ffffffff80262e63>] flush_workqueue+0x53/0x70
[ 462.784224] [<ffffffff80262e95>] flush_scheduled_work+0x15/0x20
[ 462.784232] [<ffffffff8040d05d>] tty_ldisc_release+0x4d/0x1f0
[ 462.784240] [<ffffffff80503bbe>] ? _spin_lock+0xe/0x20
[ 462.784249] [<ffffffff80406abd>] release_dev+0x45d/0x5c0
[ 462.784256] [<ffffffff80408644>] ? tty_ioctl+0xf4/0x770
[ 462.784264] [<ffffffff802ffa94>] ? d_free+0x74/0x80
[ 462.784271] [<ffffffff80406c3e>] tty_release+0x1e/0x30
[ 462.784278] [<ffffffff802ecae9>] __fput+0xc9/0x1b0
[ 462.784284] [<ffffffff802ecbf5>] fput+0x25/0x30
[ 462.784290] [<ffffffff802e8f9b>] filp_close+0x5b/0x90
[ 462.784296] [<ffffffff802e908f>] sys_close+0xbf/0x120
[ 462.784305] [<ffffffff8021289a>] system_call_fastpath+0x16/0x1b

This got repeated for synaptic several times:
[ 595.928066] INFO: task synaptic:6408 blocked for more than 120 seconds.
[ 729.072053] INFO: task synaptic:6408 blocked for more than 120 seconds.
[ 862.216053] INFO: task synaptic:6408 blocked for more than 120 seconds.
[ 995.360062] INFO: task synaptic:6408 blocked for more than 120 seconds.
[ 1128.504061] INFO: task synaptic:6408 blocked for more than 120 seconds.
[ 1261.648062] INFO: task synaptic:6408 blocked for more than 120 seconds.
[ 1394.792060] INFO: task synaptic:6408 blocked for more than 120 seconds.
[ 1527.936062] INFO: task synaptic:6408 blocked for more than 120 seconds.
[ 1661.080057] INFO: task synaptic:6408 blocked for more than 120 seconds.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Per some kernel team discussions I'm assigning a few bugs directly to a developer. Ben's the lucky winner for this one.

Changed in linux:
assignee: ubuntu-kernel-team → ben-collins
status: Confirmed → Triaged
Revision history for this message
Christopher Berner (cberner) wrote :

Has any progress been made on this? I'm seeing it in 2.6.27-9 too

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Would anyone be willing to test with the latest pre-release of Jaunty 9.04 (currently Alpha4). It contains a newer 2.6.28 based kernel. It would be great to know if this issue remains. Also note that Jaunty Alpha5 is expected to come out within the next few days so you may want to test with this even newer Alpha - http://cdimage.ubuntu.com/releases/jaunty/ . Please let us know your results. I'm also reassigning this to the ubuntu-kernel-team. Thanks.

Changed in linux:
assignee: ben-collins → ubuntu-kernel-team
status: Triaged → Incomplete
Revision history for this message
Michael Bienia (geser) wrote :

I upgraded to jaunty two or three weeks ago and use the jaunty kernel (currently 2.6.28-8) since then without problems.

Andy Whitcroft (apw)
Changed in linux:
assignee: ubuntu-kernel-team → apw
status: Incomplete → In Progress
Revision history for this message
Alvin (alvind) wrote :

Could this be a duplicate of bug 276476 ?

Andy Whitcroft (apw)
Changed in linux (Ubuntu):
assignee: Andy Whitcroft (apw) → nobody
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

adjusting from "in progress" as Andy is no longer assigned.

~JFo

Changed in linux (Ubuntu):
status: In Progress → Triaged
Revision history for this message
penalvch (penalvch) wrote :

Kees Cook, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, could you please test for this with the latest development release of Ubuntu? ISO images are available from http://cdimage.ubuntu.com/daily-live/current/ .

If it remains an issue, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

Also, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.11

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

Changed in linux (Ubuntu):
status: Triaged → Incomplete
MinWang (anglenet)
information type: Public → Public Security
To post a comment you must log in.
This report contains Public Security information  
Everyone can see this security related information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.