kernel hangs in xlog_grant_log_space

Bug #979498 reported by Juerg Haefliger
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Linux
Confirmed
High
linux (Ubuntu)
Triaged
Medium
Unassigned
Lucid
Won't Fix
Medium
Unassigned
Oneiric
Invalid
Medium
Unassigned
Precise
Won't Fix
Medium
Unassigned
Quantal
Invalid
Medium
Unassigned

Bug Description

We're seeing the following stack traces on different production machines that are running Natty 2.6.38-8-server. The machines need to be rebooted to recover. http://oss.sgi.com/archives/xfs/2011-11/msg00401.html claims that this bug is fixed in 3.0. Can this patch be backported to a Natty kernel? Upgrading to Oneiric is not an option at the moment.

Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299177] INFO: task xfssyncd/dm-4:739 blocked for more than 120 seconds.
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299206] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299235] xfssyncd/dm-4 D 000000000000000e 0 739 2 0x00000000
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299241] ffff880bdd211d00 0000000000000046 ffff880bdd211fd8 ffff880bdd210000
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299245] 0000000000013d00 ffff880bddd8df38 ffff880bdd211fd8 0000000000013d00
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299250] ffff881745c496e0 ffff880bddd8db80 0000000000000282 ffff8817de4e2800
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299254] Call Trace:
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299297] [<ffffffffa010b2d8>] xlog_grant_log_space+0x4a8/0x500 [xfs]
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299304] [<ffffffff8105f6f0>] ? default_wake_function+0x0/0x20
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299328] [<ffffffffa010d1ff>] xfs_log_reserve+0xff/0x140 [xfs]
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299352] [<ffffffffa01191fc>] xfs_trans_reserve+0x9c/0x200 [xfs]
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299373] [<ffffffffa00fd383>] xfs_fs_log_dummy+0x43/0x90 [xfs]
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299397] [<ffffffffa01303c1>] xfs_sync_worker+0x81/0x90 [xfs]
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299421] [<ffffffffa012f0f3>] xfssyncd+0x183/0x230 [xfs]
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299444] [<ffffffffa012ef70>] ? xfssyncd+0x0/0x230 [xfs]
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299450] [<ffffffff810871f6>] kthread+0x96/0xa0
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299456] [<ffffffff8100cde4>] kernel_thread_helper+0x4/0x10
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299460] [<ffffffff81087160>] ? kthread+0x0/0xa0
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299463] [<ffffffff8100cde0>] ? kernel_thread_helper+0x0/0x10
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299562] INFO: task mysqld:16377 blocked for more than 120 seconds.
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299586] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299615] mysqld D 000000000000000e 0 16377 1 0x00000000
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299619] ffff88146e6ffaa8 0000000000000082 ffff88146e6fffd8 ffff88146e6fe000
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299623] 0000000000013d00 ffff8817375303b8 ffff88146e6fffd8 0000000000013d00
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299628] ffff880bdf352dc0 ffff881737530000 0000000000000286 ffff8817de4e2800
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299632] Call Trace:
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299655] [<ffffffffa010b2d8>] xlog_grant_log_space+0x4a8/0x500 [xfs]
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299659] [<ffffffff8105f6f0>] ? default_wake_function+0x0/0x20
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299682] [<ffffffffa010d1ff>] xfs_log_reserve+0xff/0x140 [xfs]
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299705] [<ffffffffa01191fc>] xfs_trans_reserve+0x9c/0x200 [xfs]
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299729] [<ffffffffa0119071>] ? xfs_trans_alloc+0xa1/0xb0 [xfs]
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299752] [<ffffffffa011ef4f>] xfs_create+0x17f/0x660 [xfs]
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299776] [<ffffffffa012c07a>] xfs_vn_mknod+0xaa/0x1c0 [xfs]
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299799] [<ffffffffa012c1c0>] xfs_vn_create+0x10/0x20 [xfs]
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299804] [<ffffffff811705c1>] vfs_create+0xb1/0x110
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299809] [<ffffffff81173bd6>] do_last+0x346/0x410
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299812] [<ffffffff81174032>] do_filp_open+0x392/0x7c0
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299816] [<ffffffff81172d02>] ? user_path_at+0x62/0xa0
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299822] [<ffffffff811810f7>] ? alloc_fd+0xf7/0x150
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299826] [<ffffffff8116474a>] do_sys_open+0x6a/0x150
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299830] [<ffffffff81164850>] sys_open+0x20/0x30
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299833] [<ffffffff8100bfc2>] system_call_fastpath+0x16/0x1b
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299866] INFO: task cron:15652 blocked for more than 120 seconds.
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299918] cron D 0000000000000013 0 15652 1053 0x00000000
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299922] ffff88096bac7cb8 0000000000000082 ffff88096bac7fd8 ffff88096bac6000
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299926] 0000000000013d00 ffff880b6e555f38 ffff88096bac7fd8 0000000000013d00
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299930] ffff880bdf3eadc0 ffff880b6e555b80 ffff88096bac7cf8 ffff8817ddaf75b8
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299935] Call Trace:
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299941] [<ffffffff815d6537>] __mutex_lock_slowpath+0xf7/0x180
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299946] [<ffffffff812797d0>] ? security_inode_exec_permission+0x30/0x40
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299950] [<ffffffff815d5f23>] mutex_lock+0x23/0x50
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299954] [<ffffffff811739a8>] do_last+0x118/0x410
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299957] [<ffffffff81174032>] do_filp_open+0x392/0x7c0
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299962] [<ffffffff8113135d>] ? handle_mm_fault+0x16d/0x250
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299967] [<ffffffff811810f7>] ? alloc_fd+0xf7/0x150
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299971] [<ffffffff8116474a>] do_sys_open+0x6a/0x150
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299974] [<ffffffff81164850>] sys_open+0x20/0x30
Apr 10 23:31:14 nv-aw2az1-database0001 kernel: [2693078.299977] [<ffffffff8100bfc2>] system_call_fastpath+0x16/0x1b

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 979498

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: natty
Revision history for this message
Juerg Haefliger (juergh) wrote : Re: Critical: Natty kernel hangs in xlog_grant_log_space

Can't run apport-collect on these machines.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Juerg Haefliger (juergh) wrote :

On one machine, this affected the /tmp logical volume. Any command/task that touched /tmp hung and never completed.

Changed in linux (Ubuntu):
importance: Undecided → High
Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: Confirmed → In Progress
assignee: nobody → Brad Figg (brad-figg)
Revision history for this message
Brad Figg (brad-figg) wrote :

@Juerg,

1. I have backported the indicated commit and test kernels are available at:
        http://people.canonical.com/~bradf/797498

    Please test the appropriate kernel and add a comment here if it resolved this issue for you or not.

2. Be aware that we are just 6 months away from the end of support for Natty. This patch has been part of Oneiric for some time now. You may want to think about upgrading to Oneiric or possibly even Precise when it releases.

Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: In Progress → Incomplete
Revision history for this message
Juerg Haefliger (juergh) wrote :

Trying to find a reproducer to test the kernel. Thanks.

Revision history for this message
Jason Yen (jasonyen) wrote :

@Brad,

I think the link to the test kernel should be:

http://people.canonical.com/~bradf/lp979498/

If I was wrong please feel free to correct me. Thanks.

Revision history for this message
Brad Figg (brad-figg) wrote :

@Jason,

You are correct. I apologize for the fumble fingers.

Revision history for this message
Juerg Haefliger (juergh) wrote :

Can I get a copy of the source so that I can check what other patches are in that kernel?

Revision history for this message
Brad Figg (brad-figg) wrote :

@Juerg,

You can find the git tree at:
    git://kernel.ubuntu.com/bradf/ubuntu-natty

This is the Ubuntu 2.6.34-14.58 tree with just this one patch on it.

Revision history for this message
Juerg Haefliger (juergh) wrote :

I finally managed to create a reproducer for the XFS hang but the provided kernel does not solve the problem. It hangs within a few seconds, just like the original 2.6.38-8-server kernel. I also tried the following Ubuntu kernels but they both hang within a few minutes. Do run a little longer than 2.6.38-8 and -14 though.
3.0.0-17-server
3.2.0-23-lowlatency

Next I tried the upstream stable kernels but they also hang within an hour:
3.0.29
3.1.10
3.2.15
3.3.2

A typical stacktrace is attached. Once the task is hanging, the directory or partition becomes unusable and only an emergency sync clear things up again. Note that I've also started the discussion on the XS mailing list: http://oss.sgi.com/archives/xfs/2012-04/msg00951.html

Revision history for this message
Juerg Haefliger (juergh) wrote :
Changed in linux (Ubuntu):
status: Incomplete → In Progress
Revision history for this message
Brad Figg (brad-figg) wrote :

@Juerg,

We are currently trying to reproduce the problem on three systems here.

Revision history for this message
Brad Figg (brad-figg) wrote :

@Juerg,

I should have mentioned that we are attempting to reproduce this using the Precise (3.2) kernel.

Revision history for this message
Juerg Haefliger (juergh) wrote :

I logged a case #00029027 through the HP landscape account. Do you have access to that? It contains instructions and some scripts that I use to force the hang. Can you try the Natty kernel? That one hangs within a few seconds.

Revision history for this message
Brad Figg (brad-figg) wrote :

@Juerg,

Yes, I have access and have been using those instructions to try to reproduce.

Revision history for this message
Brad Figg (brad-figg) wrote :

@Juerg,

Installed Natty and ran the scripts for 1hr 10min without hang.

Revision history for this message
Juerg Haefliger (juergh) wrote :

I reproduces the issue on 4 different machines with different HW configurations (SE1170/P410, SE2170/P212, SL390/P212, z400 no RAID controller).

Do you have a machine that I can access to give it a try? Or would it help if I gave you access to one of our machines?

tags: added: kernel-da-key
Revision history for this message
Chris J Arges (arges) wrote :

Filed an upstream bug against xfs here:
http://oss.sgi.com/bugzilla/show_bug.cgi?id=922

Changed in linux:
importance: Unknown → High
status: Unknown → Confirmed
Chris J Arges (arges)
tags: added: lucid oneiric precise quantal
tags: added: exists-upstream
Changed in linux (Ubuntu Lucid):
status: New → In Progress
Changed in linux (Ubuntu Natty):
status: New → In Progress
Changed in linux (Ubuntu Oneiric):
status: New → In Progress
Changed in linux (Ubuntu Precise):
status: New → In Progress
importance: Undecided → High
Changed in linux (Ubuntu Natty):
importance: Undecided → High
Changed in linux (Ubuntu Lucid):
importance: Undecided → High
Changed in linux (Ubuntu Oneiric):
importance: Undecided → High
Chris J Arges (arges)
Changed in linux (Ubuntu Precise):
assignee: nobody → Chris J Arges (christopherarges)
Changed in linux (Ubuntu Quantal):
assignee: Brad Figg (brad-figg) → Chris J Arges (christopherarges)
Changed in linux (Ubuntu Lucid):
assignee: nobody → Chris J Arges (christopherarges)
Changed in linux (Ubuntu Natty):
assignee: nobody → Chris J Arges (christopherarges)
Changed in linux (Ubuntu Oneiric):
assignee: nobody → Chris J Arges (christopherarges)
tags: added: rls-q-incoming
Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Removing the rls-q-incoming tag as this has properly been nominated for Quantal and has an assignee.

tags: removed: rls-q-incoming
Changed in linux (Ubuntu Quantal):
milestone: none → ubuntu-12.10
Chris J Arges (arges)
no longer affects: linux (Ubuntu Natty)
Chris J Arges (arges)
Changed in linux (Ubuntu):
importance: High → Medium
Changed in linux (Ubuntu Precise):
importance: High → Medium
Changed in linux (Ubuntu Quantal):
milestone: ubuntu-12.10 → none
importance: High → Medium
Changed in linux (Ubuntu Lucid):
importance: High → Medium
Changed in linux (Ubuntu Oneiric):
importance: High → Medium
Revision history for this message
dino99 (9d9) wrote :
tags: removed: natty oneiric
Changed in linux (Ubuntu Oneiric):
status: In Progress → Invalid
penalvch (penalvch)
Changed in linux (Ubuntu):
milestone: ubuntu-12.10 → none
status: In Progress → Incomplete
Revision history for this message
Chris J Arges (arges) wrote :

@penalvch

That patch was already tested in the above comments and does not fix the issue. This may affect currently released versions of the Ubuntu kernel still.

summary: - Critical: Natty kernel hangs in xlog_grant_log_space
+ kernel hangs in xlog_grant_log_space
Changed in linux (Ubuntu Oneiric):
assignee: Chris J Arges (arges) → nobody
Changed in linux (Ubuntu):
assignee: Chris J Arges (arges) → nobody
Changed in linux (Ubuntu Precise):
assignee: Chris J Arges (arges) → nobody
penalvch (penalvch)
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Chris J Arges (arges)
Changed in linux (Ubuntu Lucid):
assignee: Chris J Arges (arges) → nobody
Changed in linux (Ubuntu Quantal):
assignee: Chris J Arges (arges) → nobody
Revision history for this message
dino99 (9d9) wrote :

EOL riched

Changed in linux (Ubuntu Quantal):
status: In Progress → Invalid
Revision history for this message
Rolf Leggewie (r0lf) wrote :

lucid has seen the end of its life and is no longer receiving any updates. Marking the lucid task for this ticket as "Won't Fix".

Changed in linux (Ubuntu Lucid):
status: In Progress → Won't Fix
Revision history for this message
Steve Langasek (vorlon) wrote :

The Precise Pangolin has reached end of life, so this bug will not be fixed for that release

Changed in linux (Ubuntu Precise):
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.