Bug #999755 “Kernel crash in rb_next doing ohai loops” : Bugs : linux package : Ubuntu

Revision history for this message

Karl Matthias (0k53-karl) wrote on 2012-05-15:

#1

BootDmesg.txt Edit (17.4 KiB, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (246.7 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (2.0 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (1.1 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (1.7 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (791 bytes, text/plain; charset="utf-8")
UdevDb.txt Edit (40.9 KiB, text/plain; charset="utf-8")
UdevLog.txt Edit (86.2 KiB, text/plain; charset="utf-8")

Revision history for this message

Brad Figg (brad-figg) wrote on 2012-05-15: Missing required logs.

#2

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 999755

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Joseph Salisbury (jsalisbury) on 2012-05-15

Changed in linux (Ubuntu):
importance:	Undecided → Medium

Revision history for this message

Stefan Bader (smb) wrote on 2012-05-16: Re: Kernel crash on EC2 m1.large instances

#3

The crash itself seems to point at a broken task structures tree while scheduling during a pipe read. Though this is rather cause not reason. Right now, nothing obvious strikes. There were a few Xen related patches waiting to come in through stable but I am not sure those would help here.
Meanwhile it would be good if we could gather one or two more crash stack traces along with "Xen version" grabbed from dmesg (which is the version of the hypervisor running on the host).

Revision history for this message

Karl Matthias (0k53-karl) wrote on 2012-05-17:

#4

Download full text (3.4 KiB)

Hi Stefan, OK. Here's another stack trace:

[47708.053788] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[47708.053810] IP: [<ffffffff8130d7f1>] rb_next+0x1/0x50
[47708.053824] PGD 1d0b83067 PUD 1d0e64067 PMD 0
[47708.053833] Oops: 0000 [#1] SMP
[47708.053840] CPU 1
[47708.053843] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables parport_pc ppdev lp parport isofs acpiphp
[47708.053874]
[47708.053878] Pid: 1190, comm: chef-client Not tainted 3.2.0-24-virtual #37-Ubuntu
[47708.053888] RIP: e030:[<ffffffff8130d7f1>] [<ffffffff8130d7f1>] rb_next+0x1/0x50
[47708.053896] RSP: e02b:ffff8801d2e99c18 EFLAGS: 00010046
[47708.053901] RAX: 0000000000000000 RBX: ffff8801d2745800 RCX: 0000000000000000
[47708.053907] RDX: fffffffffffffff0 RSI: 0000000000000000 RDI: 0000000000000010
[47708.053913] RBP: ffff8801d2e99c48 R08: 0000000000000000 R09: 0000000000000000
[47708.053919] R10: ffff8801dff866c0 R11: 0000000000000001 R12: 0000000000000000
[47708.053925] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000240db78
[47708.053935] FS: 00007fc872779700(0000) GS:ffff8801dff8f000(0000) knlGS:0000000000000000
[47708.053943] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[47708.053948] CR2: 0000000000000010 CR3: 00000001d0b3d000 CR4: 0000000000002660
[47708.053955] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[47708.053961] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[47708.053967] Process chef-client (pid: 1190, threadinfo ffff8801d2e98000, task ffff8801d24d2dc0)
[47708.053974] Stack:
[47708.053978] ffff8801d2e99c48 ffffffff8104ece9 ffff8801d2745800 ffff8801dffa26c0
[47708.053990] ffff8801d2744a00 0000000000000000 ffff8801d2e99c78 ffffffff810544b8
[47708.054001] ffff8801d2e99c78 ffff8801dffa26c0 0000000000000001 ffff8801d24d3168
[47708.054013] Call Trace:
[47708.054022] [<ffffffff8104ece9>] ? pick_next_entity+0xb9/0xe0
[47708.054030] [<ffffffff810544b8>] pick_next_task_fair+0x38/0x70
[47708.054038] [<ffffffff81652ddc>] __schedule+0x14c/0x6f0
[47708.054045] [<ffffffff816554ee>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[47708.054052] [<ffffffff8165344f>] schedule+0x3f/0x60
[47708.054060] [<ffffffff8117e119>] pipe_wait+0x59/0x80
[47708.054067] [<ffffffff81089340>] ? add_wait_queue+0x60/0x60
[47708.054074] [<ffffffff8117e87a>] pipe_read+0x1da/0x330
[47708.054081] [<ffffffff81174522>] do_sync_read+0xd2/0x110
[47708.054088] [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[47708.054095] [<ffffffff8100aa32>] ? check_events+0x12/0x20
[47708.054103] [<ffffffff81298d33>] ? security_file_permission+0x93/0xb0
[47708.054110] [<ffffffff811749a1>] ? rw_verify_area+0x61/0xf0
[47708.054116] [<ffffffff81174e80>] vfs_read+0xb0/0x180
[47708.054122] [<ffffffff81174f9a>] sys_read+0x4a/0x90
[47708.054129] [<ffffffff8165d8c2>] system_call_fastpath+0x16/0x1b
[47708.054134] Code: 89 06 48 8b 47 08 48 89 46 08 48 8b 47 10 48 89 46 10 c3 0f 1f 80 00 00 00 00 48 89 32 eb b2 0f 1f 00 48 89 70 10 eb a9 66 90 55 <48> 8b 17 48 89 e5 48 89 d0 48 83 e0 fc 48 39 c7 74 34 48 8b 47
[47708.054224] RIP [<fff...

Hi Stefan, OK.  Here's another stack trace:

[47708.053788] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[47708.053810] IP: [<ffffffff8130d7f1>] rb_next+0x1/0x50
[47708.053824] PGD 1d0b83067 PUD 1d0e64067 PMD 0 
[47708.053833] Oops: 0000 [#1] SMP 
[47708.053840] CPU 1 
[47708.053843] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables parport_pc ppdev lp parport isofs acpiphp
[47708.053874] 
[47708.053878] Pid: 1190, comm: chef-client Not tainted 3.2.0-24-virtual #37-Ubuntu  
[47708.053888] RIP: e030:[<ffffffff8130d7f1>]  [<ffffffff8130d7f1>] rb_next+0x1/0x50
[47708.053896] RSP: e02b:ffff8801d2e99c18  EFLAGS: 00010046
[47708.053901] RAX: 0000000000000000 RBX: ffff8801d2745800 RCX: 0000000000000000
[47708.053907] RDX: fffffffffffffff0 RSI: 0000000000000000 RDI: 0000000000000010
[47708.053913] RBP: ffff8801d2e99c48 R08: 0000000000000000 R09: 0000000000000000
[47708.053919] R10: ffff8801dff866c0 R11: 0000000000000001 R12: 0000000000000000
[47708.053925] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000240db78
[47708.053935] FS:  00007fc872779700(0000) GS:ffff8801dff8f000(0000) knlGS:0000000000000000
[47708.053943] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[47708.053948] CR2: 0000000000000010 CR3: 00000001d0b3d000 CR4: 0000000000002660
[47708.053955] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[47708.053961] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[47708.053967] Process chef-client (pid: 1190, threadinfo ffff8801d2e98000, task ffff8801d24d2dc0)
[47708.053974] Stack:
[47708.053978]  ffff8801d2e99c48 ffffffff8104ece9 ffff8801d2745800 ffff8801dffa26c0
[47708.053990]  ffff8801d2744a00 0000000000000000 ffff8801d2e99c78 ffffffff810544b8
[47708.054001]  ffff8801d2e99c78 ffff8801dffa26c0 0000000000000001 ffff8801d24d3168
[47708.054013] Call Trace:
[47708.054022]  [<ffffffff8104ece9>] ? pick_next_entity+0xb9/0xe0
[47708.054030]  [<ffffffff810544b8>] pick_next_task_fair+0x38/0x70
[47708.054038]  [<ffffffff81652ddc>] __schedule+0x14c/0x6f0
[47708.054045]  [<ffffffff816554ee>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[47708.054052]  [<ffffffff8165344f>] schedule+0x3f/0x60
[47708.054060]  [<ffffffff8117e119>] pipe_wait+0x59/0x80
[47708.054067]  [<ffffffff81089340>] ? add_wait_queue+0x60/0x60
[47708.054074]  [<ffffffff8117e87a>] pipe_read+0x1da/0x330
[47708.054081]  [<ffffffff81174522>] do_sync_read+0xd2/0x110
[47708.054088]  [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[47708.054095]  [<ffffffff8100aa32>] ? check_events+0x12/0x20
[47708.054103]  [<ffffffff81298d33>] ? security_file_permission+0x93/0xb0
[47708.054110]  [<ffffffff811749a1>] ? rw_verify_area+0x61/0xf0
[47708.054116]  [<ffffffff81174e80>] vfs_read+0xb0/0x180
[47708.054122]  [<ffffffff81174f9a>] sys_read+0x4a/0x90
[47708.054129]  [<ffffffff8165d8c2>] system_call_fastpath+0x16/0x1b
[47708.054134] Code: 89 06 48 8b 47 08 48 89 46 08 48 8b 47 10 48 89 46 10 c3 0f 1f 80 00 00 00 00 48 89 32 eb b2 0f 1f 00 48 89 70 10 eb a9 66 90 55 <48> 8b 17 48 89 e5 48 89 d0 48 83 e0 fc 48 39 c7 74 34 48 8b 47 
[47708.054224] RIP  [<ffffffff8130d7f1>] rb_next+0x1/0x50
[47708.054231]  RSP <ffff8801d2e99c18>
[47708.054235] CR2: 0000000000000010
[47708.054253] ---[ end trace 3fc2767cd5259b41 ]---

This is, of course still on EC2.  Here's the Xen version info from the system log:

Xen version: 3.4.3-2.6.18 (preserve-AD)

Hope that helps!

Revision history for this message

Karl Matthias (0k53-karl) wrote on 2012-05-17:

#5

Download full text (3.3 KiB)

Here's another one if it helps:

[848423.023156] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[848423.023180] IP: [<ffffffff8130d7f1>] rb_next+0x1/0x50
[848423.023194] PGD 18ad83067 PUD 18ad82067 PMD 0
[848423.023203] Oops: 0000 [#1] SMP
[848423.023210] CPU 1
[848423.023213] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables parport_pc ppdev lp parport isofs acpiphp
[848423.023245]
[848423.023249] Pid: 18572, comm: chef-client Not tainted 3.2.0-24-virtual #37-Ubuntu
[848423.023258] RIP: e030:[<ffffffff8130d7f1>] [<ffffffff8130d7f1>] rb_next+0x1/0x50
[848423.023267] RSP: e02b:ffff88018af21c18 EFLAGS: 00010046
[848423.023273] RAX: 0000000000000000 RBX: ffff8801a125ce00 RCX: 0000000000000000
[848423.023278] RDX: fffffffffffffff0 RSI: 0000000000000000 RDI: 0000000000000010
[848423.023285] RBP: ffff88018af21c48 R08: 0000000000000000 R09: 0000000000000000
[848423.023291] R10: ffff8801dff866c0 R11: 0000000000000001 R12: 0000000000000000
[848423.023299] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000003a5a9d0
[848423.023311] FS: 00007fb6bdc16700(0000) GS:ffff8801dff8f000(0000) knlGS:0000000000000000
[848423.023319] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[848423.023325] CR2: 0000000000000010 CR3: 00000001a12e8000 CR4: 0000000000002660
[848423.023332] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[848423.023339] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[848423.023346] Process chef-client (pid: 18572, threadinfo ffff88018af20000, task ffff8801a12f8000)
[848423.023353] Stack:
[848423.023356] ffff88018af21c48 ffffffff8104ece9 ffff8801a125ce00 ffff8801dffa26c0
[848423.023369] ffff8801d29cc600 0000000000000000 ffff88018af21c78 ffffffff810544b8
[848423.023380] ffff88018af21c78 ffff8801dffa26c0 0000000000000001 ffff8801a12f83a8
[848423.023392] Call Trace:
[848423.023402] [<ffffffff8104ece9>] ? pick_next_entity+0xb9/0xe0
[848423.023410] [<ffffffff810544b8>] pick_next_task_fair+0x38/0x70
[848423.023419] [<ffffffff81652ddc>] __schedule+0x14c/0x6f0
[848423.023426] [<ffffffff816554ee>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[848423.023433] [<ffffffff8165344f>] schedule+0x3f/0x60
[848423.023441] [<ffffffff8117e119>] pipe_wait+0x59/0x80
[848423.023449] [<ffffffff81089340>] ? add_wait_queue+0x60/0x60
[848423.023456] [<ffffffff8117e87a>] pipe_read+0x1da/0x330
[848423.023463] [<ffffffff81174522>] do_sync_read+0xd2/0x110
[848423.023471] [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[848423.023478] [<ffffffff8100aa32>] ? check_events+0x12/0x20
[848423.023486] [<ffffffff81298d33>] ? security_file_permission+0x93/0xb0
[848423.023493] [<ffffffff811749a1>] ? rw_verify_area+0x61/0xf0
[848423.023499] [<ffffffff81174e80>] vfs_read+0xb0/0x180
[848423.023505] [<ffffffff81174f9a>] sys_read+0x4a/0x90
[848423.023513] [<ffffffff8165d8c2>] system_call_fastpath+0x16/0x1b
[848423.023518] Code: 89 06 48 8b 47 08 48 89 46 08 48 8b 47 10 48 89 46 10 c3 0f 1f 80 00 00 00 00 48 89 32 eb b2 0f 1f 00 48 89 70 10 eb a9 66 90 55 <48> 8b 17 48 89 e5 48 89 d0 48 83 e0 fc 48 39 c7 74 34 48...

Here's another one if it helps:

[848423.023156] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[848423.023180] IP: [<ffffffff8130d7f1>] rb_next+0x1/0x50
[848423.023194] PGD 18ad83067 PUD 18ad82067 PMD 0 
[848423.023203] Oops: 0000 [#1] SMP 
[848423.023210] CPU 1 
[848423.023213] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables parport_pc ppdev lp parport isofs acpiphp
[848423.023245] 
[848423.023249] Pid: 18572, comm: chef-client Not tainted 3.2.0-24-virtual #37-Ubuntu  
[848423.023258] RIP: e030:[<ffffffff8130d7f1>]  [<ffffffff8130d7f1>] rb_next+0x1/0x50
[848423.023267] RSP: e02b:ffff88018af21c18  EFLAGS: 00010046
[848423.023273] RAX: 0000000000000000 RBX: ffff8801a125ce00 RCX: 0000000000000000
[848423.023278] RDX: fffffffffffffff0 RSI: 0000000000000000 RDI: 0000000000000010
[848423.023285] RBP: ffff88018af21c48 R08: 0000000000000000 R09: 0000000000000000
[848423.023291] R10: ffff8801dff866c0 R11: 0000000000000001 R12: 0000000000000000
[848423.023299] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000003a5a9d0
[848423.023311] FS:  00007fb6bdc16700(0000) GS:ffff8801dff8f000(0000) knlGS:0000000000000000
[848423.023319] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[848423.023325] CR2: 0000000000000010 CR3: 00000001a12e8000 CR4: 0000000000002660
[848423.023332] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[848423.023339] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[848423.023346] Process chef-client (pid: 18572, threadinfo ffff88018af20000, task ffff8801a12f8000)
[848423.023353] Stack:
[848423.023356]  ffff88018af21c48 ffffffff8104ece9 ffff8801a125ce00 ffff8801dffa26c0
[848423.023369]  ffff8801d29cc600 0000000000000000 ffff88018af21c78 ffffffff810544b8
[848423.023380]  ffff88018af21c78 ffff8801dffa26c0 0000000000000001 ffff8801a12f83a8
[848423.023392] Call Trace:
[848423.023402]  [<ffffffff8104ece9>] ? pick_next_entity+0xb9/0xe0
[848423.023410]  [<ffffffff810544b8>] pick_next_task_fair+0x38/0x70
[848423.023419]  [<ffffffff81652ddc>] __schedule+0x14c/0x6f0
[848423.023426]  [<ffffffff816554ee>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[848423.023433]  [<ffffffff8165344f>] schedule+0x3f/0x60
[848423.023441]  [<ffffffff8117e119>] pipe_wait+0x59/0x80
[848423.023449]  [<ffffffff81089340>] ? add_wait_queue+0x60/0x60
[848423.023456]  [<ffffffff8117e87a>] pipe_read+0x1da/0x330
[848423.023463]  [<ffffffff81174522>] do_sync_read+0xd2/0x110
[848423.023471]  [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[848423.023478]  [<ffffffff8100aa32>] ? check_events+0x12/0x20
[848423.023486]  [<ffffffff81298d33>] ? security_file_permission+0x93/0xb0
[848423.023493]  [<ffffffff811749a1>] ? rw_verify_area+0x61/0xf0
[848423.023499]  [<ffffffff81174e80>] vfs_read+0xb0/0x180
[848423.023505]  [<ffffffff81174f9a>] sys_read+0x4a/0x90
[848423.023513]  [<ffffffff8165d8c2>] system_call_fastpath+0x16/0x1b
[848423.023518] Code: 89 06 48 8b 47 08 48 89 46 08 48 8b 47 10 48 89 46 10 c3 0f 1f 80 00 00 00 00 48 89 32 eb b2 0f 1f 00 48 89 70 10 eb a9 66 90 55 <48> 8b 17 48 89 e5 48 89 d0 48 83 e0 fc 48 39 c7 74 34 48 8b 47 
[848423.023608] RIP  [<ffffffff8130d7f1>] rb_next+0x1/0x50
[848423.023615]  RSP <ffff88018af21c18>
[848423.023619] CR2: 0000000000000010
[848423.023637] ---[ end trace 5361bc1ec5930886 ]---

Revision history for this message

Karl Matthias (0k53-karl) wrote on 2012-05-17:

#6

Forgot to post this earlier. Re: Brad's request for crash logs. apport-cli says "No pending crash reports. Try --help for more information."

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Gavin Heavyside (mydrive) wrote on 2012-05-19:

#7

Download full text (4.8 KiB)

I've reproduced this by running the OHAI command from the OpsCode Chef ohai gem (0.6.12) in a loop, although it took nearly 2 days before it triggered. Basically I ran `gem install ohai; while true; do ohai; done` in a screen session.

The stack trace is:

[18362917.357055] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[18362917.357079] IP: [<ffffffff8130d7f1>] rb_next+0x1/0x50
[18362917.357098] PGD 1d098d067 PUD 1d045b067 PMD 0
[18362917.357110] Oops: 0000 [#1] SMP
[18362917.357122] CPU 0
[18362917.357126] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables isofs acpiphp
[18362917.357152]
[18362917.357157] Pid: 21217, comm: ohai Not tainted 3.2.0-23-virtual #36-Ubuntu
[18362917.357166] RIP: e030:[<ffffffff8130d7f1>] [<ffffffff8130d7f1>] rb_next+0x1/0x50
[18362917.357176] RSP: e02b:ffff8801d22f1808 EFLAGS: 00010046
[18362917.357181] RAX: 0000000000000000 RBX: ffff8801d0842600 RCX: 0000000000000000
[18362917.357187] RDX: fffffffffffffff0 RSI: 0000000000000000 RDI: 0000000000000010
[18362917.357193] RBP: ffff8801d22f1838 R08: 0000000000000000 R09: 0000000000000000
[18362917.357199] R10: ffff8801dffa26c0 R11: 0000000000000001 R12: 0000000000000000
[18362917.357207] R13: 0000000000000000 R14: 0000000000000008 R15: ffff8801d0dc2d00
[18362917.357218] FS: 00007fcdb3810700(0000) GS:ffff8801dff73000(0000) knlGS:0000000000000000
[18362917.357225] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[18362917.357232] CR2: 0000000000000010 CR3: 00000001d2641000 CR4: 0000000000002660
[18362917.357240] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[18362917.357246] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[18362917.357253] Process ohai (pid: 21217, threadinfo ffff8801d22f0000, task ffff8801d0ad44a0)
[18362917.357261] Stack:
[18362917.357265] ffff8801d22f1838 ffffffff8104ece9 ffff8801d0842600 ffff8801dff866c0
[18362917.357277] ffff8801d0842e00 0000000000000008 ffff8801d22f1868 ffffffff810544b8
[18362917.357289] ffff8801d22f1868 ffff8801dff866c0 0000000000000000 ffff8801d0ad4848
[18362917.357301] Call Trace:
[18362917.357314] [<ffffffff8104ece9>] ? pick_next_entity+0xb9/0xe0
[18362917.357322] [<ffffffff810544b8>] pick_next_task_fair+0x38/0x70
[18362917.357331] [<ffffffff81652ddc>] __schedule+0x14c/0x6f0
[18362917.357341] [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[18362917.357348] [<ffffffff8165344f>] schedule+0x3f/0x60
[18362917.357355] [<ffffffff8165456d>] schedule_hrtimeout_range_clock+0x14d/0x170
[18362917.357365] [<ffffffff8100aa1f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[18362917.357373] [<ffffffff816554ee>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[18362917.357383] [<ffffffff8108932d>] ? add_wait_queue+0x4d/0x60
[18362917.357389] [<ffffffff816545a3>] schedule_hrtimeout_range+0x13/0x20
[18362917.357400] [<ffffffff811877a9>] poll_schedule_timeout+0x49/0x70
[18362917.357408] [<ffffffff81188326>] do_select+0x4d6/0x600
[18362917.357414] [<ffffffff811878b0>] ? poll_freewait+0xe0/0xe0
[18362917.357422] [<ffffffff811879a0>] ? __pollwait+0xf0/0xf0
[18362917.357431] [<fffffff...

I've reproduced this by running the OHAI command from the OpsCode Chef ohai gem (0.6.12) in a loop, although it took nearly 2 days before it triggered. Basically I ran `gem install ohai; while true; do ohai; done` in a screen session.

The stack trace is:

[18362917.357055] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[18362917.357079] IP: [<ffffffff8130d7f1>] rb_next+0x1/0x50
[18362917.357098] PGD 1d098d067 PUD 1d045b067 PMD 0 
[18362917.357110] Oops: 0000 [#1] SMP 
[18362917.357122] CPU 0 
[18362917.357126] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables isofs acpiphp
[18362917.357152] 
[18362917.357157] Pid: 21217, comm: ohai Not tainted 3.2.0-23-virtual #36-Ubuntu  
[18362917.357166] RIP: e030:[<ffffffff8130d7f1>]  [<ffffffff8130d7f1>] rb_next+0x1/0x50
[18362917.357176] RSP: e02b:ffff8801d22f1808  EFLAGS: 00010046
[18362917.357181] RAX: 0000000000000000 RBX: ffff8801d0842600 RCX: 0000000000000000
[18362917.357187] RDX: fffffffffffffff0 RSI: 0000000000000000 RDI: 0000000000000010
[18362917.357193] RBP: ffff8801d22f1838 R08: 0000000000000000 R09: 0000000000000000
[18362917.357199] R10: ffff8801dffa26c0 R11: 0000000000000001 R12: 0000000000000000
[18362917.357207] R13: 0000000000000000 R14: 0000000000000008 R15: ffff8801d0dc2d00
[18362917.357218] FS:  00007fcdb3810700(0000) GS:ffff8801dff73000(0000) knlGS:0000000000000000
[18362917.357225] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[18362917.357232] CR2: 0000000000000010 CR3: 00000001d2641000 CR4: 0000000000002660
[18362917.357240] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[18362917.357246] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[18362917.357253] Process ohai (pid: 21217, threadinfo ffff8801d22f0000, task ffff8801d0ad44a0)
[18362917.357261] Stack:
[18362917.357265]  ffff8801d22f1838 ffffffff8104ece9 ffff8801d0842600 ffff8801dff866c0
[18362917.357277]  ffff8801d0842e00 0000000000000008 ffff8801d22f1868 ffffffff810544b8
[18362917.357289]  ffff8801d22f1868 ffff8801dff866c0 0000000000000000 ffff8801d0ad4848
[18362917.357301] Call Trace:
[18362917.357314]  [<ffffffff8104ece9>] ? pick_next_entity+0xb9/0xe0
[18362917.357322]  [<ffffffff810544b8>] pick_next_task_fair+0x38/0x70
[18362917.357331]  [<ffffffff81652ddc>] __schedule+0x14c/0x6f0
[18362917.357341]  [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[18362917.357348]  [<ffffffff8165344f>] schedule+0x3f/0x60
[18362917.357355]  [<ffffffff8165456d>] schedule_hrtimeout_range_clock+0x14d/0x170
[18362917.357365]  [<ffffffff8100aa1f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[18362917.357373]  [<ffffffff816554ee>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[18362917.357383]  [<ffffffff8108932d>] ? add_wait_queue+0x4d/0x60
[18362917.357389]  [<ffffffff816545a3>] schedule_hrtimeout_range+0x13/0x20
[18362917.357400]  [<ffffffff811877a9>] poll_schedule_timeout+0x49/0x70
[18362917.357408]  [<ffffffff81188326>] do_select+0x4d6/0x600
[18362917.357414]  [<ffffffff811878b0>] ? poll_freewait+0xe0/0xe0
[18362917.357422]  [<ffffffff811879a0>] ? __pollwait+0xf0/0xf0
[18362917.357431]  [<ffffffff81306cd6>] ? cpumask_next_and+0x36/0x50
[18362917.357438]  [<ffffffff81052124>] ? select_idle_sibling+0x174/0x220
[18362917.357445]  [<ffffffff8130bbdb>] ? radix_tree_lookup+0xb/0x10
[18362917.357453]  [<ffffffff810d61c7>] ? irq_to_desc+0x17/0x20
[18362917.357461]  [<ffffffff810d902e>] ? irq_get_irq_data+0xe/0x10
[18362917.357472]  [<ffffffff813a404e>] ? info_for_irq+0xe/0x30
[18362917.357478]  [<ffffffff81306cd6>] ? cpumask_next_and+0x36/0x50
[18362917.357487]  [<ffffffff810592d1>] ? find_busiest_group+0x171/0xbb0
[18362917.357494]  [<ffffffff81188611>] core_sys_select+0x1c1/0x330
[18362917.357501]  [<ffffffff816554ee>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[18362917.357511]  [<ffffffff8163d029>] ? idle_balance+0xf0/0x11b
[18362917.357517]  [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[18362917.357524]  [<ffffffff8100aa32>] ? check_events+0x12/0x20
[18362917.357531]  [<ffffffff8100aa1f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[18362917.357538]  [<ffffffff81004c62>] ? xen_mc_flush+0xb2/0x1c0
[18362917.357545]  [<ffffffff8100aa1f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[18362917.357552]  [<ffffffff811889bb>] sys_select+0xbb/0x100
[18362917.357559]  [<ffffffff8105edc7>] ? schedule_tail+0x27/0xb0
[18362917.357568]  [<ffffffff8165d8c2>] system_call_fastpath+0x16/0x1b
[18362917.357573] Code: 89 06 48 8b 47 08 48 89 46 08 48 8b 47 10 48 89 46 10 c3 0f 1f 80 00 00 00 00 48 89 32 eb b2 0f 1f 00 48 89 70 10 eb a9 66 90 55 <48> 8b 17 48 89 e5 48 89 d0 48 83 e0 fc 48 39 c7 74 34 48 8b 47 
[18362917.357653] RIP  [<ffffffff8130d7f1>] rb_next+0x1/0x50
[18362917.357660]  RSP <ffff8801d22f1808>
[18362917.357664] CR2: 0000000000000010
[18362917.357673] ---[ end trace de16620c8d9e9c7c ]---

Revision history for this message

Gavin Heavyside (mydrive) wrote on 2012-05-19:

#8

BTW Xen version from dmesg is:

Xen version: 3.4.3-2.6.18 (preserve-AD)

This is on EC2 so we have no control over this.

Revision history for this message

Gavin Heavyside (mydrive) wrote on 2012-05-20:

#9

Download full text (4.7 KiB)

Triggered this again by running ohai in a continuous loop, took about 24 hours to occur:

[18438803.627371] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[18438803.627388] IP: [<ffffffff8130d7f1>] rb_next+0x1/0x50
[18438803.627402] PGD 1d0efa067 PUD 1d232d067 PMD 0
[18438803.627411] Oops: 0000 [#1] SMP
[18438803.627419] CPU 1
[18438803.627422] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables isofs acpiphp
[18438803.627447]
[18438803.627452] Pid: 29083, comm: ohai Not tainted 3.2.0-23-virtual #36-Ubuntu
[18438803.627460] RIP: e030:[<ffffffff8130d7f1>] [<ffffffff8130d7f1>] rb_next+0x1/0x50
[18438803.627469] RSP: e02b:ffff8801d225d808 EFLAGS: 00010046
[18438803.627473] RAX: 0000000000000000 RBX: ffff8801d2232400 RCX: 0000000000000000
[18438803.627479] RDX: fffffffffffffff0 RSI: ffff8801dffa2760 RDI: 0000000000000010
[18438803.627485] RBP: ffff8801d225d838 R08: 0000000000000000 R09: 0000000000000000
[18438803.627490] R10: ffff8801dff866c0 R11: 0000000000000000 R12: 0000000000000000
[18438803.627497] R13: 0000000000000000 R14: 0000000000000280 R15: ffff8801d0992300
[18438803.627508] FS: 00007f34206c2700(0000) GS:ffff8801dff8f000(0000) knlGS:0000000000000000
[18438803.627515] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[18438803.627521] CR2: 0000000000000010 CR3: 00000001d0e9e000 CR4: 0000000000002660
[18438803.627527] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[18438803.627534] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[18438803.627541] Process ohai (pid: 29083, threadinfo ffff8801d225c000, task ffff8801d260adc0)
[18438803.627547] Stack:
[18438803.627551] ffff8801d225d838 ffffffff8104ece9 ffff8801d2232400 ffff8801dffa26c0
[18438803.627562] ffff8801d0f8fc00 0000000000000280 ffff8801d225d868 ffffffff810544b8
[18438803.627573] ffff8801d225d868 ffff8801dffa26c0 0000000000000001 ffff8801d260b168
[18438803.627584] Call Trace:
[18438803.627596] [<ffffffff8104ece9>] ? pick_next_entity+0xb9/0xe0
[18438803.627604] [<ffffffff810544b8>] pick_next_task_fair+0x38/0x70
[18438803.627861] [<ffffffff81652ddc>] __schedule+0x14c/0x6f0
[18438803.627874] [<ffffffff8111d335>] ? prep_new_page+0x145/0x1e0
[18438803.627881] [<ffffffff8165344f>] schedule+0x3f/0x60
[18438803.627889] [<ffffffff8165454c>] schedule_hrtimeout_range_clock+0x12c/0x170
[18438803.627901] [<ffffffff8108c890>] ? update_rmtp+0x70/0x70
[18438803.627908] [<ffffffff8108d684>] ? hrtimer_start_range_ns+0x14/0x20
[18438803.627916] [<ffffffff816545a3>] schedule_hrtimeout_range+0x13/0x20
[18438803.627927] [<ffffffff811877a9>] poll_schedule_timeout+0x49/0x70
[18438803.627934] [<ffffffff81188326>] do_select+0x4d6/0x600
[18438803.627942] [<ffffffff811878b0>] ? poll_freewait+0xe0/0xe0
[18438803.627949] [<ffffffff811879a0>] ? __pollwait+0xf0/0xf0
[18438803.627956] [<ffffffff811879a0>] ? __pollwait+0xf0/0xf0
[18438803.627966] [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[18438803.627974] [<ffffffff8100aa32>] ? check_events+0x12/0x20
[18438803.627981] [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[1843880...

Triggered this again by running ohai in a continuous loop, took about 24 hours to occur:

[18438803.627371] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[18438803.627388] IP: [<ffffffff8130d7f1>] rb_next+0x1/0x50
[18438803.627402] PGD 1d0efa067 PUD 1d232d067 PMD 0 
[18438803.627411] Oops: 0000 [#1] SMP 
[18438803.627419] CPU 1 
[18438803.627422] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables isofs acpiphp
[18438803.627447] 
[18438803.627452] Pid: 29083, comm: ohai Not tainted 3.2.0-23-virtual #36-Ubuntu  
[18438803.627460] RIP: e030:[<ffffffff8130d7f1>]  [<ffffffff8130d7f1>] rb_next+0x1/0x50
[18438803.627469] RSP: e02b:ffff8801d225d808  EFLAGS: 00010046
[18438803.627473] RAX: 0000000000000000 RBX: ffff8801d2232400 RCX: 0000000000000000
[18438803.627479] RDX: fffffffffffffff0 RSI: ffff8801dffa2760 RDI: 0000000000000010
[18438803.627485] RBP: ffff8801d225d838 R08: 0000000000000000 R09: 0000000000000000
[18438803.627490] R10: ffff8801dff866c0 R11: 0000000000000000 R12: 0000000000000000
[18438803.627497] R13: 0000000000000000 R14: 0000000000000280 R15: ffff8801d0992300
[18438803.627508] FS:  00007f34206c2700(0000) GS:ffff8801dff8f000(0000) knlGS:0000000000000000
[18438803.627515] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[18438803.627521] CR2: 0000000000000010 CR3: 00000001d0e9e000 CR4: 0000000000002660
[18438803.627527] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[18438803.627534] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[18438803.627541] Process ohai (pid: 29083, threadinfo ffff8801d225c000, task ffff8801d260adc0)
[18438803.627547] Stack:
[18438803.627551]  ffff8801d225d838 ffffffff8104ece9 ffff8801d2232400 ffff8801dffa26c0
[18438803.627562]  ffff8801d0f8fc00 0000000000000280 ffff8801d225d868 ffffffff810544b8
[18438803.627573]  ffff8801d225d868 ffff8801dffa26c0 0000000000000001 ffff8801d260b168
[18438803.627584] Call Trace:
[18438803.627596]  [<ffffffff8104ece9>] ? pick_next_entity+0xb9/0xe0
[18438803.627604]  [<ffffffff810544b8>] pick_next_task_fair+0x38/0x70
[18438803.627861]  [<ffffffff81652ddc>] __schedule+0x14c/0x6f0
[18438803.627874]  [<ffffffff8111d335>] ? prep_new_page+0x145/0x1e0
[18438803.627881]  [<ffffffff8165344f>] schedule+0x3f/0x60
[18438803.627889]  [<ffffffff8165454c>] schedule_hrtimeout_range_clock+0x12c/0x170
[18438803.627901]  [<ffffffff8108c890>] ? update_rmtp+0x70/0x70
[18438803.627908]  [<ffffffff8108d684>] ? hrtimer_start_range_ns+0x14/0x20
[18438803.627916]  [<ffffffff816545a3>] schedule_hrtimeout_range+0x13/0x20
[18438803.627927]  [<ffffffff811877a9>] poll_schedule_timeout+0x49/0x70
[18438803.627934]  [<ffffffff81188326>] do_select+0x4d6/0x600
[18438803.627942]  [<ffffffff811878b0>] ? poll_freewait+0xe0/0xe0
[18438803.627949]  [<ffffffff811879a0>] ? __pollwait+0xf0/0xf0
[18438803.627956]  [<ffffffff811879a0>] ? __pollwait+0xf0/0xf0
[18438803.627966]  [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[18438803.627974]  [<ffffffff8100aa32>] ? check_events+0x12/0x20
[18438803.627981]  [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[18438803.627988]  [<ffffffff8100aa32>] ? check_events+0x12/0x20
[18438803.627995]  [<ffffffff8100aa1f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[18438803.628003]  [<ffffffff81006d1d>] ? xen_flush_tlb_single+0xbd/0x210
[18438803.628013]  [<ffffffff81306dbd>] ? cpumask_any_but+0x2d/0x40
[18438803.628022]  [<ffffffff81044b98>] ? flush_tlb_page+0x48/0xb0
[18438803.628030]  [<ffffffff810438ac>] ? ptep_set_access_flags+0x6c/0x70
[18438803.628038]  [<ffffffff81138c52>] ? do_wp_page+0x382/0x740
[18438803.628045]  [<ffffffff81006739>] ? pte_mfn_to_pfn+0x89/0xf0
[18438803.628053]  [<ffffffff81005209>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
[18438803.628061]  [<ffffffff81188611>] core_sys_select+0x1c1/0x330
[18438803.628069]  [<ffffffff8113af98>] ? handle_mm_fault+0x1f8/0x350
[18438803.628076]  [<ffffffff8103cc65>] ? pvclock_clocksource_read+0x55/0xf0
[18438803.628085]  [<ffffffff8100a540>] ? xen_clocksource_read+0x20/0x30
[18438803.628092]  [<ffffffff8100a629>] ? xen_clocksource_get_cycles+0x9/0x10
[18438803.628101]  [<ffffffff810933ed>] ? ktime_get_ts+0xad/0xe0
[18438803.628108]  [<ffffffff811889bb>] sys_select+0xbb/0x100
[18438803.628117]  [<ffffffff8165d8c2>] system_call_fastpath+0x16/0x1b
[18438803.628123] Code: 89 06 48 8b 47 08 48 89 46 08 48 8b 47 10 48 89 46 10 c3 0f 1f 80 00 00 00 00 48 89 32 eb b2 0f 1f 00 48 89 70 10 eb a9 66 90 55 <48> 8b 17 48 89 e5 48 89 d0 48 83 e0 fc 48 39 c7 74 34 48 8b 47 
[18438803.628207] RIP  [<ffffffff8130d7f1>] rb_next+0x1/0x50
[18438803.628215]  RSP <ffff8801d225d808>
[18438803.628219] CR2: 0000000000000010
[18438803.628229] ---[ end trace 6e3e751b67665edf ]---

Revision history for this message

Stefan Bader (smb) wrote on 2012-05-21:

#10

The Xen version is mostly to check for correlations. So far both times those have been reported it seemed to be a 3.4.3 version (not sure whether one can still hit other versions, but if then it would be great to know whether those are affected the same way). I know its not possible to know what one gets beforehand.

At least in all cases it is the same fatal condition. I suspect some pointer is unexpectedly NULL and then trying to reference an element at offset 0x10. It is in the scheduling code but at least there were two paths that seemed to lead there. So maybe it is not so much a specific action than scheduling new processes a lot...

This likely will be a beast to find as whatever is the cause is not the place it breaks. I'll try to isolate the exact instruction that breaks. Maybe it is possible to find out more by detecting the corruption and dumping more information. And also try to re-create this on a local system hopefully close enough to ec2...

Meanwhile if anybody can confirm breakage (or not) with other versions of Xen that may or may not be seen, please add to this report.

Revision history for this message

Stefan Bader (smb) wrote on 2012-05-21:

#11

Also I see that a new kernel has just been pushed out. It cannot hurt doing tests running that (3.2.0-24.38).

Revision history for this message

Stefan Bader (smb) wrote on 2012-05-21:

#12

Writing that I noticed that the traces differ based on different kernels. One set has -24.37 and the other -23.36. So that may just cause a slightly different timing or location of certain things...

Revision history for this message

Karl Matthias (0k53-karl) wrote on 2012-05-22:

#13

Thanks, we're running tests on the 3.2.0-24.38 kernel now to see if we can get it to crash in the same way. Yes, this affected both the previous versions: -24.37 and -23.36.

Revision history for this message

Stefan Bader (smb) wrote on 2012-05-22:

#14

So from the disassembly and the registers of the crash, it is clear that both variations at some point do a schedule which calls on to pick_next_task_fair(). As that calls into pick_next_entity() it can be assumed that (struct cfs_rq *)->nr_running is not 0.
But then (struct cfs_rq *)->rb_leftmost seem to be NULL (as well as (struct cfs_rq *)->skip).
The code in pick_next_entity() never checks for rb_leftmost to be NULL but I assume that this should never happen as long as the number of running entities is not 0, too.

This does not really help in understanding what goes wrong, but it should allow to create a special kernel that would dump the memory of the affected structure before dying. Might be interesting to see whether the areas around that pointer seem wrong as well. I have not heard, yet, of similar issues when not running as PVM guests on EC2. That points to some Xen interference but how I cannot explain.

Revision history for this message

Karl Matthias (0k53-karl) wrote on 2012-05-23:

#15

Stefan, if you have a kernel build with memory dumps to help debug this, I'm happy to try it.

Revision history for this message

Stefan Bader (smb) wrote on 2012-05-24:

#16

A first attempt to get more information can be found at http://people.canonical.com/~smb/lp999755/ (I took the older -24.37 version as that at least is confirmed to be broken). Up to now I was not able to cause a similar crash (or in fact any crash) locally. But then I am not sure I how to get the same environment (lots of things I never used before).

Revision history for this message

Karl Matthias (0k53-karl) wrote on 2012-05-24:

#17

Excellent, thanks for that. I'll drop it on a box tomorrow (UK standard time).

Revision history for this message

Gavin Heavyside (mydrive) wrote on 2012-05-25:

#18

Download full text (3.3 KiB)

We've also seen this on the -24.38 and -24.39 kernels now:

[56843.390534] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[56843.390551] IP: [<ffffffff8130d901>] rb_next+0x1/0x50
[56843.390566] PGD 1d20a7067 PUD 1d29a2067 PMD 0
[56843.390575] Oops: 0000 [#1] SMP
[56843.390583] CPU 1
[56843.390585] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables parport_pc ppdev lp parport isofs acpiphp
[56843.390616]
[56843.390621] Pid: 1340, comm: chef-client Not tainted 3.2.0-24-virtual #39-Ubuntu
[56843.390629] RIP: e030:[<ffffffff8130d901>] [<ffffffff8130d901>] rb_next+0x1/0x50
[56843.390638] RSP: e02b:ffff8801d050dc18 EFLAGS: 00010046
[56843.390642] RAX: 0000000000000000 RBX: ffff8801d24c8e00 RCX: 0000000000000000
[56843.390648] RDX: fffffffffffffff0 RSI: 0000000000000000 RDI: 0000000000000010
[56843.390654] RBP: ffff8801d050dc48 R08: 0000000000000000 R09: 0000000000000000
[56843.390659] R10: ffff8801dff866c0 R11: 0000000000000001 R12: 0000000000000000
[56843.390664] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000011168a8
[56843.390674] FS: 00007fda5feaa700(0000) GS:ffff8801dff8f000(0000) knlGS:0000000000000000
[56843.390681] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[56843.390686] CR2: 0000000000000010 CR3: 00000001d28cc000 CR4: 0000000000002660
[56843.390692] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[56843.390698] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[56843.390704] Process chef-client (pid: 1340, threadinfo ffff8801d050c000, task ffff8801d2405b80)
[56843.390711] Stack:
[56843.390714] ffff8801d050dc48 ffffffff8104ee19 ffff8801d24c8e00 ffff8801dffa26c0
[56843.390726] ffff8801d24c9400 0000000000000000 ffff8801d050dc78 ffffffff810545e8
[56843.390737] ffff8801d050dc78 ffff8801dffa26c0 0000000000000001 ffff8801d2405f28
[56843.390748] Call Trace:
[56843.390758] [<ffffffff8104ee19>] ? pick_next_entity+0xb9/0xe0
[56843.390765] [<ffffffff810545e8>] pick_next_task_fair+0x38/0x70
[56843.390774] [<ffffffff816530dc>] __schedule+0x14c/0x6f0
[56843.390781] [<ffffffff816557ee>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[56843.390789] [<ffffffff8165374f>] schedule+0x3f/0x60
[56843.390797] [<ffffffff8117e209>] pipe_wait+0x59/0x80
[56843.390804] [<ffffffff81089470>] ? add_wait_queue+0x60/0x60
[56843.390810] [<ffffffff8117e96a>] pipe_read+0x1da/0x330
[56843.390817] [<ffffffff81174612>] do_sync_read+0xd2/0x110
[56843.390824] [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[56843.390831] [<ffffffff8100aa32>] ? check_events+0x12/0x20
[56843.390838] [<ffffffff81298dc3>] ? security_file_permission+0x93/0xb0
[56843.390845] [<ffffffff81174a91>] ? rw_verify_area+0x61/0xf0
[56843.390851] [<ffffffff81174f70>] vfs_read+0xb0/0x180
[56843.390857] [<ffffffff8117508a>] sys_read+0x4a/0x90
[56843.390863] [<ffffffff8165dbc2>] system_call_fastpath+0x16/0x1b
[56843.390869] Code: 89 06 48 8b 47 08 48 89 46 08 48 8b 47 10 48 89 46 10 c3 0f 1f 80 00 00 00 00 48 89 32 eb b2 0f 1f 00 48 89 70 10 eb a9 66 90 55 <48> 8b 17 48 89 e5 48 89 d0 48 83 e0 fc 48 39 c7 74 34 48 8b 47
[56843.390...

We've also seen this on the -24.38 and -24.39 kernels now:

[56843.390534] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[56843.390551] IP: [<ffffffff8130d901>] rb_next+0x1/0x50
[56843.390566] PGD 1d20a7067 PUD 1d29a2067 PMD 0 
[56843.390575] Oops: 0000 [#1] SMP 
[56843.390583] CPU 1 
[56843.390585] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables parport_pc ppdev lp parport isofs acpiphp
[56843.390616] 
[56843.390621] Pid: 1340, comm: chef-client Not tainted 3.2.0-24-virtual #39-Ubuntu  
[56843.390629] RIP: e030:[<ffffffff8130d901>]  [<ffffffff8130d901>] rb_next+0x1/0x50
[56843.390638] RSP: e02b:ffff8801d050dc18  EFLAGS: 00010046
[56843.390642] RAX: 0000000000000000 RBX: ffff8801d24c8e00 RCX: 0000000000000000
[56843.390648] RDX: fffffffffffffff0 RSI: 0000000000000000 RDI: 0000000000000010
[56843.390654] RBP: ffff8801d050dc48 R08: 0000000000000000 R09: 0000000000000000
[56843.390659] R10: ffff8801dff866c0 R11: 0000000000000001 R12: 0000000000000000
[56843.390664] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000011168a8
[56843.390674] FS:  00007fda5feaa700(0000) GS:ffff8801dff8f000(0000) knlGS:0000000000000000
[56843.390681] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[56843.390686] CR2: 0000000000000010 CR3: 00000001d28cc000 CR4: 0000000000002660
[56843.390692] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[56843.390698] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[56843.390704] Process chef-client (pid: 1340, threadinfo ffff8801d050c000, task ffff8801d2405b80)
[56843.390711] Stack:
[56843.390714]  ffff8801d050dc48 ffffffff8104ee19 ffff8801d24c8e00 ffff8801dffa26c0
[56843.390726]  ffff8801d24c9400 0000000000000000 ffff8801d050dc78 ffffffff810545e8
[56843.390737]  ffff8801d050dc78 ffff8801dffa26c0 0000000000000001 ffff8801d2405f28
[56843.390748] Call Trace:
[56843.390758]  [<ffffffff8104ee19>] ? pick_next_entity+0xb9/0xe0
[56843.390765]  [<ffffffff810545e8>] pick_next_task_fair+0x38/0x70
[56843.390774]  [<ffffffff816530dc>] __schedule+0x14c/0x6f0
[56843.390781]  [<ffffffff816557ee>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[56843.390789]  [<ffffffff8165374f>] schedule+0x3f/0x60
[56843.390797]  [<ffffffff8117e209>] pipe_wait+0x59/0x80
[56843.390804]  [<ffffffff81089470>] ? add_wait_queue+0x60/0x60
[56843.390810]  [<ffffffff8117e96a>] pipe_read+0x1da/0x330
[56843.390817]  [<ffffffff81174612>] do_sync_read+0xd2/0x110
[56843.390824]  [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[56843.390831]  [<ffffffff8100aa32>] ? check_events+0x12/0x20
[56843.390838]  [<ffffffff81298dc3>] ? security_file_permission+0x93/0xb0
[56843.390845]  [<ffffffff81174a91>] ? rw_verify_area+0x61/0xf0
[56843.390851]  [<ffffffff81174f70>] vfs_read+0xb0/0x180
[56843.390857]  [<ffffffff8117508a>] sys_read+0x4a/0x90
[56843.390863]  [<ffffffff8165dbc2>] system_call_fastpath+0x16/0x1b
[56843.390869] Code: 89 06 48 8b 47 08 48 89 46 08 48 8b 47 10 48 89 46 10 c3 0f 1f 80 00 00 00 00 48 89 32 eb b2 0f 1f 00 48 89 70 10 eb a9 66 90 55 <48> 8b 17 48 89 e5 48 89 d0 48 83 e0 fc 48 39 c7 74 34 48 8b 47 
[56843.390960] RIP  [<ffffffff8130d901>] rb_next+0x1/0x50
[56843.390967]  RSP <ffff8801d050dc18>
[56843.390970] CR2: 0000000000000010
[56843.390990] ---[ end trace ddb8957a1d924ccc ]---

Revision history for this message

Karl Matthias (0k53-karl) wrote on 2012-05-25:

#19

We have the debugging kernel installed on a box that has been happily crashing for the last few days. We'll see if we can get a good debug for you from it. What would you like us to gather the next time it crashes?

Revision history for this message

Stefan Bader (smb) wrote on 2012-05-25:

#20

It hopefully prints out the hex content of the data structure I want to check. But that should all be in the crash console output. Apart from that just the xen version is still the same 3.4.3-2.6.18 (preserve-AD) as before.

Revision history for this message

Gavin Heavyside (mydrive) wrote on 2012-05-28:

#21

Download full text (4.0 KiB)

I've just reproduced this crash using the stock 3.2.0-24-39 kernel on VirtualBox on OS X (Lion).

I created a 2-CPU VM using the latest VirtualBox (4.1.16 r78094), for Ubuntu 64-bit, default 8GB disk. The steps I followed were:

* Install 64-bit 12.04 Server LTS, minimal install from ISO downloaded from ubuntu.com/download/server
* apt-get dist-upgrade and reboot
* apt-get install ruby1.9.3 screen
* gem install chef
* In a screen session:
* while true; do ohai; done

After a while (less than an hour) I had reproduced the stack trace from this bug. I added "console=ttyS0" to the startup parameters and had VirtualBox log the serial output to a file so I could capture the stack trace (below).

This means this bug is nothing to do with Xen, and hopefully can be reproduced by developers more easily.

Stefan: If you could build a regular kernel I would be happy to try and capture more info for you.

Thanks,

Gavin.

[ 1119.141743] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 1119.145307] IP: [<ffffffff81311c81>] rb_next+0x1/0x50
[ 1119.145307] PGD 1be80067 PUD 1f17d067 PMD 0
[ 1119.145307] Oops: 0000 [#1] SMP
[ 1119.145307] CPU 0
[ 1119.145307] Modules linked in: vesafb ext2 snd_intel8x0 psmouse snd_ac97_codec ac97_bus snd_pcm snd_timer ppdev snd soundcore joydev snd_page_alloc serio_raw i2c_piix4 parport_pc mac_hid lp parport usbhid hid e1000
[ 1119.145307]
[ 1119.145307] Pid: 6768, comm: ohai Not tainted 3.2.0-24-generic #39-Ubuntu innotek GmbH VirtualBox
[ 1119.145307] RIP: 0010:[<ffffffff81311c81>] [<ffffffff81311c81>] rb_next+0x1/0x50
[ 1119.145307] RSP: 0018:ffff88001bd8dc18 EFLAGS: 00010046
[ 1119.145307] RAX: 0000000000000000 RBX: ffff88001f232400 RCX: 0000000000000000
[ 1119.145307] RDX: fffffffffffffff0 RSI: 0000000000000000 RDI: 0000000000000010
[ 1119.145307] RBP: ffff88001bd8dc48 R08: 0000000000000001 R09: 0000000000000000
[ 1119.145307] R10: ffff88001fd13780 R11: 0000000000000001 R12: 0000000000000000
[ 1119.145307] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000015010f8
[ 1119.145307] FS: 00007f6a721d7700(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
[ 1119.145307] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1119.145307] CR2: 0000000000000010 CR3: 000000001ce72000 CR4: 00000000000006f0
[ 1119.145307] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1119.145307] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1119.145307] Process ohai (pid: 6768, threadinfo ffff88001bd8c000, task ffff88001cafdbc0)
[ 1119.145307] Stack:
[ 1119.145307] ffff88001bd8dc48 ffffffff8104ff39 ffff88001f232400 ffff88001fc13780
[ 1119.145307] ffff88001f233a00 0000000000000000 ffff88001bd8dc78 ffffffff810561d8
[ 1119.145307] ffff88001bd8dc78 ffff88001fc13780 0000000000000000 ffff88001cafdf80
[ 1119.145307] Call Trace:
[ 1119.145307] [<ffffffff8104ff39>] ? pick_next_entity+0xb9/0xe0
[ 1119.145307] [<ffffffff810561d8>] pick_next_task_fair+0x38/0x70
[ 1119.145307] [<ffffffff81659fbc>] __schedule+0x14c/0x6f0
[ 1119.145307] [<ffffffff81045d18>] ? flush_tlb_page+0x48/0xb0
[ 1119.145307] [<ffffffff8165a88f>] schedule+0x3f/0x60
[ 1119.145307] [<ffffffff8118...

I've just reproduced this crash using the stock 3.2.0-24-39 kernel on VirtualBox on OS X (Lion).

I created a 2-CPU VM using the latest VirtualBox (4.1.16 r78094), for Ubuntu 64-bit, default 8GB disk. The steps I followed were:

* Install 64-bit 12.04 Server LTS, minimal install from ISO downloaded from ubuntu.com/download/server
* apt-get dist-upgrade and reboot
* apt-get install ruby1.9.3 screen
* gem install chef
* In a screen session:
  * while true; do ohai; done

After a while (less than an hour) I had reproduced the stack trace from this bug. I added "console=ttyS0" to the startup parameters and had VirtualBox log the serial output to a file so I could capture the stack trace (below).

This means this bug is nothing to do with Xen, and hopefully can be reproduced by developers more easily.

Stefan: If you could build a regular kernel I would be happy to try and capture more info for you.

Thanks,

Gavin.

[ 1119.141743] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 1119.145307] IP: [<ffffffff81311c81>] rb_next+0x1/0x50
[ 1119.145307] PGD 1be80067 PUD 1f17d067 PMD 0 
[ 1119.145307] Oops: 0000 [#1] SMP 
[ 1119.145307] CPU 0 
[ 1119.145307] Modules linked in: vesafb ext2 snd_intel8x0 psmouse snd_ac97_codec ac97_bus snd_pcm snd_timer ppdev snd soundcore joydev snd_page_alloc serio_raw i2c_piix4 parport_pc mac_hid lp parport usbhid hid e1000
[ 1119.145307] 
[ 1119.145307] Pid: 6768, comm: ohai Not tainted 3.2.0-24-generic #39-Ubuntu innotek GmbH VirtualBox
[ 1119.145307] RIP: 0010:[<ffffffff81311c81>]  [<ffffffff81311c81>] rb_next+0x1/0x50
[ 1119.145307] RSP: 0018:ffff88001bd8dc18  EFLAGS: 00010046
[ 1119.145307] RAX: 0000000000000000 RBX: ffff88001f232400 RCX: 0000000000000000
[ 1119.145307] RDX: fffffffffffffff0 RSI: 0000000000000000 RDI: 0000000000000010
[ 1119.145307] RBP: ffff88001bd8dc48 R08: 0000000000000001 R09: 0000000000000000
[ 1119.145307] R10: ffff88001fd13780 R11: 0000000000000001 R12: 0000000000000000
[ 1119.145307] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000015010f8
[ 1119.145307] FS:  00007f6a721d7700(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
[ 1119.145307] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1119.145307] CR2: 0000000000000010 CR3: 000000001ce72000 CR4: 00000000000006f0
[ 1119.145307] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1119.145307] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1119.145307] Process ohai (pid: 6768, threadinfo ffff88001bd8c000, task ffff88001cafdbc0)
[ 1119.145307] Stack:
[ 1119.145307]  ffff88001bd8dc48 ffffffff8104ff39 ffff88001f232400 ffff88001fc13780
[ 1119.145307]  ffff88001f233a00 0000000000000000 ffff88001bd8dc78 ffffffff810561d8
[ 1119.145307]  ffff88001bd8dc78 ffff88001fc13780 0000000000000000 ffff88001cafdf80
[ 1119.145307] Call Trace:
[ 1119.145307]  [<ffffffff8104ff39>] ? pick_next_entity+0xb9/0xe0
[ 1119.145307]  [<ffffffff810561d8>] pick_next_task_fair+0x38/0x70
[ 1119.145307]  [<ffffffff81659fbc>] __schedule+0x14c/0x6f0
[ 1119.145307]  [<ffffffff81045d18>] ? flush_tlb_page+0x48/0xb0
[ 1119.145307]  [<ffffffff8165a88f>] schedule+0x3f/0x60
[ 1119.145307]  [<ffffffff811811f9>] pipe_wait+0x59/0x80
[ 1119.145307]  [<ffffffff8108aff0>] ? add_wait_queue+0x60/0x60
[ 1119.145307]  [<ffffffff8118196a>] pipe_read+0x1da/0x330
[ 1119.145307]  [<ffffffff81177562>] do_sync_read+0xd2/0x110
[ 1119.145307]  [<ffffffff8129cdb3>] ? security_file_permission+0x93/0xb0
[ 1119.145307]  [<ffffffff811779e1>] ? rw_verify_area+0x61/0xf0
[ 1119.145307]  [<ffffffff81177ec0>] vfs_read+0xb0/0x180
[ 1119.145307]  [<ffffffff81177fda>] sys_read+0x4a/0x90
[ 1119.145307]  [<ffffffff81664d82>] system_call_fastpath+0x16/0x1b
[ 1119.145307] Code: 89 06 48 8b 47 08 48 89 46 08 48 8b 47 10 48 89 46 10 c3 0f 1f 80 00 00 00 00 48 89 32 eb b2 0f 1f 00 48 89 70 10 eb a9 66 90 55 <48> 8b 17 48 89 e5 48 89 d0 48 83 e0 fc 48 39 c7 74 34 48 8b 47 
[ 1119.145307] RIP  [<ffffffff81311c81>] rb_next+0x1/0x50
[ 1119.145307]  RSP <ffff88001bd8dc18>
[ 1119.145307] CR2: 0000000000000010
[ 1119.145307] ---[ end trace 787431227c69a127 ]---

Revision history for this message

Gavin Heavyside (mydrive) wrote on 2012-05-28: Re: Kernel crash on EC2 & VirtualBox

#22

We've got small EC2 instances (single processor) that haven't exhibited this behaviour, but we get it with large EC2 instances (2 CPUs); the VirtualBox machine I just reproduced it with was specifically set to 2 CPUS.

It seems to me that this bug might only occur on multi-cpu boxes?

summary:

- Kernel crash on EC2 m1.large instances
+ Kernel crash on EC2 & VirtualBox

Revision history for this message

Gavin Heavyside (mydrive) wrote on 2012-05-28:

#23

Download full text (3.2 KiB)

And we've just reproduced on EC2 with the debug kernel:

[248587.286290] ------------[ cut here ]------------
[248587.286765] kernel BUG at /home/smb/precise-amd64/ubuntu-2.6/kernel/sched_fair.c:1239!
[248587.286775] invalid opcode: 0000 [#1] SMP
[248587.286783] CPU 0
[248587.286786] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables bnep rfcomm bluetooth parport_pc ppdev lp parport isofs acpiphp
[248587.286822]
[248587.286827] Pid: 18805, comm: ohai Not tainted 3.2.0-24-virtual #37+lp999755dbg1
[248587.286836] RIP: e030:[<ffffffff81050cd5>] [<ffffffff81050cd5>] pick_next_entity+0x105/0x110
[248587.286849] RSP: e02b:ffff8801d02d7c28 EFLAGS: 00010096
[248587.286854] RAX: 000000000000002d RBX: ffff8801d20df800 RCX: 000000000003ffff
[248587.286860] RDX: 0000000000000000 RSI: ffffffff81e000a0 RDI: 0000000000000004
[248587.286866] RBP: ffff8801d02d7c48 R08: 000000000000000a R09: 0000000000000000
[248587.286872] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801dff866c0
[248587.286878] R13: ffff8801d20de600 R14: 0000000000000000 R15: 0000000001f41018
[248587.286889] FS: 00007f5ec2010700(0000) GS:ffff8801dff73000(0000) knlGS:0000000000000000
[248587.286895] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[248587.286901] CR2: 0000000001f09f30 CR3: 00000001cec3e000 CR4: 0000000000002660
[248587.286908] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[248587.286914] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[248587.286921] Process ohai (pid: 18805, threadinfo ffff8801d02d6000, task ffff8801cf10db80)
[248587.286928] Stack:
[248587.286931] ffff8801d20df800 ffff8801dff866c0 ffff8801d20de600 0000000000000000
[248587.286944] ffff8801d02d7c78 ffffffff810544e8 ffff8801d02d7c78 ffff8801dff866c0
[248587.286956] 0000000000000000 ffff8801cf10df28 ffff8801d02d7cf8 ffffffff81652f3c
[248587.286968] Call Trace:
[248587.286976] [<ffffffff810544e8>] pick_next_task_fair+0x38/0x70
[248587.286984] [<ffffffff81652f3c>] __schedule+0x14c/0x6f0
[248587.286992] [<ffffffff8165564e>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[248587.286999] [<ffffffff816535af>] schedule+0x3f/0x60
[248587.287006] [<ffffffff8117e149>] pipe_wait+0x59/0x80
[248587.287014] [<ffffffff81089370>] ? add_wait_queue+0x60/0x60
[248587.287021] [<ffffffff8117e8aa>] pipe_read+0x1da/0x330
[248587.287028] [<ffffffff81174552>] do_sync_read+0xd2/0x110
[248587.287036] [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[248587.287043] [<ffffffff8100aa32>] ? check_events+0x12/0x20
[248587.287051] [<ffffffff81298d63>] ? security_file_permission+0x93/0xb0
[248587.287057] [<ffffffff811749d1>] ? rw_verify_area+0x61/0xf0
[248587.287063] [<ffffffff81174eb0>] vfs_read+0xb0/0x180
[248587.287069] [<ffffffff81174fca>] sys_read+0x4a/0x90
[248587.287076] [<ffffffff8165da42>] system_call_fastpath+0x16/0x1b
[248587.287081] Code: 89 df e8 69 c1 5e 00 48 8b 73 38 48 c7 c7 7b ef 9f 81 31 c0 e8 98 c9 5e 00 48 8b 73 10 48 c7 c7 98 ef 9f 81 31 c0 e8 86 c9 5e 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 10 48 89
[248587.287170] RIP [<ffffffff81050cd5>] pick_next_en...

And we've just reproduced on EC2 with the debug kernel:

[248587.286290] ------------[ cut here ]------------
[248587.286765] kernel BUG at /home/smb/precise-amd64/ubuntu-2.6/kernel/sched_fair.c:1239!
[248587.286775] invalid opcode: 0000 [#1] SMP 
[248587.286783] CPU 0 
[248587.286786] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables bnep rfcomm bluetooth parport_pc ppdev lp parport isofs acpiphp
[248587.286822] 
[248587.286827] Pid: 18805, comm: ohai Not tainted 3.2.0-24-virtual #37+lp999755dbg1  
[248587.286836] RIP: e030:[<ffffffff81050cd5>]  [<ffffffff81050cd5>] pick_next_entity+0x105/0x110
[248587.286849] RSP: e02b:ffff8801d02d7c28  EFLAGS: 00010096
[248587.286854] RAX: 000000000000002d RBX: ffff8801d20df800 RCX: 000000000003ffff
[248587.286860] RDX: 0000000000000000 RSI: ffffffff81e000a0 RDI: 0000000000000004
[248587.286866] RBP: ffff8801d02d7c48 R08: 000000000000000a R09: 0000000000000000
[248587.286872] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801dff866c0
[248587.286878] R13: ffff8801d20de600 R14: 0000000000000000 R15: 0000000001f41018
[248587.286889] FS:  00007f5ec2010700(0000) GS:ffff8801dff73000(0000) knlGS:0000000000000000
[248587.286895] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[248587.286901] CR2: 0000000001f09f30 CR3: 00000001cec3e000 CR4: 0000000000002660
[248587.286908] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[248587.286914] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[248587.286921] Process ohai (pid: 18805, threadinfo ffff8801d02d6000, task ffff8801cf10db80)
[248587.286928] Stack:
[248587.286931]  ffff8801d20df800 ffff8801dff866c0 ffff8801d20de600 0000000000000000
[248587.286944]  ffff8801d02d7c78 ffffffff810544e8 ffff8801d02d7c78 ffff8801dff866c0
[248587.286956]  0000000000000000 ffff8801cf10df28 ffff8801d02d7cf8 ffffffff81652f3c
[248587.286968] Call Trace:
[248587.286976]  [<ffffffff810544e8>] pick_next_task_fair+0x38/0x70
[248587.286984]  [<ffffffff81652f3c>] __schedule+0x14c/0x6f0
[248587.286992]  [<ffffffff8165564e>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[248587.286999]  [<ffffffff816535af>] schedule+0x3f/0x60
[248587.287006]  [<ffffffff8117e149>] pipe_wait+0x59/0x80
[248587.287014]  [<ffffffff81089370>] ? add_wait_queue+0x60/0x60
[248587.287021]  [<ffffffff8117e8aa>] pipe_read+0x1da/0x330
[248587.287028]  [<ffffffff81174552>] do_sync_read+0xd2/0x110
[248587.287036]  [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[248587.287043]  [<ffffffff8100aa32>] ? check_events+0x12/0x20
[248587.287051]  [<ffffffff81298d63>] ? security_file_permission+0x93/0xb0
[248587.287057]  [<ffffffff811749d1>] ? rw_verify_area+0x61/0xf0
[248587.287063]  [<ffffffff81174eb0>] vfs_read+0xb0/0x180
[248587.287069]  [<ffffffff81174fca>] sys_read+0x4a/0x90
[248587.287076]  [<ffffffff8165da42>] system_call_fastpath+0x16/0x1b
[248587.287081] Code: 89 df e8 69 c1 5e 00 48 8b 73 38 48 c7 c7 7b ef 9f 81 31 c0 e8 98 c9 5e 00 48 8b 73 10 48 c7 c7 98 ef 9f 81 31 c0 e8 86 c9 5e 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 10 48 89 
[248587.287170] RIP  [<ffffffff81050cd5>] pick_next_entity+0x105/0x110
[248587.287177]  RSP <ffff8801d02d7c28>
[248587.287194] ---[ end trace 124f7d4d99f55a46 ]---

Revision history for this message

Karl Matthias (0k53-karl) wrote on 2012-05-28:

#24

Download full text (3.3 KiB)

We have had a crash with Stefan's debugging kernel running. Here is the output (doesn't look like it contains any more information).

[248587.286290] ------------[ cut here ]------------
[248587.286765] kernel BUG at /home/smb/precise-amd64/ubuntu-2.6/kernel/sched_fair.c:1239!
[248587.286775] invalid opcode: 0000 [#1] SMP
[248587.286783] CPU 0
[248587.286786] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables bnep rfcomm bluetooth parport_pc ppdev lp parport isofs acpiphp
[248587.286822]
[248587.286827] Pid: 18805, comm: ohai Not tainted 3.2.0-24-virtual #37+lp999755dbg1
[248587.286836] RIP: e030:[<ffffffff81050cd5>] [<ffffffff81050cd5>] pick_next_entity+0x105/0x110
[248587.286849] RSP: e02b:ffff8801d02d7c28 EFLAGS: 00010096
[248587.286854] RAX: 000000000000002d RBX: ffff8801d20df800 RCX: 000000000003ffff
[248587.286860] RDX: 0000000000000000 RSI: ffffffff81e000a0 RDI: 0000000000000004
[248587.286866] RBP: ffff8801d02d7c48 R08: 000000000000000a R09: 0000000000000000
[248587.286872] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801dff866c0
[248587.286878] R13: ffff8801d20de600 R14: 0000000000000000 R15: 0000000001f41018
[248587.286889] FS: 00007f5ec2010700(0000) GS:ffff8801dff73000(0000) knlGS:0000000000000000
[248587.286895] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[248587.286901] CR2: 0000000001f09f30 CR3: 00000001cec3e000 CR4: 0000000000002660
[248587.286908] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[248587.286914] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[248587.286921] Process ohai (pid: 18805, threadinfo ffff8801d02d6000, task ffff8801cf10db80)
[248587.286928] Stack:
[248587.286931] ffff8801d20df800 ffff8801dff866c0 ffff8801d20de600 0000000000000000
[248587.286944] ffff8801d02d7c78 ffffffff810544e8 ffff8801d02d7c78 ffff8801dff866c0
[248587.286956] 0000000000000000 ffff8801cf10df28 ffff8801d02d7cf8 ffffffff81652f3c
[248587.286968] Call Trace:
[248587.286976] [<ffffffff810544e8>] pick_next_task_fair+0x38/0x70
[248587.286984] [<ffffffff81652f3c>] __schedule+0x14c/0x6f0
[248587.286992] [<ffffffff8165564e>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[248587.286999] [<ffffffff816535af>] schedule+0x3f/0x60
[248587.287006] [<ffffffff8117e149>] pipe_wait+0x59/0x80
[248587.287014] [<ffffffff81089370>] ? add_wait_queue+0x60/0x60
[248587.287021] [<ffffffff8117e8aa>] pipe_read+0x1da/0x330
[248587.287028] [<ffffffff81174552>] do_sync_read+0xd2/0x110
[248587.287036] [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[248587.287043] [<ffffffff8100aa32>] ? check_events+0x12/0x20
[248587.287051] [<ffffffff81298d63>] ? security_file_permission+0x93/0xb0
[248587.287057] [<ffffffff811749d1>] ? rw_verify_area+0x61/0xf0
[248587.287063] [<ffffffff81174eb0>] vfs_read+0xb0/0x180
[248587.287069] [<ffffffff81174fca>] sys_read+0x4a/0x90
[248587.287076] [<ffffffff8165da42>] system_call_fastpath+0x16/0x1b
[248587.287081] Code: 89 df e8 69 c1 5e 00 48 8b 73 38 48 c7 c7 7b ef 9f 81 31 c0 e8 98 c9 5e 00 48 8b 73 10 48 c7 c7 98 ef 9f 81 31 c0 e8 86 c9 5e 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 55 48 8...

We have had a crash with Stefan's debugging kernel running.  Here is the output (doesn't look like it contains any more information).

[248587.286290] ------------[ cut here ]------------
[248587.286765] kernel BUG at /home/smb/precise-amd64/ubuntu-2.6/kernel/sched_fair.c:1239!
[248587.286775] invalid opcode: 0000 [#1] SMP 
[248587.286783] CPU 0 
[248587.286786] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables bnep rfcomm bluetooth parport_pc ppdev lp parport isofs acpiphp
[248587.286822] 
[248587.286827] Pid: 18805, comm: ohai Not tainted 3.2.0-24-virtual #37+lp999755dbg1  
[248587.286836] RIP: e030:[<ffffffff81050cd5>]  [<ffffffff81050cd5>] pick_next_entity+0x105/0x110
[248587.286849] RSP: e02b:ffff8801d02d7c28  EFLAGS: 00010096
[248587.286854] RAX: 000000000000002d RBX: ffff8801d20df800 RCX: 000000000003ffff
[248587.286860] RDX: 0000000000000000 RSI: ffffffff81e000a0 RDI: 0000000000000004
[248587.286866] RBP: ffff8801d02d7c48 R08: 000000000000000a R09: 0000000000000000
[248587.286872] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801dff866c0
[248587.286878] R13: ffff8801d20de600 R14: 0000000000000000 R15: 0000000001f41018
[248587.286889] FS:  00007f5ec2010700(0000) GS:ffff8801dff73000(0000) knlGS:0000000000000000
[248587.286895] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[248587.286901] CR2: 0000000001f09f30 CR3: 00000001cec3e000 CR4: 0000000000002660
[248587.286908] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[248587.286914] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[248587.286921] Process ohai (pid: 18805, threadinfo ffff8801d02d6000, task ffff8801cf10db80)
[248587.286928] Stack:
[248587.286931]  ffff8801d20df800 ffff8801dff866c0 ffff8801d20de600 0000000000000000
[248587.286944]  ffff8801d02d7c78 ffffffff810544e8 ffff8801d02d7c78 ffff8801dff866c0
[248587.286956]  0000000000000000 ffff8801cf10df28 ffff8801d02d7cf8 ffffffff81652f3c
[248587.286968] Call Trace:
[248587.286976]  [<ffffffff810544e8>] pick_next_task_fair+0x38/0x70
[248587.286984]  [<ffffffff81652f3c>] __schedule+0x14c/0x6f0
[248587.286992]  [<ffffffff8165564e>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[248587.286999]  [<ffffffff816535af>] schedule+0x3f/0x60
[248587.287006]  [<ffffffff8117e149>] pipe_wait+0x59/0x80
[248587.287014]  [<ffffffff81089370>] ? add_wait_queue+0x60/0x60
[248587.287021]  [<ffffffff8117e8aa>] pipe_read+0x1da/0x330
[248587.287028]  [<ffffffff81174552>] do_sync_read+0xd2/0x110
[248587.287036]  [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[248587.287043]  [<ffffffff8100aa32>] ? check_events+0x12/0x20
[248587.287051]  [<ffffffff81298d63>] ? security_file_permission+0x93/0xb0
[248587.287057]  [<ffffffff811749d1>] ? rw_verify_area+0x61/0xf0
[248587.287063]  [<ffffffff81174eb0>] vfs_read+0xb0/0x180
[248587.287069]  [<ffffffff81174fca>] sys_read+0x4a/0x90
[248587.287076]  [<ffffffff8165da42>] system_call_fastpath+0x16/0x1b
[248587.287081] Code: 89 df e8 69 c1 5e 00 48 8b 73 38 48 c7 c7 7b ef 9f 81 31 c0 e8 98 c9 5e 00 48 8b 73 10 48 c7 c7 98 ef 9f 81 31 c0 e8 86 c9 5e 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 10 48 89 
[248587.287170] RIP  [<ffffffff81050cd5>] pick_next_entity+0x105/0x110
[248587.287177]  RSP <ffff8801d02d7c28>
[248587.287194] ---[ end trace 124f7d4d99f55a46 ]---

Revision history for this message

Gavin Heavyside (mydrive) wrote on 2012-05-28:

#25

Download full text (4.4 KiB)

And another one from the debug kernel on EC2, with a slightly different call stack:

[ 4389.480352] ------------[ cut here ]------------
[ 4389.480884] kernel BUG at /home/smb/precise-amd64/ubuntu-2.6/kernel/sched_fair.c:1239!
[ 4389.480894] invalid opcode: 0000 [#1] SMP
[ 4389.480902] CPU 0
[ 4389.480905] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables bnep rfcomm bluetooth parport_pc ppdev lp parport isofs acpiphp
[ 4389.480945]
[ 4389.480949] Pid: 24612, comm: ohai Not tainted 3.2.0-24-virtual #37+lp999755dbg1
[ 4389.480958] RIP: e030:[<ffffffff81050cd5>] [<ffffffff81050cd5>] pick_next_entity+0x105/0x110
[ 4389.480972] RSP: e02b:ffff8801d0cdd818 EFLAGS: 00010092
[ 4389.480977] RAX: 000000000000002c RBX: ffff8801d1106e00 RCX: 000000000003ffff
[ 4389.480983] RDX: 0000000000000000 RSI: ffffffff81e000a0 RDI: 0000000000000004
[ 4389.480988] RBP: ffff8801d0cdd838 R08: 000000000000000a R09: 0000000000000000
[ 4389.480994] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801dff866c0
[ 4389.481000] R13: ffff8801d1107a00 R14: 0000000000000008 R15: ffff8801d1e97800
[ 4389.481009] FS: 00007f09fcc8a700(0000) GS:ffff8801dff73000(0000) knlGS:0000000000000000
[ 4389.481015] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 4389.481021] CR2: 00000000012dbf30 CR3: 00000001d0702000 CR4: 0000000000002660
[ 4389.481028] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4389.481035] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 4389.481041] Process ohai (pid: 24612, threadinfo ffff8801d0cdc000, task ffff8801d2905b80)
[ 4389.481047] Stack:
[ 4389.481051] ffff8801d1106e00 ffff8801dff866c0 ffff8801d1107a00 0000000000000008
[ 4389.481063] ffff8801d0cdd868 ffffffff810544e8 ffff8801d0cdd868 ffff8801dff866c0
[ 4389.481075] 0000000000000000 ffff8801d2905f28 ffff8801d0cdd8e8 ffffffff81652f3c
[ 4389.481086] Call Trace:
[ 4389.481093] [<ffffffff810544e8>] pick_next_task_fair+0x38/0x70
[ 4389.481101] [<ffffffff81652f3c>] __schedule+0x14c/0x6f0
[ 4389.481108] [<ffffffff816535af>] schedule+0x3f/0x60
[ 4389.481114] [<ffffffff816546cd>] schedule_hrtimeout_range_clock+0x14d/0x170
[ 4389.481124] [<ffffffff8100aa1f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[ 4389.481131] [<ffffffff8165564e>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[ 4389.481139] [<ffffffff8108935d>] ? add_wait_queue+0x4d/0x60
[ 4389.481146] [<ffffffff81654703>] schedule_hrtimeout_range+0x13/0x20
[ 4389.481154] [<ffffffff811877d9>] poll_schedule_timeout+0x49/0x70
[ 4389.481160] [<ffffffff81188356>] do_select+0x4d6/0x600
[ 4389.481167] [<ffffffff811878e0>] ? poll_freewait+0xe0/0xe0
[ 4389.481173] [<ffffffff811879d0>] ? __pollwait+0xf0/0xf0
[ 4389.481179] [<ffffffff81005191>] ? __raw_callee_save_xen_pte_val+0x11/0x1e
[ 4389.481186] [<ffffffff81054c3e>] ? update_curr+0x21e/0x230
[ 4389.481192] [<ffffffff8103cc65>] ? pvclock_clocksource_read+0x55/0xf0
[ 4389.481199] [<ffffffff810553db>] ? check_preempt_wakeup+0x15b/0x230
[ 4389.481207] [<ffffffff8104ed74>] ? check_preempt_curr+0x84/0xa0
[ 4389.481214] [<ffffffff8104edcd>] ? ttwu_do_wakeup+0x3d/0x120
[ 4389.481222] ...

And another one from the debug kernel on EC2, with a slightly different call stack:

[ 4389.480352] ------------[ cut here ]------------
[ 4389.480884] kernel BUG at /home/smb/precise-amd64/ubuntu-2.6/kernel/sched_fair.c:1239!
[ 4389.480894] invalid opcode: 0000 [#1] SMP 
[ 4389.480902] CPU 0 
[ 4389.480905] Modules linked in: ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables bnep rfcomm bluetooth parport_pc ppdev lp parport isofs acpiphp
[ 4389.480945] 
[ 4389.480949] Pid: 24612, comm: ohai Not tainted 3.2.0-24-virtual #37+lp999755dbg1  
[ 4389.480958] RIP: e030:[<ffffffff81050cd5>]  [<ffffffff81050cd5>] pick_next_entity+0x105/0x110
[ 4389.480972] RSP: e02b:ffff8801d0cdd818  EFLAGS: 00010092
[ 4389.480977] RAX: 000000000000002c RBX: ffff8801d1106e00 RCX: 000000000003ffff
[ 4389.480983] RDX: 0000000000000000 RSI: ffffffff81e000a0 RDI: 0000000000000004
[ 4389.480988] RBP: ffff8801d0cdd838 R08: 000000000000000a R09: 0000000000000000
[ 4389.480994] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801dff866c0
[ 4389.481000] R13: ffff8801d1107a00 R14: 0000000000000008 R15: ffff8801d1e97800
[ 4389.481009] FS:  00007f09fcc8a700(0000) GS:ffff8801dff73000(0000) knlGS:0000000000000000
[ 4389.481015] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 4389.481021] CR2: 00000000012dbf30 CR3: 00000001d0702000 CR4: 0000000000002660
[ 4389.481028] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4389.481035] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 4389.481041] Process ohai (pid: 24612, threadinfo ffff8801d0cdc000, task ffff8801d2905b80)
[ 4389.481047] Stack:
[ 4389.481051]  ffff8801d1106e00 ffff8801dff866c0 ffff8801d1107a00 0000000000000008
[ 4389.481063]  ffff8801d0cdd868 ffffffff810544e8 ffff8801d0cdd868 ffff8801dff866c0
[ 4389.481075]  0000000000000000 ffff8801d2905f28 ffff8801d0cdd8e8 ffffffff81652f3c
[ 4389.481086] Call Trace:
[ 4389.481093]  [<ffffffff810544e8>] pick_next_task_fair+0x38/0x70
[ 4389.481101]  [<ffffffff81652f3c>] __schedule+0x14c/0x6f0
[ 4389.481108]  [<ffffffff816535af>] schedule+0x3f/0x60
[ 4389.481114]  [<ffffffff816546cd>] schedule_hrtimeout_range_clock+0x14d/0x170
[ 4389.481124]  [<ffffffff8100aa1f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[ 4389.481131]  [<ffffffff8165564e>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[ 4389.481139]  [<ffffffff8108935d>] ? add_wait_queue+0x4d/0x60
[ 4389.481146]  [<ffffffff81654703>] schedule_hrtimeout_range+0x13/0x20
[ 4389.481154]  [<ffffffff811877d9>] poll_schedule_timeout+0x49/0x70
[ 4389.481160]  [<ffffffff81188356>] do_select+0x4d6/0x600
[ 4389.481167]  [<ffffffff811878e0>] ? poll_freewait+0xe0/0xe0
[ 4389.481173]  [<ffffffff811879d0>] ? __pollwait+0xf0/0xf0
[ 4389.481179]  [<ffffffff81005191>] ? __raw_callee_save_xen_pte_val+0x11/0x1e
[ 4389.481186]  [<ffffffff81054c3e>] ? update_curr+0x21e/0x230
[ 4389.481192]  [<ffffffff8103cc65>] ? pvclock_clocksource_read+0x55/0xf0
[ 4389.481199]  [<ffffffff810553db>] ? check_preempt_wakeup+0x15b/0x230
[ 4389.481207]  [<ffffffff8104ed74>] ? check_preempt_curr+0x84/0xa0
[ 4389.481214]  [<ffffffff8104edcd>] ? ttwu_do_wakeup+0x3d/0x120
[ 4389.481222]  [<ffffffff8130fd49>] ? put_dec+0x59/0x60
[ 4389.481228]  [<ffffffff81310c1f>] ? number.isra.2+0x31f/0x350
[ 4389.481236]  [<ffffffff81323436>] ? nla_parse+0x86/0xe0
[ 4389.481242]  [<ffffffff8165564e>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
[ 4389.481250]  [<ffffffff8105e610>] ? try_to_wake_up+0x190/0x200
[ 4389.481257]  [<ffffffff81188641>] core_sys_select+0x1c1/0x330
[ 4389.481263]  [<ffffffff8100aa32>] ? check_events+0x12/0x20
[ 4389.481269]  [<ffffffff8100a25d>] ? xen_force_evtchn_callback+0xd/0x10
[ 4389.481276]  [<ffffffff8100aa32>] ? check_events+0x12/0x20
[ 4389.481282]  [<ffffffff8100aa1f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[ 4389.481289]  [<ffffffff81004c62>] ? xen_mc_flush+0xb2/0x1c0
[ 4389.481295]  [<ffffffff8100aa1f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[ 4389.481302]  [<ffffffff811889eb>] sys_select+0xbb/0x100
[ 4389.481308]  [<ffffffff8105edf7>] ? schedule_tail+0x27/0xb0
[ 4389.481314]  [<ffffffff8165da42>] system_call_fastpath+0x16/0x1b
[ 4389.481319] Code: 89 df e8 69 c1 5e 00 48 8b 73 38 48 c7 c7 7b ef 9f 81 31 c0 e8 98 c9 5e 00 48 8b 73 10 48 c7 c7 98 ef 9f 81 31 c0 e8 86 c9 5e 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 10 48 89 
[ 4389.481407] RIP  [<ffffffff81050cd5>] pick_next_entity+0x105/0x110
[ 4389.484006]  RSP <ffff8801d0cdd818>
[ 4389.484006] ---[ end trace 7ee7cea7516c9821 ]---

Revision history for this message

Stefan Bader (smb) wrote on 2012-05-29:

#26

Bugger, I made the severity of the messages too low... :/ If at all they might be in /var/log/syslog on instance reboot... I better make a better version. But its interesting that this appears on a completely different virtualization platform. I wonder whether it could even appear on real hardware. Thanks a lot for the detailed instructions for the test case. I was running a test program doing lots of forks and pipe communication over two days without hitting this, so it helps a lot to know exactly what the reproducing steps are.

Revision history for this message

Stefan Bader (smb) wrote on 2012-05-29:

#27

So with those exact steps I managed to crash a real dual-core machine. So we can rule out any virtualization. This merely is related to whatever that command does and likely having more than one cpu. (It also worked / crashed on a local xen guest, so I can reproduce things locally). Next step will be to test latest upstream kernels to see whether this persists.

summary:	- Kernel crash on EC2 & VirtualBox + Kernel crash in rb_next doin ohai loops
description:	updated
summary:	- Kernel crash in rb_next doin ohai loops + Kernel crash in rb_next doing ohai loops

Revision history for this message

Karl Matthias (0k53-karl) wrote on 2012-05-29:

#28

Ok, that is "good" news. Happy to help with further testing.

description:

updated

Stefan Bader (smb) on 2012-05-29

description:

updated

Revision history for this message

Stefan Bader (smb) wrote on 2012-05-30:

#29

I did repeat the crash on a local HVM guest with the current precise-proposed kernel. That way I was able to dump the crashed VM. A run with the same kernel in a different guest that had 4 CPUs did not crash for several hours. 2 CPUs seems to make the issue less likely and could be a reason this is not observed more often. Currently running the loop with a 3.4 mainline kernel and 2 CPUs.

Revision history for this message

Stefan Bader (smb) wrote on 2012-05-30:

#30

Doh! I meant 4 CPUs makes it less likely and 2 more likely.

Revision history for this message

Karl Matthias (0k53-karl) wrote on 2012-05-30:

#31

Stefan, our experience is that it also does not ever happen on just a single core so it's definitely some scheduler interaction on a multi-core box. You probably could get a 4 core to do it as well if the process list were more heavily loaded.

Revision history for this message

Stefan Bader (smb) wrote on 2012-06-06:

#32

Unfortunately I was not yet able to pinpoint exactly the code that is wrong. But I think I know a way that can be used as a work-around until I find out the real fix. It looks very much like the problem is related to the autogroup feature which will automatically put tasks in separate task groups based on the session ids. The ohai process is doing this a lot, so I think compared to other work-loads there is a high volume of coming and going task groups, which might be the problem.

At the cost of loosing the fairer scheduling which isolation of high cpu using tasks in their own group brings, it is possible to disable the autogroup feature by either using "noautogroup" as pvgrub kernel argument, or changing
/proc/sys/kernel/sched_autogroup_enabled to 0. I guess that can be a sysctl entry, too. That way the ohai reproducer seems to survive, so production systems won't crash all the time (hopefully).

Revision history for this message

Ken (kenshi) wrote on 2012-06-11:

#33

Any chance this is related to this issue? https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1007082

Revision history for this message

Stefan Bader (smb) wrote on 2012-06-12:

#34

@Ken, I do not think so. This one is very like related to a race creating and removing task groups from the scheduler. Yours rather seems to have issues with the flags of a page it tries to use. I don't know why but I have the feeling of having seen this before, but maybe it was just looking at your bug. I think I saw some memory related patches being queued up into upstream stable but iirc those are not yet pushed into any next proposed. The log itself is somewhat helpful but with those kind of bugs hard to link to a specific place as the crash just is the result of something else going wrong before. If you could add to your bug more information on generic details. Like what kind of workload is running, is it possible to get any relation on when it happens (what range is occasionally).

Revision history for this message

Ken (kenshi) wrote on 2012-06-12:

#35

@Stefan - thanks for the tips, will update the ticket with more info.

Revision history for this message

Stefan Bader (smb) wrote on 2012-07-03:

#36

Sorry for not updating this before. So finally the problem has been identified. It is somewhat papered over by a change that went into linux-3.3. Though the potential race was still not fixed. What happens is that when a process calls setsid it is placed into a sperate schduling task group. However that group (a pointer) is changed while holding a different lock than when moving task to another CPU to balance the load. And even worse, while doing so (moving the task). the code before was looking up the task group pointer 4 times in a row while not being protected against changes. So whenever this crash happened, the pointer changed between two lookups to assign values to cfq scheduling elements. There could also be a inconsistency between cfq elements and rt scheduling element or between the two rt elements. Not sure which exact effects this would yield.

There is now a patch making its way upstream that will close the race. While Quantal is not affected that badly, Natty, Oneiric and Precise should be fixed.

Revision history for this message

Stefan Bader (smb) wrote on 2012-07-03:

#37

0001-sched-Fix-race-in-task_group.patch Edit (5.4 KiB, text/plain)

This is the patch that was made upstream and should be queued for mainline.

Changed in linux (Ubuntu):
status:	Confirmed → Triaged

Stefan Bader (smb) on 2012-07-03

description:

updated

Revision history for this message

Karl Matthias (0k53-karl) wrote on 2012-07-03:

#38

Awesome, thanks for the nice explanation and the dedication to working this out. We'll try out the patch on our several test setups as well.

Tim Gardner (timg-tpi) on 2012-07-17

Changed in linux (Ubuntu Oneiric):
status:	New → Fix Committed
Changed in linux (Ubuntu Precise):
status:	New → Fix Committed
Changed in linux (Ubuntu Natty):
status:	New → In Progress
assignee:	nobody → Stefan Bader (stefan-bader-canonical)

Revision history for this message

Armin Bauer (armin-bauer) wrote on 2012-07-19:

#39

Is there any estimate when this will be released? This bug is causing a lot of troubles for our setup due to random crashed instances.

Revision history for this message

Stefan Bader (smb) wrote on 2012-07-19:

#40

This went into the repo yesterday. The current cycle is in testing phase this week and given nothing goes bad there would get released next week. So kernels with this patch should go into the proposed pocket about next week. If there are no problems this should be in the officially released kernels in 3 to 4 weeks. But you could start using the proposed kernels as soon as they are available. Or as I wrote earlier, disable the autogroup feature until then (see comment #32).

Revision history for this message

Armin Bauer (armin-bauer) wrote on 2012-07-19:

#41

Thanks for the info. I think I will try a patched ohai version first, with the setsid call removed. The side effects of this change are probably better than changing the sched_autogroup_enabled on our production instances. If this doesnt help, I will try the sched_autogroup_enabled workaround. Thanks for your help and the work on this bug.

Revision history for this message

Gavin Heavyside (mydrive) wrote on 2012-07-19:

#42

We think that this seems to be an interplay between the kernel bug, ohai, and Ruby 1.9.3. I was unable to reproduce the crash using the opscode omnibus installer (http://www.opscode.com/blog/2012/06/29/omnibus-chef-packaging/) which uses a bundled ruby 1.9.2, so you could potentially try that as an alternative to patching OHAI.

Revision history for this message

Luis Henriques (henrix) wrote on 2012-07-30:

#43

This bug is awaiting verification that the kernel for Oneiric in -proposed solves the problem (3.0.0-24.40). Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-oneiric' to 'verification-done-oneiric'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-oneiric

Revision history for this message

Luis Henriques (henrix) wrote on 2012-07-30:

#44

This bug is awaiting verification that the kernel for Precise in -proposed solves the problem (3.2.0-29.46). Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-precise' to 'verification-done-precise'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-precise

Karl Matthias (0k53-karl) on 2012-07-31

tags:

added: verification-done-precise
removed: verification-needed-precise

Revision history for this message

Karl Matthias (0k53-karl) wrote on 2012-07-31:

#45

We verified the patched kernel does not exhibit the crashing behavior over a week-long run of Ohai loops on an EC2 large instance. It also behaves correctly on 2-core a VirtualBox running Ohai loops.

Revision history for this message

Stefan Bader (smb) wrote on 2012-08-01:

#46

I could confirm the verification for Precise. Though for Oneiric unfortunately, the same ohai loops do not seem to trigger the issue. From the scheduler code it has the same race but than maybe ohai internally is slightly different back there. And it seems to be a very specific pattern of fork and setsid that results in the process getting pulled over to a different CPU right when it moves groups.
So I tend to let Oneiric pass without definite verification and possible ignore the Natty kernel (it will be out of support soon and also has the problem that we have no way to cause the crash right no).

Luis Henriques (henrix) on 2012-08-03

tags:

added: verification-done-oneiric
removed: verification-needed-oneiric

Revision history for this message

Luis Henriques (henrix) wrote on 2012-08-03:

#47

Tagged as verified in Oneiric as per comment #46 and IRC chat.

Revision history for this message

Adam Conrad (adconrad) wrote on 2012-08-10: Update Released

#48

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2012-08-10:

#49

Download full text (7.5 KiB)

This bug was fixed in the package linux - 3.0.0-24.40

---------------
linux (3.0.0-24.40) oneiric-proposed; urgency=low

[Luis Henriques]

* Release Tracking Bug
- LP: #1027841

[ Andy Whitcroft ]

  * [Packaging] getabis should be extracting all packages
    - LP: #1021174
  * [Config] getabis -- series uses linux-image-extra
    - LP: #1021174

[ Stefan Bader ]

* SAUCE: sched: Fix race in task_group()
- LP: #999755

[ Tim Gardner ]

* Extract firmware module info during getabi
- LP: #1021174

[ Upstream Kernel Changes ]

  * net: sock: validate data_len before allocating skb in
    sock_alloc_send_pskb()
    - LP: #1006622
    - CVE-2012-2136
  * epoll: clear the tfile_check_list on -ELOOP
    - LP: #1021811
    - CVE-2012-3375
  * ALSA: hda - Add Realtek ALC280 codec support
    - LP: #1017017, #1025406
  * powerpc/xmon: Use cpumask iterator to avoid warning
    - LP: #1025406
  * media: smsusb: add autodetection support for USB ID 2040:f5a0
    - LP: #1025406
  * ARM: fix rcu stalls on SMP platforms
    - LP: #1025406
  * cipso: handle CIPSO options correctly when NetLabel is disabled
    - LP: #1025406
  * net: l2tp_eth: fix kernel panic on rmmod l2tp_eth
    - LP: #1025406
  * dummy: fix rcu_sched self-detected stalls
    - LP: #1025406
  * ethtool: allow ETHTOOL_GSSET_INFO for users
    - LP: #1025406
  * bridge: Assign rtnl_link_ops to bridge devices created via ioctl (v2)
    - LP: #1025406
  * bonding: Fix corrupted queue_mapping
    - LP: #1025406
  * ipv6: Move ipv6 proc file registration to end of init order
    - LP: #1025406
  * sky2: fix checksum bit management on some chips
    - LP: #1025406
  * be2net: fix a race in be_xmit()
    - LP: #1025406
  * netpoll: fix netpoll_send_udp() bugs
    - LP: #1025406
  * hwmon: (applesmc) Limit key length in warning messages
    - LP: #1025406
  * nilfs2: ensure proper cache clearing for gc-inodes
    - LP: #1025406
  * udf: Use 'ret' instead of abusing 'i' in udf_load_logicalvol()
    - LP: #1025406
  * udf: Avoid run away loop when partition table length is corrupted
    - LP: #1025406
  * udf: Fortify loading of sparing table
    - LP: #1025406
  * ath9k: Fix softlockup in AR9485
    - LP: #1025406
  * ath9k_hw: avoid possible infinite loop in ar9003_get_pll_sqsum_dvc
    - LP: #1025406
  * ath9k: enable serialize_regmode for non-PCIE AR9287
    - LP: #1025406
  * ASoC: tlv320aic3x: Fix codec pll configure bug
    - LP: #1025406
  * Btrfs: run delayed directory updates during log replay
    - LP: #1025406
  * drm/edid: don't return stack garbage from supports_rb
    - LP: #1025406
  * drm/nouveau/fbcon: using nv_two_heads is not a good idea
    - LP: #1025406
  * drm/i915: Fix eDP blank screen after S3 resume on HP desktops
    - LP: #1025406
  * acpi_pad: fix power_saving thread deadlock
    - LP: #1025406
  * ACPI: Add a quirk for "AMILO PRO V2030" to ignore the timer overriding
    - LP: #1025406
  * ACPI, x86: fix Dell M6600 ACPI reboot regression via DMI
    - LP: #1025406
  * ACPI sysfs.c strlen fix
    - LP: #1025406
  * stable: Allow merging of backports for serious user-visible performance
    issues
    - LP: #1025406
  * USB: option: add id for...

This bug was fixed in the package linux - 3.0.0-24.40

---------------
linux (3.0.0-24.40) oneiric-proposed; urgency=low

[Luis Henriques]

* Release Tracking Bug
    - LP: #1027841

[ Andy Whitcroft ]

* [Packaging] getabis should be extracting all packages
    - LP: #1021174
  * [Config] getabis -- series uses linux-image-extra
    - LP: #1021174

[ Stefan Bader ]

* SAUCE: sched: Fix race in task_group()
    - LP: #999755

[ Tim Gardner ]

* Extract firmware module info during getabi
    - LP: #1021174

[ Upstream Kernel Changes ]

* net: sock: validate data_len before allocating skb in
    sock_alloc_send_pskb()
    - LP: #1006622
    - CVE-2012-2136
  * epoll: clear the tfile_check_list on -ELOOP
    - LP: #1021811
    - CVE-2012-3375
  * ALSA: hda - Add Realtek ALC280 codec support
    - LP: #1017017, #1025406
  * powerpc/xmon: Use cpumask iterator to avoid warning
    - LP: #1025406
  * media: smsusb: add autodetection support for USB ID 2040:f5a0
    - LP: #1025406
  * ARM: fix rcu stalls on SMP platforms
    - LP: #1025406
  * cipso: handle CIPSO options correctly when NetLabel is disabled
    - LP: #1025406
  * net: l2tp_eth: fix kernel panic on rmmod l2tp_eth
    - LP: #1025406
  * dummy: fix rcu_sched self-detected stalls
    - LP: #1025406
  * ethtool: allow ETHTOOL_GSSET_INFO for users
    - LP: #1025406
  * bridge: Assign rtnl_link_ops to bridge devices created via ioctl (v2)
    - LP: #1025406
  * bonding: Fix corrupted queue_mapping
    - LP: #1025406
  * ipv6: Move ipv6 proc file registration to end of init order
    - LP: #1025406
  * sky2: fix checksum bit management on some chips
    - LP: #1025406
  * be2net: fix a race in be_xmit()
    - LP: #1025406
  * netpoll: fix netpoll_send_udp() bugs
    - LP: #1025406
  * hwmon: (applesmc) Limit key length in warning messages
    - LP: #1025406
  * nilfs2: ensure proper cache clearing for gc-inodes
    - LP: #1025406
  * udf: Use 'ret' instead of abusing 'i' in udf_load_logicalvol()
    - LP: #1025406
  * udf: Avoid run away loop when partition table length is corrupted
    - LP: #1025406
  * udf: Fortify loading of sparing table
    - LP: #1025406
  * ath9k: Fix softlockup in AR9485
    - LP: #1025406
  * ath9k_hw: avoid possible infinite loop in ar9003_get_pll_sqsum_dvc
    - LP: #1025406
  * ath9k: enable serialize_regmode for non-PCIE AR9287
    - LP: #1025406
  * ASoC: tlv320aic3x: Fix codec pll configure bug
    - LP: #1025406
  * Btrfs: run delayed directory updates during log replay
    - LP: #1025406
  * drm/edid: don't return stack garbage from supports_rb
    - LP: #1025406
  * drm/nouveau/fbcon: using nv_two_heads is not a good idea
    - LP: #1025406
  * drm/i915: Fix eDP blank screen after S3 resume on HP desktops
    - LP: #1025406
  * acpi_pad: fix power_saving thread deadlock
    - LP: #1025406
  * ACPI: Add a quirk for "AMILO PRO V2030" to ignore the timer overriding
    - LP: #1025406
  * ACPI, x86: fix Dell M6600 ACPI reboot regression via DMI
    - LP: #1025406
  * ACPI sysfs.c strlen fix
    - LP: #1025406
  * stable: Allow merging of backports for serious user-visible performance
    issues
    - LP: #1025406
  * USB: option: add id for Cellient MEN-200
    - LP: #1025406
  * USB: option: Add USB ID for Novatel Ovation MC551
    - LP: #1025406
  * USB: CP210x Add 10 Device IDs
    - LP: #1025406
  * cfg80211: fix potential deadlock in regulatory
    - LP: #1025406
  * can: c_can: precedence error in c_can_chip_config()
    - LP: #1025406
  * oprofile: perf: use NR_CPUS instead or nr_cpumask_bits for static array
    - LP: #1025406
  * mac80211: correct behaviour on unrecognised action frames
    - LP: #1025406
  * mwifiex: fix 11n rx packet drop issue
    - LP: #1025406
  * vfs: make O_PATH file descriptors usable for 'fchdir()'
    - LP: #1025406
  * mtd: cafe_nand: fix an & vs | mistake
    - LP: #1025406
  * tcm_fc: Resolve suspicious RCU usage warnings
    - LP: #1025406
  * eCryptfs: Gracefully refuse miscdev file ops on inherited/passed files
    - LP: #1025406
  * eCryptfs: Fix lockdep warning in miscdev operations
    - LP: #1025406
  * eCryptfs: Properly check for O_RDONLY flag before doing privileged open
    - LP: #1025406
  * USB: cdc-wdm: fix lockup on error in wdm_read
    - LP: #1025406
  * USB: option: add ZTE MF60
    - LP: #1025406
  * USB: option: Add MEDIATEK product ids
    - LP: #1025406
  * PCI: EHCI: fix crash during suspend on ASUS computers
    - LP: #1025406
  * xhci: Avoid dead ports when CONFIG_USB_XHCI_HCD=n
    - LP: #1025406
  * ipheth: add support for iPad
    - LP: #1025406
  * tracing: change CPU ring buffer state from tracing_cpumask
    - LP: #1025406
  * vhost: don't forget to schedule()
    - LP: #1025406
  * raid5: delayed stripe fix
    - LP: #1025406
  * rtl8187: ->brightness_set can not sleep
    - LP: #1025406
  * umem: fix up unplugging
    - LP: #1025406
  * x86, cpufeature: Rename X86_FEATURE_DTS to X86_FEATURE_DTHERM
    - LP: #1025406
  * md/raid5: Do not add data_offset before call to is_badblock
    - LP: #1025406
  * md/raid10: Don't try to recovery unmatched (and unused) chunks.
    - LP: #1025406
  * memory hotplug: fix invalid memory access caused by stale kswapd
    pointer
    - LP: #1025406
  * drivers/rtc/rtc-mxc.c: fix irq enabled interrupts warning
    - LP: #1025406
  * mm, thp: abort compaction if migration page cannot be charged to memcg
    - LP: #1025406
  * fs: ramfs: file-nommu: add SetPageUptodate()
    - LP: #1025406
  * mm: Hold a file reference in madvise_remove
    - LP: #1025406
  * ACPI: Make acpi_skip_timer_override cover all source_irq==0 cases
    - LP: #1025406
  * ACPI: Remove one board specific WARN when ignoring timer overriding
    - LP: #1025406
  * Linux 3.0.37
    - LP: #1025406
  * x86, mce, therm_throt: Don't report power limit and package level
    thermal throttle events in mcelog
    - LP: #930288
  * hwmon: (it87) Preserve configuration register bits on init
    - LP: #1026850
  * block: fix infinite loop in __getblk_slow
    - LP: #1026850
  * media: dvb-core: Release semaphore on error path dvb_register_device()
    - LP: #1026850
  * mtd: nandsim: don't open code a do_div helper
    - LP: #1026850
  * ARM: SAMSUNG: fix race in s3c_adc_start for ADC
    - LP: #1026850
  * intel_ips: blacklist HP ProBook laptops
    - LP: #1026850
  * fifo: Do not restart open() if it already found a partner
    - LP: #1026850
  * rt2x00usb: fix indexes ordering on RX queue kick
    - LP: #1026850
  * e1000e: Correct link check logic for 82571 serdes
    - LP: #1026850
  * Input: xpad - add Andamiro Pump It Up pad
    - LP: #1026850
  * tcp: drop SYN+FIN messages
    - LP: #1026850
  * cfg80211: check iface combinations only when iface is running
    - LP: #1026850
  * ntp: Fix leap-second hrtimer livelock
    - LP: #1026850
  * ntp: Correct TAI offset during leap second
    - LP: #1026850
  * timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond
    - LP: #1026850
  * time: Move common updates to a function
    - LP: #1026850
  * hrtimer: Provide clock_was_set_delayed()
    - LP: #1026850
  * timekeeping: Fix leapsecond triggered load spike issue
    - LP: #1026850
  * timekeeping: Maintain ktime_t based offsets for hrtimers
    - LP: #1026850
  * hrtimers: Move lock held region in hrtimer_interrupt()
    - LP: #1026850
  * timekeeping: Provide hrtimer update function
    - LP: #1026850
  * hrtimer: Update hrtimer base offsets each hrtimer_interrupt
    - LP: #1026850
  * timekeeping: Add missing update call in timekeeping_resume()
    - LP: #1026850
  * Linux 3.0.38
    - LP: #1026850
 -- Luis Henriques <luis.henriques@canonical.com>   Mon, 23 Jul 2012 10:46:46 +0100

Changed in linux (Ubuntu Oneiric):
status:	Fix Committed → Fix Released

Revision history for this message

Launchpad Janitor (janitor) wrote on 2012-08-10:

#50

Download full text (16.5 KiB)

This bug was fixed in the package linux - 3.2.0-29.46

---------------
linux (3.2.0-29.46) precise-proposed; urgency=low

[ Tim Gardner ]

  * No change upload to fix armel/armhf FTBS caused by
    'UBUNTU: [Config] SND_OMAP_SOC, SND_OMAP_SOC_MCBSP and SND_OMAP_SOC_OMAP3_BEAGLE =y'
    Added ignore and ignore.module files to ABI directories.

[Luis Henriques]

* Release Tracking Bug
- LP: #1029507

linux (3.2.0-29.45) precise-proposed; urgency=low

[Luis Henriques]

* Release Tracking Bug
- LP: #1029507

[ Andy Whitcroft ]

  * SAUCE: rds_ib_send() -- prevent local pings triggering BUG_ON()
    - LP: #1016299
    - CVE-2012-2372

[ Upstream Kernel Changes ]

  * Revert "samsung-laptop: make the dmi check less strict"
    - LP: #1029431
  * samsung-laptop: make the dmi check less strict
    - LP: #1029431
  * raid5: delayed stripe fix
    - LP: #1029431
  * tcp: drop SYN+FIN messages
    - LP: #1029431
  * tg3: Apply short DMA frag workaround to 5906
    - LP: #1029431
  * rtl8187: ->brightness_set can not sleep
    - LP: #1029431
  * net/wireless: ipw2x00: add supported cipher suites to wiphy
    initialization
    - LP: #1029431
  * kbuild: do not check for ancient modutils tools
    - LP: #1029431
  * brcmsmac: "INTERMEDIATE but not AMPDU" only when tracing
    - LP: #1029431
  * ext4: Report max_batch_time option correctly
    - LP: #1029431
  * NFSv4: Reduce the footprint of the idmapper
    - LP: #1029431
  * NFSv4: Further reduce the footprint of the idmapper
    - LP: #1029431
  * macvtap: zerocopy: fix offset calculation when building skb
    - LP: #1029431
  * macvtap: zerocopy: fix truesize underestimation
    - LP: #1029431
  * macvtap: zerocopy: put page when fail to get all requested user pages
    - LP: #1029431
  * macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built
    successfully
    - LP: #1029431
  * macvtap: zerocopy: validate vectors before building skb
    - LP: #1029431
  * KVM: Fix buffer overflow in kvm_set_irq()
    - LP: #1029431
  * scsi: Silence unnecessary warnings about ioctl to partition
    - LP: #1029431
  * iommu/amd: Fix missing iommu_shutdown initialization in passthrough
    mode
    - LP: #1029431
  * iommu/amd: Initialize dma_ops for hotplug and sriov devices
    - LP: #1029431
  * usb: Add support for root hub port status CAS
    - LP: #1029431
  * gpiolib: wm8994: Pay attention to the value set when enabling as output
    - LP: #1029431
  * sched/nohz: Rewrite and fix load-avg computation -- again
    - LP: #1029431
  * USB: option: add ZTE MF60
    - LP: #1029431
  * USB: option: Add MEDIATEK product ids
    - LP: #1029431
  * USB: cdc-wdm: fix lockup on error in wdm_read
    - LP: #1029431
  * mtd: nandsim: don't open code a do_div helper
    - LP: #1029431
  * dvb-core: Release semaphore on error path dvb_register_device()
    - LP: #1029431
  * hwspinlock/core: use global ID to register hwspinlocks on multiple
    devices
    - LP: #1029431
  * libsas: fix taskfile corruption in sas_ata_qc_fill_rtf
    - LP: #1029431
  * md/raid1: fix use-after-free bug in RAID1 data-check code.
    - LP: #1029431
  * PCI: EHCI: fix crash during suspend on ASUS computers
    - L...

This bug was fixed in the package linux - 3.2.0-29.46

---------------
linux (3.2.0-29.46) precise-proposed; urgency=low

[ Tim Gardner ]

* No change upload to fix armel/armhf FTBS caused by
    'UBUNTU: [Config] SND_OMAP_SOC, SND_OMAP_SOC_MCBSP and SND_OMAP_SOC_OMAP3_BEAGLE =y'
    Added ignore and ignore.module files to ABI directories.

[Luis Henriques]

* Release Tracking Bug
    - LP: #1029507

linux (3.2.0-29.45) precise-proposed; urgency=low

[Luis Henriques]

* Release Tracking Bug
    - LP: #1029507

[ Andy Whitcroft ]

* SAUCE: rds_ib_send() -- prevent local pings triggering BUG_ON()
    - LP: #1016299
    - CVE-2012-2372

[ Upstream Kernel Changes ]

* Revert "samsung-laptop: make the dmi check less strict"
    - LP: #1029431
  * samsung-laptop: make the dmi check less strict
    - LP: #1029431
  * raid5: delayed stripe fix
    - LP: #1029431
  * tcp: drop SYN+FIN messages
    - LP: #1029431
  * tg3: Apply short DMA frag workaround to 5906
    - LP: #1029431
  * rtl8187: ->brightness_set can not sleep
    - LP: #1029431
  * net/wireless: ipw2x00: add supported cipher suites to wiphy
    initialization
    - LP: #1029431
  * kbuild: do not check for ancient modutils tools
    - LP: #1029431
  * brcmsmac: "INTERMEDIATE but not AMPDU" only when tracing
    - LP: #1029431
  * ext4: Report max_batch_time option correctly
    - LP: #1029431
  * NFSv4: Reduce the footprint of the idmapper
    - LP: #1029431
  * NFSv4: Further reduce the footprint of the idmapper
    - LP: #1029431
  * macvtap: zerocopy: fix offset calculation when building skb
    - LP: #1029431
  * macvtap: zerocopy: fix truesize underestimation
    - LP: #1029431
  * macvtap: zerocopy: put page when fail to get all requested user pages
    - LP: #1029431
  * macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built
    successfully
    - LP: #1029431
  * macvtap: zerocopy: validate vectors before building skb
    - LP: #1029431
  * KVM: Fix buffer overflow in kvm_set_irq()
    - LP: #1029431
  * scsi: Silence unnecessary warnings about ioctl to partition
    - LP: #1029431
  * iommu/amd: Fix missing iommu_shutdown initialization in passthrough
    mode
    - LP: #1029431
  * iommu/amd: Initialize dma_ops for hotplug and sriov devices
    - LP: #1029431
  * usb: Add support for root hub port status CAS
    - LP: #1029431
  * gpiolib: wm8994: Pay attention to the value set when enabling as output
    - LP: #1029431
  * sched/nohz: Rewrite and fix load-avg computation -- again
    - LP: #1029431
  * USB: option: add ZTE MF60
    - LP: #1029431
  * USB: option: Add MEDIATEK product ids
    - LP: #1029431
  * USB: cdc-wdm: fix lockup on error in wdm_read
    - LP: #1029431
  * mtd: nandsim: don't open code a do_div helper
    - LP: #1029431
  * dvb-core: Release semaphore on error path dvb_register_device()
    - LP: #1029431
  * hwspinlock/core: use global ID to register hwspinlocks on multiple
    devices
    - LP: #1029431
  * libsas: fix taskfile corruption in sas_ata_qc_fill_rtf
    - LP: #1029431
  * md/raid1: fix use-after-free bug in RAID1 data-check code.
    - LP: #1029431
  * PCI: EHCI: fix crash during suspend on ASUS computers
    - LP: #1029431
  * memory hotplug: fix invalid memory access caused by stale kswapd
    pointer
    - LP: #1029431
  * mm, thp: abort compaction if migration page cannot be charged to memcg
    - LP: #1029431
  * drivers/rtc/rtc-mxc.c: fix irq enabled interrupts warning
    - LP: #1029431
  * fs: ramfs: file-nommu: add SetPageUptodate()
    - LP: #1029431
  * cpufreq / ACPI: Fix not loading acpi-cpufreq driver regression
    - LP: #1029431
  * hwmon: (it87) Preserve configuration register bits on init
    - LP: #1029431
  * ARM: SAMSUNG: fix race in s3c_adc_start for ADC
    - LP: #1029431
  * block: fix infinite loop in __getblk_slow
    - LP: #1029431
  * Remove easily user-triggerable BUG from generic_setlease
    - LP: #1029431
  * NFC: Export nfc.h to userland
    - LP: #1029431
  * PM / Hibernate: Hibernate/thaw fixes/improvements
    - LP: #1029431
  * cfg80211: check iface combinations only when iface is running
    - LP: #1029431
  * intel_ips: blacklist HP ProBook laptops
    - LP: #1029431
  * atl1c: fix issue of transmit queue 0 timed out
    - LP: #1029431
  * rt2x00usb: fix indexes ordering on RX queue kick
    - LP: #1029431
  * iwlegacy: always monitor for stuck queues
    - LP: #1029431
  * iwlegacy: don't mess up the SCD when removing a key
    - LP: #1029431
  * e1000e: Correct link check logic for 82571 serdes
    - LP: #1029431
  * tcm_fc: Fix crash seen with aborts and large reads
    - LP: #1029431
  * fifo: Do not restart open() if it already found a partner
    - LP: #1029431
  * target: Clean up returning errors in PR handling code
    - LP: #1029431
  * target: Fix range calculation in WRITE SAME emulation when num blocks
    == 0
    - LP: #1029431
  * cifs: on CONFIG_HIGHMEM machines, limit the rsize/wsize to the kmap
    space
    - LP: #1029431
  * cifs: always update the inode cache with the results from a FIND_*
    - LP: #1029431
  * mm: fix lost kswapd wakeup in kswapd_stop()
    - LP: #1029431
  * md: avoid crash when stopping md array races with closing other open
    fds.
    - LP: #1029431
  * md/raid1: close some possible races on write errors during resync
    - LP: #1029431
  * MIPS: Properly align the .data..init_task section.
    - LP: #1029431
  * UBIFS: fix a bug in empty space fix-up
    - LP: #1029431
  * ore: Fix NFS crash by supporting any unaligned RAID IO
    - LP: #1029431
  * ore: Remove support of partial IO request (NFS crash)
    - LP: #1029431
  * pnfs-obj: don't leak objio_state if ore_write/read fails
    - LP: #1029431
  * pnfs-obj: Fix __r4w_get_page when offset is beyond i_size
    - LP: #1029431
  * dm raid1: fix crash with mirror recovery and discard
    - LP: #1029431
  * dm raid1: set discard_zeroes_data_unsupported
    - LP: #1029431
  * ntp: Fix leap-second hrtimer livelock
    - LP: #1029431
  * ntp: Correct TAI offset during leap second
    - LP: #1029431
  * timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond
    - LP: #1029431
  * time: Move common updates to a function
    - LP: #1029431
  * hrtimer: Provide clock_was_set_delayed()
    - LP: #1029431
  * timekeeping: Fix leapsecond triggered load spike issue
    - LP: #1029431
  * timekeeping: Maintain ktime_t based offsets for hrtimers
    - LP: #1029431
  * hrtimers: Move lock held region in hrtimer_interrupt()
    - LP: #1029431
  * timekeeping: Provide hrtimer update function
    - LP: #1029431
  * hrtimer: Update hrtimer base offsets each hrtimer_interrupt
    - LP: #1029431
  * timekeeping: Add missing update call in timekeeping_resume()
    - LP: #1029431
  * powerpc: Fix wrong divisor in usecs_to_cputime
    - LP: #1029431
  * vhost: don't forget to schedule()
    - LP: #1029431
  * r8169: call netif_napi_del at errpaths and at driver unload
    - LP: #1029431
  * bnx2x: fix checksum validation
    - LP: #1029431
  * bnx2x: fix panic when TX ring is full
    - LP: #1029431
  * net: remove skb_orphan_try()
    - LP: #1029431
  * ACPI: Make acpi_skip_timer_override cover all source_irq==0 cases
    - LP: #1029431
  * ACPI: Remove one board specific WARN when ignoring timer overriding
    - LP: #1029431
  * ACPI: Add a quirk for "AMILO PRO V2030" to ignore the timer overriding
    - LP: #1029431
  * ACPI, x86: fix Dell M6600 ACPI reboot regression via DMI
    - LP: #1029431
  * ACPI sysfs.c strlen fix
    - LP: #1029431
  * eCryptfs: Gracefully refuse miscdev file ops on inherited/passed files
    - LP: #1029431
  * eCryptfs: Fix lockdep warning in miscdev operations
    - LP: #1029431
  * eCryptfs: Properly check for O_RDONLY flag before doing privileged open
    - LP: #1029431
  * ipheth: add support for iPad
    - LP: #1029431
  * stmmac: Fix for nfs hang on multiple reboot
    - LP: #1029431
  * bonding: debugfs and network namespaces are incompatible
    - LP: #1029431
  * bonding: Manage /proc/net/bonding/ entries from the netdev events
    - LP: #1029431
  * Input: bcm5974 - Add support for 2012 MacBook Pro Retina
    - LP: #1029431
  * Input: xpad - handle all variations of Mad Catz Beat Pad
    - LP: #1029431
  * Input: xpad - add signature for Razer Onza Tournament Edition
    - LP: #1029431
  * Input: xpad - add Andamiro Pump It Up pad
    - LP: #1029431
  * HID: add support for 2012 MacBook Pro Retina
    - LP: #1029431
  * Linux 3.2.24
    - LP: #1029431

linux (3.2.0-28.44) precise-proposed; urgency=low

## Note 3.2.0-28.44 was never uploaded and superceded by 3.2.0-29.45 ##

[Luis Henriques]

* Release Tracking Bug
    - LP: #1027881

[ Andy Whitcroft ]

* [Packaging] getabis should be extracting all packages
    - LP: #1021174
  * getabis -- series uses linux-image-extra
    - LP: #1021174

[ Herton Ronaldo Krzesinski ]

* SAUCE: (drop after 3.2) ALSA: hda - restrict bass configuration on Dell
    Inspiron 17
    - LP: #994685

[ Leann Ogasawara ]

* [Config] Enable CONFIG_RT2800USB_RT35XX and CONFIG_RT2800USB_RT53XX
    - LP: #1019561

[ Manoj Iyer ]

* SAUCE: Bluetooth: btusb: Add vendor specific ID (0a5c:21f4) BCM20702A0
    - LP: #1010281

[ Paolo Pisati ]

* [Config] SND_OMAP_SOC, SND_OMAP_SOC_MCBSP and SND_OMAP_SOC_OMAP3_BEAGLE
    =y
    - LP: #1019321

[ Stefan Bader ]

* SAUCE: sched: Fix race in task_group()
    - LP: #999755

[ Tim Gardner ]

* [Config] retrieve ABIs from PPA as a last attempt

[ Upstream Kernel Changes ]

* staging:iio:ad7606: Re-add missing scale attribute
    - LP: #1022747
  * Tools: hv: verify origin of netlink connector message
    - LP: #1022747
    - CVE-2012-2669
  * edac: avoid mce decoding crash after edac driver unloaded
    - LP: #1022747
  * hwrng: atmel-rng - fix data valid check
    - LP: #1022747
  * staging: r8712u: Add new USB IDs
    - LP: #1022747
  * hwmon: (applesmc) Limit key length in warning messages
    - LP: #1022747
  * mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race
    condition
    - LP: #1022747
  * thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE
    - LP: #1022747
  * nilfs2: ensure proper cache clearing for gc-inodes
    - LP: #1022747
  * e1000e: Disable ASPM L1 on 82574
    - LP: #1022747
  * e1000e: Remove special case for 82573/82574 ASPM L1 disablement
    - LP: #1022747
  * staging:rts_pstor:Fix possible panic by NULL pointer dereference
    - LP: #1022747
  * gspca-core: Fix buffers staying in queued state after a stream_off
    - LP: #1022747
  * smsusb: add autodetection support for USB ID 2040:f5a0
    - LP: #1022747
  * drm/edid: don't return stack garbage from supports_rb
    - LP: #1022747
  * drm/nouveau/fbcon: using nv_two_heads is not a good idea
    - LP: #1022747
  * dm thin: reinstate missing mempool_free in cell_release_singleton
    - LP: #1022747
  * ath9k: Fix a WARNING on suspend/resume with IBSS
    - LP: #1022747
  * cfg80211: fix potential deadlock in regulatory
    - LP: #1022747
  * ath9k: Fix softlockup in AR9485
    - LP: #1022747
  * can: c_can: precedence error in c_can_chip_config()
    - LP: #1022747
  * ath9k: fix a tx rate duration calculation bug
    - LP: #1022747
  * batman-adv: fix skb->data assignment
    - LP: #1022747
  * ARM: SAMSUNG: Should check for IS_ERR(clk) instead of NULL
    - LP: #1022747
  * ath9k_hw: avoid possible infinite loop in ar9003_get_pll_sqsum_dvc
    - LP: #1022747
  * iwlwifi: remove log_event debugfs file debugging is disabled
    - LP: #1022747
  * ARM: SAMSUNG: Fix for S3C2412 EBI memory mapping
    - LP: #1022747
  * USB: option: add id for Cellient MEN-200
    - LP: #1022747
  * oprofile: perf: use NR_CPUS instead or nr_cpumask_bits for static array
    - LP: #1022747
  * drm/i915: Refactor the deferred PM_IIR handling into a single function
    - LP: #1022747
  * drm/i915: rip out the PM_IIR WARN
    - LP: #1022747
  * drm/i915: Fix eDP blank screen after S3 resume on HP desktops
    - LP: #1022747
  * PM / Sleep: Prevent waiting forever on asynchronous suspend after abort
    - LP: #1022747
  * x86, cpufeature: Rename X86_FEATURE_DTS to X86_FEATURE_DTHERM
    - LP: #1022747
  * stable: Allow merging of backports for serious user-visible performance
    issues
    - LP: #1022747
  * ALSA: hda - Add Realtek ALC280 codec support
    - LP: #1017017, #1022747
  * USB: option: Add USB ID for Novatel Ovation MC551
    - LP: #1022747
  * USB: CP210x Add 10 Device IDs
    - LP: #1022747
  * xen/netfront: teardown the device before unregistering it.
    - LP: #1022747
  * can: flexcan: use be32_to_cpup to handle the value of dt entry
    - LP: #1022747
  * acpi_pad: fix power_saving thread deadlock
    - LP: #1022747
  * batman-adv: only drop packets of known wifi clients
    - LP: #1022747
  * Linux 3.2.22
    - LP: #1022747
  * ACPI / PM: Make acpi_pm_device_sleep_state() follow the specification
    - LP: #904261
  * net: sock: validate data_len before allocating skb in
    sock_alloc_send_pskb()
    - LP: #1006622
    - CVE-2012-2136
  * epoll: clear the tfile_check_list on -ELOOP
    - LP: #1021811
    - CVE-2012-3375
  * splice: fix racy pipe->buffers uses
    - LP: #1024020
  * umem: fix up unplugging
    - LP: #1024020
  * mwifiex: fix 11n rx packet drop issue
    - LP: #1024020
  * mwifiex: fix WPS eapol handshake failure
    - LP: #1024020
  * NFC: Prevent multiple buffer overflows in NCI
    - LP: #1024020
  * ath9k: fix dynamic WEP related regression
    - LP: #1024020
  * NFC: Return from rawsock_release when sk is NULL
    - LP: #1024020
  * rtlwifi: rtl8192cu: New USB IDs
    - LP: #1024020
  * ath9k: enable serialize_regmode for non-PCIE AR9287
    - LP: #1024020
  * mac80211: correct behaviour on unrecognised action frames
    - LP: #1024020
  * ASoC: tlv320aic3x: Fix codec pll configure bug
    - LP: #1024020
  * powerpc/xmon: Use cpumask iterator to avoid warning
    - LP: #1024020
  * powerpc/kvm: sldi should be sld
    - LP: #1024020
  * md/raid10: Don't try to recovery unmatched (and unused) chunks.
    - LP: #1024020
  * md/raid5: Do not add data_offset before call to is_badblock
    - LP: #1024020
  * md/raid5: In ops_run_io, inc nr_pending before calling
    md_wait_for_blocked_rdev
    - LP: #1024020
  * md/raid10: fix failure when trying to repair a read error.
    - LP: #1024020
  * drm/i915: kick any firmware framebuffers before claiming the gtt
    - LP: #1024020
  * dm persistent data: fix shadow_info_leak on dm_tm_destroy
    - LP: #1024020
  * dm persistent data: handle space map checker creation failure
    - LP: #1024020
  * dm persistent data: fix allocation failure in space map checker init
    - LP: #1024020
  * tracing: change CPU ring buffer state from tracing_cpumask
    - LP: #1024020
  * mwifiex: fix wrong return values in add_virtual_intf() error cases
    - LP: #1024020
  * udf: Use 'ret' instead of abusing 'i' in udf_load_logicalvol()
    - LP: #1024020
  * udf: Avoid run away loop when partition table length is corrupted
    - LP: #1024020
  * udf: Fortify loading of sparing table
    - LP: #1024020
  * ARM: fix rcu stalls on SMP platforms
    - LP: #1024020
  * cipso: handle CIPSO options correctly when NetLabel is disabled
    - LP: #1024020
  * net: l2tp_eth: fix kernel panic on rmmod l2tp_eth
    - LP: #1024020
  * l2tp: fix a race in l2tp_ip_sendmsg()
    - LP: #1024020
  * sky2: fix checksum bit management on some chips
    - LP: #1024020
  * be2net: fix a race in be_xmit()
    - LP: #1024020
  * dummy: fix rcu_sched self-detected stalls
    - LP: #1024020
  * bonding: Fix corrupted queue_mapping
    - LP: #1024020
  * ethtool: allow ETHTOOL_GSSET_INFO for users
    - LP: #1024020
  * netpoll: fix netpoll_send_udp() bugs
    - LP: #1024020
  * ipv6: Move ipv6 proc file registration to end of init order
    - LP: #1024020
  * bridge: Assign rtnl_link_ops to bridge devices created via ioctl (v2)
    - LP: #1024020
  * Btrfs: run delayed directory updates during log replay
    - LP: #1024020
  * cifs: when server doesn't set CAP_LARGE_READ_X, cap default rsize at
    MaxBufferSize
    - LP: #1024020
  * ocfs2: clear unaligned io flag when dio fails
    - LP: #1024020
  * aio: make kiocb->private NUll in init_sync_kiocb()
    - LP: #1024020
  * mtd: cafe_nand: fix an & vs | mistake
    - LP: #1024020
  * mm: Hold a file reference in madvise_remove
    - LP: #1024020
  * tcm_fc: Resolve suspicious RCU usage warnings
    - LP: #1024020
  * vfs: make O_PATH file descriptors usable for 'fchdir()'
    - LP: #1024020
  * Linux 3.2.23
    - LP: #1024020
  * asix: asix_rx_fixup surgery to reduce skb truesizes
    - LP: #947723
  * ALSA: hda - Fix internal mic for Lenovo Ideapad U300s
    - LP: #903853
  * ALSA: hda - Fix oops caused by recent commit "Fix internal mic for
    Lenovo Ideapad U300s"
    - LP: #903853
 -- Luis Henriques <luis.henriques@canonical.com>   Fri, 27 Jul 2012 17:08:57 +0100

Changed in linux (Ubuntu Precise):
status:	Fix Committed → Fix Released

Ubuntu Foundations Team Bug Bot (crichton) on 2012-08-10

tags:

added: patch

Stefan Bader (smb) on 2012-10-23

Changed in linux (Ubuntu Natty):
status:	In Progress → Won't Fix
assignee:	Stefan Bader (stefan-bader-canonical) → nobody
Changed in linux (Ubuntu):
status:	Triaged → Fix Released

Ubuntu
linux package

Kernel crash in rb_next doing ohai loops

Bug Description

Related branches

CVE References

Other bug subscribers

Patches

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
linux (Ubuntu)	Fix Released	Medium	Unassigned
Natty	Won't Fix	Undecided	Unassigned
Oneiric	Fix Released	Undecided	Unassigned
Precise	Fix Released	Undecided	Unassigned

Ubuntulinux package

Kernel crash in rb_next doing ohai loops

Bug Description

Related branches

CVE References

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Ubuntu
linux package