precise ec2 images fail to boot with kernel oops

Bug #911204 reported by James Page
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Critical
Stefan Bader

Bug Description

Looks like this started happening around the 28th December:

https://jenkins.qa.ubuntu.com/view/Precise%20ISO%20Testing%20Dashboard/view/Daily/job/precise-server-ec2-daily/

------ Extract from ec2 console ------

[ 0.607292] BUG: unable to handle kernel NULL pointer dereference at 00000294
[ 0.607303] IP: [<c0172cd3>] queue_work_on+0x13/0x40
[ 0.607316] *pdpt = 0000000000000000 *pde = 0000000000000000
[ 0.607325] Oops: 0002 [#1] SMP
[ 0.607332] Modules linked in:
[ 0.607337]
[ 0.607341] Pid: 21, comm: kworker/0:1 Not tainted 3.2.0-7-virtual #13-Ubuntu
[ 0.607352] EIP: 0061:[<c0172cd3>] EFLAGS: 00010092 CPU: 0
[ 0.607358] EIP is at queue_work_on+0x13/0x40
[ 0.607364] EAX: 00000000 EBX: 00000000 ECX: 00000294 EDX: eb816600
[ 0.607370] ESI: 000000ff EDI: ffffffd0 EBP: eb913dd0 ESP: eb913dc8
[ 0.607377] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: e021
[ 0.607383] Process kworker/0:1 (pid: 21, ti=eb912000 task=eb918000 task.ti=eb912000)
[ 0.607390] Stack:
[ 0.607394] eb816600 000000ff eb913ddc c0172d3a c0b04740 eb913de4 c0172d54 eb913dec
[ 0.607408] c054b252 eb913e00 c054dd0c 000000df 00000020 c0b04740 eb913e14 c054dd6a
[ 0.607421] eb913ef4 00000045 c0b04740 eb913e34 c054e5ec 2f2d1403 00000047 00000023
[ 0.607435] Call Trace:
[ 0.607441] [<c0172d3a>] queue_work+0x1a/0x20
[ 0.607448] [<c0172d54>] schedule_work+0x14/0x20
[ 0.607455] [<c054b252>] rtc_update_irq+0x12/0x20
[ 0.607462] [<c054dd0c>] cmos_checkintr.isra.2+0x5c/0x70
[ 0.607468] [<c054dd6a>] cmos_irq_disable+0x4a/0x60
[ 0.607474] [<c054e5ec>] cmos_set_alarm+0xdc/0x190
[ 0.607480] [<c054b9a7>] __rtc_set_alarm+0x87/0xa0
[ 0.607487] [<c054c7e0>] rtc_timer_do_work+0x160/0x200
[ 0.607495] [<c01092ce>] ? __raw_callee_save_xen_irq_enable+0x6/0x8
[ 0.607504] [<c0149701>] ? cpuusage_read+0x51/0x60
[ 0.607511] [<c01748c1>] process_one_work+0x101/0x3a0
[ 0.607517] [<c054c680>] ? rtc_pie_update_irq+0x70/0x70
[ 0.607523] [<c0175384>] worker_thread+0x124/0x2d0
[ 0.607530] [<c0175260>] ? manage_workers.isra.28+0x110/0x110
[ 0.607537] [<c01791ad>] kthread+0x6d/0x80
[ 0.607543] [<c0179140>] ? flush_kthread_worker+0x80/0x80
[ 0.607552] [<c06acb3e>] kernel_thread_helper+0x6/0x10
[ 0.607557] Code: c1 8b 52 04 64 8b 1d ec 3e a2 c0 89 d8 e8 16 fd ff ff 5b 5d c3 8d 76 00 55 89 e5 83 ec 08 89 5d f8 89 75 fc 3e 8d 74 26 00 89 c3 <3e> 0f ba 29 00 19 f6 31 c0 85 f6 75 0c 89 d8 e8 e9 fc ff ff b8
[ 0.607631] EIP: [<c0172cd3>] queue_work_on+0x13/0x40 SS:ESP e021:eb913dc8
[ 0.607642] CR2: 0000000000000294
[ 0.607654] ---[ end trace 679c3f87b6a5d71c ]---
[ 0.607710] rtc_cmos rtc_cmos: rtc core: registered rtc_cmos as rtc0

Phillip Susi (psusi)
affects: ubuntu → linux (Ubuntu)
Dave Walker (davewalker)
Changed in linux (Ubuntu):
importance: Undecided → Critical
status: New → Confirmed
Revision history for this message
Stefan Bader (smb) wrote :

Looks like the following commit:

commit 93b2ec0128c431148b216b8f7337c1a52131ef03
Author: NeilBrown <email address hidden>
Date: Fri Dec 9 09:39:15 2011 +1100

    rtc: Expire alarms after the time is set.

Changed code, so that rtc_initialize_alarm() will trigger the irq worker. However that function gets called from within rtc_device_register() before all of the registration is completed.

Changed in linux (Ubuntu):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
Revision history for this message
Stefan Bader (smb) wrote :

I tried this patch on top of the failing kernel and it seems to avoid the crash (or sometimes I saw some long (90s) delay before rtc registration completed with the error). May still be coincidence because the timing naturally changes by this. So it can easily not be the final solution.

tags: added: patch
tags: added: kernel-da-key precise
tags: added: iso-testing qa-daily-testing rls-p-tracking
Scott Moser (smoser)
tags: added: cloud-images ec2-images
Revision history for this message
Stefan Bader (smb) wrote :

Upstream reverted the critical change just before 3.2 release. This has been included in the last kernel upload.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.