[hardy][xen] Oops in free_hot_cold_cache, probably drbd8-related

Bug #235783 reported by Bernhard Schmidt
4
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

I run a Ubuntu Hardy i386 + Xen + drbd8 + ocfs2 cluster that frequently (every few hours, even when completely idle) reports an Oops. OCFS2 is already disabled, it seems to be linked to drbd8 being started. The drbd8 resource is not in use though, not even mounted on any box.

[ 5545.027304] invalid opcode: 0000 [#1] SMP
[ 5545.027319] Modules linked in: drbd cn bridge sbs container battery sbshc video output ac dock iptable_filter ip_tables x_tables parpe
[ 5545.027410]
[ 5545.027414] Pid: 14947, comm: sshd Not tainted (2.6.24-17-xen #1)
[ 5545.027419] EIP: 0061:[<c194e0e9>] EFLAGS: 00210216 CPU: 0
[ 5545.027426] EIP is at 0xc194e0e9
[ 5545.027429] EAX: c18d3c00 EBX: c18d7800 ECX: 00000004 EDX: 00000000
[ 5545.027434] ESI: 00000002 EDI: 40040000 EBP: 00000000 ESP: daecbe94
[ 5545.027438] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
[ 5545.027443] Process sshd (pid: 14947, ti=daeca000 task=db61c7d0 task.ti=daeca000)
[ 5545.027447] Stack: c01623a5 00000000 c03fd800 daecbef0 00000002 00000008 daecbed0 c0162456
[ 5545.027463] c196f960 c1553938 c03fd800 00000008 c0165997 00000008 00000000 00000008
[ 5545.027479] 00000000 c192ade0 c18f40e0 c1822400 c18eddc0 c1971780 c1932060 c18d3c00
[ 5545.027521] Call Trace:
[ 5545.027525] [<c01623a5>] free_hot_cold_page+0x195/0x220
[ 5545.027541] [<c0162456>] __pagevec_free+0x26/0x30
[ 5545.027551] [<c0165997>] release_pages+0x137/0x160
[ 5545.027563] [<c017a5f4>] free_pages_and_swap_cache+0x74/0xa0
[ 5545.027573] [<c01737b7>] exit_mmap+0xe7/0x100
[ 5545.027584] [<c0124303>] mmput+0x23/0x80
[ 5545.027592] [<c0129d95>] do_exit+0x165/0x8b0
[ 5545.027602] [<c0185dff>] vfs_read+0x15f/0x170
[ 5545.027610] [<c019b9c3>] mntput_no_expire+0x13/0x70
[ 5545.027621] [<c012a50a>] do_group_exit+0x2a/0xa0
[ 5545.027631] [<c0105832>] syscall_call+0x7/0xb
[ 5545.027641] [<c0320000>] vcc_create+0x90/0x110
[ 5545.027651] =======================
[ 5545.027654] Code: 20 00 00 00 00 40 01 00 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 01 10 00 00 02 20 00 00 20 00 40 0
[ 5545.027736] EIP: [<c194e0e9>] 0xc194e0e9 SS:ESP 0069:daecbe94
[ 5545.028899] ---[ end trace 880e03764f078517 ]---
[ 5545.028906] Fixing recursive fault but reboot is needed!

The process is different each time but free_hot_cold_cache is always at the top.

Revision history for this message
Bernhard Schmidt (berni) wrote :
Revision history for this message
Bernhard Schmidt (berni) wrote :
Revision history for this message
Bernhard Schmidt (berni) wrote :
Revision history for this message
Bernhard Schmidt (berni) wrote :
Revision history for this message
Bernhard Schmidt (berni) wrote :
Download full text (8.2 KiB)

Another series of Oopses I just got when running bonnie++ on a local partition (not even drbd8)

[ 91.876570] BUG: unable to handle kernel NULL pointer dereference at virtual address 00000008
[ 91.876587] printing eip: de12d234
[ 91.876594] 1c1c3000 -> *pde = 00000000:09ae4001
[ 91.876598] 15ee4000 -> *pme = 00000000:00000000
[ 91.876604] Oops: 0000 [#1] SMP
[ 91.876611] Modules linked in: drbd cn bridge sbs container battery sbshc video output ac dock iptable_filter ip_tables x_tables parpe
[ 91.876699]
[ 91.876704] Pid: 161, comm: kswapd0 Not tainted (2.6.24-17-xen #1)
[ 91.876708] EIP: 0061:[<de12d234>] EFLAGS: 00010202 CPU: 0
[ 91.876721] EIP is at __journal_remove_checkpoint+0x14/0xb0 [jbd]
[ 91.876725] EAX: 000001c0 EBX: 00000008 ECX: dc803640 EDX: 00924925
[ 91.876730] ESI: dc803640 EDI: dc803640 EBP: c192da94 ESP: db71dda8
[ 91.876734] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
[ 91.876738] Process kswapd0 (pid: 161, ti=db71c000 task=db67ceb0 task.ti=db71c000)
[ 91.876743] Stack: c19749e0 c192da94 dc803640 de12ba89 dafaa8b8 c192c340 dafaa800 de2102d0
[ 91.876789] 000000d0 db71df7c db71df7c c015f70c c192c340 db71df0c c0167c15 00000000
[ 91.876804] 00000000 dc76c908 db71de7c db71de54 00000000 00f52873 00000000 00000015
[ 91.876819] Call Trace:
[ 91.876826] [<de12ba89>] journal_try_to_free_buffers+0xe9/0x140 [jbd]
[ 91.876842] [<de2102d0>] ext3_releasepage+0x0/0xa0 [ext3]
[ 91.876860] [<c015f70c>] try_to_release_page+0x2c/0x40
[ 91.876874] [<c0167c15>] shrink_page_list+0x4c5/0x600
[ 91.876888] [<c0166daf>] isolate_lru_pages+0x5f/0x1c0
[ 91.876899] [<c0167e6f>] shrink_inactive_list+0x11f/0x3b0
[ 91.876914] [<c016819c>] shrink_zone+0x9c/0x100
[ 91.876923] [<c016883c>] kswapd+0x44c/0x490
[ 91.876938] [<c013bb90>] autoremove_wake_function+0x0/0x40
[ 91.876949] [<c011e260>] complete+0x40/0x60
[ 91.876958] [<c01683f0>] kswapd+0x0/0x490
[ 91.876966] [<c013b8d2>] kthread+0x42/0x70
[ 91.876972] [<c013b890>] kthread+0x0/0x70
[ 91.876980] [<c0105bb7>] kernel_thread_helper+0x7/0x10
[ 91.876991] =======================
[ 91.876994] Code: 0b eb fe 8d 74 26 00 0f 0b eb fe 0f 0b eb fe 90 8d b4 26 00 00 00 00 56 89 c1 53 83 ec 04 8b 58 28 85 db 74 29 8b 4
[ 91.877080] EIP: [<de12d234>] __journal_remove_checkpoint+0x14/0xb0 [jbd] SS:ESP 0069:db71dda8
[ 91.877668] ---[ end trace 276bea9ce4a4d4b9 ]---
[ 94.370990] BUG: unable to handle kernel paging request at virtual address b3578bd4
[ 94.371007] printing eip: c020ff0d
[ 94.371015] 015c5000 -> *pde = 00000000:1d5c8001
[ 94.371021] 015c8000 -> *pme = 00000000:00000000
[ 94.371028] Oops: 0000 [#2] SMP
[ 94.371036] Modules linked in: drbd cn bridge sbs container battery sbshc video output ac dock iptable_filter ip_tables x_tables parpe
[ 94.371147]
[ 94.371152] Pid: 4686, comm: getty Tainted: G D (2.6.24-17-xen #1)
[ 94.371159] EIP: 0061:[<c020ff0d>] EFLAGS: 00010446 CPU: 1
[ 94.371171] EIP is at memmove+0x1d/0x40
[ 94.371189] EAX: db578bd5 EBX: db578bd5 ECX: d8000000 EDX: db578bd5
[ 94.371195] ESI: b3578bd4 EDI: b3578bd4 EBP: dbed6000 ESP: dbed7f4c
[ 94.371201] DS: 007b ES: ...

Read more...

Revision history for this message
Bernhard Schmidt (berni) wrote :

Okay, I think I have figured this out. Pretty scary if I'm right

The box runs headless and has a serial console. I basically copied the configuration from a gutsy box for this which gives me the following settings in grub/menu.lst

---
## Xen hypervisor options to use with the default Xen boot option
# xenhopt=console=com1,vga com1=57600,8n1

## Xen Linux kernel options to use with the default Xen boot option
# xenkopt=console=ttyS0,57600n1
---

This shows all xen, kernel and bootup messages right until the "Running local boot scripts", but obviously no prompt. For that I copied the file /etc/event.d/ttyS0 from a gutsy box as well

---
start on stopped rc2
start on stopped rc3
start on stopped rc4
start on stopped rc5

stop on runlevel 0
stop on runlevel 1
stop on runlevel 6

respawn
exec /sbin/getty -L ttyS0 57600 vt102
---

I still did not get a prompt but did not have a second look, because the box was crashing very frequently and I tried to figure out this problem first.

Today I learned that I need to use xvc0 instead of ttyS0. I changed my config, got a login prompt and the box was suddenly way more stable. It usually died within minutes when I did something I/O intensive (e.g. bonnie++, no matter whether on a local partition or a drbd8 volume, or using debootstrap)

I did the test, after reenabling ttyS0 the box crashed as violent and often as before. It does not appear to happen when ttyS0 is started after the bootup process. Every access to ttyS0 is blocked with an Input/Output error, so getty is restarted in a loop by upstart.

My guess is that accessing/writing to ttyS0 from within dom0 compromises kernel memory and leads to a crash at some time.

Removed ttyS0 now and will be running bonnie++ on a drbd volume over night, I hope I can confirm this tomorrow.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Revision history for this message
kernel-janitor (kernel-janitor) wrote :

Hi berni,

This bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux-image-`uname -r` 235783

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

This bug report was marked as Incomplete and has not had any updated comments for quite some time. As a result this bug is being closed. Please reopen if this is still an issue in the current Ubuntu release http://www.ubuntu.com/getubuntu/download . Also, please be sure to provide any requested information that may have been missing. To reopen the bug, click on the current status under the Status column and change the status back to "New". Thanks.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-expired
Changed in linux (Ubuntu):
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.