2.6.31-rc1 xen domU crashes early during boot

Bug #419315 reported by Scott Moser
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Fedora)
Fix Released
High
linux (Ubuntu)
Won't Fix
Medium
Unassigned

Bug Description

Booting a karmic kernel under xen crashes very early in the boot process.

This is a duplicate of Red Hat bug 508120 (https://bugzilla.redhat.com/show_bug.cgi?id=508120).

At very least, our kernels at the moment do not have the two fixes mentioned there,
http://git.kernel.org/?p=linux/kernel/git/tip/linux-2.6-tip.git;a=commitdiff;h=ce2eef33d3
http://git.kernel.org/?p=linux/kernel/git/tip/linux-2.6-tip.git;a=commitdiff;h=5416c26635

which will cause crash.

Tags: kj-triage
Revision history for this message
In , Kalev (kalev-redhat-bugs) wrote :

Created attachment 349429
xm dmesg

I am running an i686 rawhide domU PV machine under x86_64 xen host. After updating from F-11's kernel-PAE-2.6.29.4-167.fc11.i686 the new kernels no longer boot. They seem crash very early, so that I don't even get any printk() output in the console.
Right now I can reproduce the problem with kernel-PAE-2.6.31-0.28.rc1.fc12.i686, however the same issue started with 2.6.30-something, and I also verified that I get the exact same behaviour with x86_64 domU kernels.

"xm dmesg" reports the following:
(XEN) Unhandled page fault in domain 12 on VCPU 0 (ec=0000)
(XEN) Pagetable walk from 0000000000000014:
(XEN) L4[0x000] = 0000000081d9d027 0000000000001b48
(XEN) L3[0x000] = 0000000000000000 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 12 (vcpu#0) crashed on cpu#3:
(XEN) ----[ Xen-3.1.2-155.el5 x86_64 debug=n Not tainted ]----
(XEN) CPU: 3
(XEN) RIP: e019:[<00000000c0a8b501>]
<snip> (full trace attached)

Examining matching vmlinux in gdb I get:
(gdb) x/i 0x00000000c0a8b501
0xc0a8b501 <xen_start_kernel+9>: mov %gs:0x14,%eax
(gdb) l *0x00000000c0a8b501
0xc0a8b501 is in xen_start_kernel (arch/x86/xen/enlighten.c:990).
985 .emergency_restart = xen_emergency_restart,
986 };
987
988 /* First C function to be called on Xen boot */
989 asmlinkage void __init xen_start_kernel(void)
990 {
991 pgd_t *pgd;
992
993 if (!xen_start_info)
994 return;

The host is running Centos 5.3 with kernel-xen-2.6.18-155.el5 and xen-3.0.3-80.el5_3.3.

Xen config:
name = "fedora-rawhide"
uuid = "1f162091-fa31-c67c-9da1-702bcd5cb40b"
maxmem = 512
memory = 512
vcpus = 1
bootloader = "/usr/bin/pygrub"
on_poweroff = "destroy"
on_reboot = "restart"
on_crash = "restart"
vfb = [ ]
disk = [ "phy:/dev/vg0/xen_rawhide,xvda,w" ]
vif = [ "mac=00:16:3e:11:95:e6,bridge=xenbr0" ]

Revision history for this message
In , Michal (michal-redhat-bugs) wrote :

I'm seeing the same problem with a x86_64 Rawhide DomU and a RHEL5 Dom0.
I believe it's the same problem Orion Poplawski reported recently on fedora-xen mailing list: https://www.redhat.com/archives/fedora-xen/2009-August/msg00008.html

I got the stack trace as Mark McLoughlin suggested:

michich@hammerfall ~$ sudo /usr/lib64/xen/bin/xenctx -s /tmp/System.map-2.6.31-rc5-git2 12
rip: ffffffff817290a1 xen_start_kernel+0x10
rsp: ffffffff8171df90
rax: 00000000 rbx: 00000000 rcx: 00000000 rdx: 00000000
rsi: ffffffff82fc3000 rdi: ffffffff82fc3000 rbp: ffffffff8171dff8
 r8: 00000000 r9: 00000000 r10: 00000000 r11: 00000000
r12: 00000000 r13: 00000000 r14: 00000000 r15: 00000000
 cs: 0000e033 ds: 00000000 fs: 00000000 gs: 00000000

Stack:
 0000000000000000 0000000000000000 0000000000000000 ffffffff817290a1
 000000010000e030 0000000000010096 ffffffff8171dfd8 000000000000e02b
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 0000000000000000 0000000000000000

Code:
bd 93 ff c9 c3 55 48 89 e5 53 48 83 ec 18 48 8b 3d 27 c0 33 00 <65> 48 8b 04 25 28 00 00 00 48 89

Call Trace:
  [<ffffffff817290a1>] xen_start_kernel+0x10 <--
  [<ffffffff817290a1>] xen_start_kernel+0x10

Then I dissasembled xen_start_kernel():
0000000000000000 <xen_start_kernel>:
   0: 55 push %rbp
   1: 48 89 e5 mov %rsp,%rbp
   4: 53 push %rbx
   5: 48 83 ec 18 sub $0x18,%rsp
   9: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi # 10 <xen_start_kernel+0x10>
  10: 65 48 8b 04 25 28 00 mov %gs:0x28,%rax ***CRASHES HERE***
  17: 00 00
  19: 48 89 45 e8 mov %rax,-0x18(%rbp)
  1d: 31 c0 xor %eax,%eax
...

At first these last three instructions confused me, because they did not seem to correspond to anything in the C source, but then I realized they setup the canary for stack smashing detection.
So I recompiled the kernel without CONFIG_CC_STACKPROTECTOR and I got much farther with the boot (it hung after loading some drivers, I'll investigate more).

I guess xen_start_kernel() (and possibly more of Xen DomU startup code) should be compiled with -fno-stack-protector.

Revision history for this message
In , Chris (chris-redhat-bugs) wrote :

Wow, excellent analysis, thanks. I was just starting to see this myself, but hadn't yet had time to look into it. We'll have to take this up with upstream and see what they have to say about it.

Chris Lalancette

Revision history for this message
In , Kevin (kevin-redhat-bugs) wrote :

Yeah, seeing this also on a machine in fedora infrastructure. ;(

Revision history for this message
In , Jeremy (jeremy-redhat-bugs) wrote :

Thanks for the report and analysis. I guess there's a keyword to prevent gcc from adding stack-smashing to particular functions or files... Erm...

Revision history for this message
In , Jeremy (jeremy-redhat-bugs) wrote :

Created attachment 357697
Make sure load_percpu_segment doesn't have stack-protector enabled

Revision history for this message
In , Jeremy (jeremy-redhat-bugs) wrote :

Created attachment 357698
Setup percpu segments before calling stack-protected functions

Revision history for this message
In , Jeremy (jeremy-redhat-bugs) wrote :

Do those two help?

Revision history for this message
In , Michal (michal-redhat-bugs) wrote :

Jeremy,
yes, these patches help. The kernel starts booting with them applied.

Just a suggestion: the usual way (as seen in other Makefiles) to disable the stack protection for selected source files seems to be:
nostackp := $(call cc-option, -fno-stack-protector)
CFLAGS_somefile.o := $(nostackp)

And the kernel still hangs for me later during boot, but that's a different bug.

Revision history for this message
In , Jeremy (jeremy-redhat-bugs) wrote :

Ah, I couldn't find another instance of stack-protector being disabled.

Have you reported the other bug, or is it something purely local?

Revision history for this message
In , Mark (mark-redhat-bugs) wrote :

Ingo has these queued up in linux-2.6-tip.git/x86/urgent:

http://git.kernel.org/?p=linux/kernel/git/tip/linux-2.6-tip.git;a=commitdiff;h=ce2eef33d3
http://git.kernel.org/?p=linux/kernel/git/tip/linux-2.6-tip.git;a=commitdiff;h=5416c26635

Just need to re-test and close this when they make their way to rawhide

Revision history for this message
In , Jeremy (jeremy-redhat-bugs) wrote :

Are you sure they work? M A Young still reports crashes when they're applied.

Revision history for this message
In , Mark (mark-redhat-bugs) wrote :

Michal says it still hangs later on, but that its a different issue

M A Young is probably testing Dom0, maybe yet another issue?

Dunno, that's why I said we need to re-test :-)

Revision history for this message
In , Michal (michal-redhat-bugs) wrote :

I've described the other bug in http://lkml.org/lkml/2009/8/21/71

Revision history for this message
In , Chuck (chuck-redhat-bugs) wrote :

Should be fixed in 2.6.31-0.173.rc7.git2, which has the two x86-tip patches plus the framebuffer fix from LKML.

Revision history for this message
In , Michal (michal-redhat-bugs) wrote :

2.6.31-0.173.rc7.git2 boots successfully under Xen on x86_64, but i686 still fails. Probably because load_percpu_segment(0); is under #ifdef CONFIG_X86_64 in xen_start_kernel().

Revision history for this message
In , Kevin (kevin-redhat-bugs) wrote :

Seems to work here under a x86_64 guest. Thanks.

Revision history for this message
In , Jeremy (jeremy-redhat-bugs) wrote :

(In reply to comment #15)
> 2.6.31-0.173.rc7.git2 boots successfully under Xen on x86_64, but i686 still
> fails. Probably because load_percpu_segment(0); is under #ifdef CONFIG_X86_64
> in xen_start_kernel().

32 bit is trickier because it needs a specifically set-up GDT entry and its own segment register. Doing this setup properly ends upcalling functions with stack-protector prologs which assume the segment register is already set up. I need to work out 1) how native does this setup, and/or 2) refactor the segment register setup so that can avoid functions with stack-protector code.

Revision history for this message
In , Alexander (alexander-redhat-bugs) wrote :

*** Bug 519342 has been marked as a duplicate of this bug. ***

Revision history for this message
In , Alexander (alexander-redhat-bugs) wrote :

FYI: 2.6.31-0.174.rc7.git2 still fails on i386. My dom0 is recent RHEL5 and domU is F12-Alpha

Revision history for this message
Scott Moser (smoser) wrote :

just a note, the RH bug also mentions http://lkml.org/lkml/2009/8/21/71 , (fix at http://lkml.org/lkml/2009/8/21/128).

Changed in linux (Fedora):
status: Unknown → Fix Committed
Revision history for this message
In , Pasi (pasi-redhat-bugs) wrote :

I also tried the latest rawhide tree (2.6.31-0.174.rc7.git2.fc12.i686) with virt-install on my F11 + Xen 3.4.1 + 2.6.31-rc6 pv_ops dom0 setup, and it still crashes.

Revision history for this message
In , Jeremy (jeremy-redhat-bugs) wrote :

What compiler are people using? Using F11's gcc-4.4.1-2.fc11.x86_64, it says:

/home/jeremy/git/linux/arch/x86/Makefile:80: stack protector enabled but no compiler support

Revision history for this message
In , Michal (michal-redhat-bugs) wrote :

Jeremy,

Rawhide builds currently use gcc-4.4.1-6.x86_64. You can find this information in Koji build logs, e.g.: http://kojipkgs.fedoraproject.org/packages/kernel/2.6.31/0.185.rc7.git6.fc12/data/logs/x86_64/
root.log tells you the versions of the packages used in the build.
build.log has the build warnings. There was no such stack protector warning in this case.

Revision history for this message
In , Jeremy (jeremy-redhat-bugs) wrote :

Where can I get this version of gcc? "yum update --enablerepo=rawhide gcc" doesn't get me anything more recent than gcc-4.4.1-2.fc11.x86_64. Or does 32-bit stackprotector not work in the x86-64 version of the compiler?

Revision history for this message
In , Michal (michal-redhat-bugs) wrote :

> Where can I get this version of gcc? "yum update --enablerepo=rawhide gcc"
> doesn't get me anything more recent than gcc-4.4.1-2.fc11.x86_64.

Works for me, yum can see the newer version. But the gcc from Rawhide depends on newer glibc, so I do not recommend doing it.

> Or does 32-bit stackprotector not work in the x86-64 version of the compiler?

Bug in the stack protector detection for ARCH=i386 builds on x86_64. I've sent a patch to LKML and CCed you.

Koji always builds packages using native arch toolchain, so it is not affected.

Revision history for this message
In , Jeremy (jeremy-redhat-bugs) wrote :

Created attachment 359557
Set up kernel GDT early to make -fstack-protector work under Xen

This patch should comprehensively fix -fstack-protector under Xen for both 32 and 64-bit. Please test.

Revision history for this message
In , Pasi (pasi-redhat-bugs) wrote :

Someone please add that bugfix patch to next rawhide kernel build so we get people to test it..

Revision history for this message
In , Justin (justin-redhat-bugs) wrote :

The patch has been applied and should be available in the next rawhide kernel build.

Revision history for this message
In , Michal (michal-redhat-bugs) wrote :

2.6.31-0.203.rc8.git2.fc12 boots successfully as Xen domU. I've tested both i686.PAE and x86_64.

Revision history for this message
In , Pasi (pasi-redhat-bugs) wrote :

Seems to boot now. virt-install started f12/rawhide Xen domU installation OK, on F11 host with Xen 3.4.1-3 + pv_ops dom0 kernel + libvirt from F11 updates testing.

Installation went fine, and the installed domU seems to have 2.6.31-0.203.rc8.git2.fc12.i686.PAE kernel running.

There's a traceback on domU dmesg though.. the domU still runs fine.

Write protecting the kernel text: 4352k
Write protecting the kernel read-only data: 1800k

=============================================
[ INFO: possible recursive locking detected ]
2.6.31-0.203.rc8.git2.fc12.i686.PAE #1
---------------------------------------------
init/1 is trying to acquire lock:
 (&input_pool.lock){+.+...}, at: [<c043b30e>] __wake_up+0x2b/0x61

but task is already holding lock:
 (&input_pool.lock){+.+...}, at: [<c068e21b>] account+0x30/0xf0

other info that might help us debug this:
2 locks held by init/1:
 #0: (&p->cred_guard_mutex){+.+.+.}, at: [<c0508756>] do_execve+0xa4/0x2ee
 #1: (&input_pool.lock){+.+...}, at: [<c068e21b>] account+0x30/0xf0

stack backtrace:
Pid: 1, comm: init Not tainted 2.6.31-0.203.rc8.git2.fc12.i686.PAE #1
Call Trace:
 [<c08387c0>] ? printk+0x22/0x3a
 [<c0478b59>] __lock_acquire+0x7e9/0xb25
 [<c0478f4c>] lock_acquire+0xb7/0xeb
 [<c043b30e>] ? __wake_up+0x2b/0x61
 [<c043b30e>] ? __wake_up+0x2b/0x61
 [<c083b4f7>] _spin_lock_irqsave+0x45/0x89
 [<c043b30e>] ? __wake_up+0x2b/0x61
 [<c043b30e>] __wake_up+0x2b/0x61
 [<c068e2a0>] account+0xb5/0xf0
 [<c068e3ef>] extract_entropy+0x3e/0xac
 [<c0406b0b>] ? xen_restore_fl_direct_end+0x0/0x1
 [<c04799d7>] ? lock_release+0x186/0x19f
 [<c068e56e>] get_random_bytes+0x29/0x3e
 [<c053bbd1>] load_elf_binary+0xab9/0x106c
 [<c050732d>] search_binary_handler+0xd7/0x27b
 [<c053b118>] ? load_elf_binary+0x0/0x106c
 [<c0539c76>] load_script+0x1a6/0x1c8
 [<c0507323>] ? search_binary_handler+0xcd/0x27b
 [<c0406199>] ? xen_force_evtchn_callback+0x1d/0x34
 [<c0507323>] ? search_binary_handler+0xcd/0x27b
 [<c0406b14>] ? check_events+0x8/0xc
 [<c0406b0b>] ? xen_restore_fl_direct_end+0x0/0x1
 [<c04799d7>] ? lock_release+0x186/0x19f
 [<c050732d>] search_binary_handler+0xd7/0x27b
 [<c0539ad0>] ? load_script+0x0/0x1c8
 [<c050888b>] do_execve+0x1d9/0x2ee
 [<c0408359>] sys_execve+0x39/0x6e
 [<c0409ad0>] syscall_call+0x7/0xb
 [<c04f00d8>] ? sys_swapon+0x348/0xa98
 [<c040d76b>] ? kernel_execve+0x27/0x3e
 [<c04031e0>] ? run_init_process+0x2b/0x3e
 [<c0403275>] ? init_post+0x82/0xe9
 [<c0a9b566>] ? kernel_init+0x1f6/0x211
 [<c0a9b370>] ? kernel_init+0x0/0x211
 [<c040a6bf>] ? kernel_thread_helper+0x7/0x10

Changed in linux (Fedora):
status: Fix Committed → Fix Released
Revision history for this message
In , Chris (chris-redhat-bugs) wrote :

(In reply to comment #29)
> Seems to boot now. virt-install started f12/rawhide Xen domU installation OK,
> on F11 host with Xen 3.4.1-3 + pv_ops dom0 kernel + libvirt from F11 updates
> testing.
>
> Installation went fine, and the installed domU seems to have
> 2.6.31-0.203.rc8.git2.fc12.i686.PAE kernel running.
>
> There's a traceback on domU dmesg though.. the domU still runs fine.

It's probably worth looking through BZ quickly to see if a bug with that trace exists already, and if not, to open a new bug about it.

Thanks for the testing,
Chris Lalancette

Revision history for this message
In , Pasi (pasi-redhat-bugs) wrote :
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

This bug report was marked as Triaged a while ago but has not had any updated comments for quite some time. Please let us know if this issue remains in the current Ubuntu release, http://www.ubuntu.com/getubuntu/download . If the issue remains, click on the current status under the Status column and change the status back to "New". Thanks.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-triage
Changed in linux (Ubuntu):
status: Triaged → Incomplete
Changed in linux (Fedora):
importance: Unknown → High
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Closing this bug with Won't fix as this kernel / release is no longer supported.
Please feel free to open a new bug report if you're still experiencing this on a newer release (Bionic 18.04.3 / Disco 19.04)
Thanks!

Changed in linux (Ubuntu):
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.