Comment 1 for bug 768249

Revision history for this message
Paolo Pisati (p-pisati) wrote :

looks like a corruption is happening in machine_kexec.c:machine_kexec() in the final part - let's see a debugging session:

reboot the panda board and connect via openocd+gdb:

(gdb) l machine_kexec,+50
84 {
85 unsigned long page_list;
86 unsigned long reboot_code_buffer_phys;
87 void *reboot_code_buffer;
88
89 printk("%s::%d\n", __FUNCTION__, __LINE__);
90 page_list = image->head & PAGE_MASK;
91
92 /* we need both effective and real address here */
93 reboot_code_buffer_phys =
94 page_to_pfn(image->control_code_page) << PAGE_SHIFT;
95 reboot_code_buffer = page_address(image->control_code_page);
96
97 /* Prepare parameters for reboot_code_buffer*/
98 kexec_start_address = image->start;
99 kexec_indirection_page = page_list;
100 kexec_mach_type = machine_arch_type;
101 kexec_boot_atags = image->start - KEXEC_ARM_ZIMAGE_OFFSET + KEXEC_ARM_ATAGS_OFFSET;
102 printk("%s::%d\n", __FUNCTION__, __LINE__);
103
104 /* copy our kernel relocation code to the control code page */
105 memcpy(reboot_code_buffer,
106 relocate_new_kernel, relocate_new_kernel_size);
107
108 printk("%s::%d kexec_start_address: 0x%08lx reboot_code_buffer: %p reboot_code_buffer_phys: 0x%08lx relocate_new_kernel_size: %u\n",
109 __FUNCTION__, __LINE__, kexec_start_address, reboot_code_buffer, reboot_code_buffer_phys, relocate_new_kernel_size);
110
111 flush_icache_range((unsigned long) reboot_code_buffer,
112 (unsigned long) reboot_code_buffer + KEXEC_CONTROL_PAGE_SIZE);
113 printk(KERN_INFO "Bye!\n");
114
115 if (kexec_reinit)
116 kexec_reinit();
117 local_irq_disable();
118 local_fiq_disable();
119 setup_mm_for_reboot(0); /* mode is not used, so just pass 0*/
120 flush_cache_all();
121 outer_flush_all();
122 outer_disable();
123 cpu_proc_fin();
124 outer_inv_all();
125 flush_cache_all();
126 cpu_reset(reboot_code_buffer_phys);
127 }
(gdb) break machine_kexec.c:111
Breakpoint 1 at 0xc0051bc4: file /home/flag/canonical/ubuntu-natty/arch/arm/kernel/machine_kexec.c, line 111.
(gdb) c
Continuing.

via ssh i execute kexec:

flag@omap:~$ uname -a
Linux omap 2.6.38-1208-omap4 #11 PREEMPT Thu Apr 28 10:44:16 CEST 2011 armv7l GNU/Linux
flag@omap:~$ sudo kexec -l /boot/vmlinuz-2.6.38-1208-omap4 --initrd=/boot/initrd.img-2.6.38-1208-omap4 --command-line="`cat /proc/cmdline`"
flag@omap:~$ sudo kexec -e

on the serial console:

[ 72.652587] sys_reboot::380
[ 72.655517] kernel_kexec::1504
[ 72.986114] omapdss DISPC error: timeout waiting for EVSYNC
[ 73.087677] omapdss DISPC error: timeout waiting for EVSYNC
[ 73.093566] Starting new kernel
[ 73.097625] machine_kexec::89
[ 73.100738] machine_kexec::102
[ 73.104339] machine_kexec::109 kexec_start_address: 0x80008000 reboot_code_buffer: ee4af000 reboot_code_buffer_phys: 0xae4af000 relocate_new_kernel_size: 136

back to gdb:

Breakpoint 1, machine_kexec (image=<value optimized out>) at /home/flag/canonical/ubuntu-natty/arch/arm/kernel/machine_kexec.c:111
111 flush_icache_range((unsigned long) reboot_code_buffer,
(gdb) p/x reboot_code_buffer
$3 = 0xee4af000
(gdb) p/x reboot_code_buffer_phys
$4 = 0xae4af000

everything looks good, advance till line 113:

(gdb) p/x reboot_code_buffer_phys
$5 = 0xae4af000
(gdb) p/x reboot_code_buffer
$6 = 0x0

here is the first weird thing, reboot_code_buffer is NULL now, advance some more till line 122.

at this point the kernel relocation code (arch/arm/kernel/relocate_kernel.S::relocate_new_kernel()) it has already been copied to reboot_code_buffer (see the memcpy at line 105), and the caches have been flushed, so i expect to find that routine there - and dumping the memory confirms it:

(gdb) x/20 relocate_new_kernel
0xc0051dac <relocate_new_kernel>: 0xe59f0074 0xe59f106c 0xe3500000 0x0a000014
0xc0051dbc <relocate_new_kernel+16>: 0xe4903004 0xe3130001 0x0a000001 0xe3c34001
0xc0051dcc <relocate_new_kernel+32>: 0xeafffffa 0xe3130002 0x0a000001 0xe3c30002
0xc0051ddc <relocate_new_kernel+48>: 0xeafffff6 0xe3130004 0x0a000000 0xea000008
0xc0051dec <relocate_new_kernel+64>: 0xe3130008 0x0afffff1 0xe3c33008 0xe3a06b01

(gdb) x/20 0xee4af000 (< this is the relocate_code_buffer value AKA reboot_code_buffer as the MMU sees it before turning 0 - see above)
0xee4af000: 0xe59f0074 0xe59f106c 0xe3500000 0x0a000014
0xee4af010: 0xe4903004 0xe3130001 0x0a000001 0xe3c34001
0xee4af020: 0xeafffffa 0xe3130002 0x0a000001 0xe3c30002
0xee4af030: 0xeafffff6 0xe3130004 0x0a000000 0xea000008
0xee4af040: 0xe3130008 0x0afffff1 0xe3c33008 0xe3a06b01

dump the physical memory via openocd to double check:

(gdb) mon mdw phys 0xae4af000 20
0xae4af000: e59f0074 e59f106c e3500000 0a000014 e4903004 e3130001 0a000001 e3c34001
0xae4af020: eafffffa e3130002 0a000001 e3c30002 eafffff6 e3130004 0a000000 ea000008
0xae4af040: e3130008 0afffff1 e3c33008 e3a06b01

fine, everything is setup, let's step till line 126, the last line:

(gdb) p/x reboot_code_buffer_phys
$7 = 0x815d8000

uh? what? this is wrong, why has reboot_code_buffer_phys changed?

as memory dump shows:

(gdb) mon mdw phys 0x815d8000 20
0x815d8000: 80001001 bef27008 80008001 bef35008 bf000008 bef36008 bef37008 befd2008
0x815d8020: befd3008 bf002008 bf003008 bef24008 bef25008 befd4008 befd5008 befd6008
0x815d8040: befd7008 bef20008 bef21008 bef22008

this is NOT the relocation routine that was copied with memcpy(), and that is NOT the memory location we should jump!

let's stepi a bit more:

(gdb) stepi
0xc0051c60 126 cpu_reset(reboot_code_buffer_phys);
(gdb) stepi
cpu_v7_reset () at /home/flag/canonical/ubuntu-natty/arch/arm/mm/proc-v7.S:64
64 mov pc, r0
(gdb) p/x $r0
$8 = 0x815d8000
(gdb) stepi
0x815d8000 in ?? ()
(gdb) c
Continuing.

and here is the panic:

[ 518.057678] omap_device: omap-aess-audio.-1: new worst case activate latency 0: 152587
[ 518.067565] omap_device: omap-aess-audio.-1: new worst case deactivate latency 0: 61035
[ 518.089172] Internal error: Oops - undefined instruction: 0 [#1] PREEMPT
[ 518.096221] last sysfs file: /sys/kernel/kexec_loaded
[ 518.101501] Modules linked in: wl12xx wl12xx_sdio btsdio
[ 518.107086] CPU: 0 Tainted: G W (2.6.38-1208-omap4 #11)
[ 518.113739] PC is at 0x815dc040
[ 518.117065] LR is at machine_kexec+0x15c/0x198
[ 518.121704] pc : [<815dc040>] lr : [<c0051c64>] psr: 600000d3
[ 518.121704] sp : c170de50 ip : 000098c9 fp : 000098c9
[ 518.133758] r10: 00000105 r9 : 00000000 r8 : 815d8000
[ 518.139221] r7 : c170c000 r6 : 00008000 r5 : 815d8000 r4 : c0051e24
[ 518.146057] r3 : 00000002 r2 : 00000005 r1 : 00008000 r0 : 00000000
[ 518.152893] Flags: nZCv IRQs off FIQs off Mode SVC_32 ISA ARM Segment user
[ 518.160552] Control: 10c5387d Table: ae4dc059 DAC: 00000015
[ 518.166564] Process kexec (pid: 542, stack limit = 0xc170c2f0)
[ 518.172668] Stack: (0xc170de50 to 0xc170e000)
[ 518.177246] de40: ee4af000 ae4af000 00000088 00000000
[ 518.185821] de60: c08a64ac c178cc00 fee1dead 00000000 c004c528 c00c091c 45584543 45584543
[ 518.194396] de80: c0835ed0 c009ec90 00000002 c04a8908 ef9cb800 00000002 00000000 ffffffef
[ 518.202972] dea0: 00000000 c05d45b0 ef9cb800 00001001 00000000 00000000 c160b000 c00abc58
[ 518.211517] dec0: 00000000 00000002 ef9cb800 c04a294c ef9cb800 00001003 c170ded8 c170ded8
[ 518.220092] dee0: 00000002 00000000 ee584f40 c04f50bc ee48a220 ee48a220 ef4adc00 ef6eef00
[ 518.228668] df00: ffffffff c0164480 ef4adc00 00000000 ef6eef00 00000008 ef4adc20 00000000
[ 518.237243] df20: ef6eef00 00000000 ef6eef00 c0143a2c 00000014 ef6eef00 00000000 c0143c58
[ 518.245819] df40: c1608e00 ef4adc20 ef6eef00 c0132600 00000000 00000000 c1608e14 c1608e00
[ 518.254394] df60: 00000000 c160ccc0 000241b8 c014c224 c1608e00 00000000 c160ccc0 c012eebc
[ 518.262939] df80: 00000003 00000001 c170c000 cdcefa50 00000002 400e95e7 00000000 40101fd0
[ 518.271514] dfa0: 00000058 c004c340 400e95e7 00000000 fee1dead 28121969 45584543 00000000
[ 518.280090] dfc0: 400e95e7 00000000 40101fd0 00000058 00000000 00000001 000241c8 00000105
[ 518.288665] dfe0: be967728 be967718 0000ab4b 401a47b0 80000010 fee1dead 00000000 00000000
[ 518.297271] [<c0051c64>] (machine_kexec+0x15c/0x198) from [<c00c091c>] (kernel_kexec+0xd8/0x104)
[ 518.306518] [<c00c091c>] (kernel_kexec+0xd8/0x104) from [<c009ec90>] (sys_reboot+0x190/0x214)
[ 518.315460] [<c009ec90>] (sys_reboot+0x190/0x214) from [<c004c340>] (ret_fast_syscall+0x0/0x30)
[ 518.324584] Code: 00000000 00000000 c15dc038 c15dc038 (ee4f1160)
[ 518.331207] usb 1-1.1: ifconfig timed out on ep0out len=0/4
[ 518.337066] smsc95xx 1-1.1:1.0: eth0: Failed to write register index 0x00000014
init: rsyslog main process (347) killed by SEGV signal
init: rsyslog main process ended, respawning
[ 518.373504] smsc95xx 1-1.1:1.0: eth0: Failed to write HW_CFG_LRST_ bit in HW_CFG register, ret = -110
[ 518.549499] ---[ end trace da227214a82491ba ]---

to me, it looks like a memory (stack?) corruption in the final part of machine_kexec(), anyone with mmu/openocd/gdb knowledge is welcome at this point.