Linaro Toolchain Binaries

Kernel 3.10 crashes randomly after upgrade of gcc from 4.7 to 4.8

Bug #1263764 reported by Wendy Ng on 2013-12-23

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Linaro GCC	Fix Released	Undecided	Unassigned
	Linaro Toolchain Binaries	Fix Released	Undecided	Unassigned	Linaro Toolchain Binaries 2013.12

Bug Description

We found that the kernel crashes at various places after we have upgraded the gcc from 4.7 to 4.8.

In particular, when we switched gcc:
From - gcc-linaro-arm-linux-gnueabihf-2012.08-20120827_linux (4.7.2)
To - gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_linux (4.8.2)

One of the places that the kernel crashes very often is when an MMC request is initiated. Therefore, I have analyzed the crash logs, traced the MMC stack, and created some debug build to conclude that the scatter-gather list length (nsegs) returned from blk_rq_map_sg() is not valid. Furthermore, the signature of the crashes suggested that the invalid length resembles the value of the Current Program Status Register (CPSR).

Below is the snippet of the crash log and I have attached the complete log for your reference:
========================
[ 83.756520] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 83.761815] Modules linked in:
[ 83.764884] CPU: 0 PID: 977 Comm: mmcqd/1 Tainted: G W 3.10.16+ #3
[ 83.771893] task: ddd75c40 ti: ddfa8000 task.ti: ddfa8000
[ 83.777271] PC is at mmc_queue_map_sg+0xf4/0x160
[ 83.781881] LR is at mmc_queue_map_sg+0x114/0x160
[ 83.786569] pc : [<c04d381c>] lr : [<c04d383c>] psr: a0000013
[ 83.786569] sp : ddfa9dd0 ip : ddfa9dd0 fp : ddfa9df4
[ 83.797972] r10: ddce9c00 r9 : 00000000 r8 : ddce9c24
[ 83.803170] r7 : 00000002 r6 : 00002000 r5 : 00000002 r4 : a0000013
[ 83.809661] r3 : c13c7222 r2 : c13c7222 r1 : dbd7d418 r0 : 00000000
[ 83.816154] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
[ 83.823422] Control: 10c5387d Table: 5e77006a DAC: 00000015
========================

I am aware that there was a known issue in kernel 3.8 related to memset() for the gcc upgrade from 4.7 to 4.8 (https://bugs.launchpad.net/linaro-toolchain-binaries/+bug/1186218). Since we are using kernel 3.10, I can confirm that we already have this fix in our codebase.

Nevertheless, I noticed that the fix that was done for memset() is only applicable to the assembly file memset.S, but not memzero.S. If I look into the disassembly code from blk_rq_map_sg(), I can see that memzero() is called. But I am not sure if this is the root cause of the crashes I have observed.

Lastly, I would like to relate the issue I reported here to the one reported in here.
https://bugs.launchpad.net/ubuntu/+source/gcc-4.8/+bug/1178847 (Comment #15)
The crash signature looks somewhat similar and it is still an unresolved issue.

=========================

List of supporting documents for Linaro to analyze the crash

1. .config
- this is the Kernel config used to compile the Kernel. Kernel version is based on v3.10.

2. queue.c
- additional debug code are added to catch the crash condition earlier on the MMC stack.
- original file located in: /drivers/mmc/card/

3. vmlinux_v6_chk_sg_len, vmlinux_v6_chk_sg_len.lst, System_v6_chk_sg_len.map
- vmlinx, its listing file and map-file

4. kernel_crash_console_log_1.txt, kernel_crash_console_log_2.txt
- 2 instances of the kernel crashes.

5. Makefile
- shows the compiler configuration option
Note: ARCH=arm, CROSS_COMPILE=arm-linux-gnueabihf-

6. mmc_queue_map_sg_v6_asm_annotated.txt
- annotated disassembly code which has the extra debug code added in queue.c

Tags:

Revision history for this message

Wendy Ng (wendy-ng) wrote on 2013-12-23:

list of support documents for analyzing the kernel crash Edit (73.2 MiB, application/zip)

Wendy Ng (wendy-ng) on 2013-12-23

information type:	Public → Private Security
information type:	Private Security → Private

Wendy Ng (wendy-ng) on 2013-12-23

information type:

Private → Public

Revision history for this message

Wendy Ng (wendy-ng) wrote on 2013-12-23:

I should mention that 'nsegs' is stored on the stack and it looks corrupted by the time it returns from blk_rq_map_sg().

We have observed other crashes where the value stored on the stack is corrupted and the corrupted value resemble the value stored in PSR. The following crash logs are obtained from *another linux image* and I am providing this additional info to illustrate my point here.

======================================
Klog-3.TXT

Crash #1

[ 346.623251] Unable to handle kernel paging request at virtual address a000002f
[ 346.631225] pgd = d4dc8000
[ 346.635486] [a000002f] *pgd=00000000
[ 346.640859] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 346.646199] Modules linked in: bcmdhd
[ 346.649964] CPU: 1 PID: 2339 Comm: Binder_3 Tainted: G W 3.10.16+ #1
[ 346.657184] task: d1c15200 ti: d1c2e000 task.ti: d1c2e000
[ 346.662592] PC is at plug_rq_cmp+0x18/0x4c
[ 346.666699] LR is at merge+0x40/0x80
[ 346.670286] pc : [<c02a7378>] lr : [<c02d6a44>] psr: a0000013
[ 346.670286] sp : d1c2fc90 ip : d1c2fca0 fp : d1c2fc9c
[ 346.681730] r10: d1c2fd38 r9 : 00000002 r8 : c02a7360
[ 346.686971] r7 : 00000000 r6 : d1c2fca0 r5 : c22853c0 r4 : a0000013
[ 346.693511] r3 : a0000013 r2 : a0000013 r1 : c22853c0 r0 : ddd88000
[ 346.700048] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 346.707183] Control: 10c5387d Table: 64dc806a DAC: 00000015

===========

Klog-6.TXT

Crash #1

[ 8904.709374] Unable to handle kernel paging request at virtual address 6000012f
[ 8904.719140] pgd = c82c8000
[ 8904.721923] [6000012f] *pgd=00000000
[ 8904.725627] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 8904.731010] Modules linked in:
[ 8904.734229] CPU: 2 PID: 6181 Comm: ing.mp3.android Tainted: G W 3.10.16+ #1
[ 8904.742103] task: d034dc40 ti: c829a000 task.ti: c829a000
[ 8904.747525] PC is at plug_rq_cmp+0x14/0x4c
[ 8904.751700] LR is at list_sort+0x194/0x204
[ 8904.755880] pc : [<c02a7374>] lr : [<c02d6c18>] psr: 20000113
[ 8904.755880] sp : c829bba8 ip : c829bbb8 fp : c829bbb4
[ 8904.767394] r10: c829bc20 r9 : 00000001 r8 : d2f074b0
[ 8904.772696] r7 : 60000113 r6 : c02a7360 r5 : 00000000 r4 : c829bc50
[ 8904.779267] r3 : 00000000 r2 : d2f074b0 r1 : 60000113 r0 : 00000000
[ 8904.785847] Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 8904.793035] Control: 10c5387d Table: 582c806a DAC: 00000015

Summary:
1. The crash from Klog-3.txt and Klog-6.txt revealed some interesting info because the crashes occurs in 2 different instruction that is next to each other. In one case, r2 is corrupted and the other r1 is corrupted. They should be close to each other in the memory. Both r1 and r2 are input passing into plug_rq_cmp()

I should mention that 'nsegs' is stored on the stack and it looks corrupted by the time it returns from blk_rq_map_sg().

We have observed other crashes where the value stored on the stack is corrupted and the corrupted value resemble the value stored in PSR.  The following crash logs are obtained from *another linux image* and I am providing this additional info to illustrate my point here.

======================================
Klog-3.TXT

Crash #1

[  346.623251] Unable to handle kernel paging request at virtual address a000002f
[  346.631225] pgd = d4dc8000
[  346.635486] [a000002f] *pgd=00000000
[  346.640859] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[  346.646199] Modules linked in: bcmdhd
[  346.649964] CPU: 1 PID: 2339 Comm: Binder_3 Tainted: G        W    3.10.16+ #1
[  346.657184] task: d1c15200 ti: d1c2e000 task.ti: d1c2e000
[  346.662592] PC is at plug_rq_cmp+0x18/0x4c
[  346.666699] LR is at merge+0x40/0x80
[  346.670286] pc : [<c02a7378>]    lr : [<c02d6a44>]    psr: a0000013
[  346.670286] sp : d1c2fc90  ip : d1c2fca0  fp : d1c2fc9c
[  346.681730] r10: d1c2fd38  r9 : 00000002  r8 : c02a7360
[  346.686971] r7 : 00000000  r6 : d1c2fca0  r5 : c22853c0  r4 : a0000013
[  346.693511] r3 : a0000013  r2 : a0000013  r1 : c22853c0  r0 : ddd88000
[  346.700048] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  346.707183] Control: 10c5387d  Table: 64dc806a  DAC: 00000015

===========
	
Klog-6.TXT

Crash #1

[ 8904.709374] Unable to handle kernel paging request at virtual address 6000012f
[ 8904.719140] pgd = c82c8000
[ 8904.721923] [6000012f] *pgd=00000000
[ 8904.725627] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 8904.731010] Modules linked in:
[ 8904.734229] CPU: 2 PID: 6181 Comm: ing.mp3.android Tainted: G        W    3.10.16+ #1
[ 8904.742103] task: d034dc40 ti: c829a000 task.ti: c829a000
[ 8904.747525] PC is at plug_rq_cmp+0x14/0x4c
[ 8904.751700] LR is at list_sort+0x194/0x204
[ 8904.755880] pc : [<c02a7374>]    lr : [<c02d6c18>]    psr: 20000113
[ 8904.755880] sp : c829bba8  ip : c829bbb8  fp : c829bbb4
[ 8904.767394] r10: c829bc20  r9 : 00000001  r8 : d2f074b0
[ 8904.772696] r7 : 60000113  r6 : c02a7360  r5 : 00000000  r4 : c829bc50
[ 8904.779267] r3 : 00000000  r2 : d2f074b0  r1 : 60000113  r0 : 00000000
[ 8904.785847] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[ 8904.793035] Control: 10c5387d  Table: 582c806a  DAC: 00000015

Summary: 
	1. The crash from Klog-3.txt and Klog-6.txt revealed some interesting info because the crashes occurs in 2 different instruction that is next to each other.  In one case, r2 is corrupted and the other r1 is corrupted.  They should be close to each other in the memory.  Both r1 and r2 are input passing into plug_rq_cmp()

Viktor (vchong) on 2013-12-26

Changed in linaro-toolchain-binaries:
status:	New → Confirmed

Revision history for this message

Wendy Ng (wendy-ng) wrote on 2014-01-03:

Hello Victor,

I noticed that you have changed the status of this bug to 'Confirmed', but no comment is added. I wonder if you can provide me with some details.

In the meantime, I have some addition info for you:

I added some more debug code to narrow down the problem and it seems to point to something wrong in the following snippet of the generated code for blk_rq_map_sg():

~~~~~~~~~~~~~~~~~~~~~~~~~~
c02ace58: e24bd028 sub sp, fp, #40 ; 0x28
c02ace5c: e3530000 cmp r3, #0
c02ace60: 15932000 ldrne r2, [r3]
c02ace64: 13c22001 bicne r2, r2, #1
c02ace68: 13822002 orrne r2, r2, #2
c02ace6c: 15832000 strne r2, [r3]
c02ace70: e51b0034 ldr r0, [fp, #-52] ; 0x34 // return nsegs
c02ace74: e89daff0 ldm sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc} // exit this function
~~~~~~~~~~~~~~~~~~~~~~~~~~

The 'sp' gets restored too early and the 'nsegs' that is stored at [fp - #52] would be corrupted if an interrupt occurs that after the 'sp' has been restored to [fp - #40] and before the value stored at [fp - #52] gets loaded to back to r0.

Upon searching the internet, I think this bug is the same as the one reported in the link below:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854

Do you think you can incorporate the patch from the link above to Linaro "gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_linux" for me to test it out to confirm it is indeed the same problem?

FYI -- the compiler option '-mapcs' is used in my compiled code. It might not be obvious to you as I myself couldn't find such configuration option from the Makefile I provided to you. I have to enabled the verbose option to see the actual compiler options that are applied during the code compilation process.

Revision history for this message

Matthew Gretton-Dann (matthew-gretton-dann) wrote on 2014-01-07:

That certainly looks a likely candidate - and I agree with your reasoning. The patch was backported to GCC 4.8 and should be in the Linaro GCC 4.8 2013.12. Would you be able to test that release?

Thanks,

Matt

Revision history for this message

Wendy Ng (wendy-ng) wrote on 2014-01-07:

Hello Matt,

OK -- I will download the gcc from the following link and give it a try:

https://launchpad.net/gcc-linaro/+milestone/4.8-2013.12

Thanks for your update.

-Wendy

Revision history for this message

Wendy Ng (wendy-ng) wrote on 2014-01-10:

Hello Matt,

I have re-run the same tests where we observed the kernel crash with gcc 4.8.2. I can confirm that the Linaro GCC 4.8 2013.12 has fixed the problem. The generated code for blk_rq_map_sg() is also looking good to me as well.

I expect more stress tests will be run with Linaro GCC 4.8 2013.12. I can file a new ticket if new issues are found in the other stress tests. So I think you close this ticket.

Thanks,
-Wendy

Fathi Boudra (fboudra) on 2014-01-11

Changed in linaro-toolchain-binaries:
milestone:	none → 2013.12
status:	Confirmed → Fix Released
Changed in gcc-linaro:
status:	New → Fix Released