Continuous reboot with mmc card fails

Bug #1166246 reported by Mathieu Poirier on 2013-04-08
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linaro big.LITTLE
In Progress
Low
Nicolas Pitre

Bug Description

Since IKS has been moved to the 3.9 kernel lockups can be observed. These lockup can occur during boot time or during normal system operation. When the system hangs cluster0 is powered up and cluster1 is off - no other information such as console output of error message can be observed. The sysrq interface is also unavailable. As of now, Android _seems_ to be the only user space where the issue has been seen.

Conditions such as this one can be addressed without diagnostic tools only by a handful of people.

David Zinman (dzinman) on 2013-04-08
Changed in linaro-big-little-system:
importance: Undecided → High
Changed in linaro-big-little-system:
status: New → Confirmed

The bug can only be observed when running android, either at startup time when userspace is initialising or when running the benchmark-automation. The issue is very likely present on other distribution but the workload is either not heavy enough or laid out in a way that doesn't trigger the problem.

Amit Kucheria (amitk) on 2013-04-17
Changed in linaro-big-little-system:
assignee: nobody → Nicolas Pitre (npitre)
Amit Kucheria (amitk) on 2013-04-18
Changed in linaro-big-little-system:
importance: High → Critical
Amit Kucheria (amitk) on 2013-04-24
Changed in linaro-big-little-system:
status: Confirmed → In Progress

The problem occurs when both A15 CPUs have been disabled by cpuidle. In such condition the CPU and cluster state are not properly sync'ed to memory, forcing in the CPU wakeup process to wait for the cluster to get back in a known good state.

The condition has been observed when the A15s have been running at 1.2GHz, 1.1GHz, 1.0GHz and 900MHz. Since the system has been proven to work reliably when cpuidle is disabled, we know that if the A15 have a maximum OPP of 500MHz the system will not lock. The current goal is to find at which frequency (600MHz, 700MHz or 800MHz) the system become unstable.

This bug is currently being investigated internally by ARM as a potential platform problem. The issue occurs when the file system is hosted on MMC.

The hypothesis is that a DMA transfer is in progress when a cache flush is requested. The flush operation needs to wait for the MMC DMA transfer to finish, hence introducing a delay that is deemed too long by the firmware. When faced with such condition the processor is switched off without the final cache flush going through, resulting in a corruption of the processor bring-up state machine.

Tixy (Jon Medhurst) (tixy) wrote :

From comment #3 "The flush operation needs to wait for the MMC DMA transfer to finish"

MMC doesn't use DMA, it uses Programmed IO, is that what is meant? I.e. the time to service the MMC IRQ blocks the system too long?

Tixy (Jon Medhurst) (tixy) wrote :

From comment #2...

"The problem occurs when both A15 CPUs have been disabled by cpuidle. In such condition the CPU and cluster state are not properly sync'ed to memory,"

It may be a long stot, but could the comment "We must disable L2 prefetching on A15 before cleaning L2" be relevent in this... http://lists.infradead.org/pipermail/linux-arm-kernel/2013-June/174327.html

On Wed, 12 Jun 2013, Tixy (Jon Medhurst) wrote:

> It may be a long stot, but could the comment "We must disable L2
> prefetching on A15 before cleaning L2" be relevent in this...

That was attempted with no perceptible change.

Here's a recap of the conditions related to this bug:

1- The bug occurs _only_ when the SD slot is in use. When the root
   filesystem is installed on a USB device the bug does not occur.

2- Even if the root fs is on USB, the bug may still occur if the SD
   slot is used e.g. to copy a filesystem image on it.

3- For the bug to occur, the A15 must be running at a sufficient high
   clock (above 800MHz) and suddenly go to idle where a whole cluster
   shutdown is applied. Such condition is tricky to reproduce in
   practice as it requires only one CPU to be highly busy which
   suddenly goes idle, presumably due to waiting after IO completion.
   This might explain why the actual block device in use might matter.

4- This bug was reproduced using a Ubuntu image at least once so this
   is not Android specific. The common factor was once again IO through
   the SD slot.

5- The following errors are sent over the top left serial port each time
   this bug occurs:
   |ERROR: CA15 power down waiting for SBWFIL2
   |ERROR: CA15 power down waiting for CACTIVE (0x02)
   |ERROR: CA15 power down waiting for PWRDNACK (0x0E)

Workaround: avoid using the SD slot entirely until ARM has a fix.

I'd suggest for the Linaro media creation tools to avoid creating
filesystem images where the SD slot is specified as the root device
in the initrd and use the USB device instead in the mean time.

On hold until ARM is done investigating.

It is easy to reproduce the problem using the attached script. Simply make sure UART2 is free and directly connected to the board. Note that $UART may have to be modified depending on the setup.

summary: - Lockup on post 3.9 IKS kernel
+ Continuous reboot with mmc cards fails
summary: - Continuous reboot with mmc cards fails
+ Continuous reboot with mmc card fails

This is considered to be low priority and will be worked on as time and activities permit.

tags: added: bl-iks
Amit Kucheria (amitk) on 2013-11-22
Changed in linaro-big-little-system:
importance: Critical → Low
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments