Latest NO_HZ_FULL patches cause CPU stalls & spinlock deadlocks in 3.10 PREEMPT_RT kernel

Bug #1260397 reported by Gary S. Robertson
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linaro-networking
Invalid
High
viresh kumar

Bug Description

After applying Kevin Hilman's latest patches for NO_HZ_FULL support to the 3.10.20-rt17 LNG real-time kernel, CI regression tests began failing with random timeouts, etc.. Debug traces in the console output showed evidence of deadlocks in spinlock acquisition and reports of CPU stalls detected by RCU. The kernel configuration also had NO_HZ_FULL_ALL selected, so there may be some connection to not reserving a 'management' CPU which is excluded from attempts to provide full tickless operation.

NOTE that the same kernel without the PREEMPT_RT enabled seemed to have none of these issues, so it appears to be an interaction with the changes manifested by the RT patch.

Revision history for this message
Gary S. Robertson (gary-robertson) wrote :

Here are links to the CI regression test results where the failures were noted:

CI job 90514
http://validation.linaro.org/dashboard/permalink/bundle/964b3432c71381e286b024d8887003f516e5c9ca/

CI job 90535
http://validation.linaro.org/dashboard/permalink/bundle/29c6cd5fe44dbca22cad9ef6b26dade3b8ab3975/

Initially I thought the 3.12 RT kernel was experiencing related issues, but that turns out to be an oversight on my part since the missing test results I noticed were due to a smaller spoecified set of tests rather than hangups/timeouts on the test targets.
When we can re-test the latest 3.12 RT kernel we can confirm whether or not issues remain there - but initial indications are that there are no issues.

The 3.10 issues nonetheless need to be addressed in order to provide support for the member kernels which remain at the 3.10 version level.

Revision history for this message
viresh kumar (viresh.kumar) wrote :

Kevin: Can you comment on this bug please?

Revision history for this message
Gary S. Robertson (gary-robertson) wrote : Re: [Bug 1260397] Re: Latest HO_HZ_FULL patches cause CPU stalls & spinlock deadlocks in 3.10 PREEMPT_RT kernel

Viresh and Kevin,

We are also seeing similar problems in the 3.12.1 kernel without RT -
possibly *without the NO_HZ patches applied* as well. The investigation
there hasn't yet zeroed in on a culprit but I expect that to be forthcoming
within a couple of days. I am starting this morning by testing with our
BE patches removed as well and plan to begin GIT bisect testing shortly to
identify the patch where the problem begins. I will be sure to copy you
on results.

Gary

On Mon, Dec 16, 2013 at 3:37 AM, viresh kumar <email address hidden>wrote:

> Kevin: Can you comment on this bug please?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> Matching subscriptions: LNG all
> https://bugs.launchpad.net/bugs/1260397
>
> Title:
> Latest HO_HZ_FULL patches cause CPU stalls & spinlock deadlocks in
> 3.10 PREEMPT_RT kernel
>
> Status in Linaro networking Group:
> New
>
> Bug description:
> After applying Kevin Hilman's latest patches for NO_HZ_FULL support to
> the 3.10.20-rt17 LNG real-time kernel, CI regression tests began
> failing with random timeouts, etc.. Debug traces in the console
> output showed evidence of deadlocks in spinlock acquisition and
> reports of CPU stalls detected by RCU. The kernel configuration also
> had NO_HZ_FULL_ALL selected, so there may be some connection to not
> reserving a 'management' CPU which is excluded from attempts to
> provide full tickless operation.
>
> NOTE that the same kernel without the PREEMPT_RT enabled seemed to
> have none of these issues, so it appears to be an interaction with the
> changes manifested by the RT patch.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/linaro-networking/+bug/1260397/+subscriptions
>

Revision history for this message
Kevin Hilman (khilman-deactivatedaccount) wrote : Re: Latest HO_HZ_FULL patches cause CPU stalls & spinlock deadlocks in 3.10 PREEMPT_RT kernel

@#1. Does the 3.12 kernel have NO_HZ_FULL patches applied? If so, from where? ARM is still missing a few Kconfig patches for NO_HZ_FULL on 3.12. I suspect 3.12 RT works because it's not in NO_HZ_FULL mode. (c.f. arm-nohz-v3.12 branch in git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux.git)

Revision history for this message
Kevin Hilman (khilman-deactivatedaccount) wrote :

Are we really telling members that we support NO_HZ_FULL and PREEMPT_RT together?

I've certainly never tested them together, and am not surprised that things fall apart when combined. This is uncharted territory. If this is really a feature set that is expected to work together, then we have some work to do. I think even with latest mainline, folks are only barely starting to combine NO_HZ_FULL and PREEMPT_RT.

Revision history for this message
Gary S. Robertson (gary-robertson) wrote : Re: [Bug 1260397] Re: Latest HO_HZ_FULL patches cause CPU stalls & spinlock deadlocks in 3.10 PREEMPT_RT kernel

I have not tested the 3.12 kernel with RT applied. I did test it with the
same NO_HZ_FULL patches applied which we tried on the 3.10.20 kernel, but
we had similar issues RCU detected CPU stalls, spinlock deadlock suspected,
and some memory addressing issues. I backed out these patches and
re-tested but the problems persisted. All the configurations tested thus
far had BE patches applied but were tested in LE mode - so my next step is
to remove the BE patches to see if the problem persists. Then we'll start
git bisect testing to identify just where the problems begin. Hopefully by
removing the patches for various features/fixes we can foreshorten the
bisect testing somewhat.

On Mon, Dec 16, 2013 at 9:49 AM, Kevin Hilman <email address hidden> wrote:

> @#1. Does the 3.12 kernel have NO_HZ_FULL patches applied? If so,
> from where? ARM is still missing a few Kconfig patches for NO_HZ_FULL
> on 3.12. I suspect 3.12 RT works because it's not in NO_HZ_FULL mode.
> (c.f. arm-nohz-v3.12 branch in
> git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux.git)
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> Matching subscriptions: LNG all
> https://bugs.launchpad.net/bugs/1260397
>
> Title:
> Latest HO_HZ_FULL patches cause CPU stalls & spinlock deadlocks in
> 3.10 PREEMPT_RT kernel
>
> Status in Linaro networking Group:
> New
>
> Bug description:
> After applying Kevin Hilman's latest patches for NO_HZ_FULL support to
> the 3.10.20-rt17 LNG real-time kernel, CI regression tests began
> failing with random timeouts, etc.. Debug traces in the console
> output showed evidence of deadlocks in spinlock acquisition and
> reports of CPU stalls detected by RCU. The kernel configuration also
> had NO_HZ_FULL_ALL selected, so there may be some connection to not
> reserving a 'management' CPU which is excluded from attempts to
> provide full tickless operation.
>
> NOTE that the same kernel without the PREEMPT_RT enabled seemed to
> have none of these issues, so it appears to be an interaction with the
> changes manifested by the RT patch.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/linaro-networking/+bug/1260397/+subscriptions
>

Revision history for this message
Mike Holmes (mike-holmes) wrote :
Download full text (3.9 KiB)

Are we really telling members that we support NO_HZ_FULL and PREEMPT_RT
together?

That is a goal, LNG has as part of its mandate to try to work towards this
being a stable configuration, at the same time investigating if we can get
the performance required without the need for the PREEMPT_RT patch.

Mike

On 16 December 2013 11:17, Gary S. Robertson <email address hidden>wrote:

> I have not tested the 3.12 kernel with RT applied. I did test it with the
> same NO_HZ_FULL patches applied which we tried on the 3.10.20 kernel, but
> we had similar issues RCU detected CPU stalls, spinlock deadlock suspected,
> and some memory addressing issues. I backed out these patches and
> re-tested but the problems persisted. All the configurations tested thus
> far had BE patches applied but were tested in LE mode - so my next step is
> to remove the BE patches to see if the problem persists. Then we'll start
> git bisect testing to identify just where the problems begin. Hopefully by
> removing the patches for various features/fixes we can foreshorten the
> bisect testing somewhat.
>
>
> On Mon, Dec 16, 2013 at 9:49 AM, Kevin Hilman <email address hidden> wrote:
>
> > @#1. Does the 3.12 kernel have NO_HZ_FULL patches applied? If so,
> > from where? ARM is still missing a few Kconfig patches for NO_HZ_FULL
> > on 3.12. I suspect 3.12 RT works because it's not in NO_HZ_FULL mode.
> > (c.f. arm-nohz-v3.12 branch in
> > git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux.git)
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > Matching subscriptions: LNG all
> > https://bugs.launchpad.net/bugs/1260397
> >
> > Title:
> > Latest HO_HZ_FULL patches cause CPU stalls & spinlock deadlocks in
> > 3.10 PREEMPT_RT kernel
> >
> > Status in Linaro networking Group:
> > New
> >
> > Bug description:
> > After applying Kevin Hilman's latest patches for NO_HZ_FULL support to
> > the 3.10.20-rt17 LNG real-time kernel, CI regression tests began
> > failing with random timeouts, etc.. Debug traces in the console
> > output showed evidence of deadlocks in spinlock acquisition and
> > reports of CPU stalls detected by RCU. The kernel configuration also
> > had NO_HZ_FULL_ALL selected, so there may be some connection to not
> > reserving a 'management' CPU which is excluded from attempts to
> > provide full tickless operation.
> >
> > NOTE that the same kernel without the PREEMPT_RT enabled seemed to
> > have none of these issues, so it appears to be an interaction with the
> > changes manifested by the RT patch.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/linaro-networking/+bug/1260397/+subscriptions
> >
>
> --
> You received this bug notification because you are a member of Linaro
> Networking Group, which is subscribed to linaro-networking.
> Matching subscriptions: LNG all, all issues
> https://bugs.launchpad.net/bugs/1260397
>
> Title:
> Latest HO_HZ_FULL patches cause CPU stalls & spinlock deadlocks in
> 3.10 PREEMPT_RT kernel
>
> Status in Linaro networking Group:
> New
>
> Bug description:
> After applying Kevin Hilman's ...

Read more...

summary: - Latest HO_HZ_FULL patches cause CPU stalls & spinlock deadlocks in 3.10
+ Latest NO_HZ_FULL patches cause CPU stalls & spinlock deadlocks in 3.10
PREEMPT_RT kernel
Changed in linaro-networking:
assignee: nobody → viresh kumar (viresh.kumar)
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Mike Holmes (mike-holmes) wrote :

https://validation.linaro.org/dashboard/image-charts/LNG-NO_HZ

Regression results appear to show that with RT, no isolation is achieved at all.

Revision history for this message
Gary S. Robertson (gary-robertson) wrote :

We need to revisit this issue since there have been several versions of NO_HZ suppport patches submitted and at this point it is unclear what version or combination of these are needed.

Revision history for this message
Gary S. Robertson (gary-robertson) wrote :
Download full text (5.5 KiB)

Both linux-lng 3.12.8 and linux-lng-preempt-rt-3.12.8-rt11 are failing and 'hanging' in the LAVA NO_HZ isolation tests.

Below is the output from a linux-lng-3.12.8 test run:
Section 2149

 /lava/bin/lava-test-runner /lava
 /lava
 <LAVA_TEST_RUNNER>: started
 <LAVA_TEST_RUNNER>: looking for installation work in /lava/lava-test-runner.conf-1390829708
 <LAVA_TEST_RUNNER>: save hardware/software context info...
 <LAVA_TEST_RUNNER>: looking for work in /lava/lava-test-runner.conf-1390829708
 <LAVA_TEST_RUNNER>: running 0_NO_HZ_FULL under lava-test-shell...
 <LAVA_SIGNAL_STARTRUN NO_HZ_FULL 7f6319e0-dcd2-4d00-bb8c-94a98905ccf9>
 echo LAVA_ACK

 Started Isolating CPUs - via CPUSETS
 ------------------------------------

 sched_tick_max_deferment set to: 4294967295
 vm.stat_interval = 1000
 vm.dirty_writeback_centisecs = 100000
 vm.dirty_expire_centisecs = 100000
 error: "kernel.watchdog" is an unknown key
 Cannot move PID 2:kthreadd
 Cannot move PID 3:ksoftirqd/0
 Cannot move PID 4:kworker/0:0
 Cannot move PID 5:kworker/0:0H
 Cannot move PID 6:kworker/u4:0
 Cannot move PID 7:migration/0
 Cannot move PID 14:migration/1
 Cannot move PID 15:ksoftirqd/1
 Cannot move PID 16:kworker/1:0
 Cannot move PID 17:kworker/1:0H
 Cannot move PID 18:khelper
 Cannot move PID 20:kworker/u4:1
 Cannot move PID 265:writeback
 Cannot move PID 267:bioset
 Cannot move PID 269:kblockd
 Cannot move PID 280:ata_sff
 Cannot move PID 428:rpciod
 Cannot move PID 429:kworker/0:1
 Cannot move PID 430:kvm_arch_timer
 Cannot move PID 451:nfsiod
 Cannot move PID 453:bioset
 Cannot move PID 455:crypto
 Cannot move PID 1067:kworker/1:1
 Cannot move PID 1153:kpsmoused
 Cannot move PID 1160:dw-mci-card
 Cannot move PID 1162:dw-mci-card
 Cannot move PID 1163:kworker/u4:2
 Cannot move PID 1193:kworker/0:2
 Cannot move PID 1208:deferwq
 Cannot move PID 1220:kworker/0:1H
 Cannot move PID 1222:ext4-rsv-conver
 Cannot move PID 1226:kworker/1:1H
 Cannot move PID 1668:kworker/u5:0[ 14.300372] IRQ153 no longer affine to CPU1
 [ 14.300743] CPU1: shutdown

 [ 14.317520] ---[ end trace 81f584dacc130a05 ]---
 [ 14.317561] Kernel panic - not syncing: Attempted to kill the idle task!
 [ 14.317573] CPU0: stopping
 [ 14.329888] CPU: 0 PID: 1661 Comm: is-cpu-isolated Tainted: G D 3.12.8-linaro-arndale #2
 [ 14.338702] [<8001cf85>] (unwind_backtrace+0x1/0x9c) from [<8001a90d>] (show_stack+0x11/0x14)
 [ 14.347066] [<8001a90d>] (show_stack+0x11/0x14) from [<8044d9f1>] (dump_stack+0x61/0x6c)
 [ 14.355007] [<8044d9f1>] (dump_stack+0x61/0x6c) from [<8001c43b>] (handle_IPI+0xd3/0xe0)
 [ 14.362948] [<8001c43b>] (handle_IPI+0xd3/0xe0) from [<800084b7>] (gic_handle_irq+0x57/0x58)
 [ 14.371232] [<800084b7>] (gic_handle_irq+0x57/0x58) from [<8001b1db>] (__irq_svc+0x3b/0x5c)
 [ 14.379428] Exception stack(0xed8a3e30 to 0xed8a3e78)
 [ 14.384379] 3e20: 14d081ad 00000000 fff95200 00000097
 [ 14.392412] 3e40: 80797460 000000ef 14d08116 806e00c0 806ed93c 80708fa8 ffff21c2 8045471c
 [ 14.400440] 3e60: 00000002 ed8a3e78 8001c9e9 8027c1e6 80000033 ffffffff
 [ 14.406939] [<8001b1db>] (__irq_svc+0x3b/0x5c) from [<8027c1e6>] (__timer_delay+0x22/0x34)
 [ 14.415053] [<8027c1e6>] (__timer_delay+0x22/0x34) from [<80024a99>] (exyno...

Read more...

Revision history for this message
Kevin Hilman (khilman-deactivatedaccount) wrote : Re: [Bug 1260397] Re: Latest NO_HZ_FULL patches cause CPU stalls & spinlock deadlocks in 3.10 PREEMPT_RT kernel
Download full text (7.7 KiB)

The kernel is crashing during the bootup of the secondary CPU(s) after
a hotplug. This is well before any isolation tests are actually
running.

I'm guessing the isolation test script is doing a CPU hot unplug
followed immediately by a CPU hotplug in order to force migration of
timers etc. That is what's causing the crash here.

That out-of-tree hotplug patch could be a potential culprit, but also
some basic hotplug tests should probably be run on the platform to be
sure that it's stable as well.

It also looks like this script is doing the hotplug after the CPUsets
are setup for isolation (but before any isolation tests are running.)
The interactions between hotplug and CPUsets can be a little strange
so I would recommend doing the hotunplug/hotplug *before* the CPUsets
are configured.

Kevin

On Thu, Jan 30, 2014 at 6:50 PM, Gary S. Robertson
<email address hidden> wrote:
> Both linux-lng 3.12.8 and linux-lng-preempt-rt-3.12.8-rt11 are failing
> and 'hanging' in the LAVA NO_HZ isolation tests.
>
> Below is the output from a linux-lng-3.12.8 test run:
> Section 2149
>
> /lava/bin/lava-test-runner /lava
> /lava
> <LAVA_TEST_RUNNER>: started
> <LAVA_TEST_RUNNER>: looking for installation work in /lava/lava-test-runner.conf-1390829708
> <LAVA_TEST_RUNNER>: save hardware/software context info...
> <LAVA_TEST_RUNNER>: looking for work in /lava/lava-test-runner.conf-1390829708
> <LAVA_TEST_RUNNER>: running 0_NO_HZ_FULL under lava-test-shell...
> <LAVA_SIGNAL_STARTRUN NO_HZ_FULL 7f6319e0-dcd2-4d00-bb8c-94a98905ccf9>
> echo LAVA_ACK
>
> Started Isolating CPUs - via CPUSETS
> ------------------------------------
>
> sched_tick_max_deferment set to: 4294967295
> vm.stat_interval = 1000
> vm.dirty_writeback_centisecs = 100000
> vm.dirty_expire_centisecs = 100000
> error: "kernel.watchdog" is an unknown key
> Cannot move PID 2:kthreadd
> Cannot move PID 3:ksoftirqd/0
> Cannot move PID 4:kworker/0:0
> Cannot move PID 5:kworker/0:0H
> Cannot move PID 6:kworker/u4:0
> Cannot move PID 7:migration/0
> Cannot move PID 14:migration/1
> Cannot move PID 15:ksoftirqd/1
> Cannot move PID 16:kworker/1:0
> Cannot move PID 17:kworker/1:0H
> Cannot move PID 18:khelper
> Cannot move PID 20:kworker/u4:1
> Cannot move PID 265:writeback
> Cannot move PID 267:bioset
> Cannot move PID 269:kblockd
> Cannot move PID 280:ata_sff
> Cannot move PID 428:rpciod
> Cannot move PID 429:kworker/0:1
> Cannot move PID 430:kvm_arch_timer
> Cannot move PID 451:nfsiod
> Cannot move PID 453:bioset
> Cannot move PID 455:crypto
> Cannot move PID 1067:kworker/1:1
> Cannot move PID 1153:kpsmoused
> Cannot move PID 1160:dw-mci-card
> Cannot move PID 1162:dw-mci-card
> Cannot move PID 1163:kworker/u4:2
> Cannot move PID 1193:kworker/0:2
> Cannot move PID 1208:deferwq
> Cannot move PID 1220:kworker/0:1H
> Cannot move PID 1222:ext4-rsv-conver
> Cannot move PID 1226:kworker/1:1H
> Cannot move PID 1668:kworker/u5:0[ 14.300372] IRQ153 no longer affine to CPU1
> [ 14.300743] CPU1: shutdown
>
> [ 14.317520] ---[ end trace 81f584dacc130a05 ]---
> [ 14.317561] Kernel panic - not syncing: Attempted to kill the idle task!
> [ 14.317573] CPU0: stopp...

Read more...

Revision history for this message
viresh kumar (viresh.kumar) wrote :

Thanks Kevin. Yeah its the same issue mentioned by kevin now and me earlier. You need the patch I have sent you to get this fixed.

Revision history for this message
Gary S. Robertson (gary-robertson) wrote :

The last patch I remember receiving was:

> This recent patch from Inderpal Singh was also applied:
> ARM: EXYNOS: Fix hotplug when CPUs are booted in HYP mode
2013-04-24
>

Is there another one which I missed?

On Sun, Feb 2, 2014 at 11:08 PM, viresh kumar <email address hidden>wrote:

> Thanks Kevin. Yeah its the same issue mentioned by kevin now and me
> earlier. You need the patch I have sent you to get this fixed.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> Matching subscriptions: LNG all
> https://bugs.launchpad.net/bugs/1260397
>
> Title:
> Latest NO_HZ_FULL patches cause CPU stalls & spinlock deadlocks in
> 3.10 PREEMPT_RT kernel
>
> Status in Linaro networking Group:
> Confirmed
>
> Bug description:
> After applying Kevin Hilman's latest patches for NO_HZ_FULL support to
> the 3.10.20-rt17 LNG real-time kernel, CI regression tests began
> failing with random timeouts, etc.. Debug traces in the console
> output showed evidence of deadlocks in spinlock acquisition and
> reports of CPU stalls detected by RCU. The kernel configuration also
> had NO_HZ_FULL_ALL selected, so there may be some connection to not
> reserving a 'management' CPU which is excluded from attempts to
> provide full tickless operation.
>
> NOTE that the same kernel without the PREEMPT_RT enabled seemed to
> have none of these issues, so it appears to be an interaction with the
> changes manifested by the RT patch.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/linaro-networking/+bug/1260397/+subscriptions
>

Revision history for this message
Gary S. Robertson (gary-robertson) wrote :
Download full text (75.9 KiB)

Testing of NO_HZ_FULL CPU isolation continues with kernel 3.12.10. This kernel has the following locally-applied patches to the kernel sources for NO_HZ_FULL support:

37eea35 ARM: EXYNOS: Fix hotplug when CPUs are booted in HYP mode
4749737 sched/nohz: fix overflow error in scheduler_tick_max_deferment()
d74a274 sched/nohz: add debugfs control over sched_tick_max_deferment
fd173ce ARM: Kconfig: allow full nohz CPU accounting
655ac5c full_nohz: Kconfig: add HAVE_VIRT_CPU_ACCOUNTING_GEN
4364b06 nohz_full: Kconfig: drop requrement on 64-bit
94cefdf nohz_full: Kconfig: VIRT_CPU_ACCOUNTING_GEN: drop 64-bit requirement

It was built with the 14.01 Linaro GCC cross-development toolchain release using the following config fragments:

export conf_filenames="linaro/configs/garys.conf linaro/configs/no_hz_full.conf linaro/configs/arndale.conf"

garys.conf is the 'tweaked exynos_defconfig' baseline configuration file.

no_hz_full.conf was modified to contain settings required for no_hz_full operation in the abscence of the standard LNG CI config fragment list:

CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ_FULL_ALL=y
CONFIG_NO_HZ_COMMON=y
CONFIG_NO_HZ=y
CONFIG_NO_HZ_IDLE=n
CONFIG_HZ_PERIODIC=n
CONFIG_RCU_USER_QS=y
CONFIG_RCU_NOCB_CPU=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_CONTEXT_TRACKING_FORCE=y
CONFIG_IRQ_WORK=y
CONFIG_CGROUPS=y
CONFIG_CPUSETS=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_DEBUG_FS=y

arndale.conf was modified with settings which seem to be required for stability of the 3.12 kernel on arndale:

# Tweaks to make 3.12 kernel stable on arndale
CONFIG_CPU_FREQ=n
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=n
CONFIG_CPU_IDLE=n
CONFIG_THUMB2_KERNEL=n

The LAVA NO_HZ_FULL CPU isolation tests continue to become unresponsive and cause timeouts. Excerpted LAVA console output from a typical test run is shown below:

 boot
 reading uImage
 2883664 bytes read
 reading board.dtb
 32589 bytes read
 ## Booting kernel from Legacy Image at 40007000 ...
  Image Name: Linux
  Image Type: ARM Linux Kernel Image (uncompressed)
  Data Size: 2883600 Bytes = 2.8 MiB
  Load Address: 40008000
  Entry Point: 40008000
  Verifying Checksum ... OK
 ## Flattened Device Tree blob at 41f00000
  Booting using the fdt blob at 0x41f00000
  Loading Kernel Image ... OK
 OK
  Using Device Tree in place at 41f00000, end 41f0af4c

 Starting kernel ...

 [ 0.000000] Booting Linux on physical CPU 0x0
 [ 0.000000] Initializing cgroup subsys cpuset
 [ 0.000000] Linux version 3.12.10-linaro-arndale (gary.robertson@honkintosh) (gcc version 4.8.3 20140106 (prerelease) (crosstool-NG linaro-1.13.1-4.8-2014.01 - Linaro GCC 2013.11) ) #2 SMP Mon Feb 10 11:59:07 CST 2014
 [ 0.000000] CPU: ARMv7 Processor [410fc0f4] revision 4 (ARMv7), cr=10c5387d
 [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
 [ 0.000000] Machine: SAMSUNG EXYNOS5 (Flattened Device Tree), model: Insignal Arndale evaluation board based on EXYNOS5250
 [ 0.000000] NR_BANKS too low, ignoring high memory
 [ 0.000000] Memory policy: ECC disabled, Data cache writealloc
 [ 0.000000] CPU EXYNOS5250 (id 0x43520010)
 [ 0.000000] PERCPU: Embedded 7 pages/cpu @815db000 s7552 r8192 d12928 u32768
 [ 0.000000] Built 1 zonelists in Zone o...

Revision history for this message
viresh kumar (viresh.kumar) wrote :

Gary can you please share your version of: is-cpu-isolated.sh, just to make sure you are using the latest copy..
As I have updated test-definitions repo earlier with these patches:

1249e65 is-cpu-isolated.sh: Sense infinite isolation time
640b19f is-cpu-isolated: Move all tasks to CPU0 by hotunplugging CPU1

Revision history for this message
Gary S. Robertson (gary-robertson) wrote :

Viresh,

My test definitions repo master branch HEAD is at: ae8cf61 Merge "Add
Ethernet Test for Linaro Android", and has both the above patches: 640b19f
is-cpu-isolated: Move all tasks to CPU0 by hotunplugging CPU1 -and- 1249e65
is-cpu-isolated.sh: Sense infinite isolation time. With the CPU hotplug
patch in place on 3.12.10 I am still getting a failure to return from what
I suspect is a lengthy or indefinite CPU isolation period. On linux-lng
3.10.27 with the CPU hotplug patch in place, I am getting a kernel panic
associated with THUMB2 instructions... so I'll be disabling THUMB2_KERNEL
there Friday morning US time. Unfortunately at the moment I am using all
my test resources and time getting data and preparing graphs for an
upcoming LCA14 presentation - so right now my CPU isolation testing is in a
pause mode. Probably next week I can get back to doing some tests there.

On Fri, Feb 14, 2014 at 12:28 AM, viresh kumar <email address hidden>wrote:

> Gary can you please share your version of: is-cpu-isolated.sh, just to
> make sure you are using the latest copy..
> As I have updated test-definitions repo earlier with these patches:
>
> 1249e65 is-cpu-isolated.sh: Sense infinite isolation time
> 640b19f is-cpu-isolated: Move all tasks to CPU0 by hotunplugging CPU1
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> Matching subscriptions: LNG all
> https://bugs.launchpad.net/bugs/1260397
>
> Title:
> Latest NO_HZ_FULL patches cause CPU stalls & spinlock deadlocks in
> 3.10 PREEMPT_RT kernel
>
> Status in Linaro networking Group:
> Confirmed
>
> Bug description:
> After applying Kevin Hilman's latest patches for NO_HZ_FULL support to
> the 3.10.20-rt17 LNG real-time kernel, CI regression tests began
> failing with random timeouts, etc.. Debug traces in the console
> output showed evidence of deadlocks in spinlock acquisition and
> reports of CPU stalls detected by RCU. The kernel configuration also
> had NO_HZ_FULL_ALL selected, so there may be some connection to not
> reserving a 'management' CPU which is excluded from attempts to
> provide full tickless operation.
>
> NOTE that the same kernel without the PREEMPT_RT enabled seemed to
> have none of these issues, so it appears to be an interaction with the
> changes manifested by the RT patch.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/linaro-networking/+bug/1260397/+subscriptions
>

Revision history for this message
viresh kumar (viresh.kumar) wrote :

Hi Gary,

I have just submitted few more patches for review in test-definitions repo.. Please see if they fix your LAVA timeouts.. I think they will :)

Revision history for this message
Gary S. Robertson (gary-robertson) wrote :

The patch set which was causing these problems has been abandoned and replaced with newer patches. The CPU stalls and spinlock deadlocks which were seen on the previous patch set are no longer an issue. Let's close this bug now, as all remaining issues are better associated with bug number 1270873 no_hz full isolation benchmark fails.

Changed in linaro-networking:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.