Cannot get NO_HZ_FULL to work

Bug #1224324 reported by Magnus Karlsson
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linaro-networking
Invalid
High
viresh kumar

Bug Description

Hi,

I am trying to get NO_HZ_FULL to work with linux-lng-preempt-rt-v3.10.10-rt7 and am failing miserably. What am I doing wrong?

Boot options:

setenv bootargs "isolcpus=1 nohz_full=1 rcu_nocbs=1 root=/dev/mmcblk1p2 rw rootwait console=ttySAC2,115200n8 init --no-log"

Config file attached.

From the boot up

Preemptible hierarchical RCU implementation.
        Experimental no-CBs for all CPUs
        Experimental no-CBs CPUs: 0-1.
NO_HZ: Full dynticks CPUs: 1.

The ticks are off on core 1 when I start the system. But run any program on core 1, by having the code below in the C file, and I see the tick starting in cat /proc/interrupts:

CPU_ZERO(&cpu_set);
  CPU_SET(1, &cpu_set);
  if (sched_setaffinity(0, sizeof(cpu_set_t), &cpu_set) == -1)
  {
      perror("sched_setaffinity");
  }

A ps -e -L -o pid,psr,pcpu,command tells me that only kernel threads that are not movable are one core 1.

  PID PSR %CPU COMMAND
  17 1 0.0 [migration/1]
   18 1 0.0 [ksoftirqd/1]
   19 1 0.0 [kworker/1:0]
   20 1 0.0 [kworker/1:0H]
 2628 1 0.0 [kworker/1:1]

I can even make sure with "taskset" that all threads have an affinity mask of 1 (only core 0) when possible, but it does not help. As soon as I run anything on core 1, the tick starts. What am I doing wrong?

Thank you: Magnus

Revision history for this message
Magnus Karlsson (magnus-karlsson) wrote :
Revision history for this message
Gary S. Robertson (gary-robertson) wrote : Re: [Bug 1224324] [NEW] Cannot get NO_HZ_FULL to work
Download full text (4.0 KiB)

I can't see anything wrong with your configuration or boot command line.
It looks like this should work as advertised, so we will need to see why
you are getting this behavior. I will begin setting up to test this right
away. This may take a bit of time as the Linaro Networking Group is
relatively new and we are still in the process of staffing and getting our
infrastructure established - but rest assured we will be working on this
issue. Thanks for reporting this behavior.

Gary Robertson

On Thu, Sep 12, 2013 at 2:45 AM, Magnus Karlsson <email address hidden>wrote:

> Public bug reported:
>
> Hi,
>
> I am trying to get NO_HZ_FULL to work with linux-lng-preempt-
> rt-v3.10.10-rt7 and am failing miserably. What am I doing wrong?
>
> Boot options:
>
> setenv bootargs "isolcpus=1 nohz_full=1 rcu_nocbs=1 root=/dev/mmcblk1p2
> rw rootwait console=ttySAC2,115200n8 init --no-log"
>
> Config file attached.
>
> >From the boot up
>
> Preemptible hierarchical RCU implementation.
> Experimental no-CBs for all CPUs
> Experimental no-CBs CPUs: 0-1.
> NO_HZ: Full dynticks CPUs: 1.
>
> The ticks are off on core 1 when I start the system. But run any program
> on core 1, by having the code below in the C file, and I see the tick
> starting in cat /proc/interrupts:
>
> CPU_ZERO(&cpu_set);
> CPU_SET(1, &cpu_set);
> if (sched_setaffinity(0, sizeof(cpu_set_t), &cpu_set) == -1)
> {
> perror("sched_setaffinity");
> }
>
> A ps -e -L -o pid,psr,pcpu,command tells me that only kernel threads
> that are not movable are one core 1.
>
> PID PSR %CPU COMMAND
> 17 1 0.0 [migration/1]
> 18 1 0.0 [ksoftirqd/1]
> 19 1 0.0 [kworker/1:0]
> 20 1 0.0 [kworker/1:0H]
> 2628 1 0.0 [kworker/1:1]
>
> I can even make sure with "taskset" that all threads have an affinity
> mask of 1 (only core 0) when possible, but it does not help. As soon as
> I run anything on core 1, the tick starts. What am I doing wrong?
>
> Thank you: Magnus
>
> ** Affects: linaro-networking
> Importance: Undecided
> Status: New
>
> ** Attachment added: "Config file"
>
> https://bugs.launchpad.net/bugs/1224324/+attachment/3817188/+files/config
>
> --
> You received this bug notification because you are subscribed to linaro-
> networking.
> Matching subscriptions: lng-bugs
> https://bugs.launchpad.net/bugs/1224324
>
> Title:
> Cannot get NO_HZ_FULL to work
>
> Status in Linaro networking Group:
> New
>
> Bug description:
> Hi,
>
> I am trying to get NO_HZ_FULL to work with linux-lng-preempt-
> rt-v3.10.10-rt7 and am failing miserably. What am I doing wrong?
>
> Boot options:
>
> setenv bootargs "isolcpus=1 nohz_full=1 rcu_nocbs=1
> root=/dev/mmcblk1p2 rw rootwait console=ttySAC2,115200n8 init --no-
> log"
>
> Config file attached.
>
> From the boot up
>
> Preemptible hierarchical RCU implementation.
> Experimental no-CBs for all CPUs
> Experimental no-CBs CPUs: 0-1.
> NO_HZ: Full dynticks CPUs: 1.
>
> The ticks are off on core 1 when I start the system. But run any
> program on core 1, by having the code below in the C file, and I see
> the tick starting in cat /proc/interrupts:
>
...

Read more...

Revision history for this message
Gary S. Robertson (gary-robertson) wrote :

Once again, I suspect this may be a CPU isolation problem. NO_HZ_FULL is turned off on any core as soon as more than one thread is being scheduled there. If the scheduler begins operating on this core as soon as soon as tasks are scheduled on the other core, and schedules the idle task in conjunction with your RT idle loop application - then NO_HZ_FULL operation would by design cease on core 1.

Changed in linaro-networking:
assignee: nobody → viresh kumar (viresh.kumar)
importance: Undecided → High
Revision history for this message
Mike Holmes (mike-holmes) wrote :

Is this this an RT specific issue, or does it also occur on linux-lng ?
I think runing on the current RT head as well just be sure might be worth it -> linux-lng-preempt-rt

Revision history for this message
Gary S. Robertson (gary-robertson) wrote : Re: [Bug 1224324] Re: Cannot get NO_HZ_FULL to work
Download full text (4.5 KiB)

Looking at the symptom report for this bug and also for bug
#1224318
Preempt_rt kernel enters idle loop even when there are processes
ready <https://bugs.launchpad.net/linaro-networking/+bug/1224318>

which Magnus also reported, it looks likely to me that both sets of
erroneous behavior may have the same root cause. The scheduler runs the
idle task on the CPU which was supposed to be isolated and running
NO_HZ_FULL. When this happens, NO_HZ_FULL operation will cease by design
on that CPU core because more than one thread is in the scheduler queue
there.
The critical issue is: why is the idle task running on that CPU core?
Either the single process running on that core is sleeping occasionally or
something is broken in CPU isolation or in the scheduler itself.

 * It was stated in one of these two bug reports that the behavior was the
same on linux-lng as on linux-lng-preempt-rt, so I don't believe it is an
RT-specific behavior.*

Magnus mentions a high-priority busy loop running on he CPU which is
isolated and running NO_HZ_FULL. But if this busy loop makes library or
kernel calls which might sleep - for example during the measurement of
latency or the recording of those measurements - then I suspect this might
override the CPU isolation via a trip through the scheduler from the
aforementioned system or library call. If the single process running on a
NO_HZ_FULL core sleeps, I think the scheduler HAS to enter the idle task on
that core. Even a system or library call to read a timer may encounter a
mutex which might cause the process to sleep.

Without knowing how the busy loop process operates we can only speculate
about this. Crafting a process which works successfully with NO_HZ_FULL
may be surprisingly elusive. Ideally the busy loop process should
accumulate measurements in RAM without making any system calls to perform
or store the latency measurements - perhaps by sampling some timer hardware
directly or using a wait loop to determine when it could safely read a
timer count provided in shared memory by an RT process running at a
slightly higher priority on the other core. After some finite loop count
the busy loop would then write its accumulated measurements to disk and
terminate. Also I would suggest using an RT scheduling priority of 49 or
less, since threaded ISRs typically run at priority 50 or 51 as best as I
recall. This would mean the busy loop at priority 99 might actually defer
timer hardware interrupt servicing and thus distort reported measurements.

On Thu, Sep 26, 2013 at 7:36 AM, Mike Holmes <email address hidden> wrote:

> Is this this an RT specific issue, or does it also occur on linux-lng ?
> I think runing on the current RT head as well just be sure might be worth
> it -> linux-lng-preempt-rt
>
> --
> You received this bug notification because you are subscribed to linaro-
> networking.
> Matching subscriptions: lng-bugs
> https://bugs.launchpad.net/bugs/1224324
>
> Title:
> Cannot get NO_HZ_FULL to work
>
> Status in Linaro networking Group:
> New
>
> Bug description:
> Hi,
>
> I am trying to get NO_HZ_FULL to work with linux-lng-preempt-
> rt-v3.10.10-rt7 and am failing miserably. What am I...

Read more...

Revision history for this message
Magnus Karlsson (magnus-karlsson) wrote :

Gary,

Thanks for looking into this. I think you are correct in that 1224318 and this issue might have the same root cause. You can find the test application attached to that issue. As you can see there are no system calls at all in the loop. Time is measured by reading HW registers directly from user space. Also, the benchmark runs fine on 3.6 and 3.7.

/Magnus

Changed in linaro-networking:
status: New → In Progress
Revision history for this message
Mike Holmes (mike-holmes) wrote :

Viresh, do you have any updates on this bug, the last comment was 013-09-27, is this the same root cause, that referred bug was closed as fixed in a newer version.

Revision history for this message
viresh kumar (viresh.kumar) wrote :

On Thursday 14 November 2013 09:51 PM, Mike Holmes wrote:
> Viresh, do you have any updates on this bug, the last comment was
> 013-09-27, is this the same root cause, that referred bug was closed as
> fixed in a newer version.
>

I was trying this bug today and then went into reading about cpusets... I am
currently working into it only and so you can expect a update soon..

Revision history for this message
viresh kumar (viresh.kumar) wrote :
Download full text (6.4 KiB)

Hi Kevin,

I was trying this bug today on Arndale.. I was running v3.10.13 with following patches on the top:

cb5c69a ARM: Kconfig: allow full nohz CPU accounting
7296e0e nohz: Drop generic vtime obsolete dependency on CONFIG_64BIT
530b0fd vtime: Add HAVE_VIRT_CPU_ACCOUNTING_GEN Kconfig

I tried running your script given here:https://wiki.linaro.org/WorkingGroups/PowerManagement/Doc/AdaptiveTickless

(Removed ftrace stuff and removal of cpusets)..

I see following messages when I run that:

Cannot move PID 2: kthreadd
Cannot move PID 3: ksoftirqd/0
Cannot move PID 5: kworker/0:0H
Cannot move PID 6: kworker/u4:0
Cannot move PID 7: migration/0
Cannot move PID 17: migration/1
Cannot move PID 18: ksoftirqd/1
Cannot move PID 20: kworker/1:0H
Cannot move PID 21: khelper
Cannot move PID 23: netns
Cannot move PID 24: kworker/u4:1
Cannot move PID 198: writeback
Cannot move PID 200: bioset
Cannot move PID 202: kblockd
Cannot move PID 357: kworker/1:1
Cannot move PID 424: crypto
Cannot move PID 1128: dw-mci-card
Cannot move PID 1130: dw-mci-card
Cannot move PID 1158: kworker/1:2
Cannot move PID 1162: deferwq
Cannot move PID 1241: kworker/0:1H
Cannot move PID 1257: kworker/1:1H
Cannot move PID 1332: ext4-dio-unwrit
Cannot move PID 1909: kworker/0:2
Cannot move PID 1954: kworker/0:0

All these tasks couldn't be moved to "rt" group..
ps -aFd gave this:

UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
root 2 0 0 0 0 0 1969 ? 00:00:00 [kthreadd]
root 3 2 0 0 0 0 1969 ? 00:00:00 [ksoftirqd/0]
root 5 2 0 0 0 0 1969 ? 00:00:00 [kworker/0:0H]
root 6 2 0 0 0 1 1969 ? 00:00:00 [kworker/u4:0]
root 7 2 0 0 0 0 1969 ? 00:00:00 [migration/0]
root 8 2 0 0 0 0 1969 ? 00:00:00 [rcu_preempt]
root 9 2 0 0 0 0 1969 ? 00:00:00 [rcuop/0]
root 10 2 0 0 0 0 1969 ? 00:00:00 [rcuop/1]
root 11 2 0 0 0 0 1969 ? 00:00:00 [rcu_bh]
root 12 2 0 0 0 0 1969 ? 00:00:00 [rcuob/0]
root 13 2 0 0 0 0 1969 ? 00:00:00 [rcuob/1]
root 14 2 0 0 0 0 1969 ? 00:00:00 [rcu_sched]
root 15 2 0 0 0 0 1969 ? 00:00:00 [rcuos/0]
root 16 2 0 0 0 0 1969 ? 00:00:00 [rcuos/1]
root 17 2 0 0 0 1 1969 ? 00:00:00 [migration/1]
root 18 2 0 0 0 1 1969 ? 00:00:00 [ksoftirqd/1]
root 20 2 0 0 0 1 1969 ? 00:00:00 [kworker/1:0H]
root 21 2 0 0 0 1 1969 ? 00:00:00 [khelper]
root 22 2 0 0 0 0 1969 ? 00:00:00 [kdevtmpfs]
root 23 2 0 0 0 1 1969 ? 00:00:00 [netns]
root 24 2 0 0 0 0 1969 ? 00:00:00 [kworker/u4:1]
root 198 2 0 0 0 1 1969 ? 00:00:00 [writeback]
root 200 2 0 0 0 1 1969 ? 00:00:00 [bioset]
root 202 2 ...

Read more...

Revision history for this message
viresh kumar (viresh.kumar) wrote :

Kevin,

Same behavior observed on today's Linus/master:

2d3c627 Revert "init/Kconfig: add option to disable kernel compression"

..

But no additional patches were required now.. as all your patches are already in..

Revision history for this message
Kevin Hilman (khilman-deactivatedaccount) wrote :

Viresh, it's normal that per-cpu threads do not get moved since they are pinned to CPUs. However, it's not expected that they run and get in the way. If you see those threads running, it would be useful to have a trace of the activity causing it.

Also, you wrote

> So, when I am running my terminal on gp, then the arch timer for CPU1 doesn't show any update in number.
> But as soon as I move my terminal to rt, arch timer count starts increasing..

Hmm, That's exactly what I expect to happen. Can you clarify what you're expecting vs what you're seeing?

The full NOHZ patches set does not itself *prevent* anything from running on specific CPUs. All it does allow the tick to be shut down when 1 (or less) tasks are running on a CPU. There is still a bunch of manual work to isolate a CPU using affinity/cpusets etc. in order to create the conditions for full NOHZ to work.

Also for linus/master, you'll need a couple of the debugfs patches to disable the 1Hz residual tick:

https://lkml.org/lkml/2013/9/16/499
https://lkml.org/lkml/2013/9/16/500

Revision history for this message
viresh kumar (viresh.kumar) wrote :

On 19 November 2013 03:24, Kevin Hilman <email address hidden> wrote:
> Viresh, it's normal that per-cpu threads do not get moved since they are
> pinned to CPUs. However, it's not expected that they run and get in the
> way. If you see those threads running, it would be useful to have a
> trace of the activity causing it.

Okay..

>> So, when I am running my terminal on gp, then the arch timer for CPU1 doesn't show any update in number.
>> But as soon as I move my terminal to rt, arch timer count starts increasing..
>
> Hmm, That's exactly what I expect to happen. Can you clarify what
> you're expecting vs what you're seeing?

I though we have just moved a single thread there and so we shouldn't
have tick running..

But it looks like we have moved two threads there.. One is the terminal
and second is the 'ps' command that I have run :)

So, if I simply keep the terminal on CPU0 and somehow run only ps
on CPU1 tick shouldn't be running?

> The full NOHZ patches set does not itself *prevent* anything from
> running on specific CPUs. All it does allow the tick to be shut down
> when 1 (or less) tasks are running on a CPU. There is still a bunch of
> manual work to isolate a CPU using affinity/cpusets etc. in order to
> create the conditions for full NOHZ to work.

Affinity is already tied to CPU0 for most of the irqs, and I have used
cpusets as well..

> Also for linus/master, you'll need a couple of the debugfs patches to
> disable the 1Hz residual tick:
>
> https://lkml.org/lkml/2013/9/16/499
> https://lkml.org/lkml/2013/9/16/500

I see..

Revision history for this message
Kevin Hilman (khilman-deactivatedaccount) wrote :

On Mon, Nov 18, 2013 at 7:51 PM, viresh kumar <email address hidden> wrote:
>
> I though we have just moved a single thread there and so we shouldn't
> have tick running..
>
> But it looks like we have moved two threads there.. One is the terminal
> and second is the 'ps' command that I have run :)

That's correct, using a shell as a test case is problematic because it
will spawn other processes as you run commands.

> So, if I simply keep the terminal on CPU0 and somehow run only ps
> on CPU1 tick shouldn't be running?

Correct. Either use a a utility like taskset with a single-threaded
test app, or use CPUsets like I do in my test script.

Revision history for this message
viresh kumar (viresh.kumar) wrote :

On 19 November 2013 21:07, Kevin Hilman <email address hidden> wrote:
> Correct. Either use a a utility like taskset with a single-threaded
> test app, or use CPUsets like I do in my test script.

Thanks for you help Kevin, I was able to shut down ticks for 30 seconds
on CPU1, with help of attached script (mostly like yours)

One more thing, when I run 'stress' with --cpu 1, I can see two threads
created for stress on my CPU. Why so?

Revision history for this message
viresh kumar (viresh.kumar) wrote :

On 27 September 2013 12:32, Magnus Karlsson <email address hidden> wrote:
> Thanks for looking into this. I think you are correct in that 1224318
> and this issue might have the same root cause. You can find the test
> application attached to that issue. As you can see there are no system
> calls at all in the loop. Time is measured by reading HW registers
> directly from user space. Also, the benchmark runs fine on 3.6 and 3.7.

Hi Magnus,

I am able to get NO_HZ working on Mainline and there are some problems
in your setup I believe due to which you failed to get that working earlier..

Your bootargs:

setenv bootargs "isolcpus=1 nohz_full=1 rcu_nocbs=1
root=/dev/mmcblk1p2 rw rootwait console=ttySAC2,115200n8 init
--no-log"

First of all, its not recommended to use isolcpus anymore (as you already
know) and don't know how it will behave with NO_HZ.. Better is to use
cpusets instead..

Then, I wasn't required nohz_full=1 rcu_nocbs=1 as I had following in my
.config: CONFIG_NO_HZ_FULL_ALL=y

After booting the system I used cpusets to move all existing tasks to CPU0
and then run 'stress --cpu 1' (1 here is no. of threads and not CPU) on CPU1.

When I use max deferment for tick I am able to shut up tick for almost 30
seconds and without it, by default it is set to 1 HZ and so gets a tick every
second..

Can you please try the script which I already shared on bug tracker:
my-nohz-test-cpuset.sh

You need to run on mainline for that + plus the patches that Kevin suggested.

To make it easy for you to reproduce it I have pushed my branch here:
https://git.linaro.org/gitweb?p=people/vireshk/mylinux.git;a=shortlog;h=refs/heads/nohz-working

This contains defconfig updates as well which are required to get exact
setup.. Just use exynos_defconfig and it should work..

Let me know if you have any more issues with this stuff..

Changed in linaro-networking:
status: In Progress → Invalid
Revision history for this message
Magnus Karlsson (magnus-karlsson) wrote :

Thanks Viresh. I will try this out.

Revision history for this message
Kevin Hilman (khilman-deactivatedaccount) wrote :

viresh kumar <email address hidden> writes:

> One more thing, when I run 'stress' with --cpu 1, I can see two threads
> created for stress on my CPU. Why so?

Because of the CPUset method you're using, I suspect the shell itself is
the second task you're seeing. If you can generated/send a trace, I
could tell you for sure.

Kevin

Revision history for this message
viresh kumar (viresh.kumar) wrote :

On 20 November 2013 21:19, Kevin Hilman <email address hidden> wrote:
> viresh kumar <email address hidden> writes:
>
>> One more thing, when I run 'stress' with --cpu 1, I can see two threads
>> created for stress on my CPU. Why so?
>
> Because of the CPUset method you're using, I suspect the shell itself is
> the second task you're seeing. If you can generated/send a trace, I
> could tell you for sure.

So, this is what I get normally on shell:

root@linaro-developer:/home/linaro# stress -q --cpu 1 --timeout 2000 &
[1] 21078

root@linaro-developer:/home/linaro# ps
  PID TTY TIME CMD
 1782 ttySAC2 00:00:00 login
 1874 ttySAC2 00:00:00 bash
21078 ttySAC2 00:00:00 stress
21079 ttySAC2 00:00:01 stress
21080 ttySAC2 00:00:00 ps
root@linaro-developer:/home/linaro#

See, two tasks for stress ??

Revision history for this message
Kevin Hilman (khilman-deactivatedaccount) wrote :

viresh kumar <email address hidden> writes:

> On 20 November 2013 21:19, Kevin Hilman <email address hidden> wrote:
>> viresh kumar <email address hidden> writes:
>>
>>> One more thing, when I run 'stress' with --cpu 1, I can see two threads
>>> created for stress on my CPU. Why so?
>>
>> Because of the CPUset method you're using, I suspect the shell itself is
>> the second task you're seeing. If you can generated/send a trace, I
>> could tell you for sure.
>
> So, this is what I get normally on shell:
>
> root@linaro-developer:/home/linaro# stress -q --cpu 1 --timeout 2000 &
> [1] 21078
>
> root@linaro-developer:/home/linaro# ps
> PID TTY TIME CMD
> 1782 ttySAC2 00:00:00 login
> 1874 ttySAC2 00:00:00 bash
> 21078 ttySAC2 00:00:00 stress
> 21079 ttySAC2 00:00:01 stress
> 21080 ttySAC2 00:00:00 ps
> root@linaro-developer:/home/linaro#
>
> See, two tasks for stress ??

Sure, one of them is probably a parent waiting for a child. That
doesn't mean 2 threads are active at the same time.

Revision history for this message
viresh kumar (viresh.kumar) wrote :

On 2 December 2013 23:55, Kevin Hilman <email address hidden> wrote:
> Sure, one of them is probably a parent waiting for a child. That
> doesn't mean 2 threads are active at the same time.

I was sure that only one thread is running and so CPU is isolated but
wasn't sure how 'stress' actually works, i.e. parent process starts a
child one.. Thanks.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.