Trusty + 3.19 (lts-vivid) PERF wrong cpu-migration counter

Bug #1575407 reported by Rafael David Tinoco
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Low
Rafael David Tinoco
Nominated for Vivid by Rafael David Tinoco
Nominated for Wily by Rafael David Tinoco
Nominated for Xenial by Rafael David Tinoco
Nominated for Yakkety by Rafael David Tinoco

Bug Description

It was brought to my attention that:

In a PowerPC based server, PERF seems to acuse cpu-migrations when only a single cpu is activated.

## perf

Performance counter stats for 'CPU(s) 0':

15027.888988 task-clock (msec) # 1.000 CPUs utilized [100.00%]
25,206 context-switches # 0.002 M/sec [100.00%]
3,518 cpu-migrations # 0.234 K/sec [100.00%]
639 page-faults # 0.043 K/sec
41,545,780,384 cycles # 2.765 GHz [66.68%]
2,868,753,319 stalled-cycles-frontend # 6.91% frontend cycles idle [50.01%]
30,162,193,535 stalled-cycles-backend # 72.60% backend cycles idle [50.01%]
11,161,722,533 instructions # 0.27 insns per cycle
                                                                                   # 2.70 stalled cycles per insn [66.68%]
1,544,072,679 branches # 102.747 M/sec [49.99%]
52,536,867 branch-misses # 3.40% of all branches [49.99%]

15.027768835 seconds time elapsed

## lscpu

Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0
Off-line CPU(s) list: 1-127
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 2
Model: 8335-GCA
L1d cache: 64K
L1i cache: 32K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0
NUMA node8 CPU(s):

So either task migrations are being done to offline cpus or perf is accounting it wrong.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

DMESG shows:

[255688.556945] process 1 (init) no longer affine to cpu1
[266710.938490] process 1 (init) no longer affine to cpu1
[275071.280189] process 1 (init) no longer affine to cpu1
[286088.372647] process 1 (init) no longer affine to cpu1
[355886.470777] process 1 (init) no longer affine to cpu1
[358415.046246] process 1 (init) no longer affine to cpu1

Changed in linux (Ubuntu):
status: New → Confirmed
importance: Undecided → Low
assignee: nobody → Rafael David Tinoco (inaddy)
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (3.6 KiB)

I have traced all execution paths plausible to increment "perf_count_sw_cpu_migrations"
(or (task struct *)p->se.nr_migrations++). Which would show us that the task migrated from
different CPUs.
____________________________________________________________

(Execution Path 1 - Kernel changes 2 tasks from 2 different CPUs)

[execution interrupted by page fault interruption]
handle_pte_fault ->
numa_migrate_preferred ->
task_numa_migrate ->
migrate_swap ->
stop_two_cpus ... (scheduling task migration state machine)

<asynchronous handler on cpu IPIs>
[kernel migration thread handles the task switch]
[you have 1 migration thread per cpu running]
[new cpu comes from numa balance/smp logic]

migrate_swap_stop ->
migrate_swap_task -> (kernel swaps 2 tasks from different cpus)
set_task_cpu ->
**** update to perf counter ****
[process scheduled in a new cpu]

____________________________________________________________

(Execution Path 2 - Similar to 1 but, instead of swapping, it sends the task)

[execution interrupted by page fault interruption]
handle_pte_fault ->
migrate_task_to ->
stop_one_cpu ...(scheduling task submission to another cpu)

<asynchronous handler on cpu IPIs>
[kernel migration thread handles the task switch]
[you have 1 migration thread per cpu running]
[new cpu comes from numa balance/smp logic]

migration_cpu_stop ->
migrate_task -> (move task from one cpu to another)
move_queued_task ->
set_task_cpu ->
**** update to perf counter ****
[process scheduled in a new cpu]
____________________________________________________________

(Execution Path 3 - New executions)

[fork / exec]
sched_exec ->
stop_one_cpu... (scheduling task submission to another cpu)

<asynchronous handler on cpu IPIs>
[kernel migration thread handles the task switch]
[you have 1 migration thread per cpu running]
[new cpu comes from scheduler_class - fair/deadline/rt - select_task_rq logic]
[new cpu can also come from select_fallback_rq] --> fallback might not take cpumask in consideration

migration_cpu_stop ->
migrate_task -> (move task from one cpu to another)
move_queued_task ->
set_task_cpu ->
**** update to perf counter ****
[process scheduled in a new cpu]

&&&&&&

(Execution Path 4 - Regular Scheduling)

[wake up process]
[wake up state]
try_to_wake_up ->
[new cpu comes from scheduler_class - fair/deadline/rt - select_task_rq logic]
[new cpu can also come from select_fallback_rq] --> fallback might not take cpumask in consideration
select_task_rq ->
set_task_cpu
**** update to perf counter ****
[process scheduled in a new cpu]

****** note for execution paths 3 & 4 ********
-> select_fallback_rq is responsible for the messages:

[255688.556945] process 1 (init) no longer affine to cpu1
[266710.938490] process 1 (init) no longer affine to cpu1
[275071.280189] process 1 (init) no longer affine to cpu1
[286088.372647] process 1 (init) no longer affine to cpu1
[355886.470777] process 1 (init) no longer affine to cpu1
[358415.046246] process 1 (init) no longer affine to cpu1

from the dmesg.
It shows us that the fallback mechanism of picking the cpu run queue was used.
Fallback mechanis...

Read more...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

This systemtap script will get the execution path when it reaches set_task_cpu() in kernel 3.19.
set_task_cpu() is where perf sw cpu migration counter is updated.

#!/usr/bin/stap

probe kernel.function("set_task_cpu").call
{
    // running process
    // pr_name = execname()
    // pr_pid = pid()

    // task to be migrated
    _pid = task_pid($p)
    _name = task_execname($p)
    _migr = $p->se->nr_migrations
    _ocpu = task_cpu($p)
    _ncpu = $new_cpu

    if (_ocpu != _ncpu) {
 printf("task %d (%s) (migr=%d) from (cpu=%d) to (cpu = %d)\n", _pid , _name , _migr, _ocpu, _ncpu)
 print_backtrace()
    }
}

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (3.1 KiB)

## VIVID

from kernel/sched/core.c

1064 #ifdef CONFIG_SMP
1065 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
1066 {
...
1093 if (task_cpu(p) != new_cpu) {
1094 struct task_migration_notifier tmn;
1095
1096 if (p->sched_class->migrate_task_rq)
1097 p->sched_class->migrate_task_rq(p, new_cpu);
1098 p->se.nr_migrations++;
1099 perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
1100
1101 tmn.task = p;
1102 tmn.from_cpu = task_cpu(p);
1103 tmn.to_cpu = new_cpu;
1104
1105 atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
1106 }
1107
1108 __set_task_cpu(p, new_cpu);
1109 }

## WILY

from include/linux/perf_event.h:

836 static inline void perf_event_task_sched_in(struct task_struct *prev,
837 struct task_struct *task)
838 {
839 if (static_key_false(&perf_sched_events.key))
840 __perf_event_task_sched_in(prev, task);
841
842 if (perf_sw_migrate_enabled() && task->sched_migrated) {
843 struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]);
844
845 perf_fetch_caller_regs(regs);
846 ___perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, regs, 0);
847 task->sched_migrated = 0;
848 }
849 }

----

Checking how recent kernels incremented PERF_COUNT_SW_CPU_MIGRATIONS I saw there was a difference from Vivid. While in Vivid, PERF_COUNT_SW_CPU_MIGRATIONS was being incremented directly from set_task_cpu (and that is why we asked for tracing of this function), there was a commit that changed that behavior alleging software migrate events were being accounted in a wrong way.

Instead of changing PERF SW counter right inside set_task_cpu(), it would mark the task as "migrated" (using task_struct) and, later, when context_switch() calls finish_task_switch(), if the task was marked as "migrated", then the PERF SW counter will be incremented.

This change fixes 2 issues: 1) The migration didn't occur yet, since the task wasn't scheduled (yet), just migrated. 2) Migrations that happen from softirq context were accounted in the interrupted process (possible as migrations that never happened).

Commit:

commit ff303e66c240ba6269e31817a386995440a18c99
Author: Peter Zijlstra <email address hidden>
Date: Fri Apr 17 20:05:30 2015 +0200

perf: Fix software migrate events

Stephane asked about PERF_COUNT_SW_CPU_MIGRATIONS and I realized it
was borken:

> The problem is that the task isn't actually scheduled while its being
> migrated (obviously), and if its not scheduled, the counters aren't
> scheduled either, so there's no observing of the fact.
>
> A further problem with migrations is that many migrations happen from
> softirq context, which is nested inside the 'random' task context of
> whoemever happens to run at that time, similarly for the wakeup
> migrations triggered from (soft)irq context. All those end up being
> accounted in the task that's currently running, eg. your 'ls'.

The below cures this by marking a task as migrated and accounting it
on the subsequent sched_in().

Signed-off-by: Peter Zijlstra (Intel) <email address hidden>

It first appeared in v4.2-rc1.

For now, packages:

linux-image-4.2.0-36-generic
linux-image-extra-4.2.0-36-generic

Will probably mitigat...

Read more...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (4.5 KiB)

It was brought to my attention traces from the buggy execution.

From the perf statistics:

    201054.649876 task-clock (msec) # 0.999 CPUs utilized
             9,547 context-switches # 0.047 K/sec
                 6 cpu-migrations # 0.000 K/sec
             6,427 page-faults # 0.032 K/sec
   777,038,506,852 cycles # 3.865 GHz [66.66%]
     1,799,766,809 stalled-cycles-frontend # 0.23% frontend cycles idle [50.00%]
   518,104,985,732 stalled-cycles-backend # 66.68% backend cycles idle [50.01%]
   819,194,804,419 instructions # 1.05 insns per cycle
                                                  # 0.63 stalled cycles per insn [66.67%]
    90,811,473,192 branches # 451.676 M/sec [50.00%]
     2,203,470,400 branch-misses # 2.43% of all branches [49.99%]

     201.248973668 seconds time elapsed

**** We had 6 cpu-migrations.

From tracing the execution we had exactly 6 set_task_cpus coming from bottom halves (from the IRQ handler):

# coming from interrupt handler (1)

    blackscholes-15443 [000] 5599.360541: function: set_task_cpu
    blackscholes-15443 [000] 5599.360543: kernel_stack: <stack trace>
=> set_task_cpu (c0000000000eaa48)
=> try_to_wake_up (c0000000000ee998)
=> process_timeout (c00000000013f2e8)
=> call_timer_fn (c00000000013f154)
=> run_timer_softirq (c00000000013fd9c)
=> __do_softirq (c0000000000b5854)
=> irq_exit (c0000000000b5e28)
=> timer_interrupt (c00000000001fb94)
=> decrementer_common (c000000000002758)

 # coming from interrupt handler (2)

    blackscholes-15443 [000] 5600.800540: function: set_task_cpu
    blackscholes-15443 [000] 5600.800542: kernel_stack: <stack trace>
=> set_task_cpu (c0000000000eaa48)
=> try_to_wake_up (c0000000000ee998)
=> commit_timeout (c0000000003d52f8)
=> call_timer_fn (c00000000013f154)
=> run_timer_softirq (c00000000013fd9c)
=> __do_softirq (c0000000000b5854)
=> irq_exit (c0000000000b5e28)
=> timer_interrupt (c00000000001fb94)
=> decrementer_common (c000000000002758)

# coming from interrupt handler (3)

    blackscholes-15443 [000] 5603.840601: function: set_task_cpu
    blackscholes-15443 [000] 5603.840603: kernel_stack: <stack trace>
=> set_task_cpu (c0000000000eaa48)
=> try_to_wake_up (c0000000000ee998)
=> hrtimer_wakeup (c000000000141a2c)
=> __run_hrtimer (c000000000141798)
=> hrtimer_interrupt (c0000000001427f8)
=> __timer_interrupt (c00000000001f480)
=> timer_interrupt (c00000000001fb90)
=> decrementer_common (c000000000002758)

# coming from interrupt handler (4)

    blackscholes-15444 [000] 5629.340575: function: set_task_cpu
    blackscholes-15444 [000] 5629.340577: kernel_stack: <stack trace>
=> set_task_cpu (c0000000000eaa48)
=> try_to_wake_up (c0000000000ee998)
=> wake_up_worker (c0000000000cf004)
=> __queue_work (c0000000000d1a0c)
=> queue_delayed_work_on (c0000000000d2244)
=...

Read more...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

I have asked customer to provide feedback if kernel in PPA solves the issue, so I can move to the SRU proposal.

Changed in linux (Ubuntu):
status: Confirmed → In Progress
Changed in linux (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.