More migrations with constant load

Bug #1713576 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
High
Canonical Kernel Team
linux (Ubuntu)
Fix Released
High
Joseph Salisbury
Zesty
Fix Released
High
Joseph Salisbury

Bug Description

== SRU Justification ==
There is a significantly higher number of task migrations when the load is
fixed and not balanced across cores.

Benchmark results are posted in the bug description and in the commits git log.

This bug is resolved by mainline commit 05b40e057734811ce452344fb3690d09965a7b6a, which is
in mailine as of 4.12-rc1.

== Fix ==
commit 05b40e057734811ce452344fb3690d09965a7b6a
Author: Srikar Dronamraju <email address hidden>
Date: Wed Mar 22 23:27:50 2017 +0530

    sched/fair: Prefer sibiling only if local group is under-utilized

== Regression Potential ==
Medium, since this commit does touch the scheduler. However, the commit only makes a change to
allow a local group to pull a task, if the source group has more number of
tasks than the local group.

== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

== Comment: #0 - PUVICHAKRAVARTHY RAMACHANDRAN - 2017-08-06 13:44:45 ==
---Problem Description---
Significantly higher number of task migrations when the load is fixed but not balanced across cores.

---uname output---
Linux isvbos3 4.10.0-29-generic #33~16.04.1-Ubuntu SMP Tue Jul 25 18:17:06 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux

---Additional Hardware Info---
Power9 dd2.0

Machine Type = Power9

---Steps to Reproduce---
 Benchmark : Multithreaded - cpu intensive. The system had 2 socket/ 32 cores/ SMT4 mode.

When 64 threads was run - the migrations were less over 10s interval.
when 80 threads were run - the migrations were very high.

Ideally, it should have been very minimal, as the over all load was constant

== Comment: #3 - SRIKAR DRONAMRAJU - 2017-08-11 06:56:47 ==
As suspected (commit : 05b40e0577 : "sched/fair: Prefer sibiling only if local group is under-utilized")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=05b40e0577
should fix the problem

Ran ' perf stat -a -r 5 -e sched:sched_migrate_task /home/srikar/work/ebizzy-0.3/ebizzy -t 35 -S 100'
to detect the problem and verify the fix

Here is perf stat without fix.

Performance counter stats for 'system wide' (5 runs):

             7,758 sched:sched_migrate_task ( +- 1.28% )

     100.015658079 seconds time elapsed ( +- 0.00% )

perf stat with fix.

Performance counter stats for 'system wide' (5 runs):

               415 sched:sched_migrate_task ( +- 11.74% )

     100.016021787 seconds time elapsed ( +- 0.00% )

git describe on upstream kernel says v4.11-rc2
# git describe 05b40e0577
v4.11-rc2-227-g05b40e0

== Comment: #4 - SRIKAR DRONAMRAJU - 2017-08-11 07:05:37 ==
Attaching the patch that needs to be applied to fix this bug.
Verified that patch fixes the problem.

Revision history for this message
bugproxy (bugproxy) wrote : Migration count

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-157336 severity-high targetmilestone-inin16043
Revision history for this message
bugproxy (bugproxy) wrote : Migration count with 96 threads

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : sched/fair: Prefer sibiling only if local group is under-utilized

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Changed in ubuntu-power-systems:
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Triaged
tags: added: kernel-da-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Zesty test kernel with commit 05b40e0577. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1713576

Can you test this kernel and see if it resolves this bug?

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-09-07 17:48 EDT-------
(In reply to comment #8)
> I built a Zesty test kernel with commit 05b40e0577. The test kernel can be
> downloaded from:
>
> http://kernel.ubuntu.com/~jsalisbury/lp1713576
>
> Can you test this kernel and see if it resolves this bug?

Thanks, but I am having trouble with the perf packages.. any idea what is wrong?

# uname -a
Linux ltc-boston114 4.10.0-33-generic #37~lp1713576 SMP Wed Sep 6 15:50:28 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux

# dpkg -i /home/migration_test_build/linux-tools-common_4.10.0-33.37~lp1713576_all.deb
(Reading database ... 99780 files and directories currently installed.)
Preparing to unpack .../linux-tools-common_4.10.0-33.37~lp1713576_all.deb ...
Unpacking linux-tools-common (4.10.0-33.37~lp1713576) over (4.10.0-33.37~lp1713576) ...
Setting up linux-tools-common (4.10.0-33.37~lp1713576) ...
# dpkg -i /home/migration_test_build/linux-cloud-tools-common_4.10.0-33.37~lp1713576_all.deb
(Reading database ... 99780 files and directories currently installed.)
Preparing to unpack .../linux-cloud-tools-common_4.10.0-33.37~lp1713576_all.deb ...
Unpacking linux-cloud-tools-common (4.10.0-33.37~lp1713576) over (4.10.0-33.37~lp1713576) ...
Setting up linux-cloud-tools-common (4.10.0-33.37~lp1713576) ...

# perf list
The program 'perf' is currently not installed. You can install it by typing:
apt install linux-tools-common

# apt install linux-tools-common
Reading package lists... Done
Building dependency tree
Reading state information... Done
linux-tools-common is already the newest version (4.10.0-33.37~lp1713576).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

---

Ebizzy does show a 42% improvement on average so that is probably enough, although I was hoping to validate the sched migration counts are lower.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did you happen to install both the linux-image and linux-image-extra .deb packages?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-09-08 15:59 EDT-------
Yes I installed all 9 .debs, I just re-dowloaded, installed, booted but still no luck.

Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → Triaged
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

You may have to install the linux-tools-common with the stock kernel before installing the test kernel?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-09-11 15:49 EDT-------
Thanks, I was able to make perf work by uninstalling the "linux-tools-common" test .deb and installing the stock version.

This test build resolves the problem by significantly reducing sched migrations in the under utilized case which improves performance by 30-40% for this ebizzy testcase:

stock kernel (4.10.0-33-generic) :
6,010 sched:sched_migrate_task

test kernel (4.10.0-33-generic #37~lp1713576) :
676 sched:sched_migrate_task

Testcase was:
perf stat -a -r 5 -e sched:sched_migrate_task ./ebizzy -t 35 -S 100

where ebizzy was using 35 threads on a system w/ 128 logical cpus total

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Thanks for the update. Can this issue now be closed?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-09-12 10:08 EDT-------
(In reply to comment #14)
> Thanks for the update. Can this issue now be closed?

Should we wait until it is in an official build?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@jhopper, yes, I will submit a SRU request for commit 05b40e0577 to be included in 17.04.

Changed in linux (Ubuntu):
status: Triaged → In Progress
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Zesty):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Joseph Salisbury (jsalisbury)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: Triaged → In Progress
Manoj Iyer (manjo)
tags: added: triage-g
description: updated
Changed in linux (Ubuntu Zesty):
status: In Progress → Fix Committed
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-zesty' to 'verification-done-zesty'. If the problem still exists, change the tag 'verification-needed-zesty' to 'verification-failed-zesty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-zesty
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Hello IBM,

Could you please verify the fix with the zesty kernel currently in -proposed?

Thank you!

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (4.4 KiB)

This bug was fixed in the package linux - 4.10.0-38.42

---------------
linux (4.10.0-38.42) zesty; urgency=low

  * linux: 4.10.0-38.42 -proposed tracker (LP: #1722330)

  * Controller lockup detected on ProLiant DL380 Gen9 with P440 Controller
    (LP: #1720359)
    - scsi: hpsa: limit transfer length to 1MB

  * [Dell Docking IE][0bda:8153] Realtek USB Ethernet leads to system hang
    (LP: #1720977)
    - r8152: fix the list rx_done may be used without initialization

  * Touchpad not detected in Lenovo X1 Yoga / Yoga 720-15IKB (LP: #1700657)
    - mfd: intel-lpss: Add missing PCI ID for Intel Sunrise Point LPSS devices

  * Add installer support for Broadcom BCM573xx network drivers. (LP: #1720466)
    - d-i: Add bnxt_en to nic-modules.

  * CVE-2017-1000252
    - KVM: VMX: Do not BUG() on out-of-bounds guest IRQ

  * CVE-2017-10663
    - f2fs: sanity check checkpoint segno and blkoff

  * xfstest sanity checks on seek operations fails (LP: #1696049)
    - xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()

  * [P9, Power NV][ WSP][Ubuntu 16.04.03] : perf hw breakpoint command results
    in call traces and system goes for reboot. (LP: #1706033)
    - powerpc/64s: Handle data breakpoints in Radix mode

  * 5U84 - ses driver isn't binding right - cannot blink lights on 1 of the 2
    5u84 (LP: #1693369)
    - scsi: ses: do not add a device to an enclosure if enclosure_add_links()
      fails.

  * Vlun resize request could fail with cxlflash driver (LP: #1713575)
    - scsi: cxlflash: Fix vlun resize failure in the shrink path

  * More migrations with constant load (LP: #1713576)
    - sched/fair: Prefer sibiling only if local group is under-utilized

  * New PMU fixes for marked events. (LP: #1716491)
    - powerpc/perf: POWER9 PMU stops after idle workaround

  * CVE-2017-14340
    - xfs: XFS_IS_REALTIME_INODE() should be false if no rt device present

  * [Zesty][Yakkety] rtl8192e bug fixes (LP: #1698470)
    - staging: rtl8192e: rtl92e_fill_tx_desc fix write to mapped out memory.
    - staging: rtl8192e: fix 2 byte alignment of register BSSIDR.
    - staging: rtl8192e: rtl92e_get_eeprom_size Fix read size of EPROM_CMD.
    - staging: rtl8192e: GetTs Fix invalid TID 7 warning.

  * Stranded with ENODEV after mdadm --readonly (LP: #1706243)
    - md: MD_CLOSING needs to be cleared after called md_set_readonly or
      do_md_stop

  * multipath -ll is not showing the disks which are actually multipath
    (LP: #1718397)
    - fs: aio: fix the increment of aio-nr and counting against aio-max-nr

  * ETPS/2 Elantech Touchpad inconsistently detected (Gigabyte P57W laptop)
    (LP: #1594214)
    - Input: i8042 - add Gigabyte P57 to the keyboard reset table

  * CVE-2017-10911
    - xen-blkback: don't leak stack data via response ring

  * CVE-2017-11176
    - mqueue: fix a use-after-free in sys_mq_notify()

  * implement 'complain mode' in seccomp for developer mode with snaps
    (LP: #1567597)
    - Revert "UBUNTU: SAUCE: seccomp: log actions even when audit is disabled"
    - seccomp: Provide matching filter for introspection
    - seccomp: Sysctl to display available actions
    - seccomp: Operation for checking if an a...

Read more...

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-10-30 16:55 EDT-------
Thanks, I verified the fix in -proposed dramatically reduces unnecessary migrations as expected and improves performance using an underutilized ebizzy testcase.

with previous kernel:
4.10.0-34-generic
5,640 sched:sched_migrate_task

with -proposed kernel:
4.10.0-38-generic
516 sched:sched_migrate_task

testcase was run on a 32core system:
# perf stat -a -r 5 -e sched:sched_migrate_task ./ebizzy -t 35 -S 100

tags: added: verification-done-zesty
removed: verification-needed-zesty
Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: In Progress → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.