Kernel network namespace performance regression during rcu development on kernels above 3.8

Bug #1328088 reported by Rafael David Tinoco on 2014-06-09
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Rafael David Tinoco
Trusty
Undecided
Rafael David Tinoco
Utopic
Undecided
Rafael David Tinoco

Bug Description

SRU Justification:

Impact: network namespace creation has performance regression since v3.5.
Fix: my analysis, lklm discussion, upstream patch
Testcase:

 http://people.canonical.com/~inaddy/lp1328088/make_fake_routers.sh
 http://people.canonical.com/~inaddy/lp1328088/parse.py
 http://people.canonical.com/~inaddy/lp1328088/charts/250.html
 http://people.canonical.com/~inaddy/lp1328088/charts/250-tag.html

 Running make_fake_routers.sh 4000 and using parse.py you can check if
 "fake routers" are being created in a good rate /sec (and you can
 compare with all generated charts).

----------------------------

Original Description:

Please, follow this in: http://people.canonical.com/~inaddy/lp1328088/. Same description on daily-basis updated text.

--
It was brought to my attention that network namespace creation scalability was affected during kernel development.

The following script was used for all the tests and charts generation:

http://people.canonical.com/~inaddy/lp1328088/make_fake_routers.sh
http://people.canonical.com/~inaddy/lp1328088/parse.py

I measured how many "fake routers" (above script) could be added per second from 0 to 4000 created routers mark. Using this script and a git bisect on kernel tree I was led to one specific commit causing regression: #911af50 "rcu: Provide compile-time control for no-CBs CPUs". Even Though this change was experimental at that point, it introduced a performance scalability regression (explained below) that still last and seems to be the default option for distributions nowadays.

RCU related code looked like to be responsible for the problem. With that, every commit from tag v3.8..master that changed any of this files: "kernel/rcutree.c kernel/rcutree.h kernel/rcutree_plugin.h include/trace/events/rcu.h include/linux/rcupdate.h" was tested. The idea was to check performance regression during rcu development. In the worst case, the regression not being related to rcu, I would still have data to interpret the performance/scalability regression.

All text below this refer to 2 groups of charts, generated during the study:

1) Kernel git tags from 3.8 to 3.14.
http://people.canonical.com/~inaddy/lp1328088/charts/250-tag.html

2) Kernel git commits for rcu development (111 commits).
http://people.canonical.com/~inaddy/lp1328088/charts/250.html

Since there was difference in results depending on how many cpus or how the no-cb cpus were configured, 3 kernel config options were used on every measure:

- CONFIG_RCU_NOCB_CPU (disabled): nocbno
- CONFIG_RCU_NOCB_CPU_ALL (enabled): nocball
- CONFIG_RCU_NOCB_CPU_NONE (enabled): nocbnone

Obs: For 1 cpu cases: nocbno, nocbnone, nocball behaves the same since w/ only 1 cpu there is no no-cb cpu

After charts being generated it was clear that NOCB_CPU_ALL (4 cpus) affected the "fake routers" creation process performance and this regression continues up to upstream version. It was also clear that, after commit #911af50, having more than 1 cpu does not improve performance/scalability for netns, makes it worse.

#911af50
...
+#ifdef CONFIG_RCU_NOCB_CPU_ALL
+ pr_info("\tExperimental no-CBs for all CPUs\n");
+ cpumask_setall(rcu_nocb_mask);
+#endif /* #ifdef CONFIG_RCU_NOCB_CPU_ALL */
...

Comparing standing out points (see charts):

#81e5949 - good
#911af50 - bad

I was able to see that, from the script above, the following lines causes major impact on netns scalability/performance:

1) ip netns add -> huge performance regression:
    1 cpu: no regression
    4 cpu: regression for NOCB_CPU_ALL
    obs: regression from 250 netns/sec to 50 netns/sec
         on 500 netns already created mark

2) ip netns exec -> some performance regression
    1 cpu: no regression
    4 cpu: regression for NOCB_CPU_ALL
    obs: regression from 40 netns (+1 exec per netns
         creation) to 20 netns/sec on 500 netns created
         mark

# Assumption (to be confirmed)

rcu callbacks being offloaded to other cpus caused regression in copy_net_ns<-created_new_namespaces or unshare(clone_newnet).

summary: - Kernel network namespace performance regression during kernel
- development
+ Kernel network namespace performance regression during rcu development
+ on kernels above 3.8
description: updated
Changed in linux (Ubuntu):
assignee: nobody → Rafael David Tinoco (inaddy)
status: New → In Progress
affects: linux (Ubuntu) → linux
description: updated
description: updated
tags: added: bisect-done
Chris J Arges (arges) wrote :

Related upstream discussion:
https://lkml.org/lkml/2014/6/11/42

Upstream suggestions/observations were accepted and code was changed:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=728dba3a39c66b3d8ac889ddbe38b5b1c264aec3

These changes were already tested and worked good for the performance regression.

Suggesting SRU to our kernel team soon.

Thank you

-Rafael

description: updated
Tim Gardner (timg-tpi) on 2014-08-22
Changed in linux (Ubuntu):
status: New → Fix Committed
Changed in linux (Ubuntu Trusty):
assignee: nobody → Rafael David Tinoco (inaddy)
status: New → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (3.3 KiB)

This bug was fixed in the package linux - 3.16.0-11.16

---------------
linux (3.16.0-11.16) utopic; urgency=low

  [ Mauricio Faria de Oliveira ]

  * [Config] Switch kernel to vmlinuz (from vmlinux) on ppc64el
    - LP: #1358920

  [ Peter Zijlstra ]

  * SAUCE: (no-up) mmu_notifier: add call_srcu and sync function for listener to delay call and sync
    - LP: #1361300

  [ Tim Gardner ]

  * [Config] CONFIG_ZPOOL=y
    - LP: #1360428
  * Release Tracking Bug
    - LP: #1361308

  [ Upstream Kernel Changes ]

  * Revert "net/mlx4_en: Fix bad use of dev_id"
    - LP: #1347012
  * net/mlx4_en: Reduce memory consumption on kdump kernel
    - LP: #1347012
  * net/mlx4_en: Fix mac_hash database inconsistency
    - LP: #1347012
  * net/mlx4_en: Disable blueflame using ethtool private flags
    - LP: #1347012
  * net/mlx4_en: current_mac isn't updated in port up
    - LP: #1347012
  * net/mlx4_core: Use low memory profile on kdump kernel
    - LP: #1347012
  * Drivers: scsi: storvsc: Change the limits to reflect the values on the host
    - LP: #1347169
  * Drivers: scsi: storvsc: Set cmd_per_lun to reflect value supported by the Host
    - LP: #1347169
  * Drivers: scsi: storvsc: Filter commands based on the storage protocol version
    - LP: #1347169
  * Drivers: scsi: storvsc: Fix a bug in handling VMBUS protocol version
    - LP: #1347169
  * Drivers: scsi: storvsc: Implement a eh_timed_out handler
    - LP: #1347169
  * drivers: scsi: storvsc: Set srb_flags in all cases
    - LP: #1347169
  * drivers: scsi: storvsc: Correctly handle TEST_UNIT_READY failure
    - LP: #1347169
  * namespaces: Use task_lock and not rcu to protect nsproxy
    - LP: #1328088
  * net: xgene: Check negative return value of xgene_enet_get_ring_size()
  * mm/zbud: change zbud_alloc size type to size_t
    - LP: #1360428
  * mm/zpool: implement common zpool api to zbud/zsmalloc
    - LP: #1360428
  * mm/zpool: zbud/zsmalloc implement zpool
    - LP: #1360428
  * mm/zpool: update zswap to use zpool
    - LP: #1360428
  * ideapad-laptop: Change Lenovo Yoga 2 series rfkill handling
    - LP: #1341296
  * iommu/amd: Fix for pasid initialization
    - LP: #1361300
  * iommu/amd: Moving PPR fault flags macros definitions
    - LP: #1361300
  * iommu/amd: Drop oprofile dependency
    - LP: #1361300
  * iommu/amd: Fix typo in amd_iommu_v2 driver
    - LP: #1361300
  * iommu/amd: Don't call mmu_notifer_unregister in __unbind_pasid
    - LP: #1361300
  * iommu/amd: Don't free pasid_state in mn_release path
    - LP: #1361300
  * iommu/amd: Get rid of __unbind_pasid
    - LP: #1361300
  * iommu/amd: Drop pasid_state reference in ppr_notifer error path
    - LP: #1361300
  * iommu/amd: Add pasid_state->invalid flag
    - LP: #1361300
  * iommu/amd: Don't hold a reference to mm_struct
    - LP: #1361300
  * iommu/amd: Don't hold a reference to task_struct
    - LP: #1361300
  * iommu/amd: Don't call the inv_ctx_cb when pasid is not set up
    - LP: #1361300
  * iommu/amd: Don't set pasid_state->mm to NULL in unbind_pasid
    - LP: #1361300
  * iommu/amd: Remove change_pte mmu_notifier call-back
    - LP: #1361300
  * iommu/amd: Fix device_state reference counting
    - LP: #1361300...

Read more...

Changed in linux (Ubuntu Utopic):
status: Fix Committed → Fix Released
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
tags: added: verification-done
removed: verification-needed-trusty
tags: added: verification-done-trusty
removed: verification-done
Launchpad Janitor (janitor) wrote :
Download full text (5.8 KiB)

This bug was fixed in the package linux - 3.13.0-36.63

---------------
linux (3.13.0-36.63) trusty; urgency=low

  [ Joseph Salisbury ]

  * Release Tracking Bug
    - LP: #1365052

  [ Feng Kan ]

  * SAUCE: (no-up) irqchip:gic: change access of gicc_ctrl register to read
    modify write.
    - LP: #1357527
  * SAUCE: (no-up) arm64: optimized copy_to_user and copy_from_user
    assembly code
    - LP: #1358949

  [ Ming Lei ]

  * SAUCE: (no-up) Drop APM X-Gene SoC Ethernet driver
    - LP: #1360140
  * [Config] Drop XGENE entries
    - LP: #1360140
  * [Config] CONFIG_NET_XGENE=m for arm64
    - LP: #1360140

  [ Stefan Bader ]

  * SAUCE: Add compat macro for skb_get_hash
    - LP: #1358162
  * SAUCE: bcache: prevent crash on changing writeback_running
    - LP: #1357295

  [ Suman Tripathi ]

  * SAUCE: (no-up) arm64: Fix the csr-mask for APM X-Gene SoC AHCI SATA PHY
    clock DTS node.
    - LP: #1359489
  * SAUCE: (no-up) ahci_xgene: Skip the PHY and clock initialization if
    already configured by the firmware.
    - LP: #1359501
  * SAUCE: (no-up) ahci_xgene: Fix the link down in first attempt for the
    APM X-Gene SoC AHCI SATA host controller driver.
    - LP: #1359507

  [ Tuan Phan ]

  * SAUCE: (no-up) pci-xgene-msi: fixed deadlock in irq_set_affinity
    - LP: #1359514

  [ Upstream Kernel Changes ]

  * iwlwifi: mvm: Add a missed beacons threshold
    - LP: #1349572
  * mac80211: reset probe_send_count also in HW_CONNECTION_MONITOR case
    - LP: #1349572
  * genirq: Add an accessor for IRQ_PER_CPU flag
    - LP: #1357527
  * arm64: perf: add support for percpu pmu interrupt
    - LP: #1357527
  * cifs: sanity check length of data to send before sending
    - LP: #1283101
  * KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
    - LP: #1329434
  * KVM: nVMX: Rework interception of IRQs and NMIs
    - LP: #1329434
  * KVM: vmx: disable APIC virtualization in nested guests
    - LP: #1329434
  * HID: Add transport-driver functions to the USB HID interface.
    - LP: #1353021
  * ahci_xgene: Removing NCQ support from the APM X-Gene SoC AHCI SATA Host
    Controller driver.
    - LP: #1358498
  * fold d_kill() and d_free()
    - LP: #1354234
  * fold try_prune_one_dentry()
    - LP: #1354234
  * new helper: dentry_free()
    - LP: #1354234
  * expand the call of dentry_lru_del() in dentry_kill()
    - LP: #1354234
  * dentry_kill(): don't try to remove from shrink list
    - LP: #1354234
  * don't remove from shrink list in select_collect()
    - LP: #1354234
  * more graceful recovery in umount_collect()
    - LP: #1354234
  * dcache: don't need rcu in shrink_dentry_list()
    - LP: #1354234
  * lift the "already marked killed" case into shrink_dentry_list()
  * split dentry_kill()
    - LP: #1354234
  * expand dentry_kill(dentry, 0) in shrink_dentry_list()
    - LP: #1354234
  * shrink_dentry_list(): take parent's ->d_lock earlier
    - LP: #1354234
  * dealing with the rest of shrink_dentry_list() livelock
    - LP: #1354234
  * dentry_kill() doesn't need the second argument now
    - LP: #1354234
  * dcache: add missing lockdep annotation
    - LP: #1354234
  * fs: convert use of typedef ctl_table to struct ctl_table
 ...

Read more...

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
tags: added: cts
Changed in linux:
assignee: Rafael David Tinoco (inaddy) → nobody
Changed in linux (Ubuntu Trusty):
assignee: Rafael David Tinoco (inaddy) → nobody
no longer affects: linux
Changed in linux (Ubuntu):
assignee: nobody → Rafael David Tinoco (inaddy)
Changed in linux (Ubuntu Trusty):
assignee: nobody → Rafael David Tinoco (inaddy)
Changed in linux (Ubuntu Utopic):
assignee: nobody → Rafael David Tinoco (inaddy)
Gnanasekar Velu (gnanasekarkas) wrote :

Do we have this bug on Kernel 4.4.0-67-generic on Trusty release?

Dave Chiluk (chiluk) wrote :

@gnanasekarkas. Not that we know of.

Don Bowman (donbowman) wrote :

I wonder if this fix fell out or is somehow different now?

The below script, on 4.11rc7:

1cpu: 0m1.379s
12cpu: 1m36.556s
72cpu: 2m20.118s

This is a *huge* impact for neutron L3 agent on my OpenStack system.

# cd /tmp
# ip netns add foo
# ip netns add bar
# for i in `seq 0 1000` ; do echo -e 'netns exec foo echo\nnetns exec bar echo' >> ipnetns.batch ; done
# time ip -b ipnetns.batch > /dev/null

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers