CPU isolation doesn't work in StarlingX 5.0

Bug #1952769 reported by Shrinidhi M
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
In Progress
Undecided
Unassigned

Bug Description

_______________________________
From: Gaur, Shubham <Shubham.Gaur at commscope.com>
Sent: Wednesday, October 20, 2021 11:05 AM
To: starlingx-discuss at lists.starlingx.io <starlingx-discuss at lists.starlingx.io>
Subject: Re: CPU Isolation over AIO Controller Nodes

Hi All,

CPU isolation is not working in starlingx 5.0. Static CPU manager policy has been enabled on all the nodes and the Isol-CPU resource plugin is up &running but there is no isolated CPU resource pool visible. Could not see any isolated CPU annotations (windriver.com/isolcpus CPU annotations). Is there any additional step to get this working?

[sysadmin at controller-0 ~(keystone_admin)]$ system host-cpu-list controller-0 | grep Application-
| bb7e63db-aadd-4ac1-b85e-6d8784ca498b | 2 | 0 | 7 | 0 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
| 94f9f44e-255b-4e48-83bf-bf1480a290f5 | 6 | 0 | 6 | 0 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
| 51eb3110-a097-4b13-a6c3-7b40129c4859 | 10 | 0 | 5 | 0 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
| 7073bc5a-7f43-41d8-bed4-6c0eef448c9e | 14 | 0 | 4 | 0 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
| 609c1870-f182-439a-a0d2-6c56291c107d | 34 | 0 | 7 | 1 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
| 47208322-4a80-416d-8fd8-9427f564345f | 38 | 0 | 6 | 1 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
| 79da3786-aa34-4021-b5cc-be21c60c53ab | 42 | 0 | 5 | 1 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
| 42d5a6d0-fb83-4d98-b884-698ba3e22818 | 46 | 0 | 4 | 1 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
---------------------------------------------------------------------------------------------------------------------------------------
controller-0:~$ kubectl get node controller-0 -o yaml | grep -E " allocatable:" -A 15
  allocatable:
    cpu: "64"
    ephemeral-storage: "9391196145"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 106113496Ki
    pods: "110"
  capacity:
    cpu: "64"
    ephemeral-storage: 10190100Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 131303896Ki
    pods: "110"
====================================================================
         SYSTEM: edgecloud
====================================================================

controller-0:~$ cat /etc/build.info
###
### StarlingX
### Release 21.05
###

OS="centos"
SW_VERSION="21.05"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="r/stx.5.0"

JOB="STX_5.0_build_layer_flock"
BUILD_BY="starlingx.build at cengn.ca"
BUILD_NUMBER="37"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2021-05-21 23:03:55 +0000"

FLOCK_OS="centos"
FLOCK_JOB="STX_5.0_build_layer_flock"
FLOCK_BUILD_BY="starlingx.build at cengn.ca"
FLOCK_BUILD_NUMBER="37"
FLOCK_BUILD_HOST="starlingx_mirror"
FLOCK_BUILD_DATE="2021-05-21 23:03:55 +0000"

DISTRO_OS="centos"
DISTRO_JOB="STX_5.0_build_layer_distro"
DISTRO_BUILD_BY="starlingx.build at cengn.ca"
DISTRO_BUILD_NUMBER="35"
DISTRO_BUILD_HOST="starlingx_mirror"
DISTRO_BUILD_DATE="2021-05-18 23:02:22 +0000"

COMPILER_OS="centos"
COMPILER_JOB="STX_5.0_build_layer_compiler"
COMPILER_BUILD_BY="starlingx.build at cengn.ca"
COMPILER_BUILD_NUMBER="35"
COMPILER_BUILD_HOST="starlingx_mirror"
COMPILER_BUILD_DATE="2021-05-14 19:53:00 +0000"

Regards,
Shubham Gaur

________________________________
From: Gaur, Shubham
Sent: Friday, October 8, 2021 2:07 PM
To: starlingx-discuss at lists.starlingx.io <starlingx-discuss at lists.starlingx.io>
Subject: CPU Isolation over AIO Controller Nodes

Does the CPU isolation feature work over AIO controller nodes in an edge distributed cloud environment?
====================================================================
         SYSTEM: edgecloud
====================================================================

controller-0:~$ cat /etc/build.info
###
### StarlingX
### Release 21.05
###

OS="centos"
SW_VERSION="21.05"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="r/stx.5.0"

JOB="STX_5.0_build_layer_flock"
BUILD_BY="starlingx.build at cengn.ca"
BUILD_NUMBER="37"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2021-05-21 23:03:55 +0000"

FLOCK_OS="centos"
FLOCK_JOB="STX_5.0_build_layer_flock"
FLOCK_BUILD_BY="starlingx.build at cengn.ca"
FLOCK_BUILD_NUMBER="37"
FLOCK_BUILD_HOST="starlingx_mirror"
FLOCK_BUILD_DATE="2021-05-21 23:03:55 +0000"

DISTRO_OS="centos"
DISTRO_JOB="STX_5.0_build_layer_distro"
DISTRO_BUILD_BY="starlingx.build at cengn.ca"
DISTRO_BUILD_NUMBER="35"
DISTRO_BUILD_HOST="starlingx_mirror"
DISTRO_BUILD_DATE="2021-05-18 23:02:22 +0000"

COMPILER_OS="centos"
COMPILER_JOB="STX_5.0_build_layer_compiler"
COMPILER_BUILD_BY="starlingx.build at cengn.ca"
COMPILER_BUILD_NUMBER="35"
COMPILER_BUILD_HOST="starlingx_mirror"
COMPILER_BUILD_DATE="2021-05-14 19:53:00 +0000"

Regards,
Shubham Gaur

Please use the template below when opening StarlingX bugs.

Brief Description
-----------------
CPU isolation is not working in starlingx 5.0. Static CPU manager policy has been enabled on all the nodes and the Isol-CPU resource plugin is up &running but there is no isolated CPU resource pool visible. Could not see any isolated CPU annotations (windriver.com/isolcpus CPU annotations).

Severity
--------

Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
system host-cpu-list controller-0 | grep Application-
| bb7e63db-aadd-4ac1-b85e-6d8784ca498b | 2 | 0 | 7 | 0 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
| 94f9f44e-255b-4e48-83bf-bf1480a290f5 | 6 | 0 | 6 | 0 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
| 51eb3110-a097-4b13-a6c3-7b40129c4859 | 10 | 0 | 5 | 0 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
| 7073bc5a-7f43-41d8-bed4-6c0eef448c9e | 14 | 0 | 4 | 0 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
| 609c1870-f182-439a-a0d2-6c56291c107d | 34 | 0 | 7 | 1 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
| 47208322-4a80-416d-8fd8-9427f564345f | 38 | 0 | 6 | 1 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
| 79da3786-aa34-4021-b5cc-be21c60c53ab | 42 | 0 | 5 | 1 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
| 42d5a6d0-fb83-4d98-b884-698ba3e22818 | 46 | 0 | 4 | 1 | Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz | Application-isolated
---------------------------------------------------------------------------------------------------------------------------------------
controller-0:~$ kubectl get node controller-0 -o yaml | grep -E " allocatable:" -A 15
  allocatable:
    cpu: "64"
    ephemeral-storage: "9391196145"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 106113496Ki
    pods: "110"
  capacity:
    cpu: "64"
    ephemeral-storage: 10190100Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 131303896Ki
    pods: "110"

Expected Behavior
------------------
windriver.com/isolcpu resources should have appeared in the annotations.

Actual Behavior
----------------
windriver.com/isolcpu resources is not appearing in the annotations.

Reproducibility
---------------
Reproducible
issue is 100% reproducible

System Configuration
--------------------
AIO Duplex distributed cloud

Branch/Pull Time/Commit
-----------------------
Branch and the time when code was pulled or git commit or cengn load info

Last Pass
---------
Did this test scenario pass previously? If so, please indicate the load/pull time info of the last pass.
Use this section to also indicate if this is a new test scenario.

-->No

Timestamp/Logs
--------------
Attach the logs for debugging (use attachments in Launchpad or for large collect files use: https://files.starlingx.kube.cengn.ca/)
Provide a snippet of logs here and the timestamp when issue was seen.
Please indicate the unique identifier in the logs to highlight the problem

Test Activity
-------------
[Sanity, Feature Testing, Regression Testing, Developer Testing, Evaluation, Other - Please specify]

Workaround:
​Hi,

After a clean install no isolated cpu are listed under /sys/devices/system/cpu/isolated.

controller-0:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.18.0-147.3.1.rt24.96.el8_1.tis.10.x86_64 root=UUID=da02d506-3eaa-410e-a5d0-6ed82611d77f ro security_profile=standard module_blacklist=integrity,ima audit=0 tboot=false crashkernel=auto biosdevname=0 console=tty0 iommu=pt usbcore.autosuspend=-1 selinux=0 enforcing=0 nmi_watchdog=0 softlockup_panic=0 softdog.soft_panic=1 intel_iommu=on user_namespace.enable=1 skew_tick=1 nopti nospectre_v2 nospectre_v1 hugepagesz=1G hugepages=20 hugepagesz=2M hugepages=0 default_hugepagesz=1G irqaffinity=0,2,4,6 rcu_nocbs=1,3,5,7-23 nohz_full=1,3,5,7-23 isolcpus=1,3,5,7-12,14 kthread_cpus=0,2,4,6

After further debug we found keeping rcu_nocbs, nohz_full equal to isolcpu resolved this problem. Looking at the scripts those parameters are calculated separately.

  * /usr/lib64/python2.7/site-packages/sysinv/puppet/platform.py

582,583c582,583
< rcu_nocbs_cpuset = host_cpuset - platform_cpuset
< rcu_nocbs_ranges = utils.format_range_set(rcu_nocbs_cpuset)
---
> # rcu_nocbs_cpuset = host_cpuset - platform_cpuset
> # rcu_nocbs_ranges = utils.format_range_set(rcu_nocbs_cpuset)
589a590,593
> # non-platform logical cpus
> rcu_nocbs_cpuset = host_cpuset - platform_cpuset
> #rcu_nocbs_ranges = utils.format_range_set(rcu_nocbs_cpuset)
> rcu_nocbs_ranges = utils.format_range_set(vswitch_cpuset.union(app_isolated_cpuset))

Is there any specific reason these parameters are calculated separately? Are there any side effects if the following modification has being introduced in platform.py?

Thanks and regards,
Shubham

Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: This should be looked at by the containers subproject team

tags: added: stx.5.0 stx.containers
Changed in starlingx:
status: New → In Progress
Revision history for this message
M. Vefa Bicakci (vbicakci) wrote :

Hi Ghada,

I am not 100% sure, but I think that this issue could be a duplicate of the following one, which was fixed on the master branch at the time, but not on the older release branches:

  https://bugs.launchpad.net/starlingx/+bug/1925363
  https://review.opendev.org/c/starlingx/config/+/812711

I added a comment to the code review at https://review.opendev.org/c/starlingx/config/+/821006 prepared by a community member about this.

I did not check if the 4.18 kernel (StarlingX 5.0) and the 5.10 kernel (StarlingX master) are different with respect to the rcu_nocbs kernel argument's handling, but the bug report I linked (and its solution) involved aligning the nohz_full and isolcpus arguments only, and that was sufficient for the 5.10 kernel.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (master)

Change abandoned by "Chris Friesen <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/config/+/821006
Reason: doesn't appear to be needed

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.