High platform cpu usage on AIO-DX

Bug #1840976 reported by Frank Miller on 2019-08-21
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Gerry Kopec

Bug Description

Brief Description
-----------------
This LP is opened to track high platform cpu usage levels of 60-70% on AIO-DX systems with stx-openstack applied. Several options have been identified to reduce the cpu usage.

LP https://bugs.launchpad.net/starlingx/+bug/1837426 reported very high cpu usage on AIO-DX with cpu alarms being raised due to the platform cores running >80%. Commits were made to reduce the cpu usage on an idle system to below the 80% alarm threshold. However the cpu usage of the platform cores remains high on an idle AIO-DX with stx-openstack at levels between 60-70%.

Severity:
--------
Medium: At this point no specific functional impact is identified but it is believed running various actions on AIO-DX could lead to cpu alarms being raised or actions timing out/failing due to the idle cpu usage being >60% on the platform cores.

Steps to Reproduce
------------------
Install an AIO-DX system

Expected Behavior
------------------
Platform CPUs should run at <50% level

Actual Behavior
----------------
Platform CPUs are running at 60-70% cpu usage as of loads built on Aug 21/2019.

Reproducibility
---------------
Reproducible.

System Configuration
--------------------
AIO-DX (two node) system with the following CPU:

[root@controller-1 ~(keystone_admin)]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 36
On-line CPU(s) list: 0-35
Thread(s) per core: 1
Core(s) per socket: 18
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Stepping: 4
CPU MHz: 2300.000
BogoMIPS: 4600.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-17
NUMA node1 CPU(s): 18-35
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d

Branch/Pull Time/Commit
-----------------------
Loads built as of Aug 21

Last Pass
---------
Unknown. Expectation is the high cpu usage is a result of moving to containerization and affining the platform processes/pods to the platform cores for AIO.

Timestamp/Logs
--------------
n/a

Test Activity
-------------
Developer testing

Frank Miller (sensfan22) wrote :

Gerry Kopec outlined these suggested areas of investigation as part of his work on LP 1837426:
- Increase period of cinder-volume-usage-audit and heat-engine-cleaner. These currently run every 5 minutes.
threads. Commit https://review.opendev.org/#/c/676035/ may address this but that should be confirmed.
- Saw the cpu usage of kubelet process slowly increasing over time (10% to 14% of cpu0&1 over a week).
- Subsequent application-apply's may fail due to nova-db-sync job failing due to system overload and then being unable to create tables on subsequent retries as they already exist. Have to drop the nova databases to recover. Could smooth out compute-kit (libvirt, nova, nova-api-proxy, neutron, placement) startup by not running all charts in parallel.

Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / medium priority - waiting for extra results from testing with the fixes already mentioned above.

tags: added: stx.3.0 stx.config
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Gerry Kopec (gerry-kopec)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers