StarlingX

AIODX: Platform tasks are floating on all cores

Bug #1843294 reported by Tee Ngo on 2019-09-09

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Won't Fix	Low	Ghada Khalil

Bug Description

Brief Description
-----------------
Platform tasks including those of docker are floating on all cores.

Severity
--------
Critical

Steps to Reproduce
------------------
Run top command, press 1 to see the detail of each core.

Launch a large number of pods and observe the cpu occupancy of platform cores vs application cores.
The occupancy of application cores would spike until the scaling is complete while the occupancy of platform cores would increase slightly.

Check the ps-sched.sh dump (ps-sched.sh|sort -k10 -n)
Check cpuset of docker cgroup (see Logs section below)

Note: the 2 controller nodes had originally been assigned with openstack-control-plane and openstack-compute-node labels. The stx-openstack app was not applied.

Expected Behavior
------------------
Except for k8s-infra related tasks (known issue, work in progress), all other platform related tasks should run on CPU cores reserved for platform uses.

Actual Behavior
----------------
Many platform tasks such as postgres, docker, mtcClient, lldpd, ceph-mgr, sm, etc... are running on non-platform cores.

Reproducibility
---------------
Reproducible in the load stated below.

System Configuration
--------------------
AIODX, IPv6

Branch/Pull Time/Commit
-----------------------
OS="centos"
SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190821T053000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="221"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-08-21 05:30:00 +0000"

Last Pass
---------
I am not sure if this test was conducted before in an AIODX (IPv6) system

Timestamp/Logs
--------------
controller-0:/tmp# systemd-cgls cpuset
....
....
....
....
├─docker
│ ├─dc4daa4e77401e96e3c9a32b1f843d07b1fe52d601cca44f020a00d73f2209d0
│ │ ├─2323134 uwsgi -b 32768 --die-on-term --http :8000 --http-timeout 3600 --enable-threads -L --lazy-apps --master --paste config:/etc/armada/api-paste.ini --pyargv --config-file /etc/armada/armada.conf --t
│ │ ├─2323161 uwsgi -b 32768 --die-on-term --http :8000 --http-timeout 3600 --enable-threads -L --lazy-apps --master --paste config:/etc/armada/api-paste.ini --pyargv --config-file /etc/armada/armada.conf --t
│ │ ├─2323162 uwsgi -b 32768 --die-on-term --http :8000 --http-timeout 3600 --enable-threads -L --lazy-apps --master --paste config:/etc/armada/api-paste.ini --pyargv --config-file /etc/armada/armada.conf --t
│ │ ├─2323163 uwsgi -b 32768 --die-on-term --http :8000 --http-timeout 3600 --enable-threads -L --lazy-apps --master --paste config:/etc/armada/api-paste.ini --pyargv --config-file /etc/armada/armada.conf --t
│ │ ├─2323164 uwsgi -b 32768 --die-on-term --http :8000 --http-timeout 3600 --enable-threads -L --lazy-apps --master --paste config:/etc/armada/api-paste.ini --pyargv --config-file /etc/armada/armada.conf --t
│ │ └─2323165 uwsgi -b 32768 --die-on-term --http :8000 --http-timeout 3600 --enable-threads -L --lazy-apps --master --paste config:/etc/armada/api-paste.ini --pyargv --config-file /etc/armada/armada.conf --t
│ ├─2586137481c4a1bb6a38aeadca5c1cbf6f71ba672ce941bff84c9424a74f205e
│ │ ├─4024261 /bin/sh -c /edgex/mongo/config/launch-edgex-mongo.sh
│ │ ├─4024322 /bin/sh /edgex/mongo/config/launch-edgex-mongo.sh
│ │ └─4024327 mongod --smallfiles --ipv6 --bind_ip_all
│ └─5207389c0e4119b92b82048f84b6677a89046ff34127c80b25492a48d3a9f47a
│ ├─3304583 /bin/sh -c rm -rf /consul/data/* && docker-entrypoint.sh agent -server -client=:: -bootstrap -ui | tee /edgex/logs/core-consul.log
│ ├─3304648 /bin/dumb-init /bin/sh /usr/local/bin/docker-entrypoint.sh agent -server -client=:: -bootstrap -ui
│ ├─3304649 tee /edgex/logs/core-consul.log
│ └─3304650 consul agent -data-dir=/consul/data -config-dir=/consul/config -server -client=:: -bootstrap –ui

controller-0:/tmp# cat /proc/2323134/cgroup
11:memory:/docker/dc4daa4e77401e96e3c9a32b1f843d07b1fe52d601cca44f020a00d73f2209d0
10:cpuset:/docker/dc4daa4e77401e96e3c9a32b1f843d07b1fe52d601cca44f020a00d73f2209d0
9:blkio:/docker/dc4daa4e77401e96e3c9a32b1f843d07b1fe52d601cca44f020a00d73f2209d0
8:net_prio,net_cls:/docker/dc4daa4e77401e96e3c9a32b1f843d07b1fe52d601cca44f020a00d73f2209d0
7:devices:/docker/dc4daa4e77401e96e3c9a32b1f843d07b1fe52d601cca44f020a00d73f2209d0
6:perf_event:/docker/dc4daa4e77401e96e3c9a32b1f843d07b1fe52d601cca44f020a00d73f2209d0
5:cpuacct,cpu:/docker/dc4daa4e77401e96e3c9a32b1f843d07b1fe52d601cca44f020a00d73f2209d0
4:freezer:/docker/dc4daa4e77401e96e3c9a32b1f843d07b1fe52d601cca44f020a00d73f2209d0
3:pids:/docker/dc4daa4e77401e96e3c9a32b1f843d07b1fe52d601cca44f020a00d73f2209d0
2:hugetlb:/docker/dc4daa4e77401e96e3c9a32b1f843d07b1fe52d601cca44f020a00d73f2209d0
1:name=systemd:/docker/dc4daa4e77401e96e3c9a32b1f843d07b1fe52d601cca44f020a00d73f2209d0

controller-0:/tmp# cd /sys/fs/cgroup/cpuset/docker
controller-0:/sys/fs/cgroup/cpuset/docker#
controller-0:/sys/fs/cgroup/cpuset/docker# cat cpuset.cpus
0-35

Attached is the dump of ps-sched.sh on controller-1 (primary controller), e.g. some postgres related processes on core #7

Test Activity
-------------
System Test

See original description

Tags:

Revision history for this message

Tee Ngo (teewrs) wrote on 2019-09-09:

controller-1-ps-sched-dump.txt Edit (838.7 KiB, text/plain)

Tee Ngo (teewrs) on 2019-09-09

description:

updated

Revision history for this message

Tee Ngo (teewrs) wrote on 2019-09-10:

I was informed that the platform tasks affining job is tied to the openstack-compute-node label.
After removing the openstack related labels on both controllers, the platform tasks (except for k8s infra related) seem to be affined correctly. However, some pods failed to launch due to Insufficient Memory. It turned out that Kubernetes allocatable memory is tied to the openstack labels.

When the node has openstack labels, cpu and memory reserved for platform use are "visible" to Kubernetes and thus allocatable to pods. Below is the comparison of 2 controller nodes, one with openstack labels assigned and one without

Controller-0 (without openstack labels)
============
root 117578 1 4 13:05 ? 00:15:45 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.1 --node-ip=face::3 --cpu-manager-policy=none

controller-1 (without openstack labels)
============
root 117762 1 3 15:24 ? 00:06:10 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.1 --node-ip=face::4 --cpu-manager-policy=static --system-reserved-cgroup=/system.slice --system-reserved=cpu=2,memory=16500Mi

I was informed that the platform tasks affining job is tied to the openstack-compute-node label. 
After removing the openstack related labels on both controllers, the platform tasks (except for k8s infra related) seem to be affined correctly. However, some pods failed to launch due to Insufficient Memory. It turned out that Kubernetes allocatable memory is tied to the openstack labels.

Controller-0 (without openstack labels)
============
root      117578       1  4 13:05 ?        00:15:45 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.1 --node-ip=face::3 --cpu-manager-policy=none

Capacity:
 cpu:                                   36
 ephemeral-storage:                     10190100Ki
 hugepages-1Gi:                         60Gi
 hugepages-2Mi:                         0
 intel.com/pci_sriov_net_group0_data0:  64
 memory:                                97528444Ki
 pods:                                  110
Allocatable:
 cpu:                                   36         <-- platform and vswitch cpus have not been deducted
 ephemeral-storage:                     9391196145
 hugepages-1Gi:                         60Gi
 hugepages-2Mi:                         0
 intel.com/pci_sriov_net_group0_data0:  0
 memory:                                34511484Ki  <--- ~32G (platform mem has not been deducted)
 pods:                                  110

controller-1 (without openstack labels)
============
root      117762       1  3 15:24 ?        00:06:10 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.1 --node-ip=face::4 --cpu-manager-policy=static --system-reserved-cgroup=/system.slice --system-reserved=cpu=2,memory=16500Mi

Capacity:
 cpu:                                   36
 ephemeral-storage:                     10190100Ki
 hugepages-1Gi:                         70Gi
 hugepages-2Mi:                         0
 intel.com/pci_sriov_net_group0_data0:  64
 memory:                                97528444Ki
 pods:                                  110
Allocatable:
 cpu:                                   34         <--- vswitch cpus have not been deducted         
 ephemeral-storage:                     9391196145
 hugepages-1Gi:                         70Gi
 hugepages-2Mi:                         0
 intel.com/pci_sriov_net_group0_data0:  64
 memory:                                7129724Ki   <--- ~6G (platform mem has been deducted)
 pods:                                  110

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-11:

It's expected that that processes would float if the openstack compute label is present. Regarding the insufficient memory, this needs to be setup by the user when configuring the system.

tags:

added: stx.containers

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-13:

@Tee, Can this bug be closed? Is there anything else outstanding?

Changed in starlingx:
status:	New → Incomplete

Revision history for this message

Tee Ngo (teewrs) wrote on 2019-09-13:

The issue is when the node has openstack labels, cpu and memory reserved for platform use are "visible" to Kubernetes and thus allocatable to pods. Platform resources should be unaffected by openstack labels assignment.

Revision history for this message

Brent Rowsell (brent-rowsell) wrote on 2019-10-16:

The model for openstack nodes is that only openstack control plane pods run.
These pods are run on the platform cores and share the platform memory reserved. The rest of the resources are available for VM's. The project does not support running application pods on openstack nodes at this time.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-11-04:

Marking as Won't Fix based on Brent's comment. There is no plan to do anything further for this launchpad.