Bad behaving pod not well separated from the platform
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Dan Voiculeasa |
Bug Description
Brief Description
-----------------
A pod can spawn an unlimited number of processes.
Severity
--------
Critical: System unusable.
Steps to Reproduce
------------------
Spawn processes inside a pod.
Expected Behavior
------------------
Don't consume all pid descriptors from the platform.
Actual Behavior
----------------
Consumes all pid descriptors from the platform.
Reproducibility
---------------
100%
System Configuration
-------
Any
Branch/Pull Time/Commit
-------
not relevant
Last Pass
---------
not relevant
Timestamp/Logs
--------------
not relevant
Test Activity
-------------
Testing
Workaround
----------
Persistent: add --pod-max-pids in /etc/systemd/
CVE References
Changed in starlingx: | |
assignee: | nobody → Dan Voiculeasa (dvoicule) |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master) | #1 |
Changed in starlingx: | |
status: | New → In Progress |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master) | #2 |
Fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master) | #3 |
Fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #4 |
Fix proposed to branch: master
Review: https:/
Ghada Khalil (gkhalil) wrote : | #5 |
Screening: stx.6.0 / medium - robustness handling of misbehaving pods
tags: | added: stx.containers |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.6.0 |
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master) | #6 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 3e3940824dfb830
Author: Dan Voiculeasa <email address hidden>
Date: Thu May 13 18:03:45 2021 +0300
Enable kubelet support for pod pid limit
Enable limiting the number of pids inside of pods.
Add a default value to protect against a missing value.
Default to 750 pids limit to align with service parameter default
value for most resource consuming StarlingX optional app (openstack).
In fact any value above service parameter minimum value is good for the
default.
Closes-Bug: 1928353
Signed-off-by: Dan Voiculeasa <email address hidden>
Change-Id: I10c1684fe3145e
Changed in starlingx: | |
status: | In Progress → Fix Released |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8) | #7 |
Fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8) | #8 |
Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8) | #9 |
Fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8) | #10 |
Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8) | #11 |
Fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8) | #12 |
Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8) | #13 |
Fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master) | #14 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit bf547186d19a218
Author: Dan Voiculeasa <email address hidden>
Date: Wed May 12 15:16:32 2021 +0300
Add service parameter to control pod pids limit
Create a config section for kubernetes service.
Create a parameter named pod_max_pids to have similar name as
the kubernetes parameter pod-max-pids.
Store the value in the config section.
This will create a system-wide entry in hieradata when unlocking:
plattform:
This affects hosts with kubelet running, meaning controller and
worker personalities. A config out of date will be raised for all hosts
of both personalities, even for parameters that target only a specific
personality.
After modifying the parameter a host-lock then host-unlock is required.
Platform pods use under 20 processes in steady state.
Some openstack pods reach ~450 processes in steady state.
Since StarlingX provides some optional apps we provide a default value
that takes into account the most hungry app, that being openstack.
The database entry will be populated considering openstack will be
applied.
restrict the minimum based on optional apps, as this allows the user
to set a lower minimum if there is no plan to use openstack.
Tested on Standard+dedicated storage:
- out of sync raised for controllers and workers when using
service-
- alarm cleared after host-lock, host-unlock
- new value correctly generated and used
- add with system service-
- modify with system service-
Tested on top of: I10c1684fe3145e
Partial-Bug: 1928353
Signed-off-by: Dan Voiculeasa <email address hidden>
Change-Id: I74fcf2bd405c2a
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master) | #15 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 64bf73c85c5de07
Author: Dan Voiculeasa <email address hidden>
Date: Tue May 11 16:22:12 2021 +0300
Enable kubelet support for pod pid limit
This protects the system before the unlock. This has the most meaning
during the restore procedure, when the system is unprotected until
unlock (until puppet generates the config file containing protection).
Partial-Bug: 1928353
Signed-off-by: Dan Voiculeasa <email address hidden>
Change-Id: I09c4d4f494bc11
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8) | #16 |
Fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master) | #17 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit ac0c5d51a8708c8
Author: Dan Voiculeasa <email address hidden>
Date: Thu May 13 19:53:41 2021 +0300
Create the pod_max_pids service parameter
This adds a default entry to service parameters.
Create the default entry taking into consideration the most hungry of
the optional StarlingX apps. The user is free to modify the value as
desired, using 'system service-
Same can be created by the user using 'system service-
but this helps the user by being transparent in service-
If this service parameter was missing an entry, then no hieradata
variable would have been generated, so puppet would have used
a predefined value.
Partial-Bug: 1928353
Depends-On: I74fcf2bd405c2a
Signed-off-by: Dan Voiculeasa <email address hidden>
Change-Id: I707ddc4ca67595
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to config (master) | #18 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to stx-puppet (master) | #19 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to config (master) | #20 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 8423e70fd04f07b
Author: Dan Voiculeasa <email address hidden>
Date: Fri May 21 12:56:47 2021 +0300
Fix pod max pids service parameter default value
Openstack installation fails for rabbit-mq pods.
Change the approach of how the default value is selected.
Document recommended minimum values for apps instead of using them.
Select the default value as high as possible, protecting against a
rogue pod, protecting against platform slowdowns created by high number
of processes in the system, but low enough such that platform is still
responsive even on older hardware.
User is free to decrease the limit to increase the degree of protection
against slowdowns.
Initially it was observed that openstack pods reach ~450 processes
in steady state.
New tests show even with the 2/3 extra room, 750 pid limit is not
sufficient when deploying rabbit-mq pods. But 2000 is.
Recommended minimum value for openstack pods pid limit becomes 2000.
Partial-Bug: 1928949
Related-Bug: 1928353
Signed-off-by: Dan Voiculeasa <email address hidden>
Change-Id: I0d66173e2247fa
OpenStack Infra (hudson-openstack) wrote : Related fix merged to stx-puppet (master) | #21 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 4a9710600d7dfd1
Author: Dan Voiculeasa <email address hidden>
Date: Fri May 21 14:03:30 2021 +0300
Change pod pid limit default value
Change the behavior of kubernetes pod pid limit in case the service
parameter is missing.
The initial change(
value to protect the system by default in case the service parameter was
missing. The value was aligned with what was believed to work for
StarlingX apps. Some apps, openstack for example, are upstream and
StarlingX doesn't control changes inside them. Instead of maintaining
the value initially proposed here, change the approach.
Change the behaviour to use the maximum value for the service parameter
by default.
Partial-Bug: 1928949
Related-Bug: 1928353
Signed-off-by: Dan Voiculeasa <email address hidden>
Change-Id: I0f776d9a8be573
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8) | #22 |
Fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (f/centos8) | #23 |
Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to config (f/centos8) | #24 |
Related fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #25 |
Related fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8) | #26 |
Fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (f/centos8) | #27 |
Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (f/centos8) | #28 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: f/centos8
commit 4e96b762f549aad
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 15:48:19 2021 +0000
Revert "Restore host filesystems with collected sizes"
This reverts commit 255488739efa4ac
Reason for revert: Did a rework to fix https:/
Change-Id: Iea79701a874eff
Depends-On: I55ae6954d24ba3
commit c064aacc377c8bd
Author: Angie Wang <email address hidden>
Date: Fri May 21 21:28:02 2021 -0400
Ensure apiserver keys are present before extract from tarball
This is to fix the upgrade playbook issue that happens during
AIO-SX upgrade from stx4.0 to stx5.0 which introduced by
https:/
The apiserver keys are not available in stx4.0 side so we need
to ensure the keys under /etc/kubernetes/pki are present in the
backed-up tarball before extracting, otherwise playbook fails
because the keys are not found in the archive.
Change-Id: I8602f07d1b1041
Closes-Bug: 928925
Signed-off-by: Angie Wang <email address hidden>
commit 0261f22ff7c23d2
Author: Don Penney <email address hidden>
Date: Thu May 20 23:09:07 2021 -0400
Update SX to DX migration to wait for coredns config
This commit updates the SX to DX migration playbook to wait after
modifying the system mode to duplex until the runtime manifest that
updates coredns config has completed. The playbook will wait for up to
20 minutes to allow for the possibilty that sysinv has multiple
runtime manifests queued up, each of which could take several minutes.
Depends-On: https:/
Depends-On: https:/
Change-Id: I3bf94d3493ae20
Closes-Bug: 1929148
Signed-off-by: Don Penney <email address hidden>
commit 7c4f17bd0d92fc1
Author: Daniel Safta <email address hidden>
Date: Wed May 19 09:08:16 2021 +0000
Fixed missing apiserver-
When controller-1 is the active controller
the backup archive does not contain
/etc/
This change adds a new task which brings
the certs from /etc/kubernetes/pki
Closes-bug: 1928925
Signed-off-by: Daniel Safta <email address hidden>
Change-Id: I3c68377603e1af
commit e221ef8fbe51aa6
Author: David Sullivan <email address hidden>
Date: Wed May 19 16:01:27 2021 -0500
Support boo...
tags: | added: in-f-centos8 |
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to config (f/centos8) | #29 |
Related fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8) | #30 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: f/centos8
commit 3e3940824dfb830
Author: Dan Voiculeasa <email address hidden>
Date: Thu May 13 18:03:45 2021 +0300
Enable kubelet support for pod pid limit
Enable limiting the number of pids inside of pods.
Add a default value to protect against a missing value.
Default to 750 pids limit to align with service parameter default
value for most resource consuming StarlingX optional app (openstack).
In fact any value above service parameter minimum value is good for the
default.
Closes-Bug: 1928353
Signed-off-by: Dan Voiculeasa <email address hidden>
Change-Id: I10c1684fe3145e
commit 0c16d288fbc4831
Author: Jessica Castelino <email address hidden>
Date: Tue May 11 10:21:57 2021 -0400
Safe restart of the etcd SM service in etcd upgrade runtime class
While upgrading the central cloud of a DC system, activation failed
because there was an unexpected SWACT to controller-1. This was due
to the etcd upgrade script. Part of this script runs the etcd
manifest. This triggers a reload/restart of the etcd service. As this
is done outside of the sm, sm saw the process failure and triggered
the SWACT.
This commit modifies platform:
to do a safe restart of the etcd SM service and thus, solve the
issue.
Change-Id: I3381b6976114c7
Signed-off-by: Jessica Castelino <email address hidden>
Closes-Bug: 1928135
commit eec3008f600aeeb
Author: Mihnea Saracin <email address hidden>
Date: Mon May 10 13:09:52 2021 +0300
Serialize updates to global_filter in the AIO manifest
Right now, looking at the aio manifest:
https:/
there are 3 classes that update
in parallel the lvm global_filter:
- include ::platform:
- include ::platform:
- include ::platform:
And this generates some errors.
We fix this by adding dependencies between the above classes
in order to update the global_filter in a serial mode.
Closes-Bug: 1927762
Signed-off-by: Mihnea Saracin <email address hidden>
Change-Id: If6971e520454cd
commit 97371409b9b2ae3
Author: Steven Webster <email address hidden>
Date: Fri May 7 15:33:43 2021 -0400
Add SR-IOV rate-limit dependency
Currently, the binding of an SR-IOV virtual function (VF) to a
driver has a dependency on platform:
to ensure that SR-IOV is enabled (VFs created) before actually
doing the bind.
This dependency does not exist for configuring the VF rate-limits
however. There is a cha...
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to config (f/centos8) | #31 |
Related fix proposed to branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8) | #32 |
Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to config (f/centos8) | #33 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: f/centos8
commit 9e420d9513e5faf
Author: Bin Qian <email address hidden>
Date: Mon May 31 14:45:52 2021 -0400
Add more logging to run docker login
Add error log for running docker login. The new log could
help identify docker login failure.
Closes-Bug: 1930310
Change-Id: I8a709fb6665de8
Signed-off-by: Bin Qian <email address hidden>
commit 31c77439d2cea59
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500
Fix controller-0 downgrade failing to kill ceph
kill_
file that does not exist in an AIO-DX environment.
We no longer invoke kill_ceph_
AIO SX or DX env.
This allows: "system host-downgrade controller-0"
to proceed in an AIO-DX environment where that second
controller (controller-0) was upgraded.
Partial-Bug: 1929884
Signed-off-by: albailey <email address hidden>
Change-Id: I633853f7531773
commit 0dc99eee608336f
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500
Fix file permissions failure during duplex upgrade abort
When issuing a downgrade for controller-0 in a duplex upgrade
abort and rollback scenario, the downgrade command was failing
because the sysinv API does not have root permissions to set
a file flag.
The fix is to use RPC so the conductor can create the flag
and allow the downgrade for controller-0 to get further.
Partial-Bug: 1929884
Signed-off-by: albailey <email address hidden>
Change-Id: I913bcad73309fe
commit 7ef3724dad17375
Author: Chen, Haochuan Z <email address hidden>
Date: Tue May 25 16:16:29 2021 +0800
Fix bug rook-ceph provision with multi osd on one host
Test case:
1, deploy simplex system
2, apply rook-ceph with below override value
value.yaml
cluster:
storage:
nodes:
- name: controller-0
devices:
- name: sdb
- name: sdc
3, reboot
Without this fix, only osd pod could launch successfully after boot
as vg start with ceph could not correctly add in sysinv-database
Closes-bug: 1929511
Change-Id: Ia5be599cd168d1
Signed-off-by: Chen, Haochuan Z <email address hidden>
commit 23505ba77d76114
Author: Angie Wang <email address hidden>
Date: Tue May 25 18:49:21 2021 -0400
Fix issue in partition data migration script
The created partition dictonary partition_map is not
an ordered dict so we need to sort it by its key -
device node when iterating it to adjust the device
nodes/paths for user created extra partitions to ensure
the number of device node...
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8) | #34 |
Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #35 |
Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https:/
Fix proposed to branch: master /review. opendev. org/c/starlingx /stx-puppet/ +/791267
Review: https:/