AIO: Support running high priority RT cpu hog

Bug #1900342 reported by Jim Gauld
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
Medium
Jim Gauld

Bug Description

Brief Description
-----------------
In cases where application pods use RR at high priority, critical linux tasks may be starved, leading to softdog timeouts and host reboot.

The following update will support running application pods that have high priority RT cpu hogs up to RR priority 50.

Severity
--------
Critical: low-latency systems will see unsatisfactory jitter, potential reboots.

Steps to Reproduce
------------------
Configure AIO with label kube-cpu-mgr-policy=static
system host-label-assign <hostname or id> kube-cpu-mgr-policy=static

Verify which cpus are application cores.
system host-cpu-list controller-0

Run stress-ng in application pod on application-cores with scheduler priority RR 50, specify subset of application cpus.

kubectl run stressng --image=alexeiled/stress-ng \
--overrides='{"apiVersion": "v1", "spec": { "nodeSelector": { "kubernetes.io/hostname": "controller-0" }, "containers" : [ {"name": "stressng", "image": "alexeiled/stress-ng", "args": [ "--matrix", "0", "--taskset", "2-35", "--sched", "rr", "--sched-prio", "50" ], "securityContext": { "privileged": true } } ] } }'

Note this is an example, there are many variations of applications and stress test options.

Expected Behavior
------------------
Running cyclictest in a pod, expect max jitter < 20 usec.
Expect host not to lockup and reboot.

Actual Behavior
----------------
Hit softdog timeout, host reboot.

Reproducibility
---------------
Depends on application settings.
100% reproducible.

System Configuration
--------------------
AIO low-latency.

Branch/Pull Time/Commit
-----------------------
-

Last Pass
---------
-

Timestamp/Logs
--------------
-

Test Activity
-------------
Evaluation.

Workaround
----------
none.

Jim Gauld (jgauld)
Changed in starlingx:
assignee: nobody → Jim Gauld (jgauld)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/758689

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to utilities (master)

Fix proposed to branch: master
Review: https://review.opendev.org/758690

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/758692

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/758689
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=b8fb623dc940ec2aed46dbca96bb3fd7040987fa
Submitter: Zuul
Branch: master

commit b8fb623dc940ec2aed46dbca96bb3fd7040987fa
Author: Jim Gauld <email address hidden>
Date: Sun Oct 18 21:07:19 2020 -0400

    Enable 'rcu_nocb_poll' kernel config option

    This update adds 'rcu_nocb_poll' kernel config option to
    aio_and_worker kickstarts on low-latency systems.

    This relieves each CPU from the responsibility of awakening
    their RCU offload threads.

    Change-Id: I99d06d3018c01da27376f612b3afc4a14e85d25e
    Partial-Bug: 1900342
    Signed-off-by: Jim Gauld <email address hidden>

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.5.0 stx.config
Ghada Khalil (gkhalil)
tags: added: stx.metal
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (master)

Change abandoned by Jim Gauld (<email address hidden>) on branch: master
Review: https://review.opendev.org/758692
Reason: No longer needed

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on utilities (master)

Change abandoned by Jim Gauld (<email address hidden>) on branch: master
Review: https://review.opendev.org/758690
Reason: No longer needed

Jim Gauld (jgauld)
Changed in starlingx:
status: In Progress → New
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As discussed with Jim Gauld, this LP will no longer be pursued and the code has been abandoned. Closing as "Won't Fix"

Changed in starlingx:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.