AIO standard profile: Incorrect Pod affinity

Bug #1832781 reported by Brent Rowsell
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Jim Gauld

Bug Description

Brief Description
-----------------
On a AIO configured for openstack (i.e. openstack k8s labels), pods should be affined to the platform cores. Currently they are allowed to float across all cores. This will cause unpredictable performancce for the VM workloads pinned to the application cores

Severity
--------
Major

Steps to Reproduce
------------------
Install system

Expected Behavior
------------------
See above

Actual Behavior
----------------
See above

Reproducibility
---------------
100%

System Configuration
--------------------
AIO-SX, AIO-DX standard profile

Branch/Pull Time/Commit
-----------------------
Any recent load

Last Pass
---------
N/A

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Other

tags: added: stx.2.0 stx.containers
Changed in starlingx:
importance: Undecided → High
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 gating; AIO platform processes need to be affined to specific cores for deterministic performance of the application workload

Changed in starlingx:
assignee: nobody → Jim Gauld (jgauld)
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/667972

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/667972
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=dba41755233b2c5e9f8db6ee275f69873ab95612
Submitter: Zuul
Branch: master

commit dba41755233b2c5e9f8db6ee275f69873ab95612
Author: Jim Gauld <email address hidden>
Date: Thu Jun 27 11:47:12 2019 -0400

    AIO reaffine tasks and k8s-infra during startup

    This update reimplements the affine-tasks init script and service to
    dynamically reaffine tasks and k8s-infra cgroup cpuset on AIO nodes.
    This accomodates CPU intensive phases of work. Tasks are initially
    allowed to float across all cores. Once system is at steady-state,
    this will ensure that K8S pods are constrained to platform cores and
    do not run on cores with VMs/containers.

    This will speedup the first stx-application apply, as well as pod
    recovery after lock/unlock, reboot, and controller swact.

    This script waits forever for sufficient platform readiness criteria
    (e.g., system critical pods are recovered, critical openstack pods
    are running, nova-compute pod is running) before reaffining back
    to platform cores.

    This corrects the pod affinity problem seen on AIO introduced by fix
    for bug: 1826592, commit e513baad44181f667085886007632d0ebf79eeb0,
    i.e., fix allowed the AIO to not timeout, but left pods floating.

    Change-Id: Ic257378eac451904a200a0f2e79f7bc4f8373009
    Partial-Bug: 1832781
    Signed-off-by: Jim Gauld <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/671144

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/671144
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=696f987a174725cb2d87a56935ac45e3b2fa56cb
Submitter: Zuul
Branch: master

commit 696f987a174725cb2d87a56935ac45e3b2fa56cb
Author: Jim Gauld <email address hidden>
Date: Tue Jul 16 15:35:36 2019 -0400

    AIO reaffine DRBD tasks during startup

    This will speedup the initial DRBD sync on AIO when there are limited
    number of platform cores by reaffining DRBD tasks to use all cpus.

    This enhances affine-tasks init script to dynamically reaffine CPU
    intensive DRBD tasks. The receiver threads (i.e., drbd_r_*)
    may use a full core each. On systems with fast disk, we notice the
    receiver threads and softirq processing get CPU limited by the
    number of platform cores configured.

    The DRBD receiver tasks are reaffined initially to float across all
    cores. This will poll for newly created DRBD resources and reaffine
    them as they are found until all DRBD resources have started.

    This script waits for sufficient platform readiness criteria. Once the
    system is at steady-state, this will ensure that DRBD tasks are
    constrained to platform cores and do not run on cores with
    VMs/containers. The DRBD configuration file affinity option is left
    as-is in case the DRBD kernel threads are restarted for some reason.

    Change-Id: I019137ea1cf3736768ad8882bd8d8628cc5c2857
    Closes-Bug: 1832781
    Signed-off-by: Jim Gauld <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.