AIO-DX task affining triggered by swact does not work

Bug #1928836 reported by Jim Gauld
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jim Gauld

Bug Description

Brief Description
-----------------
On AIO,DX, SM swact will invoke a task affinity script to move tasks to idle cores and then back to platform cores. This is intended to dramatically speedup the swact. There are cases where end of swact condition does not trigger, (eg, say if a minor service fails), this leaves platform tasks running on both Platform and application cores.

Issuing a swact will dramatically slow down installation of openstack application if swact occurred before the install, since there is interaction with the other affine-tasks.sh init script, an it will move tasks to Platform cores prematurely.

Severity
--------
Major: System is usable but degraded.
* Latency impact to Applications, no longer isolated from Platform
* Openstack install can be slowed down

Steps to Reproduce
------------------
AIO-DX, swact back and forth.
Cause a service to fail or be disabled (eg, ceph-osd, nfv-vim).

Expected Behavior
------------------
Swact should complete, Platform processes should be re-affined back to platform.

Actual Behavior
----------------
Swact does not actually complete due to minor failed service, but tasks remain floating on application cores. Subsequent swacts, and subsequent host reboots does not fix the affinity there is a lingering flag file created in a persistent disk location, tasks are stuck forever.

Also interaction with installation of openstack which would be perceived as a slowdown. If there a openstack-compute-node label, and if swact issued before that openstack installed, then the swact puts tasks on platform cores when it should not.

Reproducibility
---------------
Intermittent if swacting manually since we don't generally have failing services.
More frequent in specific Sanity TCs that swact as setup step.

System Configuration
--------------------
AIO-DX only.

Branch/Pull Time/Commit
-----------------------
NA

Last Pass
---------
NA

Timestamp/Logs
--------------
Can see /var/log/sm.log for swact actions, if we see the following we never get to the end of SWACT;
2021-05-12T20:50:47.000 controller-1 sm: debug time[1375.555] log<9096> INFO: sm[95058]: sm_service_enable.c(461): Started enable action (563241) for service (ceph-osd).
2021-05-12T20:50:47.000 controller-1 sm: debug time[1376.181] log<9097> INFO: sm[95058]: sm_service_enable.c(363): Action (enable) completed with result (failed), state (unknown), status (unknown), and condition (unknown) for service (ceph-osd), reason_text=, exit_code=1.

On both controllers, see SM invoking sm_task_affining_thread ate beginning and end of swact.
controller-0_20210226.205023/var/log $ !grep
grep -rs "affining to " .
./sm.log:2021-02-26T20:36:16.000 controller-0 sm: debug time[7024.633] log<3> INFO: sm_ta[99033]: sm_task_affining_thread.c(39): Invoking system call, affining to idle cores...

./sm.log:2021-02-26T20:36:33.000 controller-0 sm: debug time[7041.960] log<7> INFO: sm_ta[99033]: sm_task_affining_thread.c(45): Invoking system call, affining to platform cores...

SM is essentially doing the following:
// start of swact
source /etc/init.d/task_affinity_functions.sh; affine_tasks_to_idle_cores

// end of swact
source /etc/init.d/task_affinity_functions.sh; affine_tasks_to_platform_cores

Test Activity
-------------
Sanity

Workaround
----------
None

Tags: stx.6.0 stx.ha
Jim Gauld (jgauld)
Changed in starlingx:
assignee: nobody → Jim Gauld (jgauld)
status: New → Confirmed
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to utilities (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/utilities/+/792028

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to utilities (master)

Reviewed: https://review.opendev.org/c/starlingx/utilities/+/792028
Committed: https://opendev.org/starlingx/utilities/commit/bb2932c067c410d6d9f22e1638cf5484f55cdfb1
Submitter: "Zuul (22348)"
Branch: master

commit bb2932c067c410d6d9f22e1638cf5484f55cdfb1
Author: Jim Gauld <email address hidden>
Date: Tue May 18 13:48:53 2021 -0400

    AIO-DX swact task affinity robustness

    Task affinity functions are used to speedup initialization of AIO
    and swact on AIO-DX. When swact occurs, SM leverages task affining
    scripts to move platform tasks to idle cores, followed by moving
    platform tasks back to platform cores at the end of the swact.

    This change adds a timeout of 90 seconds so that tasks are always
    affined back to platform cores even if the swact does not complete
    (e.g., due to failed or disabled minor service).

    This also corrects interactions of the task_affinity_functions with
    the affine-tasks.sh init script by checking if the service is running,
    and by updating/removing a common flag file. This will also improve the
    task affinity handling of openstack installation and startup, since
    the affine-tasks.sh script assumes tasks float across cores until
    nova-compute is providing service.

    Closes-Bug: 1928836

    Signed-off-by: Jim Gauld <email address hidden>
    Change-Id: Ief5c65103f98e9ffb57f96327af1e0dd35d13857

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.6.0 stx.ha
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.