Enhance schedtop with blocked_max, Disk waiters, and watch commands

Bug #1927772 reported by Jim Gauld
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Jim Gauld

Bug Description

Brief Description
-----------------
The 'schedtop' tool use used to do engineering analysis of process scheduling and disk-IO.

A tool enhancement prototype was created diagnose poor latency of processes in general, with focus on the following recent issues.
- specifically watch sm-eru and trigger a sysrq to get coredump when significant delay
- measure latency of etcd threads, compare to kern.log for ETCD server

Added the following command line options:
[--watch-cmd=tid1,cmd1,cmd2,...] [--watch-only] [--watch-quiet]
[--trig-delay=time]

This enables watching specific processes and being able to trigger a sysrq when delay threshold is reached, eg, ig either "wait_max" (scheduler wait) or "block_max" (involuntary wait, includes IO wait) exceeds specified trigger delay.

Added field "bmax" corresponds to linux scheduler stats "blocked_max".
Increase field width of pid,tid,ppid to 7.

Added the "D:<n> " field to the heading row, this indicates number of Disk-sleep "D" tasks.

Example usage (showing the 'etcd' watched command at the top...) :
Notice the runq:x blk:x D:x output.
Notice the 'bmax' column (milliseconds).

schedtop --watch-cmd=etcd --period=4 --delay=4
selected options: delay = 4.000s, repeat = 1, idle=no_idle_tasks, hwm=unchanged, sort=cpu, print=full
selected watch/trigger options: watch-cmd=etcd, only=false, quiet=false, delay=0 ms
watching: tid=1976695, comm=etcd
watching: tid=1976936, comm=etcd
watching: tid=1976684, comm=etcd
watching: tid=1976793, comm=etcd
watching: tid=1976693, comm=etcd
watching: tid=1976782, comm=etcd
watching: tid=1976700, comm=etcd
watching: tid=1976794, comm=etcd
watching: tid=1976691, comm=etcd
watching: tid=2021819, comm=etcd
watching: tid=1976699, comm=etcd
watching: tid=1976692, comm=etcd
watching: tid=2133679, comm=etcd
schedtop 0.1 -- 2021-05-07 18:06:43.442 dt:5284.124 ms ldavg:5.24, 2.96, 2.45 runq:1 blk:0 D:0 nproc:4444 up:3:01:48:20
  host:controller-0 nodetype:controller subfunction:controller,worker,lowlatency
  arch:x86_64 processor:Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz speed:3300 #CPUs:32
  Linux 3.10.0-1160.15.2.rt56.1152.el7.tis.3.x86_64 build:21.05 2021-05-02_20-00-06
core: total 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
occ: 207.4 41.2 19.0 1.1 0.7 1.0 0.9 0.7 11.8 1.2 0.7 0.9 0.7 0.8 0.5 1.0 0.7 1.4 6.3 8.2 4.5 17.0 14.0 18.7 16.5 10.7 0.5 10.9 0.3 0.2 4.3 3.1 7.8

    TID PID PPID S P AFF PO NI PR ctxt migr occ tlen tmax delay dmax bmax iowt read write wcncl rsysc wsysc wchan comm cmdline
1976793 1976684 1976683 S 1 0x3 OT -19 1 574 50 0.51 0.047 1.0 3.237 581.4 1370.3 0.04 0 122.880k 0 66 112 futex_wait_queue_me etcd /usr/bin/etcd --name=controller --data-dir=/opt/etcd/21.05/contro
1976936 1976684 1976683 S 1 0x3 OT -19 1 445 36 0.28 0.034 1.0 5.665 735.1 1406.3 0.05 0 126.976k 0 76 133 ep_poll etcd /usr/bin/etcd --name=controller --data-dir=/opt/etcd/21.05/contro
2021819 1976684 1976683 S 1 0x3 OT -19 1 485 26 0.26 0.028 1.0 1.393 600.7 967.2 0.04 0 77.824k 0 108 137 futex_wait_queue_me etcd /usr/bin/etcd --name=controller --data-dir=/opt/etcd/21.05/contro
1976684 1976684 1976683 S 1 0x3 OT -19 1 441 52 0.25 0.030 1.0 7.856 661.5 1153.2 0.04 0 225.280k 0 40 82 futex_wait_queue_me etcd /usr/bin/etcd --name=controller --data-dir=/opt/etcd/21.05/contro
1976693 1976684 1976683 S 1 0x3 OT -19 1 357 18 0.14 0.020 1.0 27.530 599.9 531.1 0.01 0 49.152k 0 30 51 futex_wait_queue_me etcd /usr/bin/etcd --name=controller --data-dir=/opt/etcd/21.05/contro
1976691 1976684 1976683 S 1 0x3 OT -19 1 1750 234 0.13 0.004 1.0 6.881 23.9 0.0 0.00 0 0 0 0 0 futex_wait_queue_me etcd /usr/bin/etcd --name=controller --data-dir=/opt/etcd/21.05/contro
1976692 1976684 1976683 S 1 0x3 OT -19 1 467 26 0.10 0.012 1.0 11.619 753.6 132.0 0.00 0 24.576k 0 21 35 futex_wait_queue_me etcd /usr/bin/etcd --name=controller --data-dir=/opt/etcd/21.05/contro
 924506 910594 910583 R 23 0xffffffff OT 0 20 16214 7 50.23 0.164 45.8 880.436 682.3 1473.7 0.00 0 0 0 307.621k 2 0 elasticsearch[m /usr/share/elasticsearch/jdk/bin/java -Xshare:auto -Des.networkad
 924502 924502 912543 R 24 0xffffffff OT 15 35 35376 4 47.87 0.072 2.9 2108.556 682.3 1473.7 0.00 0 0 0 0 831.926k 0 autodetect ./autodetect --jobid=wra-system-network-egress-packets-rates-net
1285760 1285711 1285692 S 16 0xffffffff OT 0 20 545 4 26.94 2.612 711.6 611.718 650.7 0.0 0.00 0 0 0 0 0 futex_wait_queue_me VM Thread /bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiating
. . .

Severity
--------
Minor: No impact to system, no risk.
HOWEVER:
Tool has high value for potential performance issues debugging,
e.g., pinpoint processes and actually quantify values.
Useful for system engineering.

Steps to Reproduce
------------------
NA

Expected Behavior
------------------
NA

Actual Behavior
----------------
NA

Reproducibility
---------------
NA

System Configuration
--------------------
All configs, all nodetypes.

Branch/Pull Time/Commit
-----------------------
NA

Last Pass
---------
NA

Timestamp/Logs
--------------
NA

Test Activity
-------------
Sanity, Evaluation, Developer Testing.

Workaround
----------
None.
No other tool provides this functionality.

Jim Gauld (jgauld)
Changed in starlingx:
assignee: nobody → Jim Gauld (jgauld)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: marking as low priority / not gating. Tool enhancement, not a bug.

tags: added: stx.tools
Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitoring (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/monitoring/+/819498

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitoring (master)

Reviewed: https://review.opendev.org/c/starlingx/monitoring/+/819498
Committed: https://opendev.org/starlingx/monitoring/commit/34c2ef786555d56e26d3c09b63d9ed464cdcd0ea
Submitter: "Zuul (22348)"
Branch: master

commit 34c2ef786555d56e26d3c09b63d9ed464cdcd0ea
Author: Jim Gauld <email address hidden>
Date: Fri Nov 26 15:44:51 2021 -0500

    Enhance schedtop with blocked_max, disk waiters, and watch commands

    The 'schedtop' monitoring tool is used to do engineering
    analysis of process scheduling, disk IO, and latency.

    This enhances the schedtop monitoring tool with:
    - additional fields "bmax" latency and "D" disk-sleep tasks
    - command-line options to watch specific tasks and mechanism to
      trigger sysrq

    The following new fields are reported:
    - "bmax" milliseconds, corresponds to linux scheduler stats
      "blocked_max". This represents involuntary wait of scheduling
      and IO wait.
    - "D:<n>", the current number of disk-sleep "D" tasks.

    The following command line options are added to be able to watch
    specific processes, and optionally trigger a sysrq (i.e., force
    a crashdump) when trigger delay threshold milliseconds is reached.
    [--watch-cmd=tid1,cmd1,cmd2,...] [--watch-only] [--watch-quiet]
    [--trig-delay=time]

    The --watch-cmd option matches process names 'comm' field pattern.

    The --watch-only option watches and displays only the subset of
    tasks discovered at tool startup. This dramatically reduces the
    tool cpu overhead.

    The --watch-quiet displays no sample output after tool startup,
    the only output occurs when the --trig-delay is exceeded.

    The --trig-delay=time option will trigger a sysrq to force a crash
    dump any watched process "bmax" delay exceeds trigger delay time
    in milliseconds.

    Example: collect 1 minute of data, monitor all tasks,
             reset scheduler hiwatermark statistics

    schedtop \
    --period=60 --reset-hwm

    Example: collect 1 minute of data, watch specific tasks

    schedtop \
    --period=60 --reset-hwm \
    --watch-cmd=jbd2,kube-apiserver,etcd,forward-journald,containerd \
    --watch-only

    Example: watch specific tasks and trigger sysrq when any of the
             watched commands exceed 10000ms delay (10 seconds)

    schedtop \
    --period=36000 --reset-hwm \
    --watch-cmd=jbd2,kube-apiserver,etcd,forward-journald,containerd \
    --watch-only \
    --trig-delay=10000

    Testcases:
    PASS: Collect standard tool output, verified new bmax and D fields
    PASS: Verify --watch-cmds will detect the specified commands or tids
    PASS: Verify --watch-only will only display watched commands
    PASS: Verify --trig-delay will generate a sysrq
    PASS: Verify comm field is limited to 15 characters wide

    Closes-Bug: 1927772

    Signed-off-by: Jim Gauld <email address hidden>
    Change-Id: I5368aac66b24608f5eab366cd929be4c0d4a1f76

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: Adding stx.6.0 since the fix will be available for that release

tags: added: stx.6.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.