all health check timers are triggered at the same time causing process congestion.

Bug #1814870 reported by Sofer Athlan-Guyot on 2019-02-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Sofer Athlan-Guyot

Bug Description

Hi,

on a undercloud vm with 24GB of ram and 8vcpu I had very high load average, around 10 at some point and always above 7.

Looking at vmstat at saw that strange pattern:

    54 1 0 14816040 2080 1277332 0 0 0 63 11143 8774 40 44 15 0 0
    26 0 0 14626328 2080 1277712 0 0 0 568 19025 24207 51 41 8 0 0
    47 0 0 14602848 2080 1278100 0 0 0 676 13876 14737 39 59 2 0 0
    55 1 0 14594476 2080 1278304 0 0 0 876 12595 11443 44 56 0 0 0
    40 0 0 14692028 2080 1278564 0 0 0 838 11700 8938 47 52 0 0 0
    21 0 0 14920588 2080 1278492 0 0 0 1516 9459 3986 41 58 0 0 0
     9 0 0 15120312 2080 1277936 0 0 0 523 9612 4015 50 42 9 0 0
     3 0 0 15148256 2080 1277600 0 0 0 137 10596 11994 38 22 40 0 0
     6 0 0 15145900 2080 1277608 0 0 0 6 11206 9994 30 16 54 0 0
     3 0 0 15145164 2080 1277604 0 0 0 15 12332 11715 31 18 51 0 0
     1 0 0 15144668 2080 1277608 0 0 0 105 7202 6218 22 12 66 0 0
     2 0 0 15143828 2080 1277612 0 0 0 12 4207 3210 16 10 74 0 0
     2 0 0 15144380 2080 1277620 0 0 0 297 9835 9923 28 16 56 0 0
     4 0 0 15145296 2080 1277592 0 0 0 112 4024 2806 15 10 75 0 0
     2 0 0 15145792 2080 1277600 0 0 0 111 4254 2956 17 10 73 0 0
     1 0 0 15145916 2080 1277600 0 0 0 15 4351 2901 16 11 73 0 0
     2 0 0 15144628 2080 1277600 0 0 0 110 5324 4076 17 12 71 0 0
     1 0 0 15208584 2080 1277404 0 0 0 73 4755 3455 14 8 78 0 0
     4 0 0 15208584 2080 1277412 0 0 0 105 3125 2960 12 2 86 0 0
     1 0 0 15208600 2080 1277416 0 0 0 19 3197 2893 12 2 86 0 0
     1 0 0 15208600 2080 1277416 0 0 0 1132 3072 2785 11 3 86 0 0
     3 0 0 15209988 2080 1277424 0 0 0 14 3058 2843 10 2 88 0 0
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
     r b swpd free buff cache si so bi bo in cs us sy id wa st
     7 0 0 15209988 2080 1277428 0 0 0 28 3331 3036 14 2 84 0 0
     0 0 0 15209988 2080 1277432 0 0 0 23 3002 2754 10 2 88 0 0
     3 0 0 15209864 2080 1277444 0 0 16 448 3610 3051 16 3 81 0 0
     1 0 0 15209864 2080 1277444 0 0 0 13 4635 3206 27 3 70 0 0
     4 0 0 15210112 2080 1277428 0 0 0 0 4055 3203 22 4 74 0 0
     6 0 0 15210112 2080 1277432 0 0 0 31 4898 3403 27 4 69 0 0
     2 0 0 15210112 2080 1277436 0 0 0 427 3196 2953 12 2 86 0 0
     0 0 0 15209740 2080 1277440 0 0 0 13 3384 3558 11 2 87 0 0
     2 0 0 15209848 2080 1277440 0 0 0 12 3165 3006 12 2 86 0 0
    66 0 0 14804984 2080 1277788 0 0 4 84 12285 9695 41 50 9 0 0
    31 0 0 14632556 2080 1278108 0 0 0 525 17864 21503 55 41 4 0 0
    40 0 0 14601112 2080 1278620 0 0 0 621 13667 14390 45 54 1 0 0
    51 0 0 14613972 2080 1279236 0 0 0 737 12622 10738 45 55 0 0 0
    48 1 0 14656384 2080 1279412 0 0 0 1352 11807 9388 46 53 0 0 0
    22 0 0 14795132 2080 1279444 0 0 0 640 9663 4042 43 56 0 0 0
    15 0 0 15052900 2080 1279084 0 0 0 845 9565 4125 49 50 1 0 0
     5 0 0 15118256 2080 1278244 0 0 0 165 10547 8197 52 29 18 0 0

A high number of processes were triggered at the same time causing cpu
congestion.

Looking at the process diff using

   ps -eo pid,tid,ppid,user,class,rtprio,ni,pri,args

I could see that the healthcheck were the culprits:

    577c577,617
    < 62156 62156 47609 stack TS - 0 19 ps -eo pid,tid,ppid,user,class,rtprio,ni,pri,args
    ---
    > 60720 60720 1 root TS - 0 19 [podman] <defunct>
    > 60721 60721 1 root TS - 0 19 /usr/bin/podman exec heat_api_cfn /openstack/healthcheck
    > 60723 60723 1 root TS - 0 19 /usr/bin/podman exec mistral_api /openstack/healthcheck
    > 60725 60725 1 root TS - 0 19 /usr/bin/podman exec neutron_api /openstack/healthcheck
    > 60727 60727 1 root TS - 0 19 /usr/bin/podman exec ironic_conductor /openstack/healthcheck
    > 60728 60728 1 root TS - 0 19 /usr/bin/podman exec ironic_api /openstack/healthcheck
    > 60729 60729 1 root TS - 0 19 /usr/bin/podman exec nova_metadata /openstack/healthcheck
    > 60730 60730 1 root TS - 0 19 /usr/bin/podman exec heat_api /openstack/healthcheck
    > 60731 60731 1 root TS - 0 19 /usr/bin/podman exec nova_placement /openstack/healthcheck
    > 60735 60735 1 root TS - 0 19 /usr/bin/podman exec mistral_event_engine /openstack/healthcheck
    > 60743 60743 1 root TS - 0 19 /usr/bin/podman exec nova_api /openstack/healthcheck
    > 60746 60746 1 root TS - 0 19 /usr/bin/podman exec neutron_ovs_agent /openstack/healthcheck
    > 60747 60747 1 root TS - 0 19 /usr/bin/podman exec glance_api /openstack/healthcheck
    > 60749 60749 1 root TS - 0 19 /usr/bin/podman exec swift_account_server /openstack/healthcheck
    > 60752 60752 1 root TS - 0 19 /usr/bin/podman exec rabbitmq /openstack/healthcheck
    > 60754 60754 1 root TS - 0 19 /usr/bin/podman exec keystone /openstack/healthcheck
    > 60761 60761 1 root TS - 0 19 /usr/bin/podman exec mysql /openstack/healthcheck
    > 60762 60762 1 root TS - 0 19 /usr/bin/podman exec nova_conductor /openstack/healthcheck
    > 60764 60764 1 root TS - 0 19 /usr/bin/podman exec ironic_inspector /openstack/healthcheck
    > 60765 60765 1 root TS - 0 19 /usr/bin/podman exec mistral_executor /openstack/healthcheck
    > 60767 60767 1 root TS - 0 19 /usr/bin/podman exec swift_object_server /openstack/healthcheck
    > 60772 60772 1 root TS - 0 19 /usr/bin/podman exec neutron_dhcp /openstack/healthcheck
    > 60773 60773 1 root TS - 0 19 /usr/bin/podman exec ironic_pxe_tftp /openstack/healthcheck
    > 60774 60774 1 root TS - 0 19 /usr/bin/podman exec mistral_engine /openstack/healthcheck
    > 60775 60775 1 root TS - 0 19 /usr/bin/podman exec memcached /openstack/healthcheck
    > 60777 60777 1 root TS - 0 19 /usr/bin/podman exec swift_container_server /openstack/healthcheck
    > 60787 60787 1 root TS - 0 19 /usr/bin/podman exec neutron_l3_agent /openstack/healthcheck
    > 60795 60795 1 root TS - 0 19 /usr/bin/podman exec swift_proxy /openstack/healthcheck
    > 60797 60797 1 root TS - 0 19 /usr/bin/podman exec nova_scheduler /openstack/healthcheck

I adjusted the timer to this:

    5c5,6
    < OnUnitActiveSec=30
    ---
    > OnUnitActiveSec=60
    > RandomizedDelaySec=50

And saw the load average drop to 2.5 which look more reasonable.

Fix proposed to branch: master
Review: https://review.openstack.org/635129

Changed in tripleo:
assignee: nobody → Sofer Athlan-Guyot (sofer-athlan-guyot)
status: Triaged → In Progress

For reference, I use that command to change it on my dev env:

  awk '{if ($0 ~ /OnUnitActiveSec=30/){print "OnUnitActiveSec=60\nRandomizedDelaySec=50"}else{print $0}}'

in /etc/systemd/system and followed by systemctl daemon-reload

Reviewed: https://review.openstack.org/635129
Committed: https://git.openstack.org/cgit/openstack/paunch/commit/?id=074930f2a231ea3d9c49178a4acfa77a52bb7e42
Submitter: Zuul
Branch: master

commit 074930f2a231ea3d9c49178a4acfa77a52bb7e42
Author: Sofer Athlan-Guyot <email address hidden>
Date: Wed Feb 6 12:00:23 2019 +0100

    Increase and randomize healthcheck interval to avoid cpu congestion.

    All the healthcheck timers are triggered at the same time which causes
    cpu congestion. With the default interval of 30s this cause
    unnecessary stress on the environment.

    We increase the default interval and randomize the trigger to 3/4 of
    the timer. Tasks, on average and by default, should thus still be
    triggered every minute, but ranging from 15s to 1m45s.

    Change-Id: Ib11b08b5d0dec03db9740520f05fe8f1ff962192
    Closes-Bug: #1814870

Changed in tripleo:
status: In Progress → Fix Released

This issue was fixed in the openstack/paunch 4.3.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers