all health check timers are triggered at the same time causing process congestion.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Critical
|
Sofer Athlan-Guyot |
Bug Description
Hi,
on a undercloud vm with 24GB of ram and 8vcpu I had very high load average, around 10 at some point and always above 7.
Looking at vmstat at saw that strange pattern:
54 1 0 14816040 2080 1277332 0 0 0 63 11143 8774 40 44 15 0 0
26 0 0 14626328 2080 1277712 0 0 0 568 19025 24207 51 41 8 0 0
47 0 0 14602848 2080 1278100 0 0 0 676 13876 14737 39 59 2 0 0
55 1 0 14594476 2080 1278304 0 0 0 876 12595 11443 44 56 0 0 0
40 0 0 14692028 2080 1278564 0 0 0 838 11700 8938 47 52 0 0 0
21 0 0 14920588 2080 1278492 0 0 0 1516 9459 3986 41 58 0 0 0
9 0 0 15120312 2080 1277936 0 0 0 523 9612 4015 50 42 9 0 0
3 0 0 15148256 2080 1277600 0 0 0 137 10596 11994 38 22 40 0 0
6 0 0 15145900 2080 1277608 0 0 0 6 11206 9994 30 16 54 0 0
3 0 0 15145164 2080 1277604 0 0 0 15 12332 11715 31 18 51 0 0
1 0 0 15144668 2080 1277608 0 0 0 105 7202 6218 22 12 66 0 0
2 0 0 15143828 2080 1277612 0 0 0 12 4207 3210 16 10 74 0 0
2 0 0 15144380 2080 1277620 0 0 0 297 9835 9923 28 16 56 0 0
4 0 0 15145296 2080 1277592 0 0 0 112 4024 2806 15 10 75 0 0
2 0 0 15145792 2080 1277600 0 0 0 111 4254 2956 17 10 73 0 0
1 0 0 15145916 2080 1277600 0 0 0 15 4351 2901 16 11 73 0 0
2 0 0 15144628 2080 1277600 0 0 0 110 5324 4076 17 12 71 0 0
1 0 0 15208584 2080 1277404 0 0 0 73 4755 3455 14 8 78 0 0
4 0 0 15208584 2080 1277412 0 0 0 105 3125 2960 12 2 86 0 0
1 0 0 15208600 2080 1277416 0 0 0 19 3197 2893 12 2 86 0 0
1 0 0 15208600 2080 1277416 0 0 0 1132 3072 2785 11 3 86 0 0
3 0 0 15209988 2080 1277424 0 0 0 14 3058 2843 10 2 88 0 0
procs -------
r b swpd free buff cache si so bi bo in cs us sy id wa st
7 0 0 15209988 2080 1277428 0 0 0 28 3331 3036 14 2 84 0 0
0 0 0 15209988 2080 1277432 0 0 0 23 3002 2754 10 2 88 0 0
3 0 0 15209864 2080 1277444 0 0 16 448 3610 3051 16 3 81 0 0
1 0 0 15209864 2080 1277444 0 0 0 13 4635 3206 27 3 70 0 0
4 0 0 15210112 2080 1277428 0 0 0 0 4055 3203 22 4 74 0 0
6 0 0 15210112 2080 1277432 0 0 0 31 4898 3403 27 4 69 0 0
2 0 0 15210112 2080 1277436 0 0 0 427 3196 2953 12 2 86 0 0
0 0 0 15209740 2080 1277440 0 0 0 13 3384 3558 11 2 87 0 0
2 0 0 15209848 2080 1277440 0 0 0 12 3165 3006 12 2 86 0 0
66 0 0 14804984 2080 1277788 0 0 4 84 12285 9695 41 50 9 0 0
31 0 0 14632556 2080 1278108 0 0 0 525 17864 21503 55 41 4 0 0
40 0 0 14601112 2080 1278620 0 0 0 621 13667 14390 45 54 1 0 0
51 0 0 14613972 2080 1279236 0 0 0 737 12622 10738 45 55 0 0 0
48 1 0 14656384 2080 1279412 0 0 0 1352 11807 9388 46 53 0 0 0
22 0 0 14795132 2080 1279444 0 0 0 640 9663 4042 43 56 0 0 0
15 0 0 15052900 2080 1279084 0 0 0 845 9565 4125 49 50 1 0 0
5 0 0 15118256 2080 1278244 0 0 0 165 10547 8197 52 29 18 0 0
A high number of processes were triggered at the same time causing cpu
congestion.
Looking at the process diff using
ps -eo pid,tid,
I could see that the healthcheck were the culprits:
577c577,617
< 62156 62156 47609 stack TS - 0 19 ps -eo pid,tid,
---
> 60720 60720 1 root TS - 0 19 [podman] <defunct>
> 60721 60721 1 root TS - 0 19 /usr/bin/podman exec heat_api_cfn /openstack/
> 60723 60723 1 root TS - 0 19 /usr/bin/podman exec mistral_api /openstack/
> 60725 60725 1 root TS - 0 19 /usr/bin/podman exec neutron_api /openstack/
> 60727 60727 1 root TS - 0 19 /usr/bin/podman exec ironic_conductor /openstack/
> 60728 60728 1 root TS - 0 19 /usr/bin/podman exec ironic_api /openstack/
> 60729 60729 1 root TS - 0 19 /usr/bin/podman exec nova_metadata /openstack/
> 60730 60730 1 root TS - 0 19 /usr/bin/podman exec heat_api /openstack/
> 60731 60731 1 root TS - 0 19 /usr/bin/podman exec nova_placement /openstack/
> 60735 60735 1 root TS - 0 19 /usr/bin/podman exec mistral_
> 60743 60743 1 root TS - 0 19 /usr/bin/podman exec nova_api /openstack/
> 60746 60746 1 root TS - 0 19 /usr/bin/podman exec neutron_ovs_agent /openstack/
> 60747 60747 1 root TS - 0 19 /usr/bin/podman exec glance_api /openstack/
> 60749 60749 1 root TS - 0 19 /usr/bin/podman exec swift_account_
> 60752 60752 1 root TS - 0 19 /usr/bin/podman exec rabbitmq /openstack/
> 60754 60754 1 root TS - 0 19 /usr/bin/podman exec keystone /openstack/
> 60761 60761 1 root TS - 0 19 /usr/bin/podman exec mysql /openstack/
> 60762 60762 1 root TS - 0 19 /usr/bin/podman exec nova_conductor /openstack/
> 60764 60764 1 root TS - 0 19 /usr/bin/podman exec ironic_inspector /openstack/
> 60765 60765 1 root TS - 0 19 /usr/bin/podman exec mistral_executor /openstack/
> 60767 60767 1 root TS - 0 19 /usr/bin/podman exec swift_object_server /openstack/
> 60772 60772 1 root TS - 0 19 /usr/bin/podman exec neutron_dhcp /openstack/
> 60773 60773 1 root TS - 0 19 /usr/bin/podman exec ironic_pxe_tftp /openstack/
> 60774 60774 1 root TS - 0 19 /usr/bin/podman exec mistral_engine /openstack/
> 60775 60775 1 root TS - 0 19 /usr/bin/podman exec memcached /openstack/
> 60777 60777 1 root TS - 0 19 /usr/bin/podman exec swift_container
> 60787 60787 1 root TS - 0 19 /usr/bin/podman exec neutron_l3_agent /openstack/
> 60795 60795 1 root TS - 0 19 /usr/bin/podman exec swift_proxy /openstack/
> 60797 60797 1 root TS - 0 19 /usr/bin/podman exec nova_scheduler /openstack/
I adjusted the timer to this:
5c5,6
< OnUnitActiveSec=30
---
> OnUnitActiveSec=60
> RandomizedDelay
And saw the load average drop to 2.5 which look more reasonable.
Fix proposed to branch: master /review. openstack. org/635129
Review: https:/