tripleO-common healthcheck constantly spikes CPU by lsof

Bug #1921714 reported by Radoslaw Smigielski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
In Progress
High
Cédric Jeanneret

Bug Description

On completely idle controllers, no user traffic, no load, the biggest CPU hog is containers' healthcheck which constantly spikes CPU by executing lsof to find open ports.
And what's worst we execute lsof twice:

 1. https://github.com/openstack/tripleo-common/blob/master/healthcheck/common.sh#L61
 2. https://github.com/openstack/tripleo-common/blob/master/healthcheck/common.sh#L63

This is an example how it looks likes:

 top - 11:47:35 up 3 days, 19:26, 2 users, load average: 4.96, 4.74, 5.08
Tasks: 3207 total, 7 running, 3200 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.4 us, 5.2 sy, 0.0 ni, 88.3 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 19651235+total, 57689756 free, 78429920 used, 60392676 buff/cache
KiB Swap: 16777212 total, 16777212 free, 0 used. 11715618+avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 909344 42403 20 0 13360 696 556 R 45.3 0.0 0:01.43 lsof
 909211 42402 20 0 16768 932 784 S 41.8 0.0 0:01.32 lsof
 909309 42403 20 0 16768 932 784 S 41.1 0.0 0:01.30 lsof
 909307 root 20 0 16768 936 784 S 40.8 0.0 0:01.29 lsof
 909424 42402 20 0 16768 172 0 R 26.9 0.0 0:00.85 lsof
 909438 root 20 0 16768 180 0 R 17.4 0.0 0:00.55 lsof
 909468 42403 20 0 16768 176 0 R 12.0 0.0 0:00.38 lsof
      1 root 20 0 209632 20952 4276 S 10.8 0.0 418:43.47 systemd

We have 7 lsof commands running in the same time, in total we have 223% of CPU consumed just by lsof.

It seems to be rather inefficient and this is not going to scale on busy controllers with large overcloud deployed.

Changed in tripleo:
status: New → Triaged
importance: Undecided → High
milestone: none → wallaby-rc1
tags: added: train-backport-potential
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I think having a single cached view of lsof should be enough for evaluation of all healthchecks, in this round of execition.

We could look into https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/roles/tripleo_container_manage/templates/systemd-service.j2#L22 option to install some cache create/purge hooks for systemd units calling healthchecks

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

But since the systemd timers based approach was abandoned, I think that approach would make not much sense... Maybe just moving away from lsof passive scanning of connections to some active per-service connections checking would make more sense

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Changed in tripleo:
status: Triaged → In Progress
assignee: nobody → Bogdan Dobrelya (bogdando)
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Taking over - the following patch seems to do the job in a better, more simple way:
https://review.opendev.org/#/q/I64776992a7e457781aa8ddaba359ef085d4cb77d

It should be merged by the end of the week in train.

Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → Cédric Jeanneret (cjeanner)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/tripleo-common/+/785325
Committed: https://opendev.org/openstack/tripleo-common/commit/654dc08547accf2f4804e15cb1a5b9a7d93ae2a0
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 654dc08547accf2f4804e15cb1a5b9a7d93ae2a0
Author: Cédric Jeanneret <email address hidden>
Date: Tue Apr 6 17:46:43 2021 +0200

    healthcheck_port: drop lsof in favor of awk/find

    It seems lsof is a bit too heavy on the system when multiple
    healthchecks are running in parallel.
    So instead of calling a "big" process, let's split a bit things and get
    to the filesystem directly in order to pick only the things we actually
    need.

    What changes:
    - using /proc/net/{tcp,udp}, we get every socket matching the port
    - using find with the right options, we can ensure at least one socket
      exists with the wanted inode(s)

    The last part exits as soon as we have a match in order to make it
    faster and less resource consuming.

    Change-Id: I64776992a7e457781aa8ddaba359ef085d4cb77d
    Partial-Bug: #1921714
    (cherry picked from commit a072a7f07ea75933a2372b1a95ae960095a3250e)

tags: added: in-stable-ussuri
Changed in tripleo:
milestone: wallaby-rc1 → xena-1
Changed in tripleo:
milestone: xena-1 → xena-2
Changed in tripleo:
milestone: xena-2 → xena-3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.