tripleo

tripleO-common healthcheck constantly spikes CPU by lsof

Bug #1921714 reported by Radoslaw Smigielski on 2021-03-29

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	In Progress	High	Cédric Jeanneret	tripleo xena-3

Bug Description

On completely idle controllers, no user traffic, no load, the biggest CPU hog is containers' healthcheck which constantly spikes CPU by executing lsof to find open ports.
And what's worst we execute lsof twice:

1. https://github.com/openstack/tripleo-common/blob/master/healthcheck/common.sh#L61
2. https://github.com/openstack/tripleo-common/blob/master/healthcheck/common.sh#L63

This is an example how it looks likes:

top - 11:47:35 up 3 days, 19:26, 2 users, load average: 4.96, 4.74, 5.08
Tasks: 3207 total, 7 running, 3200 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.4 us, 5.2 sy, 0.0 ni, 88.3 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 19651235+total, 57689756 free, 78429920 used, 60392676 buff/cache
KiB Swap: 16777212 total, 16777212 free, 0 used. 11715618+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
909344 42403 20 0 13360 696 556 R 45.3 0.0 0:01.43 lsof
909211 42402 20 0 16768 932 784 S 41.8 0.0 0:01.32 lsof
909309 42403 20 0 16768 932 784 S 41.1 0.0 0:01.30 lsof
909307 root 20 0 16768 936 784 S 40.8 0.0 0:01.29 lsof
909424 42402 20 0 16768 172 0 R 26.9 0.0 0:00.85 lsof
909438 root 20 0 16768 180 0 R 17.4 0.0 0:00.55 lsof
909468 42403 20 0 16768 176 0 R 12.0 0.0 0:00.38 lsof
1 root 20 0 209632 20952 4276 S 10.8 0.0 418:43.47 systemd

We have 7 lsof commands running in the same time, in total we have 223% of CPU consumed just by lsof.

It seems to be rather inefficient and this is not going to scale on busy controllers with large overcloud deployed.

Tags:

Bogdan Dobrelya (bogdando) on 2021-03-29

Changed in tripleo:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → wallaby-rc1
tags:	added: train-backport-potential

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2021-03-29:

#1

I think having a single cached view of lsof should be enough for evaluation of all healthchecks, in this round of execition.

We could look into https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/roles/tripleo_container_manage/templates/systemd-service.j2#L22 option to install some cache create/purge hooks for systemd units calling healthchecks

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2021-03-29:

#2

But since the systemd timers based approach was abandoned, I think that approach would make not much sense... Maybe just moving away from lsof passive scanning of connections to some active per-service connections checking would make more sense

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2021-04-02:

#3

https://review.opendev.org/q/topic:healthcheck_reduce

Changed in tripleo:
status:	Triaged → In Progress
assignee:	nobody → Bogdan Dobrelya (bogdando)

Revision history for this message

Cédric Jeanneret (cjeanner) wrote on 2021-04-08:

#4

Taking over - the following patch seems to do the job in a better, more simple way:
https://review.opendev.org/#/q/I64776992a7e457781aa8ddaba359ef085d4cb77d

It should be merged by the end of the week in train.

Changed in tripleo:
assignee:	Bogdan Dobrelya (bogdando) → Cédric Jeanneret (cjeanner)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-04-13: Fix merged to tripleo-common (stable/ussuri)

#5

Reviewed: https://review.opendev.org/c/openstack/tripleo-common/+/785325
Committed: https://opendev.org/openstack/tripleo-common/commit/654dc08547accf2f4804e15cb1a5b9a7d93ae2a0
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 654dc08547accf2f4804e15cb1a5b9a7d93ae2a0
Author: Cédric Jeanneret <email address hidden>
Date: Tue Apr 6 17:46:43 2021 +0200

healthcheck_port: drop lsof in favor of awk/find

    It seems lsof is a bit too heavy on the system when multiple
    healthchecks are running in parallel.
    So instead of calling a "big" process, let's split a bit things and get
    to the filesystem directly in order to pick only the things we actually
    need.

    What changes:
    - using /proc/net/{tcp,udp}, we get every socket matching the port
    - using find with the right options, we can ensure at least one socket
      exists with the wanted inode(s)

The last part exits as soon as we have a match in order to make it
faster and less resource consuming.

    Change-Id: I64776992a7e457781aa8ddaba359ef085d4cb77d
    Partial-Bug: #1921714
    (cherry picked from commit a072a7f07ea75933a2372b1a95ae960095a3250e)

tags:

added: in-stable-ussuri

Marios Andreou (marios-b) on 2021-05-06

Changed in tripleo:
milestone:	wallaby-rc1 → xena-1

Marios Andreou (marios-b) on 2021-06-22

Changed in tripleo:
milestone:	xena-1 → xena-2

Marios Andreou (marios-b) on 2021-07-21

Changed in tripleo:
milestone:	xena-2 → xena-3

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.