kolla-ansible

cAdvisor has high CPU usage

Bug #2048223 reported by Mark Goddard on 2024-01-05

This bug affects 1 person

	Status	Importance	Assigned to
kolla-ansible	Fix Released	Medium	Unassigned
Antelope	Fix Released	Undecided	Unassigned
Bobcat	Fix Released	Undecided	Unassigned
Caracal	Fix Released	Medium	Unassigned
Zed	Fix Released	Undecided	Unassigned

Bug Description

The prometheus_cadvisor container has high CPU usage. On various production systems I checked it sits around 13-16% on controllers, averaged over the prometheus 1m scrape interval. When viewed with top we can see it is a bit spikey and can jump over 100%.

There are various bugs about this, but I found https://github.com/google/cadvisor/issues/2523 which suggests reducing the per-container housekeeping interval. This defaults to 1s, which provides far greater granularity than we need with the default prometheus scrape interval of 60s.

Reducing the housekeeping interval to 60s on a production controller reduced the CPU usage from 13% to 3.5% average. This still seems high, but is more reasonable.

Revision history for this message

Mark Goddard (mgoddard) wrote on 2024-01-05:

Reduction in CPU usage with housekeeping interval of 60s Edit (41.7 KiB, image/png)

Changed in kolla-ansible:
importance:	Undecided → Medium

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-01-05: Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/904823

Changed in kolla-ansible:
status:	New → In Progress

Maksim Malchuk (mmalchuk) on 2024-01-06

no longer affects:

kolla-ansible/yoga

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-01-06: Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/904823
Committed: https://opendev.org/openstack/kolla-ansible/commit/97e5c0e9b1906f2993b4c12820ac3cb9ddcfe821
Submitter: "Zuul (22348)"
Branch: master

commit 97e5c0e9b1906f2993b4c12820ac3cb9ddcfe821
Author: Mark Goddard <email address hidden>
Date: Fri Jan 5 11:02:39 2024 +0000

cadvisor: Set housekeeping interval to Prometheus scrape interval

    The prometheus_cadvisor container has high CPU usage. On various
    production systems I checked it sits around 13-16% on controllers,
    averaged over the prometheus 1m scrape interval. When viewed with top we
    can see it is a bit spikey and can jump over 100%.

    There are various bugs about this, but I found
    https://github.com/google/cadvisor/issues/2523 which suggests reducing
    the per-container housekeeping interval. This defaults to 1s, which
    provides far greater granularity than we need with the default
    prometheus scrape interval of 60s.

    Reducing the housekeeping interval to 60s on a production controller
    reduced the CPU usage from 13% to 3.5% average. This still seems high,
    but is more reasonable.

Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7
Closes-Bug: #2048223

Changed in kolla-ansible:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-01-07: Fix proposed to kolla-ansible (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/904842

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-01-07: Fix proposed to kolla-ansible (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/904843

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-01-07: Fix proposed to kolla-ansible (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/904844

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-01-12: Fix merged to kolla-ansible (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/904842
Committed: https://opendev.org/openstack/kolla-ansible/commit/5f35f1784ad453f3b4b7e5dd3312cd97e35dd83c
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 5f35f1784ad453f3b4b7e5dd3312cd97e35dd83c
Author: Mark Goddard <email address hidden>
Date: Fri Jan 5 11:02:39 2024 +0000

cadvisor: Set housekeeping interval to Prometheus scrape interval

    The prometheus_cadvisor container has high CPU usage. On various
    production systems I checked it sits around 13-16% on controllers,
    averaged over the prometheus 1m scrape interval. When viewed with top we
    can see it is a bit spikey and can jump over 100%.

    There are various bugs about this, but I found
    https://github.com/google/cadvisor/issues/2523 which suggests reducing
    the per-container housekeeping interval. This defaults to 1s, which
    provides far greater granularity than we need with the default
    prometheus scrape interval of 60s.

    Reducing the housekeeping interval to 60s on a production controller
    reduced the CPU usage from 13% to 3.5% average. This still seems high,
    but is more reasonable.

    Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7
    Closes-Bug: #2048223
    (cherry picked from commit 97e5c0e9b1906f2993b4c12820ac3cb9ddcfe821)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-01-12: Fix merged to kolla-ansible (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/904844
Committed: https://opendev.org/openstack/kolla-ansible/commit/5f148b83f3d91b7a8f43dae219d9e8258221b6be
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 5f148b83f3d91b7a8f43dae219d9e8258221b6be
Author: Mark Goddard <email address hidden>
Date: Fri Jan 5 11:02:39 2024 +0000

cadvisor: Set housekeeping interval to Prometheus scrape interval

    The prometheus_cadvisor container has high CPU usage. On various
    production systems I checked it sits around 13-16% on controllers,
    averaged over the prometheus 1m scrape interval. When viewed with top we
    can see it is a bit spikey and can jump over 100%.

    There are various bugs about this, but I found
    https://github.com/google/cadvisor/issues/2523 which suggests reducing
    the per-container housekeeping interval. This defaults to 1s, which
    provides far greater granularity than we need with the default
    prometheus scrape interval of 60s.

    Reducing the housekeeping interval to 60s on a production controller
    reduced the CPU usage from 13% to 3.5% average. This still seems high,
    but is more reasonable.

    Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7
    Closes-Bug: #2048223
    (cherry picked from commit 97e5c0e9b1906f2993b4c12820ac3cb9ddcfe821)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-01-15: Fix merged to kolla-ansible (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/904843
Committed: https://opendev.org/openstack/kolla-ansible/commit/3858e945b8e3c0338a4c377a622cf1c5d975e193
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 3858e945b8e3c0338a4c377a622cf1c5d975e193
Author: Mark Goddard <email address hidden>
Date: Fri Jan 5 11:02:39 2024 +0000

cadvisor: Set housekeeping interval to Prometheus scrape interval

    The prometheus_cadvisor container has high CPU usage. On various
    production systems I checked it sits around 13-16% on controllers,
    averaged over the prometheus 1m scrape interval. When viewed with top we
    can see it is a bit spikey and can jump over 100%.

    There are various bugs about this, but I found
    https://github.com/google/cadvisor/issues/2523 which suggests reducing
    the per-container housekeeping interval. This defaults to 1s, which
    provides far greater granularity than we need with the default
    prometheus scrape interval of 60s.

    Reducing the housekeeping interval to 60s on a production controller
    reduced the CPU usage from 13% to 3.5% average. This still seems high,
    but is more reasonable.

    Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7
    Closes-Bug: #2048223
    (cherry picked from commit 97e5c0e9b1906f2993b4c12820ac3cb9ddcfe821)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-01-25: Fix included in openstack/kolla-ansible 16.3.0

#10

This issue was fixed in the openstack/kolla-ansible 16.3.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-01-25: Fix included in openstack/kolla-ansible 17.1.0

#11

This issue was fixed in the openstack/kolla-ansible 17.1.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-01-25: Fix included in openstack/kolla-ansible 15.4.0

#12

This issue was fixed in the openstack/kolla-ansible 15.4.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-05-08: Fix included in openstack/kolla-ansible 18.0.0.0rc1

#13

This issue was fixed in the openstack/kolla-ansible 18.0.0.0rc1 release candidate.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Reduction in CPU usage with housekeeping interval of 60s Edit

Add attachment

Remote bug watches

auto-github-google-cadvisor #2523
[open] Edit

Bug watches keep track of this bug in other bug trackers.