cAdvisor has high CPU usage

Bug #2048223 reported by Mark Goddard
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
Medium
Unassigned
Antelope
Fix Released
Undecided
Unassigned
Bobcat
Fix Released
Undecided
Unassigned
Caracal
Fix Released
Medium
Unassigned
Zed
Fix Released
Undecided
Unassigned

Bug Description

The prometheus_cadvisor container has high CPU usage. On various production systems I checked it sits around 13-16% on controllers, averaged over the prometheus 1m scrape interval. When viewed with top we can see it is a bit spikey and can jump over 100%.

There are various bugs about this, but I found https://github.com/google/cadvisor/issues/2523 which suggests reducing the per-container housekeeping interval. This defaults to 1s, which provides far greater granularity than we need with the default prometheus scrape interval of 60s.

Reducing the housekeeping interval to 60s on a production controller reduced the CPU usage from 13% to 3.5% average. This still seems high, but is more reasonable.

Revision history for this message
Mark Goddard (mgoddard) wrote :
Changed in kolla-ansible:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)
Changed in kolla-ansible:
status: New → In Progress
no longer affects: kolla-ansible/yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/904823
Committed: https://opendev.org/openstack/kolla-ansible/commit/97e5c0e9b1906f2993b4c12820ac3cb9ddcfe821
Submitter: "Zuul (22348)"
Branch: master

commit 97e5c0e9b1906f2993b4c12820ac3cb9ddcfe821
Author: Mark Goddard <email address hidden>
Date: Fri Jan 5 11:02:39 2024 +0000

    cadvisor: Set housekeeping interval to Prometheus scrape interval

    The prometheus_cadvisor container has high CPU usage. On various
    production systems I checked it sits around 13-16% on controllers,
    averaged over the prometheus 1m scrape interval. When viewed with top we
    can see it is a bit spikey and can jump over 100%.

    There are various bugs about this, but I found
    https://github.com/google/cadvisor/issues/2523 which suggests reducing
    the per-container housekeeping interval. This defaults to 1s, which
    provides far greater granularity than we need with the default
    prometheus scrape interval of 60s.

    Reducing the housekeeping interval to 60s on a production controller
    reduced the CPU usage from 13% to 3.5% average. This still seems high,
    but is more reasonable.

    Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7
    Closes-Bug: #2048223

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/904842

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/904843

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/904844

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/904842
Committed: https://opendev.org/openstack/kolla-ansible/commit/5f35f1784ad453f3b4b7e5dd3312cd97e35dd83c
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 5f35f1784ad453f3b4b7e5dd3312cd97e35dd83c
Author: Mark Goddard <email address hidden>
Date: Fri Jan 5 11:02:39 2024 +0000

    cadvisor: Set housekeeping interval to Prometheus scrape interval

    The prometheus_cadvisor container has high CPU usage. On various
    production systems I checked it sits around 13-16% on controllers,
    averaged over the prometheus 1m scrape interval. When viewed with top we
    can see it is a bit spikey and can jump over 100%.

    There are various bugs about this, but I found
    https://github.com/google/cadvisor/issues/2523 which suggests reducing
    the per-container housekeeping interval. This defaults to 1s, which
    provides far greater granularity than we need with the default
    prometheus scrape interval of 60s.

    Reducing the housekeeping interval to 60s on a production controller
    reduced the CPU usage from 13% to 3.5% average. This still seems high,
    but is more reasonable.

    Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7
    Closes-Bug: #2048223
    (cherry picked from commit 97e5c0e9b1906f2993b4c12820ac3cb9ddcfe821)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/904844
Committed: https://opendev.org/openstack/kolla-ansible/commit/5f148b83f3d91b7a8f43dae219d9e8258221b6be
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 5f148b83f3d91b7a8f43dae219d9e8258221b6be
Author: Mark Goddard <email address hidden>
Date: Fri Jan 5 11:02:39 2024 +0000

    cadvisor: Set housekeeping interval to Prometheus scrape interval

    The prometheus_cadvisor container has high CPU usage. On various
    production systems I checked it sits around 13-16% on controllers,
    averaged over the prometheus 1m scrape interval. When viewed with top we
    can see it is a bit spikey and can jump over 100%.

    There are various bugs about this, but I found
    https://github.com/google/cadvisor/issues/2523 which suggests reducing
    the per-container housekeeping interval. This defaults to 1s, which
    provides far greater granularity than we need with the default
    prometheus scrape interval of 60s.

    Reducing the housekeeping interval to 60s on a production controller
    reduced the CPU usage from 13% to 3.5% average. This still seems high,
    but is more reasonable.

    Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7
    Closes-Bug: #2048223
    (cherry picked from commit 97e5c0e9b1906f2993b4c12820ac3cb9ddcfe821)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/904843
Committed: https://opendev.org/openstack/kolla-ansible/commit/3858e945b8e3c0338a4c377a622cf1c5d975e193
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 3858e945b8e3c0338a4c377a622cf1c5d975e193
Author: Mark Goddard <email address hidden>
Date: Fri Jan 5 11:02:39 2024 +0000

    cadvisor: Set housekeeping interval to Prometheus scrape interval

    The prometheus_cadvisor container has high CPU usage. On various
    production systems I checked it sits around 13-16% on controllers,
    averaged over the prometheus 1m scrape interval. When viewed with top we
    can see it is a bit spikey and can jump over 100%.

    There are various bugs about this, but I found
    https://github.com/google/cadvisor/issues/2523 which suggests reducing
    the per-container housekeeping interval. This defaults to 1s, which
    provides far greater granularity than we need with the default
    prometheus scrape interval of 60s.

    Reducing the housekeeping interval to 60s on a production controller
    reduced the CPU usage from 13% to 3.5% average. This still seems high,
    but is more reasonable.

    Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7
    Closes-Bug: #2048223
    (cherry picked from commit 97e5c0e9b1906f2993b4c12820ac3cb9ddcfe821)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 16.3.0

This issue was fixed in the openstack/kolla-ansible 16.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 17.1.0

This issue was fixed in the openstack/kolla-ansible 17.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 15.4.0

This issue was fixed in the openstack/kolla-ansible 15.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 18.0.0.0rc1

This issue was fixed in the openstack/kolla-ansible 18.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.