ceilometer cpu_util over 100%

Bug #1527620 reported by liuwei
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ceilometer
Fix Released
High
gordon chung

Bug Description

in kilo, cpu_util of a few vms in ceilometer is alway over 100%:

[root@control ~(keystone_test)]# ceilometer sample-list -m cpu_util -l 30
+--------------------------------------+----------+-------+----------------+------+---------------------+
| Resource ID | Name | Type | Volume | Unit | Timestamp |
+--------------------------------------+----------+-------+----------------+------+---------------------+
| 9085bab1-8cde-42f7-9c29-5a3a1f9a3893 | cpu_util | gauge | 0.595 | % | 2015-12-18T13:27:07 |
| 1e48031f-9c69-473f-ba42-fee4cd7a945a | cpu_util | gauge | 2.17708333333 | % | 2015-12-18T13:27:07 |
| 3dff08bd-a08a-4f56-b02e-660d628aae25 | cpu_util | gauge | 100.471 | % | 2015-12-18T13:27:07 |
| 86735967-ebfe-47a3-9115-34f19b663121 | cpu_util | gauge | 3.02583333333 | % | 2015-12-18T13:27:07 |
| 1dd58360-d894-45fd-8da8-786c59e59837 | cpu_util | gauge | 0.179583333333 | % | 2015-12-18T13:22:47 |
| 73efc5e0-acee-48c8-9d9e-ea71c39551cc | cpu_util | gauge | 100.515833333 | % | 2015-12-18T13:22:47 |
| 7602c307-291d-4016-b536-0094f83250e8 | cpu_util | gauge | 100.775 | % | 2015-12-18T13:22:47 |
| 9085bab1-8cde-42f7-9c29-5a3a1f9a3893 | cpu_util | gauge | 0.5675 | % | 2015-12-18T13:17:07 |
| 1e48031f-9c69-473f-ba42-fee4cd7a945a | cpu_util | gauge | 2.17791666667 | % | 2015-12-18T13:17:07 |
| 3dff08bd-a08a-4f56-b02e-660d628aae25 | cpu_util | gauge | 100.4285 | % | 2015-12-18T13:17:07 |
| 86735967-ebfe-47a3-9115-34f19b663121 | cpu_util | gauge | 3.04 | % | 2015-12-18T13:17:07 |
| 1dd58360-d894-45fd-8da8-786c59e59837 | cpu_util | gauge | 0.188333333333 | % | 2015-12-18T13:12:47 |
| 73efc5e0-acee-48c8-9d9e-ea71c39551cc | cpu_util | gauge | 100.239833333 | % | 2015-12-18T13:12:47 |
| 7602c307-291d-4016-b536-0094f83250e8 | cpu_util | gauge | 100.706666667 | % | 2015-12-18T13:12:47 |
| 9085bab1-8cde-42f7-9c29-5a3a1f9a3893 | cpu_util | gauge | 0.579166666667 | % | 2015-12-18T13:07:07 |
| 1e48031f-9c69-473f-ba42-fee4cd7a945a | cpu_util | gauge | 2.23041666667 | % | 2015-12-18T13:07:07 |
| 3dff08bd-a08a-4f56-b02e-660d628aae25 | cpu_util | gauge | 100.403833333 | % | 2015-12-18T13:07:07 |
| 86735967-ebfe-47a3-9115-34f19b663121 | cpu_util | gauge | 3.02666666667 | % | 2015-12-18T13:07:07 |
| 1dd58360-d894-45fd-8da8-786c59e59837 | cpu_util | gauge | 0.189583333333 | % | 2015-12-18T13:02:47 |
| 73efc5e0-acee-48c8-9d9e-ea71c39551cc | cpu_util | gauge | 100.675833333 | % | 2015-12-18T13:02:47 |
| 7602c307-291d-4016-b536-0094f83250e8 | cpu_util | gauge | 100.736388889 | % | 2015-12-18T13:02:47 |
+--------------------------------------+----------+-------+----------------+------+---------------------+

in kilo source code, I found that cpu_util is related to cpu, and the cputime of cpu(meter) is from:
   ceilometer/compute/virt/libvirt/inspector.py
   def inspect_cpus(self, instance):
        domain = self._lookup_by_uuid(instance)
        dom_info = domain.info()
        return virt_inspector.CPUStats(number=dom_info[3], time=dom_info[4])

   (the same to liberty)
This API is decribed in libvirt:
    struct virDomainInfo {
      ....
     unsigned long long cpuTime
                        ####the CPU time used in nanoseconds
    }
I think this cpuTime is the time of the vm(guest) cpus using compute-node(host) cpus, is greater than actual vm cpus using time, just as the result of this libvirt command:

[root@compute ~]# virsh cpu-stats instance-00000008
CPU0:
        cpu_time 23430.267346021 seconds
        vcpu_time 20686.192447612 seconds
CPU1:
        cpu_time 28564.062613594 seconds
        vcpu_time 26046.699831451 seconds
CPU2:
        cpu_time 29813.558938521 seconds
        vcpu_time 27469.282667848 seconds
CPU3:
        cpu_time 31055.550565857 seconds
        vcpu_time 28872.558341115 seconds
Total:
        cpu_time 112863.444574414 seconds
                      #### this is the cpuTime we get, not vm cpus using time
        user_time 105659.660000000 seconds
        system_time 3988.980000000 seconds

So, I think the vm cputime shoud be got from domain.vcpus()
  struct virVcpuInfo {
    ...
    unsigned long long cpuTime
                      ####CPU time used, in nanoseconds
    ...
   }
the code is modified :
ceilometer/compute/virt/libvirt/inspector.py
    def inspect_cpus(self, instance):
        domain = self._lookup_by_uuid(instance)
        return virt_inspector.CPUStats(number=domain.info()[3], time=domain.vcpus()[0][0][2])

liuwei (liu-wei81)
description: updated
liuwei (liu-wei81)
description: updated
Revision history for this message
gordon chung (chungg) wrote :

this seems like a duplicate of https://bugs.launchpad.net/ceilometer/+bug/1421584

can you confirm?

Revision history for this message
liuwei (liu-wei81) wrote :

I saw this bug before, but I think they are different problem.

I mean the libvirt api called to get the cputtime of cpu meter (used by cpu_util) maybe not correct:
  ## ceilometer/compute/virt/libvirt/inspector.py
     def inspect_cpus(self, instance):
         domain = self._lookup_by_uuid(instance)
         dom_info = domain.info()
         return virt_inspector.CPUStats(number=dom_info[3], time=dom_info[4])
                            #### time=dom_info[4] : this time maybe not correct

Revision history for this message
Wenzhi Yu (yuywz) wrote :

I agree with Liu Wei, ceilometer.compute.virt.libvirt.inspector.inspect_cpus method should get "vcpu_info.vcpu_time" instead of "info.cpu_time".

Changed in ceilometer:
assignee: nobody → Wen Zhi Yu (yuywz)
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/260369

Changed in ceilometer:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ceilometer (master)

Change abandoned by gordon chung (<email address hidden>) on branch: master
Review: https://review.openstack.org/260369
Reason: seems like wrong approach

Revision history for this message
leegayeon (leegy) wrote :

question: what is the progress?

Revision history for this message
yxu (xuyao18) wrote :

This is my approach.

In ceilometer/ceilometer/transformer/conversions.py

    def handle_sample(self, s):
        """Handle a sample, converting if necessary."""
        LOG.debug('handling sample %s', s)
        key = s.name + s.resource_id
        prev = self.cache.get(key)
        timestamp = timeutils.parse_isotime(s.timestamp)
        self.cache[key] = (s.volume, timestamp)

        if prev:
            prev_volume = prev[0]
            prev_timestamp = prev[1]
            time_delta = timeutils.delta_seconds(prev_timestamp, timestamp)
            # disallow violations of the arrow of time
            if time_delta < 0:
                LOG.warning(_('dropping out of time order sample: %s'), (s,))
                # Reset the cache to the newer sample.
                self.cache[key] = prev
                return None
            # we only allow negative volume deltas for noncumulative
            # samples, whereas for cumulative we assume that a reset has
            # occurred in the interim so that the current volume gives a
            # lower bound on growth
            volume_delta = (s.volume - prev_volume
                            if (prev_volume <= s.volume or
                                s.type != sample.TYPE_CUMULATIVE)
                            else s.volume)
            rate_of_change = ((1.0 * volume_delta / time_delta)
                              if time_delta else 0.0)

            s.rate_of_change = rate_of_change # ADD THIS CODE
            s = self._convert(s, rate_of_change)
            LOG.debug('converted to: %s', s)
        else:
            LOG.warning(_('dropping sample with no predecessor: %s'),
                        (s,))
            s = None
        return s

and in /etc/ceilometer/pipeline.yaml

- name: cpu_sink
      transformers:
          - name: "rate_of_change"
            parameters:
                target:
                    name: "cpu_util"
                    unit: "%"
                    type: "gauge"
                    scale: "100.0 / rate_of_change if rate_of_change >=(10**9 * resource_metadata.cpu_number or 1) else 100.0 / (10**9 * (resource_metadata.cpu_number or 1))"
      publishers:
          - notifier://

cpu_util will never over 100%. it may not so perfect , but it works.

Revision history for this message
gordon chung (chungg) wrote :

i would recommend you post the diff or send the patch to gerrit. i'm not entirely sure what you changed.

Changed in ceilometer:
status: In Progress → Won't Fix
status: Won't Fix → Triaged
assignee: Wenzhi Yu (yuywz) → nobody
Revision history for this message
Sam Morrison (sorrison) wrote :

Should this be marked as a high priority? If ceilometer is giving incorrect data this is very bad.

We are also seeing this issue

Revision history for this message
gordon chung (chungg) wrote :

are they all hovering around 100% as well?

i don't know if high priority changes the fact we have limited resources :(

Changed in ceilometer:
importance: Undecided → High
Revision history for this message
gordon chung (chungg) wrote :

also what version of libvirt?

Revision history for this message
gordon chung (chungg) wrote :

and ceilometer code.

Revision history for this message
liuwei (liu-wei81) wrote :

libvirt-1.2.21-27.el7.x86_64
libvirt-python-1.2.21-27.el7.x86_64
openstack ceilometer kilo

#are they all hovering around 100% as well

yes, Vm load is relatively heavy

Revision history for this message
gordon chung (chungg) wrote :

in that case it seems like we can only handle the rounding error described in bug. i don't believe anything else can be done as it's not a bug in ceilometer imo, it's just a limitation of what we can do with data received from libvirt.

there is this change[1] which leverages more precise data in libvirt but there is a requirement on libvirt 1.3.2

as we can only handle the potential rounding issue. i'm just going to add a 'cap'/'max' attribute to enforce values to be <=100

[1]https://github.com/openstack/ceilometer/commit/a4ec0911a3ed4137a1c832fbd7c8fee80c7d4601

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/475943

Changed in ceilometer:
assignee: nobody → gordon chung (chungg)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/475943
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=41d940e3690ef303dfb46588311dd1337bd821ab
Submitter: Jenkins
Branch: master

commit 41d940e3690ef303dfb46588311dd1337bd821ab
Author: gord chung <email address hidden>
Date: Tue Jun 20 22:16:26 2017 +0000

    cap cpu_util

    deriving cpu_util from cputime is not exact as it relies on timing
    of host and a completely independent timing of pollster. this can
    cause precision issues with nanosecond timing resulting in >100%
    calculations. this sets a cap so at most cpu_util can only report
    100% cpu utilisation.

    Change-Id: I80c099d8618833794ef19e9497cfad4db7912851
    Closes-Bug: #1527620

Changed in ceilometer:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/484272

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/484274

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (stable/ocata)

Reviewed: https://review.openstack.org/484272
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=6c0d24ec3e1ed577d3c5b92024b36a582aa6d597
Submitter: Jenkins
Branch: stable/ocata

commit 6c0d24ec3e1ed577d3c5b92024b36a582aa6d597
Author: gord chung <email address hidden>
Date: Tue Jun 20 22:16:26 2017 +0000

    cap cpu_util

    deriving cpu_util from cputime is not exact as it relies on timing
    of host and a completely independent timing of pollster. this can
    cause precision issues with nanosecond timing resulting in >100%
    calculations. this sets a cap so at most cpu_util can only report
    100% cpu utilisation.

    Change-Id: I80c099d8618833794ef19e9497cfad4db7912851
    Closes-Bug: #1527620
    (cherry picked from commit 41d940e3690ef303dfb46588311dd1337bd821ab)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (stable/newton)

Reviewed: https://review.openstack.org/484274
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=cf895bac402d855771a2257cb941ce24f84f6a76
Submitter: Jenkins
Branch: stable/newton

commit cf895bac402d855771a2257cb941ce24f84f6a76
Author: gord chung <email address hidden>
Date: Tue Jun 20 22:16:26 2017 +0000

    cap cpu_util

    deriving cpu_util from cputime is not exact as it relies on timing
    of host and a completely independent timing of pollster. this can
    cause precision issues with nanosecond timing resulting in >100%
    calculations. this sets a cap so at most cpu_util can only report
    100% cpu utilisation.

    Change-Id: I80c099d8618833794ef19e9497cfad4db7912851
    Closes-Bug: #1527620
    (cherry picked from commit 41d940e3690ef303dfb46588311dd1337bd821ab)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ceilometer (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/484640

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ceilometer (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/485096

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ceilometer (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/485708

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/484640
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=75e10b2a4808ee007a8e3d768ae90fa1037a3c11
Submitter: Jenkins
Branch: master

commit 75e10b2a4808ee007a8e3d768ae90fa1037a3c11
Author: Mehdi Abaakouk <email address hidden>
Date: Tue Jul 18 08:10:38 2017 +0200

    High precision rate of change timedelta

    The current way to calculate rate of change is not precise at all and
    depends on the local host clock. So, we have good chance that the host
    clock derive a bit between each polling. Also the timestamp is polling
    cycle run and not the exact polled sample.

    This makes the rate of change transformer not accurate, and maybe wrong
    if the local clock have jumped to much or if a pollster make to much
    time to get the stats (libvirt reconnection, ...).

    A sample gets a new attribute monotonic_time, where we can store an
    accurate polling time using monotonic.monotonic().

    In rate of change transformer, if the monotonic time is available we use
    to calculate the time delta between samples.

    For instance metrics, we set monotonic_time as soon as we poll it from
    libvirt, avoiding almost all precision issue.

    That makes the rate of change precise to the nanoseconds for polled
    samples, while keeping the timestamp identical for all samples polled
    during one cycle.

    Related-bug: #1527620
    Change-Id: I40e14fb6aa595a86df9767be5758f52b7ceafc8f

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ceilometer (stable/ocata)

Reviewed: https://review.openstack.org/485096
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=251a06d5c6a636050f92984b8f969f56868c41ec
Submitter: Jenkins
Branch: stable/ocata

commit 251a06d5c6a636050f92984b8f969f56868c41ec
Author: Mehdi Abaakouk <email address hidden>
Date: Tue Jul 18 08:10:38 2017 +0200

    High precision rate of change timedelta

    The current way to calculate rate of change is not precise at all and
    depends on the local host clock. So, we have good chance that the host
    clock derive a bit between each polling. Also the timestamp is polling
    cycle run and not the exact polled sample.

    This makes the rate of change transformer not accurate, and maybe wrong
    if the local clock have jumped to much or if a pollster make to much
    time to get the stats (libvirt reconnection, ...).

    A sample gets a new attribute monotonic_time, where we can store an
    accurate polling time using monotonic.monotonic().

    In rate of change transformer, if the monotonic time is available we use
    to calculate the time delta between samples.

    For instance metrics, we set monotonic_time as soon as we poll it from
    libvirt, avoiding almost all precision issue.

    That makes the rate of change precise to the nanoseconds for polled
    samples, while keeping the timestamp identical for all samples polled
    during one cycle.

    Related-bug: #1527620
    Change-Id: I40e14fb6aa595a86df9767be5758f52b7ceafc8f
    (cherry picked from commit fd6a76601a382cdf47527893f9255b48bc235d05)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ceilometer (stable/newton)

Reviewed: https://review.openstack.org/485708
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=0302d470ae8ca4efab1791e9005495894452d21b
Submitter: Jenkins
Branch: stable/newton

commit 0302d470ae8ca4efab1791e9005495894452d21b
Author: Mehdi Abaakouk <email address hidden>
Date: Tue Jul 18 08:10:38 2017 +0200

    High precision rate of change timedelta

    The current way to calculate rate of change is not precise at all and
    depends on the local host clock. So, we have good chance that the host
    clock derive a bit between each polling. Also the timestamp is polling
    cycle run and not the exact polled sample.

    This makes the rate of change transformer not accurate, and maybe wrong
    if the local clock have jumped to much or if a pollster make to much
    time to get the stats (libvirt reconnection, ...).

    A sample gets a new attribute monotonic_time, where we can store an
    accurate polling time using monotonic.monotonic().

    In rate of change transformer, if the monotonic time is available we use
    to calculate the time delta between samples.

    For instance metrics, we set monotonic_time as soon as we poll it from
    libvirt, avoiding almost all precision issue.

    That makes the rate of change precise to the nanoseconds for polled
    samples, while keeping the timestamp identical for all samples polled
    during one cycle.

    Related-bug: #1527620
    Change-Id: I40e14fb6aa595a86df9767be5758f52b7ceafc8f
    (cherry picked from commit 251a06d5c6a636050f92984b8f969f56868c41ec)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ceilometer 7.1.0

This issue was fixed in the openstack/ceilometer 7.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ceilometer 8.1.0

This issue was fixed in the openstack/ceilometer 8.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ceilometer 9.0.0

This issue was fixed in the openstack/ceilometer 9.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.