CPU samples reset to 0 for shutdown instances

Bug #1417949 reported by Björn Hagemeier
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceilometer
Fix Released
Medium
gordon chung

Bug Description

For instances in state shutdown, we see a reset of the cumulative CPU consumption to 0. In our opinion, this should never happen for a cumulative meter. This will probably lead to even more confusion if VMs are stopped and started again to preserve resource consumption but not quite delete them. A colleague is currently testing this.

Our current deployment is Icehouse. Here's an example of the behavior:

$ ceilometer sample-list -m 'cpu' -q 'resource_id=4f16b8ac-20f3-48e4-a5b8-87c84565c612'
+--------------------------------------+------+------------+---------------+------+---------------------+
| Resource ID | Name | Type | Volume | Unit | Timestamp |
+--------------------------------------+------+------------+---------------+------+---------------------+
| 4f16b8ac-20f3-48e4-a5b8-87c84565c612 | cpu | cumulative | 0.0 | ns | 2015-02-04T00:34:40 |
| 4f16b8ac-20f3-48e4-a5b8-87c84565c612 | cpu | cumulative | 2.42711e+12 | ns | 2015-02-04T00:24:41 |
| 4f16b8ac-20f3-48e4-a5b8-87c84565c612 | cpu | cumulative | 1.80328e+12 | ns | 2015-02-04T00:14:41 |
| 4f16b8ac-20f3-48e4-a5b8-87c84565c612 | cpu | cumulative | 1.17889e+12 | ns | 2015-02-04T00:04:40 |
| 4f16b8ac-20f3-48e4-a5b8-87c84565c612 | cpu | cumulative | 5.5995e+11 | ns | 2015-02-03T23:54:41 |
| 4f16b8ac-20f3-48e4-a5b8-87c84565c612 | cpu | cumulative | 45450000000.0 | ns | 2015-02-03T23:44:40 |
+--------------------------------------+------+------------+---------------+------+---------------------+

So the consumed CPU time should actually be max(Volume) rather than the last sample. In our opinion, any cumulative function over time should be monotonically increasing. After all, you cannot consume negative CPU time.

We are not sure whether this also happens for other cumulative meters or if CPU is a special case. This needs to be elicited.

gordon chung (chungg)
Changed in ceilometer:
status: New → Triaged
Revision history for this message
gordon chung (chungg) wrote :

so this is more complex than just taking max(volume). the cputime we get from libvirt is reset and therefore we get zeros. to guarantee an always growing value, we could try to combine last known time if the previous cpu time is greater than current time... but that isn't actually 100% full proof: ie.

1. we have 5min poll cycle
2. 3 min in, we start a cpu
3. after 2 min, we will poll and a cputime
4. we shutdown right after poll and restart right away.
5. next poll cycle, the cputime is greater than prevoius cputime but it isn't proper cumulative data since it's still missing the initial cpu time from first poll

anyone know of a proper way to handle this? i will discuss with nova dev to see if there is a way around that if not.

Revision history for this message
Antonio Messina (arcimboldo) wrote :

I think you will always loose data if you use polling. The more often you poll, the less data you lose, but you will lose some.

However, having "cpu" as a cumulative value is, IMHO, wrong, because as it is it's basically useless. There are two problems:

1) you cannot get a reliable value, because you cannot know if the server was rebooted or not, so you cannot just get "max"
2) you cannot easily get the cpu time used *in a specified period of time* (you should subtract the last and the first sample in the time window, but again, if the server was rebooted you could get a negative/meaningless value)

What you need instead of a cumulative metric is a "kind" of delta, and store how much cpu-time was used since last time you polled the value. AND, you have to correct this value in case the last value is greater than the current one (e.g. in case of a reboot).

If you use a delta instead of a cumulative metric you can still lose some data, but at least sum() will return a *reasonable* value (+ or - your polling interval).

Revision history for this message
gordon chung (chungg) wrote :

@Antonio,

i'll agree with you that cpu meter is wrong as a cumulative value... in the context of libvirt. but as this is also a meter provided by hyper-v, xenapi, vsphere, i don't believe we can simply rename it.

the ideal solution would be to have the time to be captured by libvirt as it knows exactly when an instance is rebooted/shutdown. that said, i'm not sure if that is in libvirt's roadmap so i'm not against creating a new delta metric which captures what you want. this is actually achieveable via transformers.

gordon chung (chungg)
Changed in ceilometer:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/221907

Changed in ceilometer:
assignee: nobody → gordon chung (chungg)
status: Triaged → In Progress
gordon chung (chungg)
Changed in ceilometer:
milestone: none → liberty-rc1
gordon chung (chungg)
Changed in ceilometer:
importance: High → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/221907
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=6ef079953e343885fb6251bc7bbb3c13b8e89483
Submitter: Jenkins
Branch: master

commit 6ef079953e343885fb6251bc7bbb3c13b8e89483
Author: gordon chung <email address hidden>
Date: Wed Sep 9 14:28:34 2015 -0400

    add delta transfomer support

    this patch adds support for a delta transformer. the transformer's
    only functionality is to calculate the delta between current sample
    and previous sample.

    conditions:
    - it will disregard any out of order samples
    - a growth_only param is available to capture only positive deltas
    - supports renaming to a new meter name using same schema as other
      transformers.

    using this transformer, we also create a cpu.delta meter which will
    enable another view of cpu meter. this delta meter will allow for
    (relatively) accurate cputime calculations and will cope with cputime
    resets.

    DocImpact

    Change-Id: Iabcad20d500e3157e4d19f8b2ebffd770218165b
    Closes-Bug: #1417949

Changed in ceilometer:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ceilometer:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ceilometer:
milestone: liberty-rc1 → 5.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.