CPU samples reset to 0 for shutdown instances
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceilometer |
Fix Released
|
Medium
|
gordon chung |
Bug Description
For instances in state shutdown, we see a reset of the cumulative CPU consumption to 0. In our opinion, this should never happen for a cumulative meter. This will probably lead to even more confusion if VMs are stopped and started again to preserve resource consumption but not quite delete them. A colleague is currently testing this.
Our current deployment is Icehouse. Here's an example of the behavior:
$ ceilometer sample-list -m 'cpu' -q 'resource_
+------
| Resource ID | Name | Type | Volume | Unit | Timestamp |
+------
| 4f16b8ac-
| 4f16b8ac-
| 4f16b8ac-
| 4f16b8ac-
| 4f16b8ac-
| 4f16b8ac-
+------
So the consumed CPU time should actually be max(Volume) rather than the last sample. In our opinion, any cumulative function over time should be monotonically increasing. After all, you cannot consume negative CPU time.
We are not sure whether this also happens for other cumulative meters or if CPU is a special case. This needs to be elicited.
Changed in ceilometer: | |
status: | New → Triaged |
Changed in ceilometer: | |
importance: | Undecided → High |
Changed in ceilometer: | |
milestone: | none → liberty-rc1 |
Changed in ceilometer: | |
importance: | High → Medium |
Changed in ceilometer: | |
status: | Fix Committed → Fix Released |
Changed in ceilometer: | |
milestone: | liberty-rc1 → 5.0.0 |
so this is more complex than just taking max(volume). the cputime we get from libvirt is reset and therefore we get zeros. to guarantee an always growing value, we could try to combine last known time if the previous cpu time is greater than current time... but that isn't actually 100% full proof: ie.
1. we have 5min poll cycle
2. 3 min in, we start a cpu
3. after 2 min, we will poll and a cputime
4. we shutdown right after poll and restart right away.
5. next poll cycle, the cputime is greater than prevoius cputime but it isn't proper cumulative data since it's still missing the initial cpu time from first poll
anyone know of a proper way to handle this? i will discuss with nova dev to see if there is a way around that if not.