Ceilometer

Too low precision on sample timestamps

Bug #1215676 reported by Thomas Maddox on 2013-08-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ceilometer	Fix Released	Undecided	Thomas Maddox	Ceilometer 2013.2 "havana"

Bug Description

Samples are ending up with duplicate timestamps for the same sample - this prevents precision when we want to ask for, say, the latest sample for a resource. Because the message queue can't guarantee order, we are likely to end up with old samples being stored after newer samples within the same second, thus having a higher ID, but not actually being newer.

Example of samples for a single resource: http://paste.openstack.org/show/44970/

Revision history for this message

Julien Danjou (jdanjou) wrote on 2013-08-23:

I understand the context, but I'm not sure I understand the bug.

Revision history for this message

Thomas Maddox (thomas-maddox) wrote on 2013-08-23:

Sorry, I should have been clearer; I sometimes forget to give enough context. =]

Here's how I'm seeing the problem:

The intended behavior is to be able to return the latest metadata for each resource when requesting that resource or a list of them. That intent is not consistently supported when timestamps have too low precision (to the second). We can't guarantee that what shows up is actually the latest for two reasons:

1. Timestamps have only a resolution of a second and multiple messages can processed within a second. So, we end up with several messages regarding the same resource with the same exact timestamp; we can't ORDER BY Meter.timestamp DESC to sort with certainty.
2. MAX(Meter.id) doesn't necessarily return the latest because they can be processed and stored out of order, due to network or message queue constraints. So, Meter.id 1 may actually be newer than Meter.id 2.

So, the bug is how the current implementation undermines the intent of the application.

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2013-08-23:

There are two contributing factors here. First, this is just the nature of a distributed asynchronous system. We can't be sure what order the data is generated or received, so the precision we have for "latest" is vague.

The second cause of many samples with the same timestamp is the fact that a single notification event is converted into several different samples. It is almost always necessary to restrict queries by meter name to get meaningful results.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2013-08-23:

I don't know what we can do for 1., we don't generate the timestamps most of the time, an I don't think we internally round them to the second.

2. Agreed, this is a bad implementation, that is probably not well covered by our unit tests. I don't think we have this problem in the MongoDB driver since we rely on the meter table to do the aggregation.

Revision history for this message

Thomas Maddox (thomas-maddox) wrote on 2013-08-23:

All good points, for sure.

It's definitely vague; we cannot be 100% certain in any case. I didn't know that a single notification would be broken out into many samples, though it makes sense now that you mention it. =]

The suggestion I had was to align our precision with what Nova emits, so were at least that accurate. At that point we could tie-break with ID and that'd be about as accurate as we can get. Of course, to your point, that means nothing if the samples are intended to have duplicate timestamps (without regard to the meter they're associated with) due to samples being collected for several meters from the same notification. Is that the point you're making?

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2013-08-23:

Yes, that's right. We fully expect some samples to have the same timestamp for a given resource.

Revision history for this message

Thomas Maddox (thomas-maddox) wrote on 2013-08-24:

Nova notifications have timestamps like: '_context_timestamp': '2013-08-24T14:31:06.154444'. This is similar to the unit test data setup.

After some poking around, there are a couple of places where this function is called without the higher precision flag (sub-second): https://github.com/openstack/ceilometer/blob/master/ceilometer/openstack/common/timeutils.py#L35-L44. Like here: https://github.com/openstack/ceilometer/blob/master/ceilometer/compute/pollsters/net.py#L59.

Also, it looks like up until the latter versions of MySQL (5.6.4+), sub-second values in temporal fields were truncated (http://dev.mysql.com/doc/refman/5.6/en/fractional-seconds.html). That's probably why Sandy just went with a Decimal representation in StackTach.

So, do you think we ought to fix this up to preserve the precision provided to CM from the source? From my perspective: it's destructive in that it removes a level of precision that already existed in the notification. I think it's important to preserve this for auditing and replay purposes.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2013-08-26:

Yes, it seems like it's worth fixing up to me.

Thomas Maddox (thomas-maddox) on 2013-09-04

Changed in ceilometer:
assignee:	nobody → Thomas Maddox (thomas-maddox)

Revision history for this message

Sandy Walsh (sandy-walsh) wrote on 2013-09-11:

We've already had to deal with this for Events. The unixtime stuff is already in there from StackTach.

https://github.com/openstack/ceilometer/blob/master/ceilometer/utils.py#L47-L67

OpenStack Infra (hudson-openstack) on 2013-09-11

Changed in ceilometer:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-09-23: Fix merged to ceilometer (master)

#10

Reviewed: https://review.openstack.org/46100
Committed: http://github.com/openstack/ceilometer/commit/a86e7423f1834aaf8f13de101727e39a3a537bb3
Submitter: Jenkins
Branch: master

commit a86e7423f1834aaf8f13de101727e39a3a537bb3
Author: Thomas Maddox <email address hidden>
Date: Wed Sep 11 17:02:00 2013 +0000

Fix for timestamp precision in SQLAlchemy

    This fix adds the PreciseTimestamp custom type to address how
    MySQL < 5.6.4 truncates temporal columns to the second:
    http://dev.mysql.com/doc/refman/5.6/en/fractional-seconds.html

Change-Id: I9d324d1cb6867cf99e15e2ecdc566dc5f8fef536
Closes-Bug: #1215676

Changed in ceilometer:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2013-10-02

Changed in ceilometer:
milestone:	none → havana-rc1
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2013-10-17

Changed in ceilometer:
milestone:	havana-rc1 → 2013.2

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.