Graphs play "catch-up" when carbon performs a log rotation

Bug #1097547 reported by Justin Brinegar on 2013-01-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Graphite
Fix Committed
Medium
Unassigned

Bug Description

Every day at midnight server time, or 16:00 local time, carbon performs log rotation and my graphs play "catch-up" for 1-3 minutes. Most of my metrics are impacted by this: the line will dip down before shooting back up, right at the time when the server is rotating logs.

I currently have between 60k and 70k metrics per minute sent to my graphite server. This behavior didn't start happening until 20k metrics per minute or so. I have attached a screenshot of this behavior.

The logs that are getting rotated live on SAS drives. The whisper databases live on a separate, high-performance storage device.

Log sizes, if that helps:
-rw-r--r-- 1 root root 6.7M Jan 8 23:59 console.log.2013_1_8
-rw-r--r-- 1 root root 439M Jan 8 23:59 listener.log.2013_1_8
-rw-r--r-- 1 root root 1014M Jan 8 23:59 query.log.2013_1_8

There is nothing in the logs that indicate a problem.

Log rotation should not impact the running application whose logs are being rotated.

It would be nice to be able to easily control:
1) Whether logs are rotated or not
2) Where logs are written to

If I could turn off rotation, I would just send the logs to /dev/null. Seems like a lot of overhead that could be avoided if there was a simple switch for all logging.

Michael Leinartas (mleinartas) wrote :

I'm guessing you're using the debian package for carbon (or one based on it): http://packages.debian.org/unstable/utils/graphite-carbon

The packager chose to use logrotate to rotate the logs and perform a restart of carbon to force recreation of the log files. Unfortunately, a restart of carbon is problematic as it causes it to close connections and stop listening while it flushes all cached metrics to disk. This flushing can take many minutes depending on how large the cache is and what MAX_UPDATES_PER_SECOND is set to.

If you apply these two patches and remove the restart in the logrotate conf it should behave as expected when LOG_ROTATION = False in the config

https://github.com/graphite-project/carbon/commit/d779ac704f9fe9dd2b02501521d5f49740e1b58c.patch
https://github.com/graphite-project/carbon/commit/4793eb90bc58f55b066536a5fdf15114ba974fea.patch

Changed in graphite:
importance: Undecided → Medium
status: New → Fix Committed

Apologies for not getting back to you sooner. I discovered that some of my agents were sending junk data at midnight GMT. Once I solved that problem, my graphs recovered. Oddly, the graphs that were impacted were not necessarily the same graphs that had known bad data coming in at 4:00pm. I've added strict checks to all agents responsible for sending metrics, and this problem has not recurred. Logging is still an issue for me, so these patches are helpful. Thank you!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers