carbon/whisper performing 800mb writes

Bug #885944 reported by Jordan Sissel
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Graphite
Fix Released
Undecided
Unassigned

Bug Description

I found my fresh graphite install abusing my disk pretty heavily. Upon investigation, I found carbon creating new whisper files and then issueing an 800mb write to that file. This is repeatable for all metrics not previously seen.

strace output here: https://raw.github.com/gist/1338425/3da1dd9e541cc62f7419c706e5631b6774147624/gistfile1.txt

I tracked this in whisper.py to here (this code is copied from 0.9.9's whisper)

  for secondsPerPoint,points in archiveList:
    archiveInfo = struct.pack(archiveInfoFormat, archiveOffsetPointer, secondsPerPoint, points)
    fh.write(archiveInfo)
    archiveOffsetPointer += (points * pointSize)

  zeroes = '\x00' * (archiveOffsetPointer - headerSize)
  fh.write(zeroes)

This code, to me, says to write a few headers and then pads the rest of the file with zeroes. This zero-fill operation causes a huge amount of bytes to be written to disk and explains the heavy I/O usage I observed.

I think the 'zeroing' action can be better written like this:

fh.seek(archiveOffsetPointer - headerSize - 1)
fh.write("\0")

The above should achieve the same results as the original code but without incurring huge amounts of disk activity.

I'm pretty sure this is a problem and am quite happy to write a patch for this. Thoughts?

Tags: whisper
Revision history for this message
Jordan Sissel (jls-semicomplete) wrote :

(Keeping in mind that the' 800mb' is due to my carbon schema configuration)

My proposed fix above will do two things - first, prevent carbon from allocating an 800mb string and writing it, which will save it from ballooning in memory that python tends to not want to give back to the system - top(1) shows carbon is currently using 812MB in memory, which maps to the roughly 800mb of memory + the rest of things carbon will need (which is small).

Second, preventing disk thrashing from writing a massive chunk to disk.

If there is a strong reason why forced writing of the entire file full of zeroes is necessary, please let me know. If nothing else, we can certainly fix the first issue I mentioned (high memory usage due to massive strings) while leaving the second (disk thrashing on creates)

Revision history for this message
Nathan Burns (nathanlburns) wrote :

We have both problems in our installations. I'd be interested in this getting "fixed". Awesome Jordan.

Revision history for this message
chrismd (chrismd) wrote :

So the original intent behind writing out the whole file byte-by-byte was to ensure all of the filesystem blocks get allocated contiguously on disk to minimize fragmentation and thus reduce seek time for large reads (which tend to be very common for graphite). Your approach creates an equivalent file from the filesystem perspective but it does not allocate all the inodes for each block until data actually gets written to each part of the file, which would almost definitely happen in a fragmented fashion (first 1 block across all the new metrics, then the next block for each, etc, makes things about as fragmented as they can get). So the idea was eat the one-time allocation cost up front to ensure less seeking forever after.

That was the *intent*, but it's been so many years since I wrote whisper I am honestly not sure if this actually provides any tangible benefit. I think the only way to tell is to experiment and look at some I/O performance data.

The MAX_CREATES_PER_MINUTE setting was implemented to mitigate the performance impact of allocating new metrics. Many users get frustrated that allocating say 200,000 new metrics takes forever so they crank this knob up really high and it wreaks havoc until all the metrics get created. This is fine if its a brand new system and you want to speedup the initial allocations but it is a huge liability to a production system because a slew of new metrics could flat out kill it, so it is wise to leave this very low, the default is 50. The same is true of MAX_WRITES_PER_SECOND, more is not always better, it is a trade-off between frequency and size of writes, and the right balance is system-dependent.

Regarding the memory usage concern, it is true that python allocates an 800MB string to write the 800MB file but there is no cumulative effect, I was curious about this so I tested the following snippet:

import time
string_size = 1048576 * 800
for i in range(255):
  big_string = chr(i) * string_size
  time.sleep(0.1)
  for line in open('/proc/self/status'):
    if 'vmrss' in line.lower():
      print line

The vm size immediately jumps up to 800M, but it remains flat despite creating a bunch of unique 800MB strings in succession because of garbage collection. A simple way to cap the memory cost here would be to write the 800MB string in chunks, say 10MB at a time.

Anyways, all that could be moot and your patch would be perfect if we can determine whether or not the fragmentation factor has a real performance impact or not. I think we should make this behavior optional so people can test. It will be hard to replicate the right testing conditions through simulation because fragmentation will only affect reads, and the nature of graphite's reads can vary greatly depending on the type of request, frequency of datapoints, caching configurations, etc. Plus many users with larger installs have some type of RAID setup, not to mention this will probably vary across different filesystems...

So anyways I am positive there are situations in which there will be no discernable performance impact, but we need to figure out what cases it does matter in if any, before changing the default behavior.

Revision history for this message
Jordan Sissel (jls-semicomplete) wrote :

I'm happy with sparse file creation being config option with the default to stay with the current behavior (allocate the entire file at creation time)

Regarding memory usage alone - Yeah, it is not cumulative, but carbon would otherwise not use 800mb of memory if it hadn't had a one-time "make a huge string" that was quickly garbage collected - python won't free() that large chunk, so I think it should avoid allocating it. Solving this alone could be done by using a smaller buffer and writing it repeatedly until it fills the proper size of file.

I'm working on the patch as I originally proposed (perhaps including a way to make it an option), but I'm pretty terrible with bzr so it'll be a little bit :)

Revision history for this message
Jordan Sissel (jls-semicomplete) wrote :

I came up with this: http://bazaar.launchpad.net/~jls-semicomplete/+junk/graphite/revision/600

It makes sparse optional and non-default as discussed. it also makes the zeroing write small chunks at a time to avoid the massive string allocation (which needs testing).

Can you review? If it looks OK, I'll move forward on making it a carbon config option.

Revision history for this message
chrismd (chrismd) wrote :

Looks good to me, thanks!

Revision history for this message
Michael Leinartas (mleinartas) wrote :

Committed the patch to Whisper in r695 and new configuration option to expose it in r696

Thanks for the patch

Changed in graphite:
milestone: none → 0.9.10
status: New → Fix Committed
Revision history for this message
Linux Benchmark Solutions (linux-benchmarksolutions) wrote :

I know this is 'fixed', but I just came across.

I'm afraid this setting may allow users to shoot themselves in the foot. File fragmentation is still a very real problem for those of us on rotational media. This is especially true for anyone with multiples archives in their whisper databases as this requires a potentially large amount of reads during propagation events. If these reads are fragmented, performance will fall through the floor. I/O ordering and fragmentation are so important I've even custom hacked my graphite install to sort the workload by inode number to reduce head seeks.

A better solution would be to detect the presence of 'fallocate' and use this if available. This will preserve data ordering and completely eliminate the excess I/O on database creation.

From the 'fallocate' man page:

DESCRIPTION
       fallocate is used to preallocate blocks to a file. For filesystems which support the fallocate system call, this
       is done quickly by allocating blocks and marking them as uninitialized, requiring no IO to the data blocks. This
       is much faster than creating a file by filling it with zeros.

       As of the Linux Kernel v2.6.31, the fallocate system call is supported by the btrfs, ext4, ocfs2, and xfs filesys-
       tems.

       The exit code returned by fallocate is 0 on success and 1 on failure.

Python support is available in linux-ftools. On a side note, this python package also adds POSIX FADVISE support, which is also useful; ie. POSIX_FADV_RANDOM to disable read-ahead, which can reduce trashing of the page cache and therefore boost I/O performance by only keeping useful metrics in the cache). POSIX_FADV_WILLNEED can also be used to prefetch only the metrics that will be needed (start, end file offset).

http://code.google.com/p/linux-ftools/

Francois Mikus (fmikus)
description: updated
Revision history for this message
Nicholas Leskiw (nleskiw) wrote : Re: [Bug 885944] Re: carbon/whisper performing 800mb writes
Download full text (4.0 KiB)

This is not a problem, Whisper does this to ensure that the files do not become fragmented and are as sequential on the disk as possible. A side effect is that if your storage partition fills up, all the space has been allocated and Graphite continues to collect data for existing metrics.

Please do not change this behavior. You can adjust the rate new metrics are created in the carbon configuration files and limit the number of new creates per minute. (Carbon caches data until it is written to disk.)

-Nick

On Feb 28, 2012, at 10:14 PM, Francois Mikus <email address hidden> wrote:

> ** Description changed:
>
> I found my fresh graphite install abusing my disk pretty heavily. Upon
> investigation, I found carbon creating new whisper files and then
> issueing an 800mb write to that file. This is repeatable for all metrics
> not previously seen.
>
> strace output here:
> https://raw.github.com/gist/1338425/3da1dd9e541cc62f7419c706e5631b6774147624/gistfile1.txt
>
> I tracked this in whisper.py to here (this code is copied from 0.9.9's
> whisper)
>
> - for secondsPerPoint,points in archiveList:
> - archiveInfo = struct.pack(archiveInfoFormat, archiveOffsetPointer, secondsPerPoint, points)
> - fh.write(archiveInfo)
> - archiveOffsetPointer += (points * pointSize)
> + for secondsPerPoint,points in archiveList:
> + archiveInfo = struct.pack(archiveInfoFormat, archiveOffsetPointer, secondsPerPoint, points)
> + fh.write(archiveInfo)
> + archiveOffsetPointer += (points * pointSize)
>
> - zeroes = '\x00' * (archiveOffsetPointer - headerSize)
> - fh.write(zeroes)
> + zeroes = '\x00' * (archiveOffsetPointer - headerSize)
> + fh.write(zeroes)
>
> This code, to me, says to write a few headers and then pads the rest of
> the file with zeroes. This zero-fill operation causes a huge amount of
> bytes to be written to disk and explains the heavy I/O usage I observed.
>
> I think the 'zeroing' action can be better written like this:
>
> fh.seek(archiveOffsetPointer - headerSize - 1)
> fh.write("\0")
>
> The above should achieve the same results as the original code but
> without incurring huge amounts of disk activity.
>
> I'm pretty sure this is a problem and am quite happy to write a patch
> for this. Thoughts?
>
> --
> You received this bug notification because you are subscribed to
> Graphite.
> https://bugs.launchpad.net/bugs/885944
>
> Title:
> carbon/whisper performing 800mb writes
>
> Status in Graphite - Enterprise scalable realtime graphing:
> Fix Committed
>
> Bug description:
> I found my fresh graphite install abusing my disk pretty heavily. Upon
> investigation, I found carbon creating new whisper files and then
> issueing an 800mb write to that file. This is repeatable for all
> metrics not previously seen.
>
> strace output here:
> https://raw.github.com/gist/1338425/3da1dd9e541cc62f7419c706e5631b6774147624/gistfile1.txt
>
> I tracked this in whisper.py to here (this code is copied from 0.9.9's
> whisper)
>
> for secondsPerPoint,points in archiveList:
> archiveInfo = struct.pack(archiveInfoFormat, archiveOffsetPointer, secondsPerPoint, points)
> ...

Read more...

Revision history for this message
Michael Leinartas (mleinartas) wrote :

Thanks for the great information linux-benchmarksolutions (sorry, I dont know your real name). It certainly makes sense to use fallocate() when available. I'll look into FADVISE as well. I've created a separate bug to track this: https://bugs.launchpad.net/graphite/+bug/957827

Changed in graphite:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.