Comment 3 for bug 885944

Revision history for this message
chrismd (chrismd) wrote :

So the original intent behind writing out the whole file byte-by-byte was to ensure all of the filesystem blocks get allocated contiguously on disk to minimize fragmentation and thus reduce seek time for large reads (which tend to be very common for graphite). Your approach creates an equivalent file from the filesystem perspective but it does not allocate all the inodes for each block until data actually gets written to each part of the file, which would almost definitely happen in a fragmented fashion (first 1 block across all the new metrics, then the next block for each, etc, makes things about as fragmented as they can get). So the idea was eat the one-time allocation cost up front to ensure less seeking forever after.

That was the *intent*, but it's been so many years since I wrote whisper I am honestly not sure if this actually provides any tangible benefit. I think the only way to tell is to experiment and look at some I/O performance data.

The MAX_CREATES_PER_MINUTE setting was implemented to mitigate the performance impact of allocating new metrics. Many users get frustrated that allocating say 200,000 new metrics takes forever so they crank this knob up really high and it wreaks havoc until all the metrics get created. This is fine if its a brand new system and you want to speedup the initial allocations but it is a huge liability to a production system because a slew of new metrics could flat out kill it, so it is wise to leave this very low, the default is 50. The same is true of MAX_WRITES_PER_SECOND, more is not always better, it is a trade-off between frequency and size of writes, and the right balance is system-dependent.

Regarding the memory usage concern, it is true that python allocates an 800MB string to write the 800MB file but there is no cumulative effect, I was curious about this so I tested the following snippet:

import time
string_size = 1048576 * 800
for i in range(255):
  big_string = chr(i) * string_size
  time.sleep(0.1)
  for line in open('/proc/self/status'):
    if 'vmrss' in line.lower():
      print line

The vm size immediately jumps up to 800M, but it remains flat despite creating a bunch of unique 800MB strings in succession because of garbage collection. A simple way to cap the memory cost here would be to write the 800MB string in chunks, say 10MB at a time.

Anyways, all that could be moot and your patch would be perfect if we can determine whether or not the fragmentation factor has a real performance impact or not. I think we should make this behavior optional so people can test. It will be hard to replicate the right testing conditions through simulation because fragmentation will only affect reads, and the nature of graphite's reads can vary greatly depending on the type of request, frequency of datapoints, caching configurations, etc. Plus many users with larger installs have some type of RAID setup, not to mention this will probably vary across different filesystems...

So anyways I am positive there are situations in which there will be no discernable performance impact, but we need to figure out what cases it does matter in if any, before changing the default behavior.