I wrote a tool which allows to check for each timer metric, how many keys are being received by statsd per flush interval.
tool: https://github.com/Dieterbe/statsd/blob/statsd-timer-metric-counts/utils/statsd-timer-metric-counts.sh
I'm seeing that there's huge difference in how many packets are being sent amongst different keys.
for example in my output (see below) there's a huge amount of packets for REPLICATE and HEAD requests @ object-server and container-server, these require me to do a lot of sampling, in order to not overload statsd.
But as you can see, some other requests like object-server GET submit few packets. for these, I want to have every single metric value otherwise my statistics would be severely skewed.
So I think there should be a way to configure the sampling more granularly.
I think it makes most sense to leave default_sample_rate at 1 and than supply overrides for keys which are known to submit a lot of values.
{
"container-server.REPLICATE.timing.": 0.01,
"container-server.HEAD.timing": 0.02,
"object-server.HEAD.timing": 0.02
}
so far, it seems the frequent keys are just a few and i can easily manually configure them. however, this is not a highly loaded cluster at all. so maybe this won't suffice.
an other idea could be to automatically have swift adjust the sampling on a key-by-key basis. for example with a rule like "sample_rate is 1 by default, but aim for max 1000 packets per 5 seconds." so if it sent more than 500 packets in 2 seconds, it would start increasing the sample rate automatically for that key. similarly. increase again if there's few packets.
the algorithm needs some more thinking for smooth operation though.
here's my output:
[dieter@dfvimeographite2 ~]$ ./find-most-frequently-submitted-statsd-timer-metric.sh
stats.timers.dfvimeodfs5.account-server.HEAD.timing.count 4.0
stats.timers.dfvimeodfsproxy2.proxy-server.account.HEAD.204.timing.count 6.0
stats.timers.dfvimeodfsproxy1.proxy-server.account.HEAD.204.timing.count 7.0
stats.timers.dfvimeodfs1.account-server.PUT.timing.count 8.0
stats.timers.dfvimeodfs2.account-server.PUT.timing.count 8.0
stats.timers.dfvimeodfs5.account-server.PUT.timing.count 8.0
stats.timers.dfvimeodfsproxy2.proxy-server.object.PUT.411.timing.count 16.0
stats.timers.dfvimeodfs1.account-server.HEAD.timing.count 18.0
stats.timers.dfvimeodfs2.account-server.HEAD.timing.count 18.0
stats.timers.dfvimeodfsproxy1.proxy-server.object.PUT.411.timing.count 22.0
stats.timers.dfvimeodfsproxy1.proxy-server.object.HEAD.404.timing.count 23.0
stats.timers.dfvimeodfsproxy2.proxy-server.object.HEAD.404.timing.count 25.0
stats.timers.dfvimeodfs4.account-server.GET.timing.count 26.0
stats.timers.dfvimeodfs5.account-server.GET.timing.count 26.0
stats.timers.dfvimeodfs4.object-server.GET.timing.count 27.0
stats.timers.dfvimeodfs2.account-server.GET.timing.count 29.0
stats.timers.dfvimeodfs6.container-server.GET.timing.count 29.0
stats.timers.dfvimeodfsproxy1.proxy-server.account.GET.200.timing.count 31.0
stats.timers.dfvimeodfs1.container-server.GET.timing.count 36.0
stats.timers.dfvimeodfs3.object-server.GET.timing.count 37.0
stats.timers.dfvimeodfs1.object-server.GET.timing.count 41.0
stats.timers.dfvimeodfs2.object-server.GET.timing.count 41.0
stats.timers.dfvimeodfs5.object-server.PUT.timing.count 46.0
stats.timers.dfvimeodfs6.object-server.GET.timing.count 46.0
stats.timers.dfvimeodfsproxy1.proxy-server.object.PUT.201.timing.count 46.0
stats.timers.dfvimeodfs2.object-server.PUT.timing.count 48.0
stats.timers.dfvimeodfs3.object-server.PUT.timing.count 48.0
stats.timers.dfvimeodfs4.object-server.PUT.timing.count 49.0
stats.timers.dfvimeodfsproxy2.proxy-server.object.PUT.201.timing.count 51.0
stats.timers.dfvimeodfs5.object-server.GET.timing.count 54.0
stats.timers.dfvimeodfs1.object-server.PUT.timing.count 55.0
stats.timers.dfvimeodfsproxy2.proxy-server.account.GET.200.timing.count 57.0
stats.timers.dfvimeodfs6.object-server.PUT.timing.count 58.0
stats.timers.dfvimeodfsproxy1.proxy-server.object.GET.200.timing.count 64.0
stats.timers.dfvimeodfsproxy2.proxy-server.object.GET.200.timing.count 66.0
stats.timers.dfvimeodfsproxy1.proxy-server.container.GET.200.timing.count 70.0
stats.timers.dfvimeodfsproxy2.proxy-server.container.GET.200.timing.count 84.0
stats.timers.dfvimeodfs3.container-server.GET.timing.count 91.0
stats.timers.dfvimeodfs2.container-server.PUT.timing.count 99.0
stats.timers.dfvimeodfs2.container-server.GET.timing.count 102.0
stats.timers.dfvimeodfs3.container-server.PUT.timing.count 104.0
stats.timers.dfvimeodfs4.container-server.PUT.timing.count 104.0
stats.timers.dfvimeodfs4.container-server.GET.timing.count 146.0
stats.timers.dfvimeodfs2.account-server.REPLICATE.timing.count 174.0
stats.timers.dfvimeodfs5.account-server.REPLICATE.timing.count 196.0
stats.timers.dfvimeodfs1.account-server.REPLICATE.timing.count 200.0
stats.timers.dfvimeodfs4.account-server.REPLICATE.timing.count 219.0
stats.timers.dfvimeodfs1.object-server.REPLICATE.timing.count 129563.0
stats.timers.dfvimeodfs5.object-server.REPLICATE.timing.count 144975.0
stats.timers.dfvimeodfs6.object-server.REPLICATE.timing.count 151934.0
stats.timers.dfvimeodfs3.object-server.REPLICATE.timing.count 154501.0
stats.timers.dfvimeodfs4.object-server.REPLICATE.timing.count 155883.0
stats.timers.dfvimeodfs2.object-server.REPLICATE.timing.count 155977.0
stats.timers.dfvimeodfs1.object-server.HEAD.timing.count 204005.0
stats.timers.dfvimeodfs4.object-server.HEAD.timing.count 219981.0
stats.timers.dfvimeodfs1.container-server.HEAD.timing.count 221976.0
stats.timers.dfvimeodfs2.object-server.HEAD.timing.count 222047.0
stats.timers.dfvimeodfs3.object-server.HEAD.timing.count 222121.0
stats.timers.dfvimeodfs6.object-server.HEAD.timing.count 222593.0
stats.timers.dfvimeodfs5.object-server.HEAD.timing.count 222732.0
stats.timers.dfvimeodfs3.container-server.HEAD.timing.count 240697.0
stats.timers.dfvimeodfs5.container-server.HEAD.timing.count 241076.0
stats.timers.dfvimeodfs4.container-server.HEAD.timing.count 242101.0
stats.timers.dfvimeodfs2.container-server.HEAD.timing.count 242305.0
stats.timers.dfvimeodfs6.container-server.HEAD.timing.count 245777.0
stats.timers.dfvimeodfs1.container-server.REPLICATE.timing.count 491286.0
stats.timers.dfvimeodfs3.container-server.REPLICATE.timing.count 536029.0
stats.timers.dfvimeodfs5.container-server.REPLICATE.timing.count 538027.0
stats.timers.dfvimeodfs4.container-server.REPLICATE.timing.count 538628.0
stats.timers.dfvimeodfs2.container-server.REPLICATE.timing.count 538650.0
stats.timers.dfvimeodfs6.container-server.REPLICATE.timing.count 547794.0
correction, those numbers are for an hour. not for a flushinterval.