make statsd sampling config more granular

Bug #1090495 reported by Dieter P
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Medium
Darrell Bishop

Bug Description

I wrote a tool which allows to check for each timer metric, how many keys are being received by statsd per flush interval.
tool: https://github.com/Dieterbe/statsd/blob/statsd-timer-metric-counts/utils/statsd-timer-metric-counts.sh

I'm seeing that there's huge difference in how many packets are being sent amongst different keys.
for example in my output (see below) there's a huge amount of packets for REPLICATE and HEAD requests @ object-server and container-server, these require me to do a lot of sampling, in order to not overload statsd.
But as you can see, some other requests like object-server GET submit few packets. for these, I want to have every single metric value otherwise my statistics would be severely skewed.

So I think there should be a way to configure the sampling more granularly.
I think it makes most sense to leave default_sample_rate at 1 and than supply overrides for keys which are known to submit a lot of values.
{
"container-server.REPLICATE.timing.": 0.01,
"container-server.HEAD.timing": 0.02,
"object-server.HEAD.timing": 0.02
}
so far, it seems the frequent keys are just a few and i can easily manually configure them. however, this is not a highly loaded cluster at all. so maybe this won't suffice.

an other idea could be to automatically have swift adjust the sampling on a key-by-key basis. for example with a rule like "sample_rate is 1 by default, but aim for max 1000 packets per 5 seconds." so if it sent more than 500 packets in 2 seconds, it would start increasing the sample rate automatically for that key. similarly. increase again if there's few packets.
the algorithm needs some more thinking for smooth operation though.

here's my output:
[dieter@dfvimeographite2 ~]$ ./find-most-frequently-submitted-statsd-timer-metric.sh
stats.timers.dfvimeodfs5.account-server.HEAD.timing.count 4.0
stats.timers.dfvimeodfsproxy2.proxy-server.account.HEAD.204.timing.count 6.0
stats.timers.dfvimeodfsproxy1.proxy-server.account.HEAD.204.timing.count 7.0
stats.timers.dfvimeodfs1.account-server.PUT.timing.count 8.0
stats.timers.dfvimeodfs2.account-server.PUT.timing.count 8.0
stats.timers.dfvimeodfs5.account-server.PUT.timing.count 8.0
stats.timers.dfvimeodfsproxy2.proxy-server.object.PUT.411.timing.count 16.0
stats.timers.dfvimeodfs1.account-server.HEAD.timing.count 18.0
stats.timers.dfvimeodfs2.account-server.HEAD.timing.count 18.0
stats.timers.dfvimeodfsproxy1.proxy-server.object.PUT.411.timing.count 22.0
stats.timers.dfvimeodfsproxy1.proxy-server.object.HEAD.404.timing.count 23.0
stats.timers.dfvimeodfsproxy2.proxy-server.object.HEAD.404.timing.count 25.0
stats.timers.dfvimeodfs4.account-server.GET.timing.count 26.0
stats.timers.dfvimeodfs5.account-server.GET.timing.count 26.0
stats.timers.dfvimeodfs4.object-server.GET.timing.count 27.0
stats.timers.dfvimeodfs2.account-server.GET.timing.count 29.0
stats.timers.dfvimeodfs6.container-server.GET.timing.count 29.0
stats.timers.dfvimeodfsproxy1.proxy-server.account.GET.200.timing.count 31.0
stats.timers.dfvimeodfs1.container-server.GET.timing.count 36.0
stats.timers.dfvimeodfs3.object-server.GET.timing.count 37.0
stats.timers.dfvimeodfs1.object-server.GET.timing.count 41.0
stats.timers.dfvimeodfs2.object-server.GET.timing.count 41.0
stats.timers.dfvimeodfs5.object-server.PUT.timing.count 46.0
stats.timers.dfvimeodfs6.object-server.GET.timing.count 46.0
stats.timers.dfvimeodfsproxy1.proxy-server.object.PUT.201.timing.count 46.0
stats.timers.dfvimeodfs2.object-server.PUT.timing.count 48.0
stats.timers.dfvimeodfs3.object-server.PUT.timing.count 48.0
stats.timers.dfvimeodfs4.object-server.PUT.timing.count 49.0
stats.timers.dfvimeodfsproxy2.proxy-server.object.PUT.201.timing.count 51.0
stats.timers.dfvimeodfs5.object-server.GET.timing.count 54.0
stats.timers.dfvimeodfs1.object-server.PUT.timing.count 55.0
stats.timers.dfvimeodfsproxy2.proxy-server.account.GET.200.timing.count 57.0
stats.timers.dfvimeodfs6.object-server.PUT.timing.count 58.0
stats.timers.dfvimeodfsproxy1.proxy-server.object.GET.200.timing.count 64.0
stats.timers.dfvimeodfsproxy2.proxy-server.object.GET.200.timing.count 66.0
stats.timers.dfvimeodfsproxy1.proxy-server.container.GET.200.timing.count 70.0
stats.timers.dfvimeodfsproxy2.proxy-server.container.GET.200.timing.count 84.0
stats.timers.dfvimeodfs3.container-server.GET.timing.count 91.0
stats.timers.dfvimeodfs2.container-server.PUT.timing.count 99.0
stats.timers.dfvimeodfs2.container-server.GET.timing.count 102.0
stats.timers.dfvimeodfs3.container-server.PUT.timing.count 104.0
stats.timers.dfvimeodfs4.container-server.PUT.timing.count 104.0
stats.timers.dfvimeodfs4.container-server.GET.timing.count 146.0
stats.timers.dfvimeodfs2.account-server.REPLICATE.timing.count 174.0
stats.timers.dfvimeodfs5.account-server.REPLICATE.timing.count 196.0
stats.timers.dfvimeodfs1.account-server.REPLICATE.timing.count 200.0
stats.timers.dfvimeodfs4.account-server.REPLICATE.timing.count 219.0
stats.timers.dfvimeodfs1.object-server.REPLICATE.timing.count 129563.0
stats.timers.dfvimeodfs5.object-server.REPLICATE.timing.count 144975.0
stats.timers.dfvimeodfs6.object-server.REPLICATE.timing.count 151934.0
stats.timers.dfvimeodfs3.object-server.REPLICATE.timing.count 154501.0
stats.timers.dfvimeodfs4.object-server.REPLICATE.timing.count 155883.0
stats.timers.dfvimeodfs2.object-server.REPLICATE.timing.count 155977.0
stats.timers.dfvimeodfs1.object-server.HEAD.timing.count 204005.0
stats.timers.dfvimeodfs4.object-server.HEAD.timing.count 219981.0
stats.timers.dfvimeodfs1.container-server.HEAD.timing.count 221976.0
stats.timers.dfvimeodfs2.object-server.HEAD.timing.count 222047.0
stats.timers.dfvimeodfs3.object-server.HEAD.timing.count 222121.0
stats.timers.dfvimeodfs6.object-server.HEAD.timing.count 222593.0
stats.timers.dfvimeodfs5.object-server.HEAD.timing.count 222732.0
stats.timers.dfvimeodfs3.container-server.HEAD.timing.count 240697.0
stats.timers.dfvimeodfs5.container-server.HEAD.timing.count 241076.0
stats.timers.dfvimeodfs4.container-server.HEAD.timing.count 242101.0
stats.timers.dfvimeodfs2.container-server.HEAD.timing.count 242305.0
stats.timers.dfvimeodfs6.container-server.HEAD.timing.count 245777.0
stats.timers.dfvimeodfs1.container-server.REPLICATE.timing.count 491286.0
stats.timers.dfvimeodfs3.container-server.REPLICATE.timing.count 536029.0
stats.timers.dfvimeodfs5.container-server.REPLICATE.timing.count 538027.0
stats.timers.dfvimeodfs4.container-server.REPLICATE.timing.count 538628.0
stats.timers.dfvimeodfs2.container-server.REPLICATE.timing.count 538650.0
stats.timers.dfvimeodfs6.container-server.REPLICATE.timing.count 547794.0

Revision history for this message
Dieter P (dieter-plaetinck) wrote :

correction, those numbers are for an hour. not for a flushinterval.

Revision history for this message
Dieter P (dieter-plaetinck) wrote :

just got bitten by this in production.. had to lower samplerate because the REPLICATE timing packets were flooding my statsd, but I'm seeing elevated latencies for GET requests on my proxy servers which I can't trace back because the few GET timing packets that survive after the sampling don't cover the problem.

Changed in swift:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.openstack.org/20093

Changed in swift:
assignee: nobody → Darrell Bishop (darrellb)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/20093
Committed: http://github.com/openstack/swift/commit/8801b7409030a1fb36fefcabd38af190bcb38c95
Submitter: Jenkins
Branch: master

commit 8801b7409030a1fb36fefcabd38af190bcb38c95
Author: Darrell Bishop <email address hidden>
Date: Sat Jan 19 15:25:27 2013 -0800

    Make statsd sample rate behave better.

    As Dieter pointed out in bug 1090495
    (https://bugs.launchpad.net/swift/+bug/1090495), the volume of metrics
    can vary wildly between StatsD metrics.

    This patch implements a partial solution by reducing the sample_rate
    used for known high-volume metrics (operational experience will need to
    inform this over time) and introducing a new tunable,
    log_statsd_sample_rate_factor which is multiplied by the sample_rate for
    every statsd stat. This tunable can be used to reduce StatsD traffic
    proportionally for all metrics and is intended to replace
    log_statsd_default_sample_rate, which is left alone for
    backward-compatibility, should anyone be using it.

    This patch also includes a drive-by fix for log_udp_port which wasn't
    being converted to an int (I didn't verify that actually causes trouble
    in SysLogHandler(), but it's definitely an improvement regardles).

    Change-Id: Id404636e3629f6431cf1c4e64a143959750a3c23

Changed in swift:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in swift:
milestone: none → 1.8.0-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in swift:
milestone: 1.8.0-rc1 → 1.8.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.