Investigate relstorage + memcached configuration / performance

Bug #1302750 reported by Tres Seaver
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Low
Shane Hathaway

Bug Description

On Monday, 2014-03-31, Paul reported:

> This morning I decided to visit KARL. I first did a ping on
> <production> to make sure there were no DNS issues. My KARL home
> page is the communities listing.
>
> We have an improved "timeit" at the bottom of every screen. Above
> the copyright statement is a blank line. Mouseover it and you get:
>
> - The execution time in the middleware stack
> - The name of the app server that served that request
>
> From this I could see <appserver1> took 28.87 sec to render the
> results. That, obviously, is a crazy number. The communities listing
> is a very popular request. The cache should be very primed.
>
> I decided to look a little more into it. The
> var/log/connection_stats.csv is Tres's log of how many ZODB objects
> were loaded for slow requests. Here is the line that request,
> showing 5,119 objects were loaded (not a crazy amount):
>
> "2014-03-31 08:00:04", "GET",
> "<hostname>/communities/active_communities.html",
> 28.865574, 5119, 0
>
> Interestingly (perhaps) that is the only entry in the last 39
> minutes, meaning, we haven't had many slow requests on that app
> server.
>
> Now, over to DataDog. At 8AM I see a little spike in CPU, up to 20%
> system. I see a big spike in memcache bytes read at that instance:
> 96 KB, which appears to be the peak amount for all the spikes in the
> last month. The 28.8s shows up in permetrics.tween.t.95percentile.
>
> Ok, getting a little more interesting. Shane has a permetric
> CS.karl.models.catalog.CachingCatalog.search.t.95percentile.
> Basically, time spent executing a catalog operation. At 8AM I see a
> huge jump to....28.8 seconds. So the time spent on my long request
> was (not surprisingly) in the catalog.
>
> None of the relstorage metrics show anything significant at the 8AM
> timeframe. It would appear that the relevant objects were gotten out
> of memcache. I need to get the DataDog agent running again on the
> other app servers, the frontend, and on the db server. (They were
> running previously but I never wired them up for VM reboot.)

Changed in karl3:
milestone: m136 → m138
Changed in karl3:
importance: Medium → Low
Changed in karl3:
assignee: Tres Seaver (tseaver) → nobody
Changed in karl3:
milestone: m138 → m139
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

In September I will ask ChrisR if there is a 1 hour science experiment that will confirm/deny whether we have a weird config. If it's just a fishing expedition, I'll close this.

Changed in karl3:
assignee: nobody → Chris Rossi (chris-archimedeanco)
milestone: m139 → m140
Changed in karl3:
assignee: Chris Rossi (chris-archimedeanco) → Shane Hathaway (shane-hathawaymix)
milestone: m140 → m139
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Shane, do you think this one is worth doing? I don't think we have good visibility on how memcached is or isn't helping us. We certainly know it is causing an upgrade pain, plus introducing some mystery into how we reason about performance.

Changed in karl3:
milestone: m139 → m140
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

I'm going to close this. I don't think there's anything we can do about memcached except guess and hope.

Changed in karl3:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.