load spike on HA 2.2.6 controller following remove-application

Bug #1733708 reported by Paul Collins on 2017-11-21
40
This bug affects 7 people
Affects Status Importance Assigned to Milestone
juju
High
Unassigned
2.2
High
Andrew Wilkins
2.3
High
Andrew Wilkins

Bug Description

One of our common deployments involved deploying a new Juju application, making its units active in haproxy, removing the previous application, and waiting for it to be removed.

Today during such a deployment, our 3 x node 2.2.6 controller started registering increasing load and ever more mongodb operations, and increasingly poor response times, e.g. multiple minutes to request "juju status" of a single application (see attached Grafana screenshot).

The load spike lines up with one of the deployment scripts having issued juju remove-application after the new application was deployed and made active.

Paul Collins (pjdc) wrote :
Paul Collins (pjdc) on 2017-11-21
description: updated
Paul Collins (pjdc) wrote :

some mongotop: https://pastebin.canonical.com/203792/ (Canonical-only link, sorry!)

Ian Booth (wallyworld) wrote :

Can we get mongo syslog and juju controller logs as well?

Paul Collins (pjdc) wrote :

machine-*.log from 20:50 to 20:59: https://pastebin.canonical.com/203796/

mongod syslog on primary (it became primary at 07:53:50) from 20:50 to 20:59: https://private-fileshare.canonical.com/~pjdc/juju2controllers-mongod-syslog.xz

Paul Collins (pjdc) wrote :

Sorry, those mongodb logs are from 23:50 to 23:59. Here's 20:50 to 20:59:

https://private-fileshare.canonical.com/~pjdc/juju2controllers-mongod-syslog-2.xz

The "wrong" logs might still be of interest as they probably reflect the system's current state.

Ian Booth (wallyworld) wrote :

https://pastebin.canonical.com/203796/ looks like a typo - it contains some mongo commands rather than machine log entries

Paul Collins (pjdc) wrote :

Yep, sorry. The machine-*.log entries are https://pastebin.canonical.com/203794/ (try https://pastebin.canonical.com/203794/plain/ if the former takes too long to load)

Tim Penhey (thumper) on 2017-11-22
Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.3.1
Junien Fridrick (axino) wrote :

FWIW, This is not the first time it happens, it also happened to me last week (when we were still on 2.2.4). And I'm pretty sure I linked at least one load spike to an application removal earlier than that.

Tim Penhey (thumper) wrote :

I watched the removal of an application today with wgrant and t0mb0.

Before we started we checked out the database. The txns collection had just been cleaned up from the previous night and was a reasonable level. There were no pending cleanups for the model, and mongotop looked sane.

The command was just to remove a single application that had two units. Each of these principle units would have had four subordinates.

Just after the removal, the juju.txns.log collection had reads go through the roof. A lot of write load into juju.txns. The read load on juju.txns.log was high for a number of minutes causing 'juju status' on other models to spike to 20s from 0.5s.

There was nothing untoward going on as far as I could tell, but the statetracker report on each of the controllers showed that every statepool had 184 models (and a few extra just because). This would have resulted in 552 models each polling the txns.log.

I think what we are hitting here is a situation where the cascading changes due to a deletion cause load in the txns collection. This causes a spike in the txns.log reads due to the number of tailers. This introduces significan i/o load on mongo which in turn can cause other commands to fail, which causes retries, which triggers additional transactions, which just adds to the overall load. This can degrade into a death spiral that only is recoverable by restarting the application servers.

To solve this, I think we really need to investigate a way to provide a central txns.log tailer for each controller.

Junien Fridrick (axino) wrote :

I found out today that in another model (the staging version of the model you took a look at yesterday), an application is removed multiple times per day, but this doesn't generate any load increase. This staging model has about half the number of statuseshistory records, and a third of logs records of the production model (in case this is relevant).

This probably invalidates the theory you posted above :|

Tim Penhey (thumper) wrote :
Changed in juju:
status: Triaged → In Progress
assignee: nobody → Tim Penhey (thumper)
Tim Penhey (thumper) wrote :

Junien, is the staging model using the same controllers?

I think the load occurs when there are enough changes. Sometimes the load is enough to trigger bad behaviour and sometime not.

Are there subordinates in this staging model?

tags: added: scalability
Junien Fridrick (axino) wrote :

Yes it's using the same controller, and yes, the staging models has the same number of subordinates (6) than production.

However, production has 4 units where staging only has 2.

Changed in juju:
milestone: 2.3.1 → none
Tim Penhey (thumper) on 2017-12-10
Changed in juju:
milestone: none → 2.3.2
Junien Fridrick (axino) wrote :

This is still impacting a 2.2.8 controller, when removing applications from a 2.2.6 model. mongodb "commands" counter (see https://docs.mongodb.com/manual/reference/command/serverStatus/ , section opcounters.command) increases highly, from ~600 per second to ~3500 per second.

I captured 100M worth of operations (using db.setProfilingLevel(2, 100)) during high load, and after a restart, when load was back to normal.

I'm unable to find anything relevant in this. And as a matter of fact, the 100MB were reached faster when under normal load than when under high load.

"perf top" showed this line at the top on the mongodb primary :
24.89% jujud [.] crypto/sha1.blockAMD64

And it went up to 60% on the secondaries (it's lower on the primary because mongodb takes a fair share of the CPU, I guess). And SHA1 is used during the SASL handshake. So it could be that all the controllers are trying to SASL to mongodb as fast as possible.

In fact, even with mongodb profiling (logging any command taking more than 100ms), I can see mongodb logging multiple thousands of saslStart per minute during the high load period.

When load was back to normal, I configured profiling to log _everything_, and only got less than 10 saslStart in 30s seconds.

I'll keep on investigating when this re-occurs (we can trigger it easily).

Note that 2.2.8 made "high load" lower than before.

I believe the issue is that something internal to Juju causes us to not
fully consume cursors, which ultimately leads us to thinking that
connection is full so we connect one more time to mongo. If we had fully
consumed and closed the cursor, then that connection to mongo would be
available in the pool for us to re-use for the next query, which would
avoid the need to handshake a new cursor. (and also deals with things like
too-many-open-file-handles.)

I think having some sort of logging that gets triggered whenever a new
connection is created, with a stack trace of what we're doing to cause the
connection to be opened could be useful. Its tricky to do in code, because
it requires hacking the mgo driver to generate the full stack trace. Do we
have any way to reproduce this sort of behavior that isn't production?

On Wed, Dec 20, 2017 at 7:54 PM, Junien Fridrick <<email address hidden>
> wrote:

> This is still impacting a 2.2.8 controller, when removing applications
> from a 2.2.6 model. mongodb "commands" counter (see
> https://docs.mongodb.com/manual/reference/command/serverStatus/ ,
> section opcounters.command) increases highly, from ~600 per second to
> ~3500 per second.
>
> I captured 100M worth of operations (using db.setProfilingLevel(2, 100))
> during high load, and after a restart, when load was back to normal.
>
> I'm unable to find anything relevant in this. And as a matter of fact,
> the 100MB were reached faster when under normal load than when under
> high load.
>
> "perf top" showed this line at the top on the mongodb primary :
> 24.89% jujud [.]
> crypto/sha1.blockAMD64
>
> And it went up to 60% on the secondaries (it's lower on the primary
> because mongodb takes a fair share of the CPU, I guess). And SHA1 is
> used during the SASL handshake. So it could be that all the controllers
> are trying to SASL to mongodb as fast as possible.
>
> In fact, even with mongodb profiling (logging any command taking more
> than 100ms), I can see mongodb logging multiple thousands of saslStart
> per minute during the high load period.
>
> When load was back to normal, I configured profiling to log
> _everything_, and only got less than 10 saslStart in 30s seconds.
>
> I'll keep on investigating when this re-occurs (we can trigger it
> easily).
>
> Note that 2.2.8 made "high load" lower than before.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1733708
>
> Title:
> load spike on HA 2.2.6 controller following remove-application
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1733708/+subscriptions
>

Junien Fridrick (axino) wrote :

> Do we have any way to reproduce this sort of behavior that isn't production ?

Not that I'm aware of, unfortunately.

Junien Fridrick (axino) wrote :

Is there a way to kill juju in such a way that it would generate a core dump and/or display all stack traces or something ? (same as what happens when you SIGQUIT a telegraf process)

Junien Fridrick (axino) wrote :

I forgot about the pprof facilities, I'll take a look at them next time !

John A Meinel (jameinel) wrote :

SIGQUIT does exactly that for Juju, with the caveat that it outputs to
whatever the original log file was when we started up. (it is captured via
systemd, and when you rotate log files, that doesn't cause stderr to rotate
as well.)

We also have pprof and some of the other utilities as well.

On Thu, Dec 21, 2017 at 11:20 AM, Junien Fridrick <
<email address hidden>> wrote:

> I forgot about the pprof facilities, I'll take a look at them next time
> !
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1733708
>
> Title:
> load spike on HA 2.2.6 controller following remove-application
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1733708/+subscriptions
>

Tim Penhey (thumper) wrote :

After more investigation, I came across this:

https://github.com/juju/juju/blob/2.2/state/status.go#L371

Statuses history is being removed all at once, and isn't batched at all. I believe that this will be causing i/o timeouts and reconnections.

Tim Penhey (thumper) wrote :

There is a function in state/prune.go for pruning collections, that seems to have some hard coded references to history, but it could / should be reused in the places where we are removing all the statuses history documents.

Andrew Wilkins (axwalk) wrote :

https://github.com/juju/juju/pull/8254 batch-deletes status history.

William Grant (wgrant) wrote :
Download full text (3.6 KiB)

After a couple of deploys yesterday reinforced the statuseshistory pruning hypothesis, we preserved some candidate machines and executed controlled demolitions today which basically proved that the statuseshistory cleanup is the problem.

For context, the two models (on the same 3-way HA OpenStack controller running Juju 2.2.8) that we can reproduce this on have frequently created and deleted applications with two (staging) and four (production) units each with four and six subordinates respectively. Staging is usually deployed to a couple of times a day, and production 1-3 times a week. Relevant model settings are update-status-hook-interval=5m, max-status-history-age=336h, max-status-history-size=5G. The principal and a couple of the subordinates set workload status on every update-status.

Deployments consist of deploying a new application, verifying it, adding relations, then remove-application and remove-machine --force on the old one (the remove-machine --force is mostly a leftover from the 2.1 subordinate bugs). In almost every production deployment, the remove phase tickles this bug. In staging deployments it is very rare, but has happened on at least two occasions. The main differences between the models are that production has twice as many machines and 50% more subordinates, and its units tend to live for several days rather than no more than one or two. It was on this basis that we speculated in November that statuseshistory might be to blame.

During the week of December 18, we were attempting to tickle the bug in controlled, monitored circumstances, so performed production deployments more closely together than is usual. The second production deployment in 24 hours did not tickle the bug, the first time in months that a deployment had proceeded without incident, and so we were unable to gather particularly useful data.

Both production and staging went without deployments for 12 days over the EOY break. The first staging deployment of 2018 tickled the bug -- unusual for staging. Controller logs showed "cannot delete history for unit" errors with IO timeouts for some of the units, some repeating over the next 20 minutes. The saslStart storm mentioned in comment #15 was again evident. A controller restart resolved things.

I preserved remaining production and staging machines from before the break for more detailed experiments. We manually purged the statuseshistories for the units on one machine of the other 12-day-old staging application, and a remove-machine --force worked fine; the machine vanished in about 5s and no added "juju status" latency was observed. Without removing statueshistories, remove-machine --force of the other machine also worked okay, but it took more than 30s, load on one controller spiked above 90, and status was delayed by several seconds during the removal. Each machine's units had about 50000 statuseshistories in total (on both production and staging there were ~7700 statuseshistory documents for each unit agent, and the same for the charms that update their status often).

Since that seemed like a reasonably successful test, we tried it on a 13-day-old production application inside a normal deployment ...

Read more...

William Grant (wgrant) wrote :

We also this morning saw a saslStart storm during a destroy-model from an unrelated team on the same controller, but it was too late to gather any data.

Tim Penhey (thumper) wrote :

The deletion of the statuses history values is done using a cleanup process. These are triggered as the units transition from dying to dead and are executed asynchronously, but serially for a particular model.

What we would see would be a unit hanging around for a while although dead as the cleanup jobs run. The cleanup could still delete the statuses history in small batches to effectively yield time back to the database for dealing with other requests.

Tim Penhey (thumper) on 2018-01-03
Changed in juju:
milestone: 2.3.2 → 2.4-beta1
status: In Progress → Triaged
assignee: Tim Penhey (thumper) → nobody
Junien Fridrick (axino) wrote :
Download full text (15.0 KiB)

Hi,

Disclaimer : much reading but not much progress here :(

So this still happening, even with 2.2.9. wgrant has a controller that is under high load / high sha1.blockAMD64 activity. On this controller, I tried to get a perf flamegraph but it wasn't really useful : https://private-fileshare.canonical.com/~axino/lp1733708/perf.svg (you can click on boxes to zoom on them).

I also got a graph of the pprof profile data, and it yielded https://private-fileshare.canonical.com/~axino/lp1733708/calls.svg - this isn't as useful as it looks. However, when taking a look with "peek", you can get hints at what's going on. We're going to start on sha1.blockAMD64, and go up the call chain :

(pprof) peek crypto/sha1.blockAMD64
24.79s of 26.04s total (95.20%)
Dropped 381 nodes (cum <= 0.13s)
----------------------------------------------------------+-------------
      flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
                                            16.77s 100% | crypto/sha1.block
    16.77s 64.40% 64.40% 16.77s 64.40% | crypto/sha1.blockAMD64
----------------------------------------------------------+-------------

^ means that all calls to crypto/sha1.blockAMD64 are made from crypto/sha1.block. Let's continue up :

(pprof) peek crypto/sha1.block$
24.79s of 26.04s total (95.20%)
Dropped 381 nodes (cum <= 0.13s)
----------------------------------------------------------+-------------
      flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
                                               17s 100% | crypto/sha1.(*digest).Write
     0.22s 0.84% 0.84% 17.01s 65.32% | crypto/sha1.block
                                            16.77s 100% | crypto/sha1.blockAMD64
----------------------------------------------------------+-------------

(pprof) peek crypto/sha1.\(\*digest\).Write
24.79s of 26.04s total (95.20%)
Dropped 381 nodes (cum <= 0.13s)
----------------------------------------------------------+-------------
      flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
                                            10.28s 53.15% | crypto/sha1.(*digest).checkSum
                                             4.57s 23.63% | crypto/hmac.(*hmac).Sum
                                             4.21s 21.77% | crypto/hmac.(*hmac).Reset
                                             0.28s 1.45% | crypto/hmac.(*hmac).Write
     1.83s 7.03% 7.03% 19.39s 74.46% | crypto/sha1.(*digest).Write
                                               17s 96.81% | crypto/sha1.block
                                             0.56s 3.19% | runtime.memmove
----------------------------------------------------------+-------------

There's a split here, with 3 different callers > 20%, but they actually all converge back just at the boundary of the "crypto" package. Let's keep looking :

(pprof) peek \.checkSum$
24.79s of 26.04s total (95.2...

John A Meinel (jameinel) wrote :
Download full text (16.1 KiB)

Given that this overhead is calls to login SASL, and this bug filed against
mgo, I wonder if it is the culprit:
https://github.com/go-mgo/mgo/issues/254

(it seems to fit the evidence, at least).

John
=:->

On Fri, Jan 12, 2018 at 6:37 PM, Junien Fridrick <<email address hidden>
> wrote:

> Hi,
>
>
> Disclaimer : much reading but not much progress here :(
>
>
> So this still happening, even with 2.2.9. wgrant has a controller that is
> under high load / high sha1.blockAMD64 activity. On this controller, I
> tried to get a perf flamegraph but it wasn't really useful :
> https://private-fileshare.canonical.com/~axino/lp1733708/perf.svg (you
> can click on boxes to zoom on them).
>
> I also got a graph of the pprof profile data, and it yielded https
> ://private-fileshare.canonical.com/~axino/lp1733708/calls.svg - this
> isn't as useful as it looks. However, when taking a look with "peek",
> you can get hints at what's going on. We're going to start on
> sha1.blockAMD64, and go up the call chain :
>
>
> (pprof) peek crypto/sha1.blockAMD64
> 24.79s of 26.04s total (95.20%)
> Dropped 381 nodes (cum <= 0.13s)
> ----------------------------------------------------------+-------------
> flat flat% sum% cum cum% calls calls% + context
> ----------------------------------------------------------+-------------
> 16.77s 100% |
> crypto/sha1.block
> 16.77s 64.40% 64.40% 16.77s 64.40% |
> crypto/sha1.blockAMD64
> ----------------------------------------------------------+-------------
>
> ^ means that all calls to crypto/sha1.blockAMD64 are made from
> crypto/sha1.block. Let's continue up :
>
> (pprof) peek crypto/sha1.block$
> 24.79s of 26.04s total (95.20%)
> Dropped 381 nodes (cum <= 0.13s)
> ----------------------------------------------------------+-------------
> flat flat% sum% cum cum% calls calls% + context
> ----------------------------------------------------------+-------------
> 17s 100% |
> crypto/sha1.(*digest).Write
> 0.22s 0.84% 0.84% 17.01s 65.32% |
> crypto/sha1.block
> 16.77s 100% |
> crypto/sha1.blockAMD64
> ----------------------------------------------------------+-------------
>
>
> (pprof) peek crypto/sha1.\(\*digest\).Write
> 24.79s of 26.04s total (95.20%)
> Dropped 381 nodes (cum <= 0.13s)
> ----------------------------------------------------------+-------------
> flat flat% sum% cum cum% calls calls% + context
> ----------------------------------------------------------+-------------
> 10.28s 53.15% |
> crypto/sha1.(*digest).checkSum
> 4.57s 23.63% |
> crypto/hmac.(*hmac).Sum
> 4.21s 21.77% |
> crypto/hmac.(*hmac).Reset
> 0.28s 1.45% |
> crypto/hmac.(*hmac).Write
> 1.83s 7.03% 7.03% 19.39s 74.46% |
> crypto/sha1.(*digest).Write
> ...

Anastasia (anastasia-macmood) wrote :

@John A Meinel,

I am not too sure where this is for other series... It appears that there was a lot of code committed to improve the performance but we have not had a recent confirmation that the loads are resolved.

I'll mark it as "Incomplete" for now. Also, we will not be fixing this in 2.2 as there are no further point releases planned in this series.

Changed in juju:
status: Triaged → Fix Committed
Anastasia (anastasia-macmood) wrote :

Actually, marking as "Fix Committed" for "juju" since I am sure all related code has been forwardported to "develop" (heading into 2.4+) as part of a larger merge.

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.