Hi, Disclaimer : much reading but not much progress here :( So this still happening, even with 2.2.9. wgrant has a controller that is under high load / high sha1.blockAMD64 activity. On this controller, I tried to get a perf flamegraph but it wasn't really useful : https://private-fileshare.canonical.com/~axino/lp1733708/perf.svg (you can click on boxes to zoom on them). I also got a graph of the pprof profile data, and it yielded https://private-fileshare.canonical.com/~axino/lp1733708/calls.svg - this isn't as useful as it looks. However, when taking a look with "peek", you can get hints at what's going on. We're going to start on sha1.blockAMD64, and go up the call chain : (pprof) peek crypto/sha1.blockAMD64 24.79s of 26.04s total (95.20%) Dropped 381 nodes (cum <= 0.13s) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 16.77s 100% | crypto/sha1.block 16.77s 64.40% 64.40% 16.77s 64.40% | crypto/sha1.blockAMD64 ----------------------------------------------------------+------------- ^ means that all calls to crypto/sha1.blockAMD64 are made from crypto/sha1.block. Let's continue up : (pprof) peek crypto/sha1.block$ 24.79s of 26.04s total (95.20%) Dropped 381 nodes (cum <= 0.13s) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 17s 100% | crypto/sha1.(*digest).Write 0.22s 0.84% 0.84% 17.01s 65.32% | crypto/sha1.block 16.77s 100% | crypto/sha1.blockAMD64 ----------------------------------------------------------+------------- (pprof) peek crypto/sha1.\(\*digest\).Write 24.79s of 26.04s total (95.20%) Dropped 381 nodes (cum <= 0.13s) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 10.28s 53.15% | crypto/sha1.(*digest).checkSum 4.57s 23.63% | crypto/hmac.(*hmac).Sum 4.21s 21.77% | crypto/hmac.(*hmac).Reset 0.28s 1.45% | crypto/hmac.(*hmac).Write 1.83s 7.03% 7.03% 19.39s 74.46% | crypto/sha1.(*digest).Write 17s 96.81% | crypto/sha1.block 0.56s 3.19% | runtime.memmove ----------------------------------------------------------+------------- There's a split here, with 3 different callers > 20%, but they actually all converge back just at the boundary of the "crypto" package. Let's keep looking : (pprof) peek \.checkSum$ 24.79s of 26.04s total (95.20%) Dropped 381 nodes (cum <= 0.13s) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 12.31s 100% | crypto/sha1.(*digest).Sum 2s 7.68% 7.68% 12.31s 47.27% | crypto/sha1.(*digest).checkSum 10.28s 100% | crypto/sha1.(*digest).Write ----------------------------------------------------------+------------- (pprof) peek \.Sum 24.79s of 26.04s total (95.20%) Dropped 381 nodes (cum <= 0.13s) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 13.52s 100% | crypto/hmac.(*hmac).Sum 0.67s 2.57% 2.57% 13.53s 51.96% | crypto/sha1.(*digest).Sum 12.31s 95.72% | crypto/sha1.(*digest).checkSum 0.29s 2.26% | runtime.memmove 0.26s 2.02% | runtime.duffcopy ----------------------------------------------------------+------------- 18.40s 100% | gopkg.in/mgo.v2/internal/scram.(*Client).saltPassword 0.28s 1.08% 3.65% 18.46s 70.89% | crypto/hmac.(*hmac).Sum 13.52s 74.57% | crypto/sha1.(*digest).Sum 4.57s 25.21% | crypto/sha1.(*digest).Write 0.04s 0.22% | crypto/sha1.(*digest).Reset ----------------------------------------------------------+------------- OK this is our first boundary outside of the "crypto" package, with gopkg.in/mgo.v2/internal/scram.(*Client).saltPassword (see https://github.com/go-mgo/mgo/blob/v2/internal/scram/scram.go). The 3 "split" callers from earlier all get called from here. So 100% of the sha1.blockAMD64 calls are from saltPassword(). Let's continue : (pprof) peek saltPassword 24.79s of 26.04s total (95.20%) Dropped 381 nodes (cum <= 0.13s) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 24.03s 100% | gopkg.in/mgo.v2/internal/scram.(*Client).step2 0.65s 2.50% 2.50% 24.03s 92.28% | gopkg.in/mgo.v2/internal/scram.(*Client).saltPassword 18.40s 78.73% | crypto/hmac.(*hmac).Sum 4.46s 19.08% | crypto/hmac.(*hmac).Reset 0.51s 2.18% | crypto/hmac.(*hmac).Write ----------------------------------------------------------+------------- (pprof) peek step2 24.79s of 26.04s total (95.20%) Dropped 381 nodes (cum <= 0.13s) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 24.14s 100% | gopkg.in/mgo.v2/internal/scram.(*Client).Step 0 0% 0% 24.14s 92.70% | gopkg.in/mgo.v2/internal/scram.(*Client).step2 24.03s 100% | gopkg.in/mgo.v2/internal/scram.(*Client).saltPassword ----------------------------------------------------------+------------- (pprof) peek Step 24.79s of 26.04s total (95.20%) Dropped 381 nodes (cum <= 0.13s) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 24.16s 100% | gopkg.in/mgo%2ev2.(*mongoSocket).loginSASL 0 0% 0% 24.16s 92.78% | gopkg.in/mgo%2ev2.(*saslScram).Step 24.16s 100% | gopkg.in/mgo.v2/internal/scram.(*Client).Step ----------------------------------------------------------+------------- 24.16s 100% | gopkg.in/mgo%2ev2.(*saslScram).Step 0 0% 0% 24.16s 92.78% | gopkg.in/mgo.v2/internal/scram.(*Client).Step 24.14s 100% | gopkg.in/mgo.v2/internal/scram.(*Client).step2 ----------------------------------------------------------+------------- (pprof) peek loginSASL 24.79s of 26.04s total (95.20%) Dropped 381 nodes (cum <= 0.13s) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 24.57s 100% | gopkg.in/mgo%2ev2.(*mongoSocket).Login 0 0% 0% 24.57s 94.35% | gopkg.in/mgo%2ev2.(*mongoSocket).loginSASL 24.16s 98.37% | gopkg.in/mgo%2ev2.(*saslScram).Step 0.40s 1.63% | gopkg.in/mgo%2ev2.(*mongoSocket).loginRun ----------------------------------------------------------+------------- (pprof) peek Login 24.79s of 26.04s total (95.20%) Dropped 381 nodes (cum <= 0.13s) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 24.58s 100% | gopkg.in/mgo%2ev2.(*Session).acquireSocket 0 0% 0% 24.58s 94.39% | gopkg.in/mgo%2ev2.(*Session).socketLogin 24.58s 100% | gopkg.in/mgo%2ev2.(*mongoSocket).Login ----------------------------------------------------------+------------- 24.58s 100% | gopkg.in/mgo%2ev2.(*Session).socketLogin 0 0% 0% 24.58s 94.39% | gopkg.in/mgo%2ev2.(*mongoSocket).Login 24.57s 100% | gopkg.in/mgo%2ev2.(*mongoSocket).loginSASL ----------------------------------------------------------+------------- (pprof) peek acquireSocket 24.79s of 26.04s total (95.20%) Dropped 381 nodes (cum <= 0.13s) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 22.58s 90.98% | gopkg.in/mgo%2ev2.(*Query).One 1.30s 5.24% | gopkg.in/mgo%2ev2.(*Collection).writeOp 0.62s 2.50% | gopkg.in/mgo%2ev2.(*Query).Iter 0.32s 1.29% | gopkg.in/mgo%2ev2.(*Database).Run 0 0% 0% 24.82s 95.31% | gopkg.in/mgo%2ev2.(*Session).acquireSocket 24.58s 99.03% | gopkg.in/mgo%2ev2.(*Session).socketLogin 0.24s 0.97% | gopkg.in/mgo%2ev2.(*mongoCluster).AcquireSocket ----------------------------------------------------------+------------- (pprof) peek \.One 24.79s of 26.04s total (95.20%) Dropped 381 nodes (cum <= 0.13s) ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 17.81s 80.33% | github.com/juju/juju/state.getStatus 1.58s 7.13% | github.com/juju/juju/state.(*State).Unit 0.50s 2.26% | github.com/juju/juju/state.(*State).getMachineDoc 0.47s 2.12% | github.com/juju/juju/state.getInstanceData 0.44s 1.98% | github.com/juju/juju/state.(*allWatcherStateBacking).Changed 0.37s 1.67% | github.com/juju/juju/state.(*Unit).AssignedMachineId 0.31s 1.40% | github.com/juju/juju/state.(*State).Application 0.28s 1.26% | github.com/juju/juju/state.readTxnRevno 0.27s 1.22% | github.com/juju/juju/state.readSettingsDocInto 0.14s 0.63% | github.com/juju/juju/state.(*State).APIHostPorts 0 0% 0% 22.77s 87.44% | github.com/juju/juju/mongo.queryWrapper.One 22.77s 100% | gopkg.in/mgo%2ev2.(*Query).One ----------------------------------------------------------+------------- 22.77s 100% | github.com/juju/juju/mongo.queryWrapper.One 0 0% 0% 22.78s 87.48% | gopkg.in/mgo%2ev2.(*Query).One 22.58s 99.12% | gopkg.in/mgo%2ev2.(*Session).acquireSocket 0.10s 0.44% | gopkg.in/mgo%2ev2.(*mongoSocket).SimpleQuery 0.10s 0.44% | gopkg.in/mgo.v2/bson.Unmarshal ----------------------------------------------------------+------------- OK, end of the line. We can see here the various juju functions making calls to mongodb via mgo (note that on other profiles, the repartition is different, i.e. getStatus() is not always so predominant). Looking at a "normal" controller, we have : ----------------------------------------------------------+------------- 2.10s 78.36% | github.com/juju/juju/mongo.queryWrapper.One 0.25s 9.33% | gopkg.in/mgo.v2/txn.(*Runner).load 0.17s 6.34% | gopkg.in/mgo.v2/txn.(*flusher).assert 0.16s 5.97% | gopkg.in/mgo.v2/txn.(*flusher).rescan 0.01s 0.047% 0.093% 2.71s 12.63% | gopkg.in/mgo%2ev2.(*Query).One 1.33s 50.76% | gopkg.in/mgo%2ev2.(*mongoSocket).SimpleQuery 1.17s 44.66% | gopkg.in/mgo.v2/bson.Unmarshal 0.07s 2.67% | runtime.newobject 0.05s 1.91% | gopkg.in/mgo%2ev2.(*Session).acquireSocket ----------------------------------------------------------+------------- so Query.One() is mostly calling SimpleQuery and bson.Unmarshal, as expected. So I believe all this indicates that _somehow_, the cause of this bug is that juju just stops using the mongodb pool and creates a new connection each time it wants to do a request. If you get back at the flamegraph, and click on "runtime.goexit", then search for "loginSASL" at the top right, you can see that nearly every state is spending 99% of its time logging in the to database. I haven't been able to root cause that yet, and the following would have helped : a) pprof tracing - see bug 1742955 b) enabling mgo debug logging - I'm not sure how to do that via the "logging-config" model config option - and I'm not sure that's even possible. If it's not possible then let's file a bug. Thanks