I was investigating some of our issues with Transaction Queues and Leadership, and when you introduce a bad txn it causes all of the leadership workers to bounce.
During this time, something happens that causes the presence table to go crazy.
Specifically, I did:
$ juju enable-ha -n3
and then in mongo shell:
db.leases.update({"_id" : "a7dc4401-4028-4a05-8af3-347755f6f62d:application-leadership#ubuntu-lite#"}, {$push: {"txn-queue": "5a03fbfc1741fc14deadbeef_deadbeef"}})
(Pick any valid lease _id)
After which, I let the system sit there for a while (couple of hours?) with something like 100 applications, but only 1 of them broken.
After running this way for a while I did the opposite:
db.leases.update({"_id" : "a7dc4401-4028-4a05-8af3-347755f6f62d:application-leadership#ubuntu-lite#"}, {$pull: {"txn-queue": "5a03fbfc1741fc144411b4cc_562c305b"}}
And I was observing whether the system could recover once the bad transaction was removed.
It was, but I observed in mongotop an inordinate amount of time being spent in presence.presence.beings.
So I ran:
juju:PRIMARY> db.presence.beings.aggregate([{$group: {_id: "$key", count: {$sum: 1}}}, {$sort: {count: -1}}])
{ "_id" : "m#1", "count" : 2813 }
{ "_id" : "m#2", "count" : 2801 }
{ "_id" : "m#0", "count" : 2768 }
{ "_id" : "u#ul084/0", "count" : 1 }
{ "_id" : "u#ul047/0", "count" : 1 }
{ "_id" : "u#ul036/0", "count" : 1 }
{ "_id" : "u#ul073/0", "count" : 1 }
{ "_id" : "u#ul017/0", "count" : 1 }
{ "_id" : "u#ul086/0", "count" : 1 }
{ "_id" : "u#ul000/0", "count" : 1 }
...
It seems that for whatever reason, all of the controllers are running lots-and-lots of Presence data for each of the machine agents.
Presumably something like a Pinger that isn't getting killed when Leadership worker is getting restarted.
Note that the pings themselves are getting reaped properly:
juju:PRIMARY> db.presence.pings.count()
4
It seems that all of these Pingers are still actively pinging. Which also causes "juju status" to thrash, because it sees all of these alive pingers, but only 1 is valid for a given agent. Which means the Pinger logic reads in all of the extra identities, and then throws them away because their state is invalid. But the *next* status has to read them all in again.
juju:PRIMARY> db.presence.pings.find().limit(1).pretty()
{
"_id" : "a7dc4401-4028-4a05-8af3-347755f6f62d:1510484520",
"slot" : NumberLong(1510484520),
"alive" : {
"48" : NumberLong("9223372036854775807"),
"58" : NumberLong("9223372036854775807"),
"3e" : NumberLong("9223372036854775807"),
"8" : NumberLong("9223372036854775807"),
"49" : NumberLong("9223372036854775807"),
"27" : NumberLong("9223372036854775807"),
"2a" : NumberLong("9223372036854775807"),
"6b" : NumberLong("9223372036854775807"),
"75" : NumberLong("9223372036854775807"),
"63" : NumberLong("9223372036854775807"),
"5e" : NumberLong("9223372036854775807"),
"26" : NumberLong("9223372036854775807"),
"89" : NumberLong("9223372036854775807"),
"38" : NumberLong("9223372036854775807"),
"37" : NumberLong("9223372036854775807"),
"3a" : NumberLong("9223372036854775807"),
"20" : NumberLong("9223372036854775807"),
"8d" : NumberLong(8191),
"66" : NumberLong("9223372036854775807"),
"5a" : NumberLong("9223372036854775807"),
"7e" : NumberLong("9223372036854775807"),
"7b" : NumberLong("9223372036854775807"),
"54" : NumberLong("9223372036854775807"),
"67" : NumberLong("9223372036854775807"),
"45" : NumberLong("9223372036854775807"),
"34" : NumberLong("9223372036854775807"),
"2d" : NumberLong("9223372036854775807"),
"79" : NumberLong("9223372036854775807"),
"51" : NumberLong("9223372036854775807"),
"83" : NumberLong("9223372036854775807"),
"68" : NumberLong("9223372036854775807"),
"19" : NumberLong("9223372036854775807"),
"2f" : NumberLong("9223372036854775807"),
"65" : NumberLong("9223372036854775807"),
"74" : NumberLong("9223372036854775807"),
"87" : NumberLong("9223372036854775807"),
"33" : NumberLong("9223372036854775807"),
"80" : NumberLong("9223372036854775807"),
"8a" : NumberLong("9223372036854775807"),
"76" : NumberLong("9223372036854775807"),
"41" : NumberLong("9223372036854775807"),
"1d" : NumberLong("9223372036854775807"),
"5c" : NumberLong("9223372036854775807"),
"c" : NumberLong("9223372036854775807"),
"25" : NumberLong("9223372036854775807"),
"71" : NumberLong("9223372036854775807"),
"b" : NumberLong("9223372036854775807"),
"7c" : NumberLong("9223372036854775807"),
"60" : NumberLong("9223372036854775807"),
"55" : NumberLong("9223372036854775807"),
"3c" : NumberLong("9223372036854775807"),
"1a" : NumberLong("9223372036854775807"),
"64" : NumberLong("9223372036854775807"),
"61" : NumberLong("9223372036854775807"),
"85" : NumberLong("9223372036854775807"),
"7d" : NumberLong("9223372036854775807"),
"31" : NumberLong("9223372036854775807"),
"13" : NumberLong("9223372036854775807"),
"32" : NumberLong("9223372036854775807"),
"7f" : NumberLong("9223372036854775807"),
"d" : NumberLong("9223372036854775807"),
"f" : NumberLong("9223372036854775807"),
"6a" : NumberLong("9223372036854775807"),
"23" : NumberLong("9223372036854775807"),
"1c" : NumberLong("9223372036854775807"),
"8b" : NumberLong("9223372036854775807"),
"47" : NumberLong("9223372036854775807"),
"86" : NumberLong("9223372036854775807"),
"6e" : NumberLong("9223372036854775807"),
"72" : NumberLong("9223372036854775807"),
"3b" : NumberLong("9223372036854775807"),
"17" : NumberLong("9223372036854775807"),
"52" : NumberLong("9223372036854775807"),
"46" : NumberLong("9223372036854775807"),
"44" : NumberLong("9223372036854775807"),
"88" : NumberLong("9223372036854775807"),
"57" : NumberLong("9223372036854775807"),
"59" : NumberLong("9223372036854775807"),
"8c" : NumberLong("9223372036854775807"),
"78" : NumberLong("9223372036854775807"),
"5" : NumberLong("9223372036854775807"),
"4e" : NumberLong("9223372036854775807"),
"73" : NumberLong("9223372036854775807"),
"4b" : NumberLong("9223372036854775807"),
"15" : NumberLong("9223372036854775807"),
"5b" : NumberLong("9223372036854775807"),
"16" : NumberLong("9223372036854775807"),
"30" : NumberLong("9223372036854775807"),
"6c" : NumberLong("9223372036854775807"),
"2e" : NumberLong("9223372036854775807"),
"5d" : NumberLong("9223372036854775807"),
"77" : NumberLong("9223372036854775807"),
"4a" : NumberLong("9223372036854775807"),
"82" : NumberLong("9223372036854775807"),
"70" : NumberLong("9223372036854775807"),
"10" : NumberLong("9223372036854775807"),
"3f" : NumberLong("9223372036854775807"),
"6d" : NumberLong("9223372036854775807"),
"40" : NumberLong("9223372036854775807"),
"1f" : NumberLong("9223372036854775807"),
"24" : NumberLong("9223372036854775807"),
"29" : NumberLong("9223372036854775807"),
"12" : NumberLong("9223372036854775807"),
"7" : NumberLong("9223372036854775807"),
"e" : NumberLong("9223372036854775807"),
"81" : NumberLong("9223372036854775807"),
"18" : NumberLong("9223372036854775807"),
"62" : NumberLong("9223372036854775807"),
"3d" : NumberLong("9223372036854775807"),
"84" : NumberLong("9223372036854775807"),
"4d" : NumberLong("9223372036854775807"),
"36" : NumberLong("9223372036854775807"),
"6" : NumberLong("9223372036854775807"),
"11" : NumberLong("9223372036854775807"),
"1b" : NumberLong("9223372036854775807"),
"7a" : NumberLong("9223372036854775807"),
"5f" : NumberLong("9223372036854775807"),
"22" : NumberLong("9223372036854775807"),
"4f" : NumberLong("9223372036854775807"),
"50" : NumberLong("9223372036854775807"),
"69" : NumberLong("9223372036854775807"),
"53" : NumberLong("9223372036854775807"),
"2c" : NumberLong("9223372036854775807"),
"42" : NumberLong("9223372036854775807"),
"39" : NumberLong("9223372036854775807"),
"14" : NumberLong("9223372036854775807"),
"a" : NumberLong("9223372036854775807"),
"6f" : NumberLong("9223372036854775807"),
"43" : NumberLong("9223372036854775807"),
"9" : NumberLong("9223372036854775807"),
"28" : NumberLong("9223372036854775807"),
"56" : NumberLong("9223372036854775807"),
"1e" : NumberLong("9223372036854775807"),
"21" : NumberLong("9223372036854775807"),
"2b" : NumberLong("9223372036854775807"),
"4c" : NumberLong("9223372036854775807"),
"35" : NumberLong("9223372036854775807"),
"3" : NumberLong("3481790096963796992"),
"4" : NumberLong("9223372036854775797"),
"0" : NumberLong(256)
}
}
Note that statepool report doesn't seem to show anything going on:
# juju-statepool-report
Querying @jujud-machine-0 introspection socket: /statepool/
State Pool Report:
Model count: 1 models
Marked for removal: 0 models
Model: 4051bd02-cbb5-41eb-873c-758286f35f0c
Marked for removal: false
Reference count: 3
[1]
goroutine 7181821 [running]:
runtime/debug.Stack(0x2df9680, 0xc4237b8360, 0xc4203eff4d)
/snap/go/1016/src/runtime/debug/stack.go:24 +0xa7
github.com/juju/juju/state.(*StatePool).Get(0xc4209d0220, 0xc4203eff4d, 0x24, 0x0, 0x0, 0x0, 0x0)
/home/jameinel/dev/go/src/github.com/juju/juju/state/pool.go:91 +0x1ed
github.com/juju/juju/apiserver.(*Server).serveConn(0xc4209583c0, 0xc4207531b8, 0xc4203eff4d, 0x24, 0x4cc4560, 0xc4275e97a0, 0xc42760db20, 0x12, 0x0, 0x0)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:926 +0x75c
github.com/juju/juju/apiserver.(*Server).apiHandler.func2(0xc4207531b8)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:903 +0x17f
github.com/juju/juju/apiserver/websocket.Serve(0x4cc0d60, 0xc42b59f5e0, 0xc425effc00, 0xc42a999ab0)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/websocket/websocket.go:64 +0x119
github.com/juju/juju/apiserver.(*Server).apiHandler(0xc4209583c0, 0x4cc0d60, 0xc42b59f5e0, 0xc425effc00)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:900 +0x1a3
github.com/juju/juju/apiserver.(*Server).(github.com/juju/juju/apiserver.apiHandler)-fm(0x4cc0d60, 0xc42b59f5e0, 0xc425effc00)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:623 +0x48
net/http.HandlerFunc.ServeHTTP(0xc4209d3680, 0x4cc0d60, 0xc42b59f5e0, 0xc425effc00)
/snap/go/1016/src/net/http/server.go:1918 +0x44
github.com/juju/juju/apiserver.(*Server).trackRequests.func1(0x4cc0d60, 0xc42b59f5e0, 0xc425effc00)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:873 +0x11b
net/http.HandlerFunc.ServeHTTP(0xc4209f89c0, 0x4cc0d60, 0xc42b59f5e0, 0xc425effc00)
/snap/go/1016/src/net/http/server.go:1918 +0x44
github.com/bmizerany/pat.(*PatternServeMux).ServeHTTP(0xc4209f8980, 0x4cc0d60, 0xc42b59f5e0, 0xc425effc00)
/home/jameinel/dev/go/src/github.com/bmizerany/pat/mux.go:117 +0x15b
net/http.serverHandler.ServeHTTP(0xc4209e45b0, 0x4cc0d60, 0xc42b59f5e0, 0xc425effc00)
/snap/go/1016/src/net/http/server.go:2619 +0xb4
net/http.(*conn).serve(0xc42ab8e1e0, 0x4cc3c20, 0xc423c2da00)
/snap/go/1016/src/net/http/server.go:1801 +0x71d
created by net/http.(*Server).Serve
/snap/go/1016/src/net/http/server.go:2720 +0x288
[2]
goroutine 7182021 [running]:
runtime/debug.Stack(0x2df9680, 0xc42343bb60, 0xc42230ea8d)
/snap/go/1016/src/runtime/debug/stack.go:24 +0xa7
github.com/juju/juju/state.(*StatePool).Get(0xc4209d0220, 0xc42230ea8d, 0x24, 0x0, 0x0, 0x0, 0x0)
/home/jameinel/dev/go/src/github.com/juju/juju/state/pool.go:91 +0x1ed
github.com/juju/juju/apiserver.(*Server).serveConn(0xc4209583c0, 0xc420753f60, 0xc42230ea8d, 0x24, 0x4cc4560, 0xc425912520, 0xc4217fe100, 0x12, 0x0, 0x0)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:926 +0x75c
github.com/juju/juju/apiserver.(*Server).apiHandler.func2(0xc420753f60)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:903 +0x17f
github.com/juju/juju/apiserver/websocket.Serve(0x4cc0d60, 0xc4295920e0, 0xc425ee1e00, 0xc42bf11ab0)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/websocket/websocket.go:64 +0x119
github.com/juju/juju/apiserver.(*Server).apiHandler(0xc4209583c0, 0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:900 +0x1a3
github.com/juju/juju/apiserver.(*Server).(github.com/juju/juju/apiserver.apiHandler)-fm(0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:623 +0x48
net/http.HandlerFunc.ServeHTTP(0xc4209d3680, 0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
/snap/go/1016/src/net/http/server.go:1918 +0x44
github.com/juju/juju/apiserver.(*Server).trackRequests.func1(0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:873 +0x11b
net/http.HandlerFunc.ServeHTTP(0xc4209f89c0, 0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
/snap/go/1016/src/net/http/server.go:1918 +0x44
github.com/juju/juju/apiserver.(*Server).trackRequests.func1(0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:873 +0x11b
net/http.HandlerFunc.ServeHTTP(0xc4209f89c0, 0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
/snap/go/1016/src/net/http/server.go:1918 +0x44
github.com/bmizerany/pat.(*PatternServeMux).ServeHTTP(0xc4209f8980, 0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
/home/jameinel/dev/go/src/github.com/bmizerany/pat/mux.go:117 +0x15b
net/http.serverHandler.ServeHTTP(0xc4209e45b0, 0x4cc0d60, 0xc4295920e0, 0xc425ee1e00)
/snap/go/1016/src/net/http/server.go:2619 +0xb4
net/http.(*conn).serve(0xc42ab8e280, 0x4cc3c20, 0xc427230d00)
/snap/go/1016/src/net/http/server.go:1801 +0x71d
created by net/http.(*Server).Serve
/snap/go/1016/src/net/http/server.go:2720 +0x288
[3]
goroutine 745 [running]:
runtime/debug.Stack(0x2df9680, 0xc420df5080, 0xc420a8294d)
/snap/go/1016/src/runtime/debug/stack.go:24 +0xa7
github.com/juju/juju/state.(*StatePool).Get(0xc4209d0220, 0xc420a8294d, 0x24, 0x0, 0x0, 0x0, 0x0)
/home/jameinel/dev/go/src/github.com/juju/juju/state/pool.go:91 +0x1ed
github.com/juju/juju/apiserver.(*Server).serveConn(0xc4209583c0, 0xc42039b3c0, 0xc420a8294d, 0x24, 0x4cc4560, 0xc420b0e160, 0xc4204ee360, 0xf, 0x0, 0x0)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:926 +0x75c
github.com/juju/juju/apiserver.(*Server).apiHandler.func2(0xc42039b3c0)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:903 +0x17f
github.com/juju/juju/apiserver/websocket.Serve(0x4cc0d60, 0xc42013ee00, 0xc420984f00, 0xc420c55ab0)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/websocket/websocket.go:64 +0x119
github.com/juju/juju/apiserver.(*Server).apiHandler(0xc4209583c0, 0x4cc0d60, 0xc42013ee00, 0xc420984f00)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:900 +0x1a3
github.com/juju/juju/apiserver.(*Server).(github.com/juju/juju/apiserver.apiHandler)-fm(0x4cc0d60, 0xc42013ee00, 0xc420984f00)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:623 +0x48
net/http.HandlerFunc.ServeHTTP(0xc4209d3680, 0x4cc0d60, 0xc42013ee00, 0xc420984f00)
/snap/go/1016/src/net/http/server.go:1918 +0x44
github.com/juju/juju/apiserver.(*Server).trackRequests.func1(0x4cc0d60, 0xc42013ee00, 0xc420984f00)
/home/jameinel/dev/go/src/github.com/juju/juju/apiserver/apiserver.go:873 +0x11b
net/http.HandlerFunc.ServeHTTP(0xc4209f89c0, 0x4cc0d60, 0xc42013ee00, 0xc420984f00)
/snap/go/1016/src/net/http/server.go:1918 +0x44
github.com/bmizerany/pat.(*PatternServeMux).ServeHTTP(0xc4209f8980, 0x4cc0d60, 0xc42013ee00, 0xc420984f00)
/home/jameinel/dev/go/src/github.com/bmizerany/pat/mux.go:117 +0x15b
net/http.serverHandler.ServeHTTP(0xc4209e45b0, 0x4cc0d60, 0xc42013ee00, 0xc420984f00)
/snap/go/1016/src/net/http/server.go:2619 +0xb4
net/http.(*conn).serve(0xc420da0140, 0x4cc3c20, 0xc420d10840)
/snap/go/1016/src/net/http/server.go:1801 +0x71d
created by net/http.(*Server).Serve
/snap/go/1016/src/net/http/server.go:2720 +0x288
There may be a different issue going on, because I haven't been changing anything, just let it run, and now we're up to: beings. aggregate( [{$group: {_id: "$key", count: {$sum: 1}}}, {$sort: {count: -1}}])
> db.presence.
{ "_id" : "m#1", "count" : 2952 }
{ "_id" : "m#2", "count" : 2941 }
{ "_id" : "m#0", "count" : 2904 }
{ "_id" : "m#1", "count" : 2966 }
{ "_id" : "m#2", "count" : 2955 }
{ "_id" : "m#0", "count" : 2916 }
2017-11- 12T11:10: 29.261+ 0000 beings. aggregate( [{$group: {_id: "$key", count: {$sum: 1}}}, {$sort: {count: -1}}, {$limit: 5}])
db.presence.
i{ "_id" : "m#1", "count" : 2977 }
{ "_id" : "m#2", "count" : 2967 }
{ "_id" : "m#0", "count" : 2928 }