Canonical Juju

mgo saslStart Socket storm

Bug #1747368 reported by John A Meinel on 2018-02-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	John A Meinel	Canonical Juju 2.4-beta1
	2.3	Fix Released	High	John A Meinel	Canonical Juju 2.3.3

Bug Description

We have seen this in production, but didn't quite understand what was going on. There is a patch in 'globalsign/mgo' which looks to be helpful:
https://github.com/go-mgo/mgo/issues/254

We should bring that into our code base. (likely for 2.3 we'll just cherry-pick this patch, but for 2.4 we'll bring in the rest of globalsign changes.)

Tags:

John A Meinel (jameinel) on 2018-02-05

Changed in juju:
assignee:	nobody → John A Meinel (jameinel)
status:	Triaged → In Progress
milestone:	none → 2.4-beta1

Revision history for this message

John A Meinel (jameinel) wrote on 2018-02-08:

https://github.com/juju/juju/pull/8351

Revision history for this message

John A Meinel (jameinel) wrote on 2018-02-11:

Steps to manually reproduce the saslStart storm, which this patch addresses.

$ juju bootstrap lxd
$ juju deploy cs:~jameinel/ubuntu-lite
# Get 100 units running to have stuff for the controller to do
$ for i in `seq 10`; do juju add-unit ubuntu-lite -n10 --to 0,0,0,0,0,0,0,0,0,0; done
$ juju ssh -m controller 0
$$ sudo su -
$$ apt install dsniff
# Find out what connections Juju has to mongo
$$ lsof -i :37017 | grep ^jujud.*localhost | head -n 20
# Pick one and try to get tcpkill to interrupt it. It must be an active connection for tcpkill to be able to inject the bad packet
$$ tcpkill -i lo 'port 54200'
# Note that the early connections are likely to be the txn watcher, which tends to bring everything down and bounce.
# trying random ones a bit should be able to trigger the abended bug without bringing everything to be restarted
# You should also be able to grab the CPU profile during this time
# The symptom when its bad is that the CPU usage for jujud never really settles back down to low digits, even though it isn't firing many hooks. It seems to stay >50% (before the tcpkill I always let it settle down to 10-20%)
$$ source /etc/profile.d/juju-introspection.sh
$$ juju-cpu-profile 120 > cpu.pprof
# The bad CPU profile shows a lot of:
  0.82mins 62.92% 62.92% 0.82mins 62.92% crypto/sha1.blockAMD64 /snap/go/1404/src/crypto/sha1/sha1block_amd64.s
  0.10mins 7.84% 70.76% 0.59mins 45.27% crypto/sha1.(*digest).checkSum /snap/go/1404/src/crypto/sha1/sha1.go
  0.08mins 6.38% 77.13% 0.95mins 72.92% crypto/sha1.(*digest).Write /snap/go/1404/src/crypto/sha1/sha1.go
  0.05mins 3.83% 80.96% 0.05mins 3.83% runtime.memmove /snap/go/1404/src/runtime/memmove_amd64.s
  0.03mins 2.16% 83.12% 0.64mins 49.23% crypto/sha1.(*digest).Sum /snap/go/1404/src/crypto/sha1/sha1.go
  0.03mins 2.06% 85.18% 1.16mins 89.43% gopkg.in/mgo.v2/internal/scram.(*Client).saltPassword /home/jameinel/dev/go/src/gopkg.in/mgo.v2/internal/scram/scram.go

# Which is Juju re-authing on each request.

Steps to manually reproduce the saslStart storm, which this patch addresses.

$ juju bootstrap lxd
$ juju deploy cs:~jameinel/ubuntu-lite
# Get 100 units running to have stuff for the controller to do
$ for i in `seq 10`; do juju add-unit ubuntu-lite -n10 --to 0,0,0,0,0,0,0,0,0,0; done
$ juju ssh -m controller 0
$$ sudo su -
$$ apt install dsniff
# Find out what connections Juju has to mongo
$$ lsof -i :37017 | grep ^jujud.*localhost  | head -n 20
# Pick one and try to get tcpkill to interrupt it. It must be an active connection for tcpkill to be able to inject the bad packet
$$ tcpkill -i lo 'port 54200'
# Note that the early connections are likely to be the txn watcher, which tends to bring everything down and bounce.
# trying random ones a bit should be able to trigger the abended bug without bringing everything to be restarted
# You should also be able to grab the CPU profile during this time
# The symptom when its bad is that the CPU usage for jujud never really settles back down to low digits, even though it isn't firing many hooks. It seems to stay >50% (before the tcpkill I always let it settle down to 10-20%)
$$ source /etc/profile.d/juju-introspection.sh
$$ juju-cpu-profile 120 > cpu.pprof
# The bad CPU profile shows a lot of:
  0.82mins 62.92% 62.92%   0.82mins 62.92%  crypto/sha1.blockAMD64 /snap/go/1404/src/crypto/sha1/sha1block_amd64.s
  0.10mins  7.84% 70.76%   0.59mins 45.27%  crypto/sha1.(*digest).checkSum /snap/go/1404/src/crypto/sha1/sha1.go
  0.08mins  6.38% 77.13%   0.95mins 72.92%  crypto/sha1.(*digest).Write /snap/go/1404/src/crypto/sha1/sha1.go
  0.05mins  3.83% 80.96%   0.05mins  3.83%  runtime.memmove /snap/go/1404/src/runtime/memmove_amd64.s
  0.03mins  2.16% 83.12%   0.64mins 49.23%  crypto/sha1.(*digest).Sum /snap/go/1404/src/crypto/sha1/sha1.go
  0.03mins  2.06% 85.18%   1.16mins 89.43%  gopkg.in/mgo.v2/internal/scram.(*Client).saltPassword /home/jameinel/dev/go/src/gopkg.in/mgo.v2/internal/scram/scram.go

# Which is Juju re-authing on each request.

Revision history for this message

John A Meinel (jameinel) wrote on 2018-02-11:

See also
https://github.com/globalsign/mgo/pull/100

John A Meinel (jameinel) on 2018-02-28

Changed in juju:
status:	In Progress → Fix Committed

Anastasia (anastasia-macmood) on 2018-07-10

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.