mgo saslStart Socket storm

Bug #1747368 reported by John A Meinel on 2018-02-05
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
High
John A Meinel
2.3
High
John A Meinel

Bug Description

We have seen this in production, but didn't quite understand what was going on. There is a patch in 'globalsign/mgo' which looks to be helpful:
  https://github.com/go-mgo/mgo/issues/254

We should bring that into our code base. (likely for 2.3 we'll just cherry-pick this patch, but for 2.4 we'll bring in the rest of globalsign changes.)

John A Meinel (jameinel) on 2018-02-05
Changed in juju:
assignee: nobody → John A Meinel (jameinel)
status: Triaged → In Progress
milestone: none → 2.4-beta1
John A Meinel (jameinel) wrote :

Steps to manually reproduce the saslStart storm, which this patch addresses.

$ juju bootstrap lxd
$ juju deploy cs:~jameinel/ubuntu-lite
# Get 100 units running to have stuff for the controller to do
$ for i in `seq 10`; do juju add-unit ubuntu-lite -n10 --to 0,0,0,0,0,0,0,0,0,0; done
$ juju ssh -m controller 0
$$ sudo su -
$$ apt install dsniff
# Find out what connections Juju has to mongo
$$ lsof -i :37017 | grep ^jujud.*localhost | head -n 20
# Pick one and try to get tcpkill to interrupt it. It must be an active connection for tcpkill to be able to inject the bad packet
$$ tcpkill -i lo 'port 54200'
# Note that the early connections are likely to be the txn watcher, which tends to bring everything down and bounce.
# trying random ones a bit should be able to trigger the abended bug without bringing everything to be restarted
# You should also be able to grab the CPU profile during this time
# The symptom when its bad is that the CPU usage for jujud never really settles back down to low digits, even though it isn't firing many hooks. It seems to stay >50% (before the tcpkill I always let it settle down to 10-20%)
$$ source /etc/profile.d/juju-introspection.sh
$$ juju-cpu-profile 120 > cpu.pprof
# The bad CPU profile shows a lot of:
  0.82mins 62.92% 62.92% 0.82mins 62.92% crypto/sha1.blockAMD64 /snap/go/1404/src/crypto/sha1/sha1block_amd64.s
  0.10mins 7.84% 70.76% 0.59mins 45.27% crypto/sha1.(*digest).checkSum /snap/go/1404/src/crypto/sha1/sha1.go
  0.08mins 6.38% 77.13% 0.95mins 72.92% crypto/sha1.(*digest).Write /snap/go/1404/src/crypto/sha1/sha1.go
  0.05mins 3.83% 80.96% 0.05mins 3.83% runtime.memmove /snap/go/1404/src/runtime/memmove_amd64.s
  0.03mins 2.16% 83.12% 0.64mins 49.23% crypto/sha1.(*digest).Sum /snap/go/1404/src/crypto/sha1/sha1.go
  0.03mins 2.06% 85.18% 1.16mins 89.43% gopkg.in/mgo.v2/internal/scram.(*Client).saltPassword /home/jameinel/dev/go/src/gopkg.in/mgo.v2/internal/scram/scram.go

# Which is Juju re-authing on each request.

John A Meinel (jameinel) on 2018-02-28
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers