External CI failures due to SSH issues
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack-Ansible |
Fix Released
|
High
|
Hugh Saunders | ||
Juno |
Fix Released
|
High
|
Jesse Pretorius | ||
Trunk |
Fix Released
|
High
|
Hugh Saunders |
Bug Description
Many runs fail due to SSH issues, for example:
16:27:58 fatal: [node17_
16:27:58 fatal: [node17_
I have a couple of theories so far.
1) Multiplexing issues
1.1)I have caught one exception where the ssh client dies as it can't connect to the master to the controlmaster in order to request a new session, thats a client side issue which I can't think of a solution for apart from disabling multiplexing (controlmaster=no)
1.2) I have no evidence for this, but connections couldn't be failing if MaxSessions is hit, which is plausible as max sessions defaults to 10 and forks is 15. This could happen when a task targets all containers and is then delegated to the host.
2) Too many unauthenticated connections
I have enabled ssh logging on the hosts, and run ansible with -vvvv to attempt to determine the cause, however this is not straightforward. The server reports "didn't receive identification string from client" and the client reports "connection closed by server" (because identification wasn't received). So the client is sending identification but the server isn't receiving it, one possibility is that MaxStartups is being hit, though this seems unlikely if controlMaster is set to Auto.
Changed in openstack-ansible: | |
importance: | Undecided → High |
assignee: | nobody → Hugh Saunders (hughsaunders) |
Changed in openstack-ansible: | |
status: | New → In Progress |
Changed in openstack-ansible: | |
milestone: | none → next |
I saw the same thing happen when we were running the playbooks on a 160 node cluster. Some tasks within the plays would intermittently fail with this error, forcing us to re-run the play with the --rejoin flag. At first I thought we were saturating the network. You can't see this problem unless you run the plays on a big cluster. I'll keep my eye out for any of the theories stated above next time I'm running the playbooks on this cluster.