NEW BUG DESCRIPTION:
After investigating a bug more deeply: It does not make sense to store SSHClient object as they all are Threads, which once stopped, cannot be started again.
All SSHClient objects should be changed to be used as Python Context Managers - what will automatically close all connections even when an Exception occur.
Also we should remember that SSH connection is quite heavy process for the system if we talking about hundreds of connections running at the same moment, so all the future tests should be written with care about the issue.
OLD BUG DESCRIPTION:
In the bug https://bugs.launchpad.net/fuel/+bug/1416365 we tried to fix the issue by updating fuel-devops, where method SSHClient.clear() was added to automate closing paramiko transports when SSHClient is deleted/exited [1].
But it doesn't work because of two things:
- we don't use SSHClient as a context manager;
- SSHClient instances are deleted only when Proboscis is finished, because we initialize it not using Node model from devops, but directly by IP address. So SSHClient is not deleted when Environment model of fuel-devops is destroyed after each test case, but exists to the end of whole job because Proboscis never delete classes after test case is finished;
- stop_thread() method in Paramiko transport is not designed for such load when several hundreds threads are closing at the same time [2].
As the result:
- SSHClient.clear() is called not at the end of each test case, but at the end of whole test group , when it exits.
- This method is correctly called by python interpreter, but because of 10-seconds timeout of waiting for thread (see [2]) , some threads are finished on time, some are not.
- Those thread which are not finished on time - cause the exceptions.
What we can do:
1) Always call remote.clear() from methods in fuel-devops/fuel-qa ;
2) Use SSHClient as a context manager everywhere in fuel-devops/fuel-qa ;
3) Store all opened transports in the 'Environment' model of fuel-devops, re-use it to reduce the amount of opened transports, delete them when 'Environment' model is destroyed after each test case. This will be not very good, but easiest solution to avoid refactoring whole fuel-qa source code. 10 seconds must be enough to close <10 transports.
[1] https://github.com/stackforge/fuel-devops/blob/master/devops/helpers/helpers.py#L243
[2] https://github.com/paramiko/paramiko/blob/master/paramiko/transport.py#L1420-L1421
Returned to pool as more important HCF blocking bugs needed attention.