Paramiko 'Exception in thread Thread-NNN' errors after system test/devops command finished.

Bug #1455357 reported by Dennis Dmitriev
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Medium
Fuel QA Team

Bug Description

NEW BUG DESCRIPTION:
After investigating a bug more deeply: It does not make sense to store SSHClient object as they all are Threads, which once stopped, cannot be started again.

All SSHClient objects should be changed to be used as Python Context Managers - what will automatically close all connections even when an Exception occur.

Also we should remember that SSH connection is quite heavy process for the system if we talking about hundreds of connections running at the same moment, so all the future tests should be written with care about the issue.

OLD BUG DESCRIPTION:
In the bug https://bugs.launchpad.net/fuel/+bug/1416365 we tried to fix the issue by updating fuel-devops, where method SSHClient.clear() was added to automate closing paramiko transports when SSHClient is deleted/exited [1].

But it doesn't work because of two things:
- we don't use SSHClient as a context manager;
- SSHClient instances are deleted only when Proboscis is finished, because we initialize it not using Node model from devops, but directly by IP address. So SSHClient is not deleted when Environment model of fuel-devops is destroyed after each test case, but exists to the end of whole job because Proboscis never delete classes after test case is finished;
- stop_thread() method in Paramiko transport is not designed for such load when several hundreds threads are closing at the same time [2].

As the result:
- SSHClient.clear() is called not at the end of each test case, but at the end of whole test group , when it exits.
- This method is correctly called by python interpreter, but because of 10-seconds timeout of waiting for thread (see [2]) , some threads are finished on time, some are not.
- Those thread which are not finished on time - cause the exceptions.

What we can do:
1) Always call remote.clear() from methods in fuel-devops/fuel-qa ;
2) Use SSHClient as a context manager everywhere in fuel-devops/fuel-qa ;
3) Store all opened transports in the 'Environment' model of fuel-devops, re-use it to reduce the amount of opened transports, delete them when 'Environment' model is destroyed after each test case. This will be not very good, but easiest solution to avoid refactoring whole fuel-qa source code. 10 seconds must be enough to close <10 transports.

[1] https://github.com/stackforge/fuel-devops/blob/master/devops/helpers/helpers.py#L243
[2] https://github.com/paramiko/paramiko/blob/master/paramiko/transport.py#L1420-L1421

Changed in fuel:
importance: Undecided → High
Changed in fuel:
status: New → Confirmed
tags: added: non-release
Changed in fuel:
assignee: MOS QA Team (mos-qa) → Sebastian Kalinowski (prmtl)
status: Confirmed → In Progress
Changed in fuel:
status: In Progress → Confirmed
assignee: Sebastian Kalinowski (prmtl) → Fuel QA Team (fuel-qa)
Revision history for this message
Sebastian Kalinowski (prmtl) wrote :

Returned to pool as more important HCF blocking bugs needed attention.

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Sylwester Brzeczkowski (sbrzeczkowski)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-devops (master)

Fix proposed to branch: master
Review: https://review.openstack.org/186738

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (master)

Fix proposed to branch: master
Review: https://review.openstack.org/186741

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-devops (master)

Change abandoned by Sylwester Brzeczkowski (<email address hidden>) on branch: master
Review: https://review.openstack.org/186738

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-qa (master)

Change abandoned by Sylwester Brzeczkowski (<email address hidden>) on branch: master
Review: https://review.openstack.org/186741

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (master)

Fix proposed to branch: master
Review: https://review.openstack.org/187539

Revision history for this message
Sylwester Brzeczkowski (sbrzeczkowski) wrote :

I tested this patch here http://jenkins-product.srt.mirantis.net:8080/job/6.1.custom_system_test/169/console
Exceptions from threads seem to not appear. So the fix works as intended but it's still a partial fix because ssh connections are badly managed in whole qa - we always should use SSHClient as a context manager.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/187539
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=a9152f77615eaaca9b4a36d9d218008107519ae4
Submitter: Jenkins
Branch: master

commit a9152f77615eaaca9b4a36d9d218008107519ae4
Author: Sylwester Brzeczkowski <email address hidden>
Date: Tue Jun 2 13:42:55 2015 +0200

    Ensure ssh connections are closed after each TC

    Add plugin to nose, which is responsible for closing paramiko
    ssh connections.

    Change-Id: Ibae19b096f4ddd22b6baa704001cf0bb2c4cbd13
    Partial-Bug: #1455357

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-devops (master)

Fix proposed to branch: master
Review: https://review.openstack.org/188682

Revision history for this message
Sylwester Brzeczkowski (sbrzeczkowski) wrote :

I'm now trying to reduce number of parmiko's ssh clients by saving SSHClient instances (different for every host) and distributing them depending on needs. I had some problems with reconnect but I'm now working on it. It's quite cumbersome, becase every run of the tests takes 1-2h.

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Still reproduced on CI, for example:
http://jenkins-product.srt.mirantis.net:8080/view/7.0_swarm/job/7.0.system_test.ubuntu.thread_7/12/console
http://jenkins-product.srt.mirantis.net:8080/view/7.0_swarm/job/7.0.system_test.ubuntu.thread_7/10/console
http://jenkins-product.srt.mirantis.net:8080/view/7.0_swarm/job/7.0.system_test.ubuntu.thread_non_func_1/3/console

To faster investigation, you can reproduce the issue in a minute even without 'fuel-qa' repository, using just 'fuel-devops':

$ dos.py time-sync <env_name>

, if you, for example, create ~1000 instances of SSHClient ( get_admin_remote() or get_node_remote() ) here: https://github.com/stackforge/fuel-devops/blob/master/devops/helpers/helpers.py#L169-L171

It happens from time to time even with clear fuel-devops code during different requests.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/207917

Revision history for this message
Sylwester Brzeczkowski (sbrzeczkowski) wrote :

Is the bug still occurs? I cannot reproduce it

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-devops (master)

Change abandoned by Sylwester Brzeczkowski (<email address hidden>) on branch: master
Review: https://review.openstack.org/188682
Reason: threads cannot be restarted. So we should start new ones every time but it actually does nothing with the bug cause.

Revision history for this message
Sylwester Brzeczkowski (sbrzeczkowski) wrote :

Due to https://review.openstack.org/#/c/187539/ the threads exception seems not be the cause of test failures anymore, and because of it's degraded to Medium.

Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/7.0.x
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/6.1.x
description: updated
Changed in fuel:
assignee: Sylwester Brzeczkowski (sbrzeczkowski) → Fuel QA Team (fuel-qa)
status: In Progress → Triaged
Dmitry Pyzhov (dpyzhov)
tags: added: area-qa
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Guys a lot of patches were merged to get rid of this issue, Also I do not see any reproduces over the month - so move to the incomplete status for now

Changed in fuel:
status: Triaged → Incomplete
Revision history for this message
Alexandr Kostrikov (akostrikov-mirantis) wrote :

I would say that bug is 'Fix released' because we made a lot of context manager patches.

Changed in fuel:
status: Incomplete → Fix Released
tags: removed: non-release
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.