[neutron-tempest-plugin] If paramiko SSH client connection fails because of authentication, cannot reconnect

Bug #1892861 reported by Rodolfo Alonso
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Slawek Kaplonski

Bug Description

In the VM boot process, cloud-init copies the SSH keys.

If the tempest test tries to connect to the VM before the SSH keys are copied, the SSH client will raise a paramiko.ssh_exception.AuthenticationException. From this point, even when the SSH keys are copied into the VM, the SSH client cannot reconnect anymore into the VM using the pkey.

If a bigger sleep time is added manually (to avoid this race condition: try to connect when the IP is available in the port but the SSH keys are still not present in the VM), the SSH client connects without any problem.

[1]http://paste.openstack.org/show/797127/

Tags: gate-failure
tags: added: gate-failure
Changed in neutron:
importance: Undecided → High
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I'm not sure what am I missing here but I was trying to reproduce issue locally and I couldn't.

Here is what I did:

- I modifed cirros image and put "sleep 30" in https://github.com/cirros-dev/cirros/blob/359b5ebda0b84db947b60c2b75e4d14feb900dc6/src/sbin/cirros-apply#L115
- using that modifed image I run neutron_tempest_plugin.scenario.test_basic.NetworkBasicTest.test_basic_instance
- in tempest logs I saw that it was trying to ssh to the instance few times but finally, when ssh key was added test passed successfully.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hi Slawek:

Maybe that was a Paramiko error. We saw that in our internal CI and we could reproduce the issue 100% of the times.

The coincidence here is that, 8 months later, we have a new Paramiko release [1]. To be honest, I don't see any relevant difference between 2.7.1 and 2.7.2, but maybe the error reported is now solved.

Anyway, if the error is not happening now, we can close it.

Regards.

[1]https://pypi.org/project/paramiko/#history

Changed in neutron:
status: New → Incomplete
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-tempest-plugin (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/758968

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-tempest-plugin (master)

Reviewed: https://review.opendev.org/758968
Committed: https://git.openstack.org/cgit/openstack/neutron-tempest-plugin/commit/?id=2211eabf3be7ccc1ec15d0b63190d085149ffb4d
Submitter: Zuul
Branch: master

commit 2211eabf3be7ccc1ec15d0b63190d085149ffb4d
Author: Slawek Kaplonski <email address hidden>
Date: Tue Oct 20 16:43:53 2020 +0200

    Check VM's console log before trying to SSH to it.

    Due to issue described in related bug report, it seems that it may
    happen sometimes that tempest will start trying to ssh to the instance
    before ssh key will be really configured in the instance and in such
    case it may happened that there will be AuthenticationFailure error
    always there. Even if ssh key will be configured properly later during
    the test.

    To workaround that issue and avoid failures of tests, this patch adds
    check if the vm is really booted and ready to ssh. It is done by
    checking console log of the VM and looking for specific string "login:"
    which appears at least in case of Cirros and Ubuntu images used in our
    CI jobs.
    In case when such string will not be found, test will continue to run
    and will still try to ssh to the instance. So in worst case it may slow
    ssh to the instance a bit but shouldn't really have any bad impact on
    test as before this patch it would probably also wait similar amount of
    time but on trying to SSH to the instance.

    Change-Id: I8739f17ec8b05405056fd21f59817de60af12dd8
    Related-Bug: #1892861

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
Changed in neutron:
status: Expired → Fix Committed
assignee: nobody → Slawek Kaplonski (slaweq)
Changed in neutron:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.