Ansible ssh connection errors in opendev jobs

Bug #1986708 reported by Rafael Castillo
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Tripleo ci jobs failing on opendev.org with message "Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host".

Examples:
https://zuul.opendev.org/t/openstack/build/6a96f799c4c7459392d0cba3544d0817
https://zuul.opendev.org/t/openstack/build/2419cb17169348c697f62773d098bc93
https://zuul.opendev.org/t/openstack/build/a646fc193aa8451386cee2e0608ca596

Seems to be able to occur to any job at any stage. Journal shows attempted connections by IP scanners but that may just be a coincidence.

tags: added: promotion-blocker
Revision history for this message
Douglas Viroel (dviroel) wrote :
Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Seen again today in Gate job:-

https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_649/852532/5/gate/tripleo-ci-centos-9-content-provider/649da27/job-output.txt
~~~
2022-08-18 01:14:06.982440 | primary | failed: [undercloud] (item=quay.io/prometheus/node-exporter:v1.3.1) => {"ansible_loop_var": "item", "item": "quay.io/prometheus/node-exporter:v1.3.1", "msg": "Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host\r\nConnection closed by 127.0.0.2 port 22", "unreachable": true}
~~~

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :
Download full text (7.9 KiB)

Seen today as well in this job:
https://review.opendev.org/c/openstack/tripleo-ansible/+/853252

https://zuul.opendev.org/t/openstack/build/c6e5b64b5a4940a08ca751b50ac9e6c9
fatal: [subnode-1]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: kex_exchange_identification: read: Connection reset by peer\r\nConnection reset by 10.209.128.68 port 22", "unreachable": true}

If we compare to the time we see this error in the log (Thursday 18 August 2022 09:32:46 +0000) we can frame this in the journald:
Aug 18 09:32:45 centos-9-stream-rax-dfw-0030762605 sshd[135791]: Accepted publickey for zuul from 10.209.39.69 port 54484 ssh2: RSA SHA256:ZF9fnpKktgsD1qskSaOryoa9rBiFv0AhNYxlCxB1S/s
Aug 18 09:32:45 centos-9-stream-rax-dfw-0030762605 systemd-logind[704]: New session 128 of user zuul.
Aug 18 09:32:45 centos-9-stream-rax-dfw-0030762605 systemd[1]: Started Session 128 of User zuul.
Aug 18 09:32:45 centos-9-stream-rax-dfw-0030762605 sshd[135791]: pam_unix(sshd:session): session opened for user zuul(uid=1000) by (uid=0)
Aug 18 09:32:45 centos-9-stream-rax-dfw-0030762605 sudo[135828]: zuul : TTY=pts/0 ; PWD=/home/zuul ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-igauwcvpqfxxcpaxhwapjcrsfscqvxhe ; /usr/libexec/platform-python /home/zuul/.ansible/tmp/ansible-tmp-1660815163.318697-128609-249815272917366/AnsiballZ_file.py
Aug 18 09:32:45 centos-9-stream-rax-dfw-0030762605 sudo[135828]: pam_unix(sudo:session): session opened for user root(uid=0) by zuul(uid=1000)
Aug 18 09:32:45 centos-9-stream-rax-dfw-0030762605 systemd[1]: Started /usr/bin/podman healthcheck run d1a62b77bec9a31ba7325592b4f2c8b70ce6f8120d079776e0d70dbf1521e88a.
Aug 18 09:32:45 centos-9-stream-rax-dfw-0030762605 systemd[1]: tmp-crun.FzxKF6.mount: Deactivated successfully.
Aug 18 09:32:45 centos-9-stream-rax-dfw-0030762605 podman[135830]: 2022-08-18 09:32:45.939338182 +0000 UTC m=+0.098918338 container exec d1a62b77bec9a31ba7325592b4f2c8b70ce6f8120d079776e0d70dbf1521e88a (image=192.168.24.1:8787/tripleomastercentos9/openstack-nova-compute:80afe6f3cedeea3f67fa00bebbd52e52, name=nova_migration_target, container_name=nova_migration_target, com.redhat.component=ubi9-container, version=9.0.0, config_id=tripleo_step4, release=1604, url=https://access.redhat.com/containers/#/registry.access.redhat.com/ubi9/images/9.0.0-1604, build-date=2022-08-02T22:00:29.261592, io.openshift.expose-services=, io.buildah.version=1.26.4, vendor=Red Hat, Inc., summary=Provides the latest release of Red Hat Universal Base Image 9., config_data={'environment': {'KOLLA_CONFIG_STRATEGY': 'COPY_ALWAYS', 'TRIPLEO_CONFIG_HASH': '26b74d0c5605283ac83ca9f732ef69b1'}, 'healthcheck': {'test': '/openstack/healthcheck'}, 'image': '192.168.24.1:8787/tripleomastercentos9/openstack-nova-compute:80afe6f3cedeea3f67fa00bebbd52e52', 'net': 'host', 'privileged': True, 'restart': 'always', 'user': 'root', 'volumes': ['/etc/hosts:/etc/hosts:ro', '/etc/localtime:/etc/localtime:ro', '/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '/etc/pki/tls/ce...

Read more...

Revision history for this message
Rabi Mishra (rabi) wrote :

The fact that we've started to see it lately, is it related to some iptables/nftables related changes? It can't ssh to localhost (loopback address).

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Nope, not that I know at least - not for nftables (still not used as default). As for iptables, I didn't see any DROPPED logs in the journald content (this is the location with that one).

Apparently, it's more due to scans and other things taking resources. We may be able to play with limit/burst on the firewall to limit the impact of bots - though it's of course far from perfect...

But with current master (and other branches), it's still iptables, still the old way.

Douglas Viroel (dviroel)
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.