RDO cloud jobs failing with SSH Error: data could not be sent to remote host \"38.145.32.100\". Make sure this host can be reached over ssh

Bug #1781871 reported by chandan kumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Sagi (Sergey) Shnaidman

Bug Description

On All branches OpenStack-check and openstack-periodic 24 hours job got failed with giving POST_FAILURE and with following error:
2018-07-16 04:44:34.735807 | primary | TASK [rdo-kolla-build : Build and push images] *********************************
2018-07-16 06:44:45.050329 | [Zuul] Log Stream did not terminate
2018-07-16 06:44:45.051225 | primary | ERROR
2018-07-16 06:44:45.051414 | primary | {
2018-07-16 06:44:45.051473 | primary | "msg": "SSH Error: data could not be sent to remote host \"38.145.32.100\". Make sure this host can be reached over ssh",
2018-07-16 06:44:45.051522 | primary | "unreachable": true

Below is the list of affected jobs:
https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-centos-7-queens-containers-build/8aa51e6/job-output.txt.gz#_2018-07-16_06_44_51_326465

https://logs.rdoproject.org/openstack-periodic-24hr/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-pike-upload/fcd86ab/job-output.txt.gz#_2018-07-16_06_27_54_988307

https://logs.rdoproject.org/openstack-periodic-24hr/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-pike/85a8656/job-output.txt.gz#_2018-07-16_06_15_49_631042

https://logs.rdoproject.org/openstack-periodic-24hr/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-pike/05e6fad/job-output.txt.gz#_2018-07-16_06_28_39_032920

https://logs.rdoproject.org/openstack-periodic-24hr/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp_1ceph-featureset024-pike/5b7343d/job-output.txt.gz#_2018-07-16_06_27_29_646653

https://logs.rdoproject.org/openstack-periodic-24hr/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-centos-7-pike-containers-build/1f98165/job-output.txt.gz#_2018-07-16_06_38_52_874246

https://logs.rdoproject.org/openstack-periodic-24hr/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-ocata-upload/1427a55/job-output.txt.gz#_2018-07-16_06_27_59_590250

https://logs.rdoproject.org/openstack-periodic-24hr/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-ocata/8773fa5/job-output.txt.gz#_2018-07-16_06_28_05_790110

https://logs.rdoproject.org/openstack-periodic-24hr/git.openstack.org/openstack-infra/tripleo-ci/master/legacy-periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-ocata/f7cf044/job-output.txt.gz#_2018-07-16_06_15_16_845042

https://logs.rdoproject.org/29/582829/1/openstack-check/legacy-tripleo-ci-centos-7-containers-multinode-upgrades-pike-branch/8f820b3/job-output.txt.gz#_2018-07-16_06_52_52_524457

https://logs.rdoproject.org/29/582829/1/openstack-check/legacy-tripleo-ci-centos-7-multinode-1ctlr-featureset037-updates-master/bb1201a/job-output.txt.gz#_2018-07-16_07_07_53_048663

2018-07-16 07:08:01.963100 | primary | ERROR
2018-07-16 07:08:01.963319 | primary | {
2018-07-16 07:08:01.963387 | primary | "msg": "SSH Error: data could not be sent to remote host \"38.145.33.250\". Make sure this host can be reached over ssh",
2018-07-16 07:08:01.963477 | primary | "unreachable": true
2018-07-16 07:08:01.963524 | primary | }

https://logs.rdoproject.org/29/582829/1/openstack-check/legacy-tripleo-ci-centos-7-container-to-container-upgrades-master/9b8af30/job-output.txt.gz#_2018-07-16_07_05_23_966980

https://logs.rdoproject.org/29/582829/1/openstack-check/legacy-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/2744331/job-output.txt.gz#_2018-07-16_07_04_54_225882

https://logs.rdoproject.org/29/582829/1/openstack-check/legacy-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/4561bc6/job-output.txt.gz#_2018-07-16_07_06_30_345890

https://logs.rdoproject.org/29/582829/1/openstack-check/legacy-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-pike-branch/f921b67/job-output.txt.gz#_2018-07-16_07_06_45_433857

https://logs.rdoproject.org/29/582829/1/openstack-check/legacy-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-ocata-branch/b8734d4/job-output.txt.gz#_2018-07-16_06_54_26_484999

https://logs.rdoproject.org/29/582829/1/openstack-check/legacy-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-queens-branch/bb72d2d/job-output.txt.gz#_2018-07-16_06_54_25_255885

It might be an infra issue, It need to checked it is blocking most of the openstack-check jobs.

Tags: ci
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

This is failure for connection from zuul executor to nodepool node.

for example:
https://logs.rdoproject.org/29/582829/1/openstack-check/legacy-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-queens-branch/bb72d2d/job-output.txt.gz#_2018-07-16_06_54_25_255885

"msg": "SSH Error: data could not be sent to remote host \"38.145.33.10\". Make sure this host can be reached over ssh",

and 38.145.33.10 is vm host:
https://logs.rdoproject.org/29/582829/1/openstack-check/legacy-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-queens-branch/bb72d2d/zuul-info/inventory.yaml

Seems like network issue in rdo cloud.

Changed in tripleo:
assignee: nobody → Sagi (Sergey) Shnaidman (sshnaidm)
Revision history for this message
Tristan Cacqueray (tristan-cacqueray) wrote :

It surprising that 7 jobs failed right after these tasks: "Build and push images" or "run the image build script (direct)"
Then the others jobs also failed after a similar tasks:
2 failed after "build-test-packages : Run DLRN"
1 failed after "Prepare the overcloud images for deploy"
1 failed after "Run the package installation script"
1 failed after "modify-image : Run script on image"
1 failed after "modify-image : Close qcow2 image"

Though there are other non related last task:
1 failed after "Add eth2 interface from eth2.conf"
1 failed after "Run the TripleO-CI VXLAN networking script"

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Opened a ticket to rdo cloud ops, number #726

summary: - [All branches][openstack-check][openstack-periodic] failing with SSH
- Error: data could not be sent to remote host \"38.145.32.100\". Make
- sure this host can be reached over ssh
+ RDO cloud jobs failing with SSH Error: data could not be sent to remote
+ host \"38.145.32.100\". Make sure this host can be reached over ssh
Revision history for this message
Paul Belanger (pabelanger) wrote :

I don't believe this to be a networking issue, but issue where nodepool is deleting the nodes behind the back of the running jobs. The correct fix here is to maybe move zookeeper in sf.io into its own dedicated server and back it with SSDs. We should also cluster zookeeper here to avoid single point of failure.

tags: removed: alert
Revision history for this message
Tristan Cacqueray (tristan-cacqueray) wrote :

We updated the zookeeper settings to:

tickTime=5000
minSessionTimeout=600000
maxSessionTimeout=1800000

Please let us know if the job failure happens again.

Revision history for this message
chandan kumar (chkumar246) wrote :

We are still seeing this issue http://logs.openstack.org/23/582823/4/gate/tripleo-ci-centos-7-scenario001-multinode-oooq-container/e548c5a/job-output.txt.gz#_2018-07-18_07_44_08_936878

2018-07-18 06:09:55.151553 | primary | TASK [overcloud-prep-containers : Prepare for the containerized deployment] ****
2018-07-18 06:09:55.199369 | primary | Wednesday 18 July 2018 06:09:55 +0000 (0:00:02.988) 0:00:36.454 ********
2018-07-18 07:44:08.786736 | [Zuul] Log Stream did not terminate
2018-07-18 07:44:08.912815 | primary | ERROR
2018-07-18 07:44:08.935101 | primary | {
2018-07-18 07:44:08.936878 | primary | "msg": "SSH Error: data could not be sent to remote host \"149.202.166.193\". Make sure this host can be reached over ssh",
2018-07-18 07:44:08.941615 | primary | "unreachable": true
2018-07-18 07:44:08.941741 | primary | }
2018-07-18 07:44:10.313197 |

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like it was a major issue with packethost-us-west-1.

Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

was solved in infra

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.