post failure on multiple periodic jobs with SSH Error: data could not be sent to remote host

Bug #1803964 reported by chandan kumar
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Won't Fix
Critical
Unassigned

Bug Description

It is a tracker bug for the multiple post failures seen on the RDO cloud periodic ovb jobs with following errors:
https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset037-updates-master/44bf16c/job-output.txt.gz#_2018-11-19_01_44_23_312362

TASK [remove-build-sshkey : Remove the build SSH key from all nodes]
2018-11-19 01:44:08.625169 | primary | changed
2018-11-19 01:44:23.311954 | secondary | ERROR
2018-11-19 01:44:23.312239 | secondary | {
2018-11-19 01:44:23.312362 | secondary | "msg": "SSH Error: data could not be sent to remote host \"38.145.33.154\". Make sure this host can be reached over ssh",
2018-11-19 01:44:23.312442 | secondary | "unreachable": true
2018-11-19 01:44:23.312535 | secondary | }

https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset018-rocky/ef89b3c/job-output.txt.gz#_2018-11-19_08_37_16_190026

https://logs.rdoproject.org/openstack-periodic-24hr/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-pike-upload/f00ec1c/job-output.txt.gz#_2018-11-19_06_06_53_264746

https://logs.rdoproject.org/69/618669/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset053/d9dc22f/job-output.txt.gz#_2018-11-19_04_00_43_694057

https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset037-updates-master/44bf16c/job-output.txt.gz#_2018-11-19_01_44_23_312362

Time stamp when it happened:
2018-11-19 01:44:23.
2018-11-19 08:37:16.
2018-11-19 06:06:53

As per discussion with kforde on #rhos-ops, there is no networking issue, E2E tests were passing.
We need to take a look why it happened.

Revision history for this message
Javier Peña (jpena-c) wrote :

In 2 of the 4 cases, we are seeing an SSH error message claiming that the SSH host key had changed:

- https://logs.rdoproject.org/openstack-periodic-24hr/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-pike-upload/f00ec1c/job-output.txt.gz#_2018-11-19_06_06_59_957778

- https://logs.rdoproject.org/69/618669/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset053/d9dc22f/job-output.txt.gz#_2018-11-19_04_00_37_775801

It looks like the node had been deleted outside the control of nodepool. Maybe we can check the RDO Cloud logs to see which user requested the deletion of those nodes? Also, do we have any other script/bot (te-broker) that could have deleted the VM?

Revision history for this message
Alan Pevec (apevec) wrote :

@Javier there is cleanup cron job on te-broker
@Chandan @panda where is the log from cleanup job to compare timestamps?

Revision history for this message
Alan Pevec (apevec) wrote :

@Javier re. RDO Cloud logs - please open a ticket for cloud ops

Revision history for this message
Javier Peña (jpena-c) wrote :

@Alan ticket opened. We have the following Nova log for one of the instances:

2018-11-19 13:49:14.195 4283 INFO nova.osapi_compute.wsgi.server [req-2a1bab57-c41b-4005-b315-99c74c1e2e49 e07dc7f3bfb14472b5a76fc0bba9480f 5d8fad99d6e14040b794ea2b0c4d91f5 - default default] 172.16.0.12 "DELETE /v2.1/servers/3d520087-3069-4608-9846-cf00c8fd5672 HTTP/1.1" status: 204 len: 315 time: 1.1041851

The source IP is one of the controllers, so we need to check the HAProxy logs (if available) to find the real source IP.

Revision history for this message
Javier Peña (jpena-c) wrote :

Could we get the /var/log/cleaning_stacks.log file from the te-broker on November 19? We do not have HAProxy logs from RDO Cloud, but checking back nodepool logs from one of the cases I have found the following:

2018-11-19 04:03:27,512 DEBUG nodepool.NodeLauncher-0000303194: Node 0000303194 is running [region: regionOne, az: nova, ip: 38.145.33.10 ipv4: 38.145.33.10, ipv6: ]

Instance 52c2c700-6a4e-4f1f-b156-0f18eff30301 was created at that time, and nodepool tried to delete it at 06:07

2018-11-19 06:07:21,098 INFO nodepool.DeletedNodeWorker: Deleting used instance 52c2c700-6a4e-4f1f-b156-0f18eff30301 from rdo-cloud-tripleo

It got IP 38.145.33.10, however that same IP was used elsewhere 1 hour before (!):

2018-11-19 05:01:38,812 DEBUG nodepool.NodeLauncher-0000299388: Node 0000299388 is running [region: regionOne, az: nova, ip: 38.145.33.10 ipv4: 38.145.33.10, ipv6: ]
2018-11-19 05:08:12,925 DEBUG nodepool.NodeLauncher-0000303241: Node 0000303241 is running [region: regionOne, az: nova, ip: 38.145.33.10 ipv4: 38.145.33.10, ipv6: ]

That means that either the VM was deleted or the FIP was released way earlier than it should have been. Nodepool did not do it (at least there are no traces of it in the logs), so let's check if the ovb cleanup script did it.

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
Javier Peña (jpena-c) wrote :

We have been checking logs and correlating timestamps, and this is the current situation. We have seen the following pattern:

1- VM1 gets created by nodepool, gets assigned a floating IP (FIP1).
2- In the Neutron logs, we see nodepool deleting the FIP, but VM1 stays active.
3- VM2 gets created by nodepool, gets FIP1 assigned.
4- The job running on VM1 eventually fails, and VM1 is deleted by nodepool.

We have not yet found the reason for step 2, so we will be running some more tests to find out.

Revision history for this message
Javier Peña (jpena-c) wrote :

I have proposed https://softwarefactory-project.io/r/14366 to monitor if this makes a difference in the rdo-cloud-tripleo tenant.

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
Javier Peña (jpena-c) wrote :
Revision history for this message
Sorin Sbarnea (ssbarnea) wrote :

Javier, is this fixed or we need to wait a little bit more?

Revision history for this message
Javier Peña (jpena-c) wrote :

We haven't had a single occurrence of the issue since early this week, see [1]. We had several changes in the meantime:

- RDO Cloud update and hypervisor reboot
- Nodepool restarts due to an internal error

Those changes might have contributed to this, so we could close the issue for now, and reopen if we see new instances of it.

[1] - https://review.rdoproject.org/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-7d,mode:quick,to:now))&_a=(columns:!(_source),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'tags:%20console%20AND%20message:%22SSH%20Error:%20data%20could%20not%20be%20sent%20to%20remote%20host%22')),sort:!('@timestamp',desc))

Revision history for this message
Sorin Sbarnea (ssbarnea) wrote :

I asked because I was trying to identify a log start query to identify this error but the mentioned string is far too generic to be trusted: "SSH Error: data could not be sent to remote host".

It would be useful to have a query addes so we would catch this when it happens again.

Revision history for this message
wes hayutin (weshayutin) wrote :

Bah, I see several recreations of this issue on dec 8, dec 10, 2018. Still a bug.

Changed in tripleo:
milestone: stein-2 → stein-3
wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Incomplete
tags: removed: promotion-blocker
Changed in tripleo:
milestone: stein-3 → stein-rc1
wes hayutin (weshayutin)
Changed in tripleo:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.