CS9 - OVB FS001 master job is failing on overcloud_node_provisioning Failed to connect to the host via ssh

Bug #1970400 reported by Bhagyashri Shewale
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

2022-04-26 05:35:24.532269 | fa163ec7-522f-c5e7-3258-00000000000f | TIMING | Output growvols_args | overcloud-controller-2 | 0:00:10.698406 | 0.02s
2022-04-26 05:35:24.539613 | fa163ec7-522f-c5e7-3258-000000000010 | TASK | Find the growvols utility
[WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
controller-2: Failed to connect to the host via ssh: ssh: connect to host
192.168.24.24 port 22: No route to host
[WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
controller-1: Failed to connect to the host via ssh: ssh: connect to host
192.168.24.17 port 22: No route to host
[WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
controller-0: Failed to connect to the host via ssh: ssh: connect to host
192.168.24.10 port 22: No route to host
2022-04-26 05:36:04.543539 | fa163ec7-522f-c5e7-3258-000000000010 | UNREACHABLE | Find the growvols utility | overcloud-controller-0
2022-04-26 05:36:04.544840 | fa163ec7-522f-c5e7-3258-000000000010 | TIMING | Find the growvols utility | overcloud-controller-0 | 0:00:50.710967 | 40.00s
2022-04-26 05:36:04.545529 | fa163ec7-522f-c5e7-3258-000000000010 | UNREACHABLE | Find the growvols utility | overcloud-controller-1
2022-04-26 05:36:04.546193 | fa163ec7-522f-c5e7-3258-000000000010 | TIMING | Find the growvols utility | overcloud-controller-1 | 0:00:50.712328 | 40.00s
2022-04-26 05:36:04.546843 | fa163ec7-522f-c5e7-3258-000000000010 | UNREACHABLE | Find the growvols utility | overcloud-controller-2
2022-04-26 05:36:04.547322 | fa163ec7-522f-c5e7-3258-000000000010 | TIMING | Find the growvols utility | overcloud-controller-2 | 0:00:50.713462 | 39.99s

[1]: https://review.rdoproject.org/zuul/builds?job_name=tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001
[2]: https://logserver.rdoproject.org/79/824479/26/openstack-check/tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001/35ab5ef/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz
[3]: https://logserver.rdoproject.org/63/839163/2/openstack-check/tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001/ad39bde/logs/baremetal_2_86483_0-console.log

description: updated
summary: - CS9 - OVB FS039 and FS064 master jobs are failing on
- overcloud_node_provisioning Failed to connect to the host via ssh
+ CS9 - OVB FS001 master job is failing on overcloud_node_provisioning
+ Failed to connect to the host via ssh
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

baremetal_2_86483_0-console.log suggests that it didn't progress beyond grub boot.

I downloaded the latest https://images.rdoproject.org/centos9/master/rdo_trunk/current-tripleo/overcloud-hardened-uefi-full.qcow2 and deployed it locally and it worked fine.

According to [1] the baremetal_image is ipxe-boot. Shouldn't it be ipxe-boot-uefi?

[1] https://logserver.rdoproject.org/63/839163/2/openstack-check/tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001/ad39bde/job-output.txt

Revision history for this message
Marios Andreou (marios-b) wrote :

Adding some more context:

Dariusz was testing [1] to switch to uefi with the testproject at [2]. However the latest run there [3] seems to hit that same issue as this bug:

[WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
controller-1: Failed to connect to the host via ssh: ssh: connect to host
192.168.24.24 port 22: Connection refused
[WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
controller-2: Failed to connect to the host via ssh: ssh: connect to host
192.168.24.22 port 22: Connection refused
[WARNING]: Unhandled error in Python interpreter discovery for host overcloud-

[1] https://review.rdoproject.org/r/c/rdo-jobs/+/42101
[2] https://review.rdoproject.org/r/c/testproject/+/41128
[3] https://logserver.rdoproject.org/28/41128/6/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/0d0101f/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz

Revision history for this message
Marios Andreou (marios-b) wrote :

Just noticed... this bug was filed against the check queue not periodic (i.e. the links in description). So this isn't strictly speaking a promotion blocker or at least I haven't found an example from periodic for this.

The last 2 runs from periodic [1] at [2] and [3] are failing with https://bugs.launchpad.net/tripleo/+bug/1970899

2022-05-01 14:14:54.385237 | fa163e71-adf7-868b-a282-00000000001a | FATAL | Provision instances | localhost | error={"changed": false, "logging": "Deploy attempt failed on node baremetal-51591-1 (UUID 9acffeaf-269a-44c5-80d7-217ed8d53127), cleaning up\nDeploy attempt failed on node baremetal-51591-3 (UUID a4f09c49-2cf0-48a6-853c-e7b22cf172b3), cleaning up\nDeploy attempt failed on node baremetal-51591-2 (UUID 75ffc271-6fd3-4bfe-af42-e3703c697454), cleaning up\nDeploy attempt failed on node baremetal-51591-0 (UUID 33e8dca6-2d1a-4244-8e6a-f7329df09771), cleaning up\n", "msg": "ConflictException: 409: Client Error for url: https://192.168.24.2:13696/v2.0/ports, Host 9acffeaf-269a-44c5-80d7-217ed8d53127 is not connected to any segments on routed provider network 'f3d3d76f-1764-4116-b8cd-42f49657abb6'. It should be connected to one."}

From the check queue [4] we have green runs e.g. [5] from 02 May

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master
[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/ea4ff20/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz
[3] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/4c29abe/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz
[4] https://review.rdoproject.org/zuul/builds?job_name=tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001
[5] https://review.rdoproject.org/zuul/build/987cd4baa5a74f838c0bff0bd5927314

Revision history for this message
Marios Andreou (marios-b) wrote :

as discussed in yesterday's cix call - moving this to fix-released for now

we are no longer seeing it in check [1]

we haven't seen it (yet?) in periodic [2]

If you see more examples of this please re-open the bug and we can investigate more

[1] https://review.rdoproject.org/zuul/builds?job_name=tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001
[2] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master

Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
Marios Andreou (marios-b) wrote :

found one example of this today in periodic there [1] - definitely not a consistent error - the other jobs in that buildset [2] are failing for different things only this one example

2022-05-02 22:25:46.906382 | fa163ef4-67df-b27f-92e9-00000000000f | TASK | Find the growvols utility
[WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
controller-0: Failed to connect to the host via ssh: ssh: connect to host
192.168.24.8 port 22: Connection refused
2022-05-02 22:25:59.746164 | fa163ef4-67df-b27f-92e9-00000000000f | UNREACHABLE | Find the growvols utility | overcloud-controller-0

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-1ctlr_2comp-featureset020-master/369e8a4/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz
[2] https://review.rdoproject.org/zuul/buildset/af8e6da8e7b24619957694aa2a62991f

Revision history for this message
Marios Andreou (marios-b) wrote :

moving back in progress - more from master integration there https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/122791b/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz

 2022-05-03 06:09:42.548591 | fa163e03-4435-6f90-80b0-000000000010 | TIMING | Find the growvols utility | overcloud-controller-2 | 0:00:11.813522 | 1.20s
[WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
controller-1: Failed to connect to the host via ssh: ssh: connect to host
192.168.24.12 port 22: No route to host
[WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
controller-0: Failed to connect to the host via ssh: ssh: connect to host
192.168.24.7 port 22: No route to host
2022-05-03 06:10:21.701999 | fa163e03-4435-6f90-80b0-000000000010 | UNREACHABLE | Find the growvols utility | overcloud-controller-1

Changed in tripleo:
status: Fix Released → Triaged
Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :
Download full text (3.7 KiB)

Looked at logs along with Harald, few observation:-

* Ansible failed for controller-0 - 2022-05-10 06:13:37.289975 [1] around same there was a network related logs on controller, suspecting if legacy network service is causing a blipp[2]
* cloud-init was still running and by the time ansible failed, As per the logs - cloud-init is creating the heat-admin user after the ansible login failure [3]

[1]

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-1ctlr_2comp-featureset020-master/69a4390/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz

~~~
2022-05-10 06:13:28.960063 | fa163e8d-01ec-fe41-ce7a-00000000000f | TASK | Find the growvols utility
[WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
controller-0: Failed to connect to the host via ssh: ssh: connect to host
192.168.24.22 port 22: Connection refused
~~~

[2]

https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-1ctlr_2comp-featureset020-master/69a4390/logs/overcloud-controller-0/var/log/extra/journal.txt.gz

~~~
May 10 06:12:38 overcloud-controller-0 NetworkManager[1130]: <info> [1652177558.7952] device (enp3s0): Activation: successful, device activated.
< ... >
May 10 06:13:23 overcloud-controller-0 NetworkManager[1130]: <info> [1652177603.9559] dhcp4 (enp8s0): activation: beginning transaction (timeout in 45 seconds)
May 10 06:13:38 overcloud-controller-0 systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
May 10 06:13:38 overcloud-controller-0 systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
May 10 06:13:38 overcloud-controller-0 systemd[1]: Failed to start Network Manager Wait Online.
May 10 06:13:38 overcloud-controller-0 systemd[1]: Starting LSB: Bring up/down networking...
May 10 06:13:38 overcloud-controller-0 network[1195]: WARN : [network] You are using 'network' service provided by 'network-scripts', which are now deprecated.
May 10 06:13:38 overcloud-controller-0 network[1206]: You are using 'network' service provided by 'network-scripts', which are now deprecated.
May 10 06:13:38 overcloud-controller-0 network[1195]: WARN : [network] 'network-scripts' will be removed from distribution in near future.
May 10 06:13:38 overcloud-controller-0 network[1207]: 'network-scripts' will be removed from distribution in near future.
May 10 06:13:38 overcloud-controller-0 network[1195]: WARN : [network] It is advised to switch to 'NetworkManager' instead for network management.
May 10 06:13:38 overcloud-controller-0 network[1208]: It is advised to switch to 'NetworkManager' instead for network management.
May 10 06:13:38 overcloud-controller-0 NetworkManager[1130]: <info> [1652177618.3949] audit: op="connections-reload" pid=1248 uid=0 result="success"
May 10 06:13:38 overcloud-controller-0 network[1195]: Bringing up loopback interface: [ OK ]
May 10 06:13:38 overcloud-controller-0 NetworkManager[1130]: <info> [1652177618.5848] audit: op="connections-load" args="...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

The switch to wait_for_connection left the old connection:local, so it is not actually attempting any ssh connection. I've proposed https://review.opendev.org/c/openstack/tripleo-ansible/+/841360 to fix this

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

It looks like the proposed fix works, overcloud-controller-2 took 43 seconds to finish wait_for_connection, then provision finished to completion:

https://review.rdoproject.org/zuul/build/765126b4ac804d87b9fb6e410a2cda27/log/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz#128

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-ansible/+/841360
Committed: https://opendev.org/openstack/tripleo-ansible/commit/ba4e9908acd2931c693f45cda8b4a26748a3b846
Submitter: "Zuul (22348)"
Branch: master

commit ba4e9908acd2931c693f45cda8b4a26748a3b846
Author: Steve Baker <email address hidden>
Date: Wed May 11 13:37:03 2022 +1200

    Use inventory host for wait_for_connection

    Using the local connection will always return immediately, but the
    intent is to wait for the inventory node to have a fully successful
    ssh connection

    Change-Id: Ie763082f79754107031bd93578697e1a35fe11d0
    Closes-Bug: 1970400

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 5.0.0

This issue was fixed in the openstack/tripleo-ansible 5.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.