periodic FS 1 wallaby fails node provision "Timeout reached while waiting for callback"

Bug #1933504 reported by Marios Andreou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

At [1][2][3] the periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby fails during overcloud deploy node provisioning with trace like:

        * 2021-06-23 12:33:29 | 2021-06-23 12:33:29.267851 | fa163e17-7f4c-6726-4747-000000000017 | FATAL | Provision instances | localhost | error={"changed": false, "logging": "Created port overcloud-novacompute-0-ctlplane (UUID dacde964-3f8f-4c8f-82ae-a6586b945924) for node baremetal-88561-2 (UUID fd99a70f-a936-4f86-a5cc-4f17ff047988) with {'network_id': 'b2f0ad25-d8b4-4af0-9115-ff1dd13f4e08', 'name': 'overcloud-novacompute-0-ctlplane'}\nCreated port overcloud-controller-0-ctlplane (UUID c927e1c6-018b-4f2c-b3b3-2710fb9ba64d) for node baremetal-88561-0 (UUID a8a00499-55ca-45e0-9f1f-aaf0601c5f2e) with {'network_id': 'b2f0ad25-d8b4-4af0-9115-ff1dd13f4e08', 'name': 'overcloud-controller-0-ctlplane'}\nCreated port overcloud-controller-1-ctlplane (UUID 32b81a82-8312-45d9-816c-89289e24d908) for node baremetal-88561-3 (UUID b1c56874-a7c8-4279-b859-4d75cce56ff3) with {'network_id': 'b2f0ad25-d8b4-4af0-9115-ff1dd13f4e08', 'name': 'overcloud-controller-1-ctlplane'}\nCreated port overcloud-controller-2-ctlplane (UUID b4ee18c0-18c8-4918-8d4d-c1d499c8852c) for node baremetal-88561-1 (UUID 54a177cb-b9e5-4cf1-b7f8-6a5f369280bb) with {'network_id': 'b2f0ad25-d8b4-4af0-9115-ff1dd13f4e08', 'name': 'overcloud-controller-2-ctlplane'}\nAttached port overcloud-controller-2-ctlplane (UUID b4ee18c0-18c8-4918-8d4d-c1d499c8852c) to node baremetal-88561-1 (UUID 54a177cb-b9e5-4cf1-b7f8-6a5f369280bb)\nAttached port overcloud-controller-0-ctlplane (UUID c927e1c6-018b-4f2c-b3b3-2710fb9ba64d) to node baremetal-88561-0 (UUID a8a00499-55ca-45e0-9f1f-aaf0601c5f2e)\nAttached port overcloud-controller-1-ctlplane (UUID 32b81a82-8312-45d9-816c-89289e24d908) to node baremetal-88561-3 (UUID b1c56874-a7c8-4279-b859-4d75cce56ff3)\nAttached port overcloud-novacompute-0-ctlplane (UUID dacde964-3f8f-4c8f-82ae-a6586b945924) to node baremetal-88561-2 (UUID fd99a70f-a936-4f86-a5cc-4f17ff047988)\nProvisioning started on node baremetal-88561-1 (UUID 54a177cb-b9e5-4cf1-b7f8-6a5f369280bb)\nProvisioning started on node baremetal-88561-2 (UUID fd99a70f-a936-4f86-a5cc-4f17ff047988)\nProvisioning started on node baremetal-88561-0 (UUID a8a00499-55ca-45e0-9f1f-aaf0601c5f2e)\nProvisioning started on node baremetal-88561-3 (UUID b1c56874-a7c8-4279-b859-4d75cce56ff3)\n", "msg": "Node b1c56874-a7c8-4279-b859-4d75cce56ff3 reached failure state \"deploy failed\"; the last error is Timeout reached while waiting for callback for node b1c56874-a7c8-4279-b859-4d75cce56ff3"}

This is a promotion blocker for wallaby

[1] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/4a80f7b/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
[2] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/7d5b530/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
[3] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/0fd607e/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Revision history for this message
Marios Andreou (marios-b) wrote :
Download full text (3.4 KiB)

similar issue seen on featureset 2 adding 3 traces below but they are slightly different... not sure if related yet

1 * https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-wallaby/27ab048/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

        * 2021-06-23 18:17:45 | 2021-06-23 18:17:45.569287 | fa163e90-ec04-f20d-bc10-00000000001c | FATAL | Wait for provisioned nodes to boot | overcloud-controller-0 | error={"ansible_facts": {"discovered_interpreter_python": "/usr/libexec/platform-python"}, "changed": false, "elapsed": 601, "msg": "Timeout waiting for provisioned nodes to become available"}
 2021-06-23 18:17:45 | 2021-06-23 18:17:45.573824 | fa163e90-ec04-f20d-bc10-00000000001c | TIMING | Wait for provisioned nodes to boot | overcloud-controller-0 | 0:10:02.157780 | 602.12s
 2021-06-23 18:17:45 |
 2021-06-23 18:17:45 | NO MORE HOSTS LEFT *************************************************************
 2021-06-23 18:17:45 |
 2021-06-23 18:17:45 | PLAY RECAP *********************************************************************
 2021-06-23 18:17:45 | overcloud-controller-0 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0

2 * https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-wallaby/26f06b2/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

        * 2021-06-22 18:06:22 | PLAY [Overcloud Node Network Config] *******************************************
 2021-06-22 18:06:22 | 2021-06-22 18:06:22.108592 | fa163e8b-7a5d-3530-4068-000000000020 | TASK | Gathering Facts
 2021-06-22 18:06:26 | [WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
 2021-06-22 18:06:26 | novacompute-0: Failed to connect to the host via ssh: ssh: connect to host
 2021-06-22 18:06:26 | 192.168.24.7 port 22: Connection refused
 2021-06-22 18:06:26 | [WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
 2021-06-22 18:06:26 | controller-0: Failed to connect to the host via ssh: ssh: connect to host

3 * https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-wallaby/bc6be7e/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
        * 2021-06-21 18:24:12 | PLAY [Overcloud Node Network Config] *******************************************
 2021-06-21 18:24:12 | 2021-06-21 18:24:12.809158 | fa163e3a-e35e-3d61-31cf-00000000001e | TASK | Gathering Facts
 2021-06-21 18:24:14 | 2021-06-21 18:24:14.423877 | fa163e3a-e35e-3d61-31cf-00000000001e | OK | Gathering Facts | overcloud-novacompute-0
 2021-06-21 18:24:14 | 2021-06-21 18:24:14.427493 | fa163e3a-e35e-3d61-31cf-00000000001e | TIMING | Gathering Facts | overcloud-novacompute-0 | 0:00:01.640009 | 1.62s
 2021-06-21 18:24:17 | [WARNING]: Unhandled error in Python interpreter discovery for host overcloud-
 2021-06-21 1...

Read more...

Revision history for this message
Marios Andreou (marios-b) wrote :

Revisiting for some more digging - I think this may be infra related rather than a bug on our side. Mainly because 1) this is not consistent and 2) only some nodes are failing e.g. compute-1 and controller-2 controller-3 are OK but controller-0 times out.

WRT 'not consistent': the 3 fails in the description are from 21st and 23rd June. Since then we had a few green runs here at [1]. However I found another example of the bug at [2] from the 28th yesterday

WRT 'only some nodes': see logs at [3]

        * [ 7529.849040] openstackbmc[9170]: Reporting power state "False" for instance 43e11297-6737-45f4-8eef-bdf1ad86c71e

In this case the only missing node is controller-0 (no logs at all in [4] but there are logs from the others like at [5])

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby

[2] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/076b975/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

[3] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/076b975/logs/bmc-console.log

[4] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/076b975/logs/baremetal_3-console.log

[5] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/076b975/logs/baremetal_1-console.log

Revision history for this message
Harald Jensås (harald-jensas) wrote :

I agree with Marios, this looks like an infra issue. I see this happens every now and then in upstream RDO CI.

I use OVB in my homelab, and I never have those kind of issues. But I see it happening from time-to-time in upstream CI.

- The console of one (sometimes two nodes] node is completely blank like[1] in this case.
- The deployment times out, after some retries.
- And we can see one node consistently: Reporting power state "False" for instance <INSTANCE_UUID>

I think we may want to try to involve the infra provider and ask if they can see something in their logs, and possibly look at server event[2] command output.

Ideally we would figure out a way to detect this specific failure, and hold an environment for troubleshooting.

[1] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/076b975/logs/baremetal_3-console.log
[2] https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/server-event.html

wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.