BMC node in periodic OVB jobs fails to retrieve metadata information

Bug #1831053 reported by Gabriele Cerami
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Gabriele Cerami

Bug Description

Many periodic jobs are failing during node registration:

e.g.

https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/07633b8/logs/undercloud/home/zuul/overcloud_prep_images.log.txt.gz#_2019-05-30_02_39_52

shows:

FAILED', u'message': [{u'result': u'Node eb42fa7e-ed9c-4eae-a5ad-cba77db34188 did not reach state "manageable", the state is "enroll", error: Failed to get power state for node eb42fa7e-ed9c-4eae-a5ad-cba77db34188. Error: IPMI call failed: power status.'},

For all the nodes.

Looking at the bottom of BMC console logs at

https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/3a9bd80/logs/bmc-console.log

we can see node fails to retrieve metedata information with these errors:

 cloud-init[1335]: 2019-05-30 01:51:09,851 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/120s]: unexpected error ['NoneType' object has no attribute 'status_code']

so either the network or the metadata service is not working properly

Revision history for this message
Marios Andreou (marios-b) wrote :

Today no failure but the job times out... I think it might still be introspection related, at least that doesn't look quite right:

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/d6746ac/logs/undercloud/home/zuul/overcloud_prep_images.log.txt.gz

        2019-06-03 02:08:21 | Introspection of node 2c6e169f-5d90-4d9d-8b1c-b1ecd7aa71b2 completed. Status:SUCCESS. Errors:None
        2019-06-03 02:08:21 | Introspection of node d17a4ffc-bb52-4291-b4e3-b5065e116507 timed out.
        2019-06-03 02:08:21 | Introspection of node 78c030a9-f28f-43c4-a02f-55c9b7ba9ffa completed. Status:SUCCESS. Errors:None
        2019-06-03 02:08:21 | Introspection of node 2233542e-aa2a-4257-a31b-97aa8c1aea6d completed. Status:SUCCESS. Errors:None
        2019-06-03 02:08:21 | Successfully introspected 4 node(s).
        2019-06-03 02:08:21 |
        2019-06-03 02:08:21 | Introspection completed.
        2019-06-03 02:08:21 | + openstack overcloud node provide --all-manageable
        2019-06-03 02:08:24 | Waiting for messages on queue 'tripleo' with no timeout.

one of those nodes times out and then it waits indefinitely.. nothing further in log

from bmc logs i see

        http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/d6746ac/logs/bmc-console.log
        [ 93.777302] cloud-init[2063]: parse error: Invalid numeric literal at line 1, column 15

not sure yet if this is same bug or different one

Revision history for this message
Marios Andreou (marios-b) wrote :

re comment #1 its definitely introspection timeout from job output:

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/d6746ac/job-output.txt.gz

2019-06-03 02:04:21.733899 | primary | TASK [overcloud-prep-images : Prepare the overcloud images for deploy] *********
2019-06-03 02:04:21.760474 | primary | Monday 03 June 2019 02:04:21 +0000 (0:00:02.807) 0:00:22.679 ***********
2019-06-03 04:38:27.723356 | RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/openstack/tripleo-ci/playbooks/tripleo-ci/run-v3.yaml@master]
2019-06-03 04:38:27.723555 | POST-RUN START: [trusted : review.rdoproject.org/config/playbooks/tripleo-ci-periodic-base/post.yaml@master]

Revision history for this message
Bob Fournier (bfournie) wrote :

From logs in comment 1, looks like the bmc_ip isn't being set, which would result in a introspection timeout.
since it can't be powered on or off.
 86.359447] cloud-init[2063]: parse error: Invalid numeric literal at line 1, column 15
[ 86.386921] cloud-init[2063]: + bmc_ip=

Changed in tripleo:
milestone: train-1 → train-2
Revision history for this message
Ronelle Landy (rlandy) wrote :

I am fairly certain that cleaning up the stuck stacks and failed instance will help.
Have a ticket out to admins to work on that

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

There are about 14 stacks to delete, opened a ticket with RHOS OPs

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.