FS 01 master periodic job failed during overcloud deploy with Exceeded maximum number of retries and 503 status code

Bug #1854721 reported by chandan kumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Won't Fix
Critical
Unassigned

Bug Description

periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master job failed during overcloud deploy
http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/b216f86/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

2019-12-02 04:00:37 | 2019-12-02 03:59:26Z [overcloud.Controller.2.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 3fbbece2-521f-4a48-b96b-7e384c5d92f2., Code: 500"
2019-12-02 04:00:37 | 2019-12-02 03:59:26Z [overcloud.Controller.2.Controller]: DELETE_IN_PROGRESS state changed
2019-12-02 04:00:37 | 2019-12-02 03:59:28Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 4143c4cd-8c1a-466f-8073-73eb6b16a339., Code: 500"
2019-12-02 04:00:37 | 2019-12-02 03:59:28Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS state changed
2019-12-02 04:00:37 | 2019-12-02 03:59:32Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE state changed
2019-12-02 04:00:37 | 2019-12-02 03:59:32Z [overcloud.Controller.2.Controller]: DELETE_COMPLETE state changed
2019-12-02 04:00:37 | 2019-12-02 03:59:36Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed
2019-12-02 04:00:37 | 2019-12-02 03:59:36Z [overcloud.Controller.2.Controller]: CREATE_IN_PROGRESS state changed
2019-12-02 04:00:37 | 2019-12-02 03:59:41Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. , Code: 500"
2019-12-02 04:00:37 | 2019-12-02 03:59:41Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS
 state changed

While looking at nova compute logs:

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/b216f86/logs/undercloud/var/log/containers/nova/nova-compute.log.txt.gz

2019-12-02 03:58:18.732 8 ERROR oslo.service.loopingcall Traceback (most recent call last):
2019-12-02 03:58:18.732 8 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 150, in _run_loop
2019-12-02 03:58:18.732 8 ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
2019-12-02 03:58:18.732 8 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 558, in _wait_for_active
2019-12-02 03:58:18.732 8 ERROR oslo.service.loopingcall raise exception.InstanceDeployFailure(msg)
2019-12-02 03:58:18.732 8 ERROR oslo.service.loopingcall InstanceDeployFailure: Failed to provision instance 4143c4cd-8c1a-466f-8073-73eb6b16a339: Timeout reached while waiting for callback for node 9638403c-1676-44ce-8d64-618f86b413aa
2019-12-02 03:58:18.732 8 ERROR oslo.service.loopingcall

We are also seeing the same vexxhost train fs01 jobs:
http://logs.rdoproject.org/80/23880/3/check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-train-vexxhost-nhw/5ddd3bc/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

and http://logs.rdoproject.org/80/23880/3/check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-train-vexxhost-nhw/5ddd3bc/logs/undercloud/var/log/extra/errors.txt.txt.gz

2019-12-02 07:53:28.343 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR oslo.service.loopingcall Traceback (most recent call last):
2019-12-02 07:53:28.343 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 150, in _run_loop
2019-12-02 07:53:28.343 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
2019-12-02 07:53:28.343 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 558, in _wait_for_active
2019-12-02 07:53:28.343 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR oslo.service.loopingcall raise exception.InstanceDeployFailure(msg)
2019-12-02 07:53:28.343 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR oslo.service.loopingcall InstanceDeployFailure: Failed to provision instance 4b78089d-f164-496a-974a-88dc5b9136eb: Failed to deploy: {u'priority': 100, u'interface': u'deploy', u'step': u'deploy', u'argsinfo': None}
2019-12-02 07:53:28.343 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR oslo.service.loopingcall
2019-12-02 07:53:28.345 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR nova.virt.ironic.driver [req-d19f41e9-ada1-4c3d-942a-c38ab37af7c9 01abdaaeddeb4e07b66863fc775d227c 52580ccbf60d4c639421341fba61d897 - default default] Error deploying instance 4b78089d-f164-496a-974a-88dc5b9136eb on baremetal node 281e50fe-d8ab-465d-8196-aa24a53ba99a.: InstanceDeployFailure: Failed to provision instance 4b78089d-f164-496a-974a-88dc5b9136eb: Failed to deploy: {u'priority':

It needs to be investigated why it is happening on two different infras.

summary: - FS 01 master periodic job failed during overcloud deploy with No valid
- host was found
+ FS 01 master periodic job failed during overcloud deploy with Exceeded
+ maximum number of retries and 503 status code
Revision history for this message
Marios Andreou (marios-b) wrote :

this is seen intermittently ... personally seen it at least once per day as ruck|rover last week.

examples
        * http://logs.rdoproject.org/05/695805/6/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/eb60e5d/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

        * 2019-11-28 20:06:32 | 2019-11-28 20:04:43Z [overcloud.Controller.1.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 7067dc92-0d0e-4d4c-b998-343fa55be66d., Code: 500"

./logfile_28Nov2019.log-236- * http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp_1supp-featureset039-master/94d0d35/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
./logfile_28Nov2019.log:237: * 2019-11-27 22:12:55 | ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. , Code: 500"
./logfile_28Nov2019.log-238-
./logfile_28Nov2019.log-239----
./logfile_28Nov2019.log-240- * periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master openstack/tripleo-ci master openstack-periodic-master master 8307 2019-11-27T19:59:11 FAILURE
./logfile_28Nov2019.log-241- * http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/ad4ea1e/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
./logfile_28Nov2019.log:242: * 2019-11-27 22:04:52 | ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. , Code: 500"
./logfile_28Nov2019.log-243-

Changed in tripleo:
status: Confirmed → Triaged
Revision history for this message
wes hayutin (weshayutin) wrote :

/me notes... "Message: No valid host was found. , Code: 500" can happen for a number of different issues. Not enough data here to link vex / rdo issues atm.

Revision history for this message
Ronelle Landy (rlandy) wrote :

Agreed - this is pretty inconclusive.
Message: No valid host was found. , Code: 500" is a standard 'not enough resources issue'.

vexxhost jobs did show this error at times when ovb stacks were not being cleaned up.

rdocloud shows this problem from time to time. If resources are in overdemand, the error is legit.

Revision history for this message
Marios Andreou (marios-b) wrote :

so wrt rdocloud and comments #2/#3 this is actually a legit resource issue ... so this bug is trending closed? lets leave it until we get a fresh round of promotions and if it increases in frequency we can escalate to sf-ops.

for vexxhost i think we should file a new bug if it continues to be seen... personally have seen it only on rdo cloud thus far but haven't been looking at vexx jobs as much (!)

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
chandan kumar (chkumar246) wrote :

based on disccusion with kforde,
05:38 <@kforde> chandankumar: it looks like a genuine capacity issue
05:39 <@kforde> chandankumar I see much of the CI agg is in use

It might be a capacity issue, @marios, do we want to close this bug?

Revision history for this message
Marios Andreou (marios-b) wrote :

ack marked won't fix then

Changed in tripleo:
status: Triaged → Confirmed
status: Confirmed → Won't Fix
Revision history for this message
chandan kumar (chkumar246) wrote :

Currently at 10:30 AM to 11:30 AM UTC, both train (openstack-periodic-latest-released) starts and master (openstack-periodic-master) is in mid, that time the rdo cloud resources sharing is too high, that time, we are seeing the same issue.
For example: http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_2comp-featureset020-master/9cf75cd/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Revision history for this message
chandan kumar (chkumar246) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.