tripleo

FS 01 master periodic job failed during overcloud deploy with Exceeded maximum number of retries and 503 status code

Bug #1854721 reported by chandan kumar on 2019-12-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Won't Fix	Critical	Unassigned	tripleo ussuri-1

Bug Description

periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master job failed during overcloud deploy
http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/b216f86/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

2019-12-02 04:00:37 | 2019-12-02 03:59:26Z [overcloud.Controller.2.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 3fbbece2-521f-4a48-b96b-7e384c5d92f2., Code: 500"
2019-12-02 04:00:37 | 2019-12-02 03:59:26Z [overcloud.Controller.2.Controller]: DELETE_IN_PROGRESS state changed
2019-12-02 04:00:37 | 2019-12-02 03:59:28Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 4143c4cd-8c1a-466f-8073-73eb6b16a339., Code: 500"
2019-12-02 04:00:37 | 2019-12-02 03:59:28Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS state changed
2019-12-02 04:00:37 | 2019-12-02 03:59:32Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE state changed
2019-12-02 04:00:37 | 2019-12-02 03:59:32Z [overcloud.Controller.2.Controller]: DELETE_COMPLETE state changed
2019-12-02 04:00:37 | 2019-12-02 03:59:36Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed
2019-12-02 04:00:37 | 2019-12-02 03:59:36Z [overcloud.Controller.2.Controller]: CREATE_IN_PROGRESS state changed
2019-12-02 04:00:37 | 2019-12-02 03:59:41Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. , Code: 500"
2019-12-02 04:00:37 | 2019-12-02 03:59:41Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS
state changed

While looking at nova compute logs:

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/b216f86/logs/undercloud/var/log/containers/nova/nova-compute.log.txt.gz

2019-12-02 03:58:18.732 8 ERROR oslo.service.loopingcall Traceback (most recent call last):
2019-12-02 03:58:18.732 8 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 150, in _run_loop
2019-12-02 03:58:18.732 8 ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
2019-12-02 03:58:18.732 8 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 558, in _wait_for_active
2019-12-02 03:58:18.732 8 ERROR oslo.service.loopingcall raise exception.InstanceDeployFailure(msg)
2019-12-02 03:58:18.732 8 ERROR oslo.service.loopingcall InstanceDeployFailure: Failed to provision instance 4143c4cd-8c1a-466f-8073-73eb6b16a339: Timeout reached while waiting for callback for node 9638403c-1676-44ce-8d64-618f86b413aa
2019-12-02 03:58:18.732 8 ERROR oslo.service.loopingcall

We are also seeing the same vexxhost train fs01 jobs:
http://logs.rdoproject.org/80/23880/3/check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-train-vexxhost-nhw/5ddd3bc/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

and http://logs.rdoproject.org/80/23880/3/check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-train-vexxhost-nhw/5ddd3bc/logs/undercloud/var/log/extra/errors.txt.txt.gz

2019-12-02 07:53:28.343 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR oslo.service.loopingcall Traceback (most recent call last):
2019-12-02 07:53:28.343 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 150, in _run_loop
2019-12-02 07:53:28.343 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
2019-12-02 07:53:28.343 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 558, in _wait_for_active
2019-12-02 07:53:28.343 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR oslo.service.loopingcall raise exception.InstanceDeployFailure(msg)
2019-12-02 07:53:28.343 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR oslo.service.loopingcall InstanceDeployFailure: Failed to provision instance 4b78089d-f164-496a-974a-88dc5b9136eb: Failed to deploy: {u'priority': 100, u'interface': u'deploy', u'step': u'deploy', u'argsinfo': None}
2019-12-02 07:53:28.343 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR oslo.service.loopingcall
2019-12-02 07:53:28.345 ERROR /var/log/containers/nova/nova-compute.log: 9 ERROR nova.virt.ironic.driver [req-d19f41e9-ada1-4c3d-942a-c38ab37af7c9 01abdaaeddeb4e07b66863fc775d227c 52580ccbf60d4c639421341fba61d897 - default default] Error deploying instance 4b78089d-f164-496a-974a-88dc5b9136eb on baremetal node 281e50fe-d8ab-465d-8196-aa24a53ba99a.: InstanceDeployFailure: Failed to provision instance 4b78089d-f164-496a-974a-88dc5b9136eb: Failed to deploy: {u'priority':

It needs to be investigated why it is happening on two different infras.

Tags:

chandan kumar (chkumar246) on 2019-12-02

summary:

- FS 01 master periodic job failed during overcloud deploy with No valid
- host was found
+ FS 01 master periodic job failed during overcloud deploy with Exceeded
+ maximum number of retries and 503 status code

Revision history for this message

Marios Andreou (marios-b) wrote on 2019-12-02:

this is seen intermittently ... personally seen it at least once per day as ruck|rover last week.

examples
* http://logs.rdoproject.org/05/695805/6/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/eb60e5d/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

* 2019-11-28 20:06:32 | 2019-11-28 20:04:43Z [overcloud.Controller.1.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 7067dc92-0d0e-4d4c-b998-343fa55be66d., Code: 500"

./logfile_28Nov2019.log-236- * http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp_1supp-featureset039-master/94d0d35/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
./logfile_28Nov2019.log:237: * 2019-11-27 22:12:55 | ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. , Code: 500"
./logfile_28Nov2019.log-238-
./logfile_28Nov2019.log-239----
./logfile_28Nov2019.log-240- * periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master openstack/tripleo-ci master openstack-periodic-master master 8307 2019-11-27T19:59:11 FAILURE
./logfile_28Nov2019.log-241- * http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/ad4ea1e/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz
./logfile_28Nov2019.log:242: * 2019-11-27 22:04:52 | ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. , Code: 500"
./logfile_28Nov2019.log-243-

chandan kumar (chkumar246) on 2019-12-02

Changed in tripleo:
status:	Confirmed → Triaged

Revision history for this message

wes hayutin (weshayutin) wrote on 2019-12-02:

/me notes... "Message: No valid host was found. , Code: 500" can happen for a number of different issues. Not enough data here to link vex / rdo issues atm.

Revision history for this message

Ronelle Landy (rlandy) wrote on 2019-12-02:

Agreed - this is pretty inconclusive.
Message: No valid host was found. , Code: 500" is a standard 'not enough resources issue'.

vexxhost jobs did show this error at times when ovb stacks were not being cleaned up.

rdocloud shows this problem from time to time. If resources are in overdemand, the error is legit.

Revision history for this message

Marios Andreou (marios-b) wrote on 2019-12-03:

so wrt rdocloud and comments #2/#3 this is actually a legit resource issue ... so this bug is trending closed? lets leave it until we get a fresh round of promotions and if it increases in frequency we can escalate to sf-ops.

for vexxhost i think we should file a new bug if it continues to be seen... personally have seen it only on rdo cloud thus far but haven't been looking at vexx jobs as much (!)

Revision history for this message

chandan kumar (chkumar246) wrote on 2019-12-04:

Seen today in two jobs in rdocloud
http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp_1supp-featureset039-master/0360142/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_2comp-featureset020-master/4a5fc1a/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Revision history for this message

chandan kumar (chkumar246) wrote on 2019-12-05:

based on disccusion with kforde,
05:38 <@kforde> chandankumar: it looks like a genuine capacity issue
05:39 <@kforde> chandankumar I see much of the CI agg is in use

It might be a capacity issue, @marios, do we want to close this bug?

Revision history for this message

Marios Andreou (marios-b) wrote on 2019-12-05:

ack marked won't fix then

Changed in tripleo:
status:	Triaged → Confirmed
status:	Confirmed → Won't Fix

Revision history for this message

chandan kumar (chkumar246) wrote on 2019-12-11:

Currently at 10:30 AM to 11:30 AM UTC, both train (openstack-periodic-latest-released) starts and master (openstack-periodic-master) is in mid, that time the rdo cloud resources sharing is too high, that time, we are seeing the same issue.
For example: http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_2comp-featureset020-master/9cf75cd/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Revision history for this message

chandan kumar (chkumar246) wrote on 2019-12-11:

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/05b7001/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/c81a440/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.