Creating pingtest_sack fails: Failed to schedule instances: NoValidHost_Remote: No valid host was found

Bug #1767076 reported by Quique Llorente
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Triaged
High
Sagi (Sergey) Shnaidman

Bug Description

At the creationg of the ping stack stack we have the following heat problem

https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-pike-upload/7dadafb/overcloud-controller-0/var/log/heat/heat-engine.log.txt.gz#_2018-04-25_05_53_30_057

2018-04-25 05:54:47.009 52043 ERROR heat.engine.resource 'code': fault.get('code', _('Unknown'))
2018-04-25 05:54:47.009 52043 ERROR heat.engine.resource ResourceInError: Went to status ERROR due to "Message: No valid host was found. , Code: 500"
2018-04-25 05:54:47.009 52043 ERROR heat.engine.resource
2018-04-25 05:54:47.023 52043 INFO heat.engine.stack [req-0bd1ef0a-846b-4e3a-b6d0-47e3d399f055 - admin - default default] Stack CREATE FAILED (pingtest_stack): Resource CREATE failed: ResourceInError: resources.server1: Went to status ERROR due to "Message: No valid host was found. , Code: 500"

Looking at nova conductor:
https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-pike-upload/7dadafb/overcloud-controller-0/var/log/nova/nova-conductor.log.txt.gz#_2018-04-25_05_53_37_211
obj_load_attr /usr/lib/python2.7/site-packages/nova/objects/instance.py:1049
2018-04-25 05:53:37.211 104834 ERROR nova.conductor.manager [req-f2e1341d-9db2-4a4d-8992-2c6e6810cc97 11ea08f7a6cd482893f589060245a898 b55881d9263a428392142e369bb63bee - default default] Failed to schedule instances: NoValidHost_Remote: No valid host was found.
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 232, in inner
    return func(*args, **kwargs)

  File "/usr/lib/python2.7/site-packages/nova/scheduler/manager.py", line 137, in select_destinations
    raise exception.NoValidHost(reason="")

Revision history for this message
Quique Llorente (quiquell) wrote :
Changed in tripleo:
importance: Critical → High
status: New → Triaged
Revision history for this message
yatin (yatinkarel) wrote :

Last run passed and pike promoted:- https://review.rdoproject.org/jenkins/job/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-pike-upload/164/, But good to investigate the issue as it might be hit again(as jobs failed with this Error multiple times).

Revision history for this message
Quique Llorente (quiquell) wrote :

It's only at RDO cloud jobs ?

Revision history for this message
Oliver Walsh (owalsh) wrote :
Revision history for this message
Matt Young (halcyondude) wrote :

(tripleo-ci triage)

- we don't presently know the frequency of this error
- Will experiment with using a smaller flavor, as this might help mitigate resource utilization.

Revision history for this message
Oliver Walsh (owalsh) wrote :

Nova devs have taken a look at this. nova-compute cannot create a resource provider in placement API, which suggests a network or keystone issue is the root cause.

Revision history for this message
yatin (yatinkarel) wrote :

Failed again:- https://review.rdoproject.org/jenkins/job/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-pike-upload/165/

<<< Nova devs have taken a look at this. nova-compute cannot create a resource provider in placement API, <<< which suggests a network or keystone issue is the root cause.

Looking at the logs it seems when nova tries to register resource provider, keystone is down as httpd is restarting that time:-
nova:- https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-pike-upload/7dadafb/overcloud-novacompute-0/var/log/nova/nova-compute.log.txt.gz#_2018-04-25_05_40_03_852

Apache Starting:- https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-pike-upload/7dadafb/overcloud-controller-0/var/log/journal.txt.gz#_Apr_25_05_39_49

Apache Started:- https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-pike-upload/7dadafb/overcloud-controller-0/var/log/journal.txt.gz#_Apr_25_05_40_15

The apache start request is by gnocchi. So need to find why this started happening recently.

@owalsh, Also isn't there a retry mechanism in nova to register placement resource? How many time it tries and what's the timeout for this try?
Workaround can be:- restarting nova-compute or starting nova-compute when keystone/apache is responding.

Changed in tripleo:
assignee: Quique Llorente (quiquell) → nobody
Revision history for this message
Oliver Walsh (owalsh) wrote :

@yatin retrying every minute, _ensure_resource_provider in the logs.

Revision history for this message
Oliver Walsh (owalsh) wrote :

I found an issue while reviewing _ensure_resource_provider - https://bugs.launchpad.net/nova/+bug/1768953. I expect it's the root cause here.

Revision history for this message
Oliver Walsh (owalsh) wrote :
Changed in tripleo:
assignee: nobody → Sagi (Sergey) Shnaidman (sshnaidm)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.