Nova show will not display NoValidHost with right exception traces

Bug #1369818 reported by zhu zhu
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Medium
Unassigned

Bug Description

As for the nova scheduler for scheduler multiple attempts, If with certain host deployment attempt failed raise with detail exceptions, nova scheduler will choose other host to retry.

But after all attempts are tried. it will raise a Generic NoValidHost exception without a proper message. It will make nova show <instance> will not display useful information straightforward to end users.

So it's suggested to wrap the NoValidHost exception message with last attempt failure exception detail trace.

For example,
When using nova vmware driver to spawn a VM with the disk larger than the datastore upper limit, it will raise an exception
for DatastoreNotFound exception with detail, but after scheduler retries, it will got lost from nova show. So it would be friendly to have operators to view such error directly from the nova show instead for digging into the scheduler log.

filter_scheduler.py

schedule_run_instance

        for num, instance_uuid in enumerate(instance_uuids):
            request_spec['instance_properties']['launch_index'] = num

            try:
                try:
                    weighed_host = weighed_hosts.pop(0)
                    LOG.info(_("Choosing host %(weighed_host)s "
                                "for instance %(instance_uuid)s"),
                              {'weighed_host': weighed_host,
                               'instance_uuid': instance_uuid})
                except IndexError:
                    raise exception.NoValidHost(reason="")

Tags: scheduler
zhu zhu (zhuzhubj)
summary: - Nova show will not display NoValidHost with detail exception traces
+ Nova show will not display NoValidHost with right exception traces
tags: added: scheduler
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/121739

Changed in nova:
assignee: nobody → zhu zhu (zhuzhubj)
status: New → In Progress
Changed in nova:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/121739
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Removing "In Progress" status and assignee as change is abandoned.

Changed in nova:
status: In Progress → Confirmed
assignee: zhu zhu (zhuzhubj) → nobody
Revision history for this message
Sudipta Biswas (sbiswas7) wrote :

I was scrubbing through the list of Nova bugs, and hoping to work on a few. Looks like this one hasn't been touched upon for a while. I am assigning it to myself to work on it further.

Changed in nova:
assignee: nobody → Sudipta Biswas (sbiswas7)
Revision history for this message
Sudipta Biswas (sbiswas7) wrote :

I feel there are lot of moving parts to this problem.
Currently, I see a discrepancy in the way the NoValidHost exception is being handled/generated.
The amount of information - we would like to provide to this exception are philosophically different and different parts of the code.

In the nova/scheduler/utils.py method - it appears that when the retries exceed the max_attempts - we are putting out a lot of information in the NoValidHost exception:

https://github.com/openstack/nova/blob/master/nova/scheduler/utils.py#L165

However, at the same time, in https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L79 ,
we seem to say - that - we shouldn't be putting out too much information.

Upon chatting with bauzas on IRC, it sounds like - we need a discussion for this at the summit.

Revision history for this message
Zhe Jing (jingzhe) wrote :

I am a maintenance egineer who is much interest in it . NoValidHost is not a clear error info to me and customer, so that I have to read lots of logs to find the reason especially there are many compute nodes.

Could we use instance_event or subaction to contain the fault reason of every compute node.

Revision history for this message
Sudipta Biswas (sbiswas7) wrote :

You can take a look at this spec: https://review.openstack.org/#/c/194204/
Additionally you can provide your feedback to Ed Leafe.

Revision history for this message
Ed Leafe (ed-leafe) wrote :

I've added a more specific exception in this patch: https://review.openstack.org/#/c/194780/

If you have suggestions for improving this, please comment on that patch.

Changed in nova:
assignee: Sudipta Biswas (sbiswas7) → nobody
Changed in nova:
assignee: nobody → Manjunath Ranganathaiah (manjunath-ranganathaiah)
Changed in nova:
assignee: Manjunath Ranganathaiah (manjunath-ranganathaiah) → nobody
Revision history for this message
Chris Dent (cdent) wrote :

Ed's change merged 8 months ago, there's been no additional input since. Let's kill this in favour of a new bug that is more in tune with the current state of affairs and more specific about the problems that need to be solved.

Changed in nova:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.