OpenStack Compute (nova)

Scheduler Failures are no longer logged with enough detail for a site admin to do problem determination

Bug #1431291 reported by Sean Dague on 2015-03-12

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Invalid	High	Pranav Salunke

Bug Description

After the move of the scheduler dispatch into conductor all the logging changed. The impact is that the scheduler no longer logs scheduled failures at WARN in the scheduler log with the instance id in question. (This apparently also somehow slipped back into Juno)

This is a notable regression from an ops perspective, and has required everyone to start running debug and doing lots of log searching to figure out whats going on.

The ops community asked that we log at a high level (WARN seems appropriate here) with reason, context, and instance id which makes their problem determination loop on "nova didn't boot my thing" from users much quicker.

Tags:

Sean Dague (sdague) on 2015-03-12

Changed in nova:
status:	New → Confirmed
importance:	Undecided → High
milestone:	none → kilo-3

Sylvain Bauza (sylvain-bauza) on 2015-03-12

Changed in nova:
assignee:	nobody → Sylvain Bauza (sylvain-bauza)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-12: Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/163771

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2015-03-12:

As per the move to the conductor, it was also agreed on not leaving the scheduler know the list of instance_uuids
Consequently, the recent Scheduler RPC API version bump (4.0) removed that field from the request_spec which was passed in because it was no longer needed.

If we want to backtrack which instance was failing with the logs, we then only have 2 possibilities :
#1 request_spec still continues to provide a full information of the first instance being requested, including its uuid - so we can show this up but it would only be accurate for the first one
#2 that instance information still includes an EC2-compatible field called reservation_id which is set even for OpenStack API and describe the request by itself. We can show it up too.

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2015-03-12:

Here is where the reservation_id is set at the API level :
https://github.com/openstack/nova/blob/95eaeb072d3dec25b7b50766140eb378c45b9b69/nova/compute/api.py#L1072-L1073

Matt Riedemann (mriedem) on 2015-03-17

tags:

added: scheduler

Revision history for this message

Alex Xu (xuhj) wrote on 2015-03-18:

We have request-id, is it enough?

Revision history for this message

Sean Dague (sdague) wrote on 2015-03-18:

request-id is not enough, that's not persistent

Revision history for this message

Sean Dague (sdague) wrote on 2015-03-18:

ok, so I've spent some time staring at the filter scheduler, and because the entire thing is written in this generator reduce model, we never have enough information flowing back. I thought this might be an easier issue to solve, but I'm not seeing a way through this.

The UX I think we want is:

on Schedule:

nova.scheduler INFO "Instance UUID scheduled to host X"

on fail to schedule

nova.scheduler WARN "Instance UUID failed to schedule due to Not Enough Y on hosts A(Y=x), B(Y=z)..."

Basically provide the closest match hosts.

All of the information is sitting in the DEBUG logs, but there is a lot of filtering and assembly to get back to a global answer there.

Revision history for this message

Alex Xu (xuhj) wrote on 2015-03-19:

Another think is we should warn log when the filters return too few hosts.

And there should be an option which operator can config to when filter return how many host will trigger that warn log.
For example: the default value of that option is 10%. If we have 10 hosts, after go through a filter, only return <=1 host, we should log that to warn the admin some resource are close to exhaustion or already exhaustion.

Revision history for this message

Alex Xu (xuhj) wrote on 2015-03-19:

Emm... sorry above thinking is still based on request-id. if based on request-id, that log can be helpful.

Thierry Carrez (ttx) on 2015-03-20

Changed in nova:
milestone:	kilo-3 → kilo-rc1

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2015-03-20:

Just to be clear, while I'm OK with the UX, we only have the first instance details passed in the request_spec (incl. the instance_uuid field) and the number of instances being requested, so the message could only be :

"Instance <# of num_instances> triggered at the same boot time with instance <instance_uuid> [scheduled to | failed to schedule to] ..."

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2015-03-30:

#10

Still planning to work on the bugfix but struggling with other bugs for RC1 so freeing up the ticket until me or someone else provides a patch

Changed in nova:
assignee:	Sylvain Bauza (sylvain-bauza) → nobody

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2015-03-31:

#11

As Alex said, sounds like we can take request_id from the context. So, we could say that "instance # from request_id <req_id> failed to boot"

That sounds a bit cleaner.

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2015-04-01:

#12

Assigning it back to me as I should provide a change by tomorrow.

Changed in nova:
assignee:	nobody → Sylvain Bauza (sylvain-bauza)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-03:

#13

Related fix proposed to branch: master
Review: https://review.openstack.org/170421

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-03: Fix proposed to nova (master)

#14

Fix proposed to branch: master
Review: https://review.openstack.org/170472

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-03: Change abandoned on nova (master)

#15

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/163771
Reason: bauzas said this could be abandoned, there are two other changes up for review as WIP/POC.

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2015-04-09:

#16

Downgrading the milestone to kilo-rc-potential as by implementing it, I think we need change the behaviour of select_destinations() by returning something to the conductor when it's unsuccessful instead of raising a NoValidHost.

Changed in nova:
milestone:	kilo-rc1 → none
tags:	added: kilo-rc-potential

Revision history for this message

John Garbutt (johngarbutt) wrote on 2015-04-09:

#17

As a summary, the current efforts are here:
* https://review.openstack.org/#/c/170421/
* https://review.openstack.org/#/c/170472/

Sylvain Bauza (sylvain-bauza) on 2015-04-14

tags:

removed: kilo-rc-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-13:

#18

Change abandoned by Joe Gordon (<email address hidden>) on branch: master
Review: https://review.openstack.org/170472
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-12:

#19

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/170421
Reason: I'm going to abandon this one because its been idle for a fair while, but I do agree we need to do _something_ here.

Michael Still (mikal) on 2015-08-24

Changed in nova:
status:	In Progress → Confirmed
assignee:	Sylvain Bauza (sylvain-bauza) → nobody

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-08-31:

#20

There has been some work done here in liberty:

https://review.openstack.org/#/q/status:merged+project:openstack/nova+branch:master+topic:bp/no-valid-host-reporting,n,z

There is some more work that could be done, see the discussion in https://review.openstack.org/#/c/194780/ - but so far this is better than what we had.

Pranav Salunke (dguitarbite) on 2015-10-14

Changed in nova:
assignee:	nobody → Pranav Salunke (dguitarbite)

Revision history for this message

Chris Dent (cdent) wrote on 2016-01-08:

#21

dguitarbite, can you confirm whether you are still working on this or not?

If not, or if I don't hear anything, I'll take this on next week.

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2016-01-11:

#22

Given what has been merged by Liberty and given how the current status is pretty unclear about what needs to be fixed, the scheduler subteam agreed on closing that bug.

Feel free to open another bug with substantial information about what's missing if you feel so, and we'd be glad to look at it.

Changed in nova:
status:	Confirmed → Incomplete

Revision history for this message

Chris Dent (cdent) wrote on 2016-03-15:

#23

Switching to invalid as incomplete is not a closed state.

Changed in nova:
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Related blueprints

Improve the feedback when NoValidHost is raised by the Scheduler

Remote bug watches

Bug watches keep track of this bug in other bug trackers.