Scheduler Failures are no longer logged with enough detail for a site admin to do problem determination

Bug #1431291 reported by Sean Dague on 2015-03-12
34
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Pranav Salunke

Bug Description

After the move of the scheduler dispatch into conductor all the logging changed. The impact is that the scheduler no longer logs scheduled failures at WARN in the scheduler log with the instance id in question. (This apparently also somehow slipped back into Juno)

This is a notable regression from an ops perspective, and has required everyone to start running debug and doing lots of log searching to figure out whats going on.

The ops community asked that we log at a high level (WARN seems appropriate here) with reason, context, and instance id which makes their problem determination loop on "nova didn't boot my thing" from users much quicker.

Sean Dague (sdague) on 2015-03-12
Changed in nova:
status: New → Confirmed
importance: Undecided → High
milestone: none → kilo-3
Changed in nova:
assignee: nobody → Sylvain Bauza (sylvain-bauza)
Sylvain Bauza (sylvain-bauza) wrote :

As per the move to the conductor, it was also agreed on not leaving the scheduler know the list of instance_uuids
Consequently, the recent Scheduler RPC API version bump (4.0) removed that field from the request_spec which was passed in because it was no longer needed.

If we want to backtrack which instance was failing with the logs, we then only have 2 possibilities :
 #1 request_spec still continues to provide a full information of the first instance being requested, including its uuid - so we can show this up but it would only be accurate for the first one
 #2 that instance information still includes an EC2-compatible field called reservation_id which is set even for OpenStack API and describe the request by itself. We can show it up too.

Matt Riedemann (mriedem) on 2015-03-17
tags: added: scheduler
Alex Xu (xuhj) wrote :

We have request-id, is it enough?

Sean Dague (sdague) wrote :

request-id is not enough, that's not persistent

Sean Dague (sdague) wrote :

ok, so I've spent some time staring at the filter scheduler, and because the entire thing is written in this generator reduce model, we never have enough information flowing back. I thought this might be an easier issue to solve, but I'm not seeing a way through this.

The UX I think we want is:

on Schedule:

nova.scheduler INFO "Instance UUID scheduled to host X"

on fail to schedule

nova.scheduler WARN "Instance UUID failed to schedule due to Not Enough Y on hosts A(Y=x), B(Y=z)..."

Basically provide the closest match hosts.

All of the information is sitting in the DEBUG logs, but there is a lot of filtering and assembly to get back to a global answer there.

Alex Xu (xuhj) wrote :

Another think is we should warn log when the filters return too few hosts.

And there should be an option which operator can config to when filter return how many host will trigger that warn log.
For example: the default value of that option is 10%. If we have 10 hosts, after go through a filter, only return <=1 host, we should log that to warn the admin some resource are close to exhaustion or already exhaustion.

Alex Xu (xuhj) wrote :

Emm... sorry above thinking is still based on request-id. if based on request-id, that log can be helpful.

Thierry Carrez (ttx) on 2015-03-20
Changed in nova:
milestone: kilo-3 → kilo-rc1
Sylvain Bauza (sylvain-bauza) wrote :

Just to be clear, while I'm OK with the UX, we only have the first instance details passed in the request_spec (incl. the instance_uuid field) and the number of instances being requested, so the message could only be :

"Instance <# of num_instances> triggered at the same boot time with instance <instance_uuid> [scheduled to | failed to schedule to] ..."

Sylvain Bauza (sylvain-bauza) wrote :

Still planning to work on the bugfix but struggling with other bugs for RC1 so freeing up the ticket until me or someone else provides a patch

Changed in nova:
assignee: Sylvain Bauza (sylvain-bauza) → nobody
Sylvain Bauza (sylvain-bauza) wrote :

As Alex said, sounds like we can take request_id from the context. So, we could say that "instance # from request_id <req_id> failed to boot"

That sounds a bit cleaner.

Sylvain Bauza (sylvain-bauza) wrote :

Assigning it back to me as I should provide a change by tomorrow.

Changed in nova:
assignee: nobody → Sylvain Bauza (sylvain-bauza)

Related fix proposed to branch: master
Review: https://review.openstack.org/170421

Fix proposed to branch: master
Review: https://review.openstack.org/170472

Changed in nova:
status: Confirmed → In Progress

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/163771
Reason: bauzas said this could be abandoned, there are two other changes up for review as WIP/POC.

Sylvain Bauza (sylvain-bauza) wrote :

Downgrading the milestone to kilo-rc-potential as by implementing it, I think we need change the behaviour of select_destinations() by returning something to the conductor when it's unsuccessful instead of raising a NoValidHost.

Changed in nova:
milestone: kilo-rc1 → none
tags: added: kilo-rc-potential
tags: removed: kilo-rc-potential

Change abandoned by Joe Gordon (<email address hidden>) on branch: master
Review: https://review.openstack.org/170472
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/170421
Reason: I'm going to abandon this one because its been idle for a fair while, but I do agree we need to do _something_ here.

Michael Still (mikal) on 2015-08-24
Changed in nova:
status: In Progress → Confirmed
assignee: Sylvain Bauza (sylvain-bauza) → nobody
Matt Riedemann (mriedem) wrote :

There has been some work done here in liberty:

https://review.openstack.org/#/q/status:merged+project:openstack/nova+branch:master+topic:bp/no-valid-host-reporting,n,z

There is some more work that could be done, see the discussion in https://review.openstack.org/#/c/194780/ - but so far this is better than what we had.

Changed in nova:
assignee: nobody → Pranav Salunke (dguitarbite)
Chris Dent (cdent) wrote :

dguitarbite, can you confirm whether you are still working on this or not?

If not, or if I don't hear anything, I'll take this on next week.

Sylvain Bauza (sylvain-bauza) wrote :

Given what has been merged by Liberty and given how the current status is pretty unclear about what needs to be fixed, the scheduler subteam agreed on closing that bug.

Feel free to open another bug with substantial information about what's missing if you feel so, and we'd be glad to look at it.

Changed in nova:
status: Confirmed → Incomplete
Chris Dent (cdent) wrote :

Switching to invalid as incomplete is not a closed state.

Changed in nova:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers