Instance that uses force-host still needs to run some filters

Bug #1427772 reported by Nikola Đipanov on 2015-03-03
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)

Bug Description

More in-depth discussion can be found here:

Basically - there is a number of filters that need to be re-run even if we force a host. The reasons are two-fold. Placing some instances on some hosts is an obvious mistake and should be disallowed (instances with specific CPU pinning are an example), though this will be eventually rejected by the host. Second reason is that claims logic on compute hosts depends on limits being set by the filters, and if they are not some of the oversubscription as well as more complex placement logic will not work for the instance (see the following bug report as to how it impacts NUMA placement logic

Overall completely bypassing the filters is not ideal.

Changed in nova:
assignee: nobody → Sylvain Bauza (sylvain-bauza)
status: New → Confirmed
importance: Undecided → Low
Changed in nova:
assignee: Sylvain Bauza (sylvain-bauza) → Anant Kaushik (anantkaushik-nsit)
Chris Dent (cdent) wrote :

I got to the end of the referenced thread <> and was none the wiser on what the chosen outcome was for this. There were several options bounced back and forth but nothing that seemed to win consensus. The options appeared to be:

* do nothing, let the targeted compute node fail when force_host is set if the host can't accept
    * except that apparently some claims can't be validated on-node
* pass the hosts listed as forced into the filters without changing the filters
* pass the hosts listed as forced into the filters but only run those filters which are deemed required when doing a force
* make a new command that is like force but means "force with filtering"
* but wait, forcing is antithetical to "cloud"; end users should not know or care about where stuff ends up as long as it is good enough, let's get rid of the force command

From what I can tell the sentence in the description beginning "Second reason..." suggests that "pass the hosts listed as forced into the filters without changing the filters" is the change that resolves the presenting problems with the smallest surface area of a change to the system.

So the question becomes: does the second reason matter?

If not we should mark this as incomplete, or invalid, or not a bug and get it off the radar.

Chris Dent wrote in #1:

> * pass the hosts listed as forced into the filters but only run
> those filters which are deemed required when doing a force

I think that's the outcome of the ML post.

> So the question becomes: does the second reason matter?

I don't know enough of the claim logic to answer that. Sylvain and
Nikola seem to know more about that.

> If not we should mark this as incomplete, or invalid, or not a
> bug and get it off the radar.

I think it's fair to close a bug report when we don't come to a
conclusion within 1 year, as it indicates that this is an issue
which might not be important enough to spent effort on it. To
clearly indicate that this bug report is expired, I'm in favor
of using the status "won't fix" with a comment that this is an
expired issue. But this expiration policy I have in mind is not yet
in place [1].


Chris Dent (cdent) wrote :

Let's close this and let it come back up later. There's been no progress on it for greater than a year and we don't even seem to have consensus on what the right behavior should be.

Nikola Đipanov (ndipanov) wrote :

I am not sure that's valid grounds for closing. We should really keep it open unless we decide we will not fix this ever. If we'll come back to it - leave it open and maybe tag as long-standing or similar to be able to filter it out. Bugs don't need food to stick around in the bugtracker... :)

Chris Dent (cdent) wrote :

Bugs that aren't getting attention and don't have clear replication strategy (after a year) are just noise and noise makes it harder to get things done or to have a sense of purpose or direction. Having to filter a list of bugs between those that are actionable and those that are not is extremely discouraging and makes me not want to look at the bug list. However I'm obliged to do so and every time I do I just think: I do not want to do this again.

We need to make the process more actionable. So if one way to do that is to remove bugs that nobody seems to really care about then we should do so. The great things about bugs is that if they are _actually_ bugs they'll get reported again. If they aren't reported then they must not be real.

Nikola Đipanov (ndipanov) wrote :

I don't think that we want to close bugs just because they are not getting immediate attention. Ultimately - that's what the bug tracker is for, and this bug is (I believe) a valid defect so there is no reason to close it unless we decide we really don't want to fix it.

I don't think we need to get discouraged by the number of open bugs, just organize them better to make them actionable. There is definitely value in keeping backlog bugs. OTOH - we could have a different way to track these kind of issues if we want to keep the bug list here as only immediately actionable stuff.

Either way we should keep useful info accessible. I can see people hitting this issue, and trying to find what's the story and getting discouraged by a closed bug.

Maciej Szankin (mszankin) wrote :

Removing assignee due to lack of visible progress. Feel free to reassign if it is not the case.

Changed in nova:
assignee: Anant Kaushik (anantkaushik-nsit) → nobody
Matt Riedemann (mriedem) wrote :

I was going to open a new bug for this code:

But this existing bug is probably close enough, especially since in Pike we removed RamFilter and DiskFilter from the default list of enabled filters, so the limits for those aren't going to come from the scheduler to the compute anymore for claims (by default for new installs).

But the code here specifically:

Is mimicing the RamFilter and uses the ComputeNode information, when it should be using the Placement service now which would have the latest inventory information for that ComputeNode resource provider at this point.

So as we remove core/ram/disk filters from the scheduler and replace those with using the Placement service, in the hopes of eventually removing that kind of claims logic in the compute service, we need to replace anything that relied on it to also use Placement, like this code in conductor.

tags: added: conductor placement
Matt Riedemann (mriedem) wrote :

Since change we're still allocating resources against the forced destination host during live migration using placement, which for the FilterScheduler at least should be accurate and good since it will tell us if there isn't enough free memory on the host to fix the instance.

I think that means _check_destination_has_enough_memory is only needed then for people using the CachingScheduler because the CachingScheduler isn't using Placement so capacity information on the host wouldn't be accurate (the inventory for the resource provider is getting reported by the allocations for all of the other instances on that host might not be reported since the CachingScheduler doesn't allocate them during scheduling). So the RAMFilter type check in _check_destination_has_enough_memory is still needed for the CachingScheduler, which is deprecated.

We should be able to remove the CachingScheduler and _check_destination_has_enough_memory check in Stein now that we have the "nova-manage placement heal_allocations" CLI which CachingScheduler deployments can use to create allocations in placement for their existing instances.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers