Scheduler connects to all cells DBs to gather compute nodes info

Bug #1767303 reported by Belmiro Moreira on 2018-04-27
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Surya Seetharaman
Declined for Queens by Matt Riedemann

Bug Description

The scheduler host.manager connects to all cells DBs to get compute node info even if only a subset of compute nodes uuids are given by placement.

This has a performance impact in large cloud deployments with several cells.

Also related with:

def _get_computes_for_cells(self, context, cells, compute_uuids=None)
        for cell in cells:
            LOG.debug('Getting compute nodes and services for cell %(cell)s',
                      {'cell': cell.identity})
            with context_module.target_cell(context, cell) as cctxt:
                if compute_uuids is None:
                            cctxt, compute_uuids))
                    { service
                     for service in objects.ServiceList.get_by_binary(
                             cctxt, 'nova-compute',
        return compute_nodes, services

Changed in nova:
assignee: nobody → Surya Seetharaman (tssurya)
tags: added: cells scheduler
Matt Riedemann (mriedem) wrote :

I'm confused, you reference but are pasting a code snippet of old code before that scatter/gather routine was added. Does resolve your issue or at least make it acceptable performance?

tags: added: performance
Changed in nova:
status: New → Incomplete
Matt Riedemann (mriedem) wrote :

If isn't enough, we could also think about adding a HostMapping.uuid field which mirrors the ComputeNode.uuid field and then we could get the list of host mappings by uuids and from that list get the list of cells from which to pull the compute nodes, but it would be good to know if made a big enough difference that it's good enough for now.

Matt Riedemann (mriedem) wrote :

Or maybe at this point, this is a duplicate of bug 1737465?

Matt Riedemann (mriedem) wrote :

OK I understand the issue now. The problem is that when we get results from placement, all of the allocation candidates (compute nodes) might be in a single cell, because maybe the request is tied to an aggregate which represents that cell. But when the HostManager queries the cell databases for the compute nodes, it iterates overall enabled cells, which could be ~70 in CERNs case. So we're doing a lot of extra DB queries that won't yield results, and might be on older slower cell DBs which take longer to return a response.

If we could filter the cells up front based on the computes (via host_mappings maybe) like we do for filtering instances by project mapped to cells in the API using config:

Then that might make scheduling faster, assuming the compute nodes are in fact restricted to a small subset of cells.

Changed in nova:
status: Incomplete → Triaged

Fix proposed to branch: master

Changed in nova:
status: Triaged → In Progress

Change abandoned by Surya Seetharaman (<email address hidden>) on branch: master
Reason: cern specific

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers