Docs needed for tunables at large scale

Bug #1838819 reported by Matt Riedemann
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)

Bug Description

Various things come up in IRC every once in a while about configuration options that need to be tweaked at large scale (blizzard, cern, etc) which once you hit hundreds or thousands of compute nodes need to be changed to avoid killing the control plane.

One such option is this:

From a blizzard operator:

(3:04:18 PM) eandersson: mriedem, we had to set heal_instance_info_cache high because it was killing our control plane
(3:05:41 PM) eandersson: It was getting real heavy on large sites with 1k nodes
(3:06:26 PM) eandersson: We also ended up adding a variance

Similarly, CERN had to totally disable this one:

And rely on SIGHUP / restart of the service if they needed to refresh that cache.

We should put these things in the admin docs as we come across them so we don't forget about this stuff when new operators/users come along and hit scaling issues.

Revision history for this message
Matt Riedemann (mriedem) wrote : is another one, and rpc_response_timeout and long_rpc_timeout.

Changed in nova:
assignee: nobody → Takashi NATSUME (natsume-takashi)
status: New → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :

I meant to assign this to myself since I have some ideas in mind for what to document, or at least start a document.

Revision history for this message
Tim Bell (tim-bell) wrote :

Another one from CERN's experiences is regarding Placement : scheduler.max_placement_results (

Are you looking to document only for Nova or also for other areas too? We're gradually tuning Ironic/Neutron for scale too.

Changed in nova:
assignee: Takashi NATSUME (natsume-takashi) → nobody
status: In Progress → Confirmed
Revision history for this message
Brin Zhang (zhangbailin) wrote :

I think in the docs, o consider in many ways is necessayr. Nova, glance. neutron, cinder and related clients, configuration items should be comprehensive components to improve the stability of large-scale scenarios.

Revision history for this message
Matt Riedemann (mriedem) wrote :

@Tim, I'm just looking for some simple pointers to options people typically tune at scale for *nova* specifically. The ops guide [1] is something more general for openstack-wide documentation. I'm not trying to boil the ocean here.


Revision history for this message
Matt Riedemann (mriedem) wrote :

More details from CERN's nova+ironic testing at scale:

Revision history for this message
Matt Riedemann (mriedem) wrote :

Another one: image cache management when you're not using ceph or NFS on your computes:

The idea is to share ^ across your computes so the first compute to download a new image has the image shared across all computes so the others don't have to re-download the same image.

Note that there could be unexpected bugs as a result (see bug 1804262).

tags: added: doc
removed: docs
Revision history for this message
Matt Riedemann (mriedem) wrote :
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers