Docs needed for tunables at large scale

Bug #1838819 reported by Matt Riedemann on 2019-08-02
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned

Bug Description

Various things come up in IRC every once in a while about configuration options that need to be tweaked at large scale (blizzard, cern, etc) which once you hit hundreds or thousands of compute nodes need to be changed to avoid killing the control plane.

One such option is this:

https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.heal_instance_info_cache_interval

From a blizzard operator:

(3:04:18 PM) eandersson: mriedem, we had to set heal_instance_info_cache high because it was killing our control plane
(3:05:41 PM) eandersson: It was getting real heavy on large sites with 1k nodes
(3:06:26 PM) eandersson: We also ended up adding a variance

Similarly, CERN had to totally disable this one:

https://docs.openstack.org/nova/latest/configuration/config.html#compute.resource_provider_association_refresh

And rely on SIGHUP / restart of the service if they needed to refresh that cache.

We should put these things in the admin docs as we come across them so we don't forget about this stuff when new operators/users come along and hit scaling issues.

Matt Riedemann (mriedem) wrote :

https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.report_interval is another one, and rpc_response_timeout and long_rpc_timeout.

Changed in nova:
assignee: nobody → Takashi NATSUME (natsume-takashi)
status: New → In Progress
Matt Riedemann (mriedem) wrote :

I meant to assign this to myself since I have some ideas in mind for what to document, or at least start a document.

Tim Bell (tim-bell) wrote :

Another one from CERN's experiences is regarding Placement : scheduler.max_placement_results (https://techblog.web.cern.ch/techblog/post/scheduling-optimizations/).

Are you looking to document only for Nova or also for other areas too? We're gradually tuning Ironic/Neutron for scale too.

Changed in nova:
assignee: Takashi NATSUME (natsume-takashi) → nobody
status: In Progress → Confirmed
Brin Zhang (zhangbailin) wrote :

I think in the docs, o consider in many ways is necessayr. Nova, glance. neutron, cinder and related clients, configuration items should be comprehensive components to improve the stability of large-scale scenarios.

Matt Riedemann (mriedem) wrote :

@Tim, I'm just looking for some simple pointers to options people typically tune at scale for *nova* specifically. The ops guide [1] is something more general for openstack-wide documentation. I'm not trying to boil the ocean here.

[1] https://docs.openstack.org/operations-guide/

Matt Riedemann (mriedem) wrote :

More details from CERN's nova+ironic testing at scale:

https://techblog.web.cern.ch/techblog/post/nova-ironic-at-scale/

Matt Riedemann (mriedem) wrote :

Another one: image cache management when you're not using ceph or NFS on your computes:

https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.image_cache_subdirectory_name

The idea is to share ^ across your computes so the first compute to download a new image has the image shared across all computes so the others don't have to re-download the same image.

Note that there could be unexpected bugs as a result (see bug 1804262).

tags: added: doc
removed: docs
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers