ceilometer default multiple workers option is unrelated with cpu amount

Bug #1481254 reported by Liusheng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceilometer
Fix Released
Medium
Liusheng

Bug Description

All of the *_workers (api_workers, collertor_workers, notification_workers) options default value is 1, but it seems the default value is designed to be the num of cpu, see[1], because the default value is 1, if we didn't configure the option, it will unrelated with the cpu amount.

Not sure the default value 1 is set on purpose for some reason.

[1] http://git.openstack.org/cgit/openstack/ceilometer/tree/ceilometer/service.py#n101

Liusheng (liusheng)
Changed in ceilometer:
assignee: nobody → Liusheng (liusheng)
Julien Danjou (jdanjou)
Changed in ceilometer:
status: New → Triaged
importance: Undecided → Medium
milestone: none → liberty-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/209333

Changed in ceilometer:
status: Triaged → In Progress
Revision history for this message
Chris Dent (cdent) wrote :

I don't really agree with the idea that the default number of workers for all the services should be the number of cpus. I think it is far better to default to something "safe" (which is one) and make sure people are well aware of the option to use more.

If you are running and all-in-one installation and all the various services choose to run numcpus workers, that ends up being a huge number of processes for a situation where there's not only is that not necessarily good for the machine, it's also not really useful for the services because they may not need to be so broad.

Revision history for this message
gordon chung (chungg) wrote :

agree with cdent. it should be configured per deployment. similar to how we set it via devstack conf

in regards to your patch, i'm indifferent if we keep this bug open and you change the configuration options to fall under appropriate sections.

Revision history for this message
Julien Danjou (jdanjou) wrote :

I agree it should be configured by deployment, but using only N% of a computer rather than 100% by default and trying to be smarter than the kernel to allocate resource is a terrible idea IMHO.

"Yes, it's slow by default but feel free to make it fast" is not a good option.

Revision history for this message
Liusheng (liusheng) wrote :

yes, I am also thinking about this, agree with cdent, how about posting change to set the default value to devstack configure file ? Since we don't need workers as more as cpu num in devstack environment. The cpu amount workers default configuration is similar with other project services . I just found the method[1] is implemented to set the workers as cpu num if it didn't set, then report this bug :)

[1] http://git.openstack.org/cgit/openstack/ceilometer/tree/ceilometer/service.py#n99

what do you think ?

Revision history for this message
Liusheng (liusheng) wrote :

besides developing environment, enable workers as cpu amount as default will make more use of compute resource, maybe this is a better option as JD suggested ?

Revision history for this message
gordon chung (chungg) wrote :

i'm going to give an awful political answer but i feel like having Ceilometer running 100% will cause some person(re: idiot) to say 'typical Ceilometer, it needs so much workers, does not scale'...

the default we set in devstack i believe was half the cpus... maybe that's a smarter default? i'm definitely against having ceilometer have the number cpu as Nova

Revision history for this message
Julien Danjou (jdanjou) wrote :

This is a too bad political answer to be ever receivable. Apache always had something like 20 workers preforked etc, and no sysadmin ever complained (except when you have way too much :).

We shouldn't design software for developement platforms or wtf. We should design it for prod. And in production, you want things to be as fast as you can. If there are 8 CPU, and Ceilo and Nova runs on the same system, well both should be spread over the 8 CPU and have the ability to use 100% of the sytem. Reality if that the kernel will handle the load spread itself, and the admin can set priority (or reduce the number of worker) if he wants to fine grain the system.

And in prod it's likely anyway that Ceilometer will be alone, so that question does not even exist I guess. :)

Revision history for this message
gordon chung (chungg) wrote :

"the admin can set priority (or reduce the number of worker) if he wants to fine grain the system." -- isn't this just the same as now except instead of having admins scale up, they'll need to scale down? either way, they'll need to change it.

i'm not sure ceilometer is run completely on it's own node.

that said, if this is the default behaviour in OpenStack, it doesn't bother me to follow.

Revision history for this message
Julien Danjou (jdanjou) wrote :

Gordon: I guess you can see it that way, but then in real life I never saw an admin scaling down a service because users complain it is too fast. ;)

Revision history for this message
Julien Danjou (jdanjou) wrote :

Re: default behaviour

That's AFAIK the default behaviour of launching API servers using mod_wsgi since Apache scales the request on all the available CPU, thanks sanity. :)

Revision history for this message
Chris Dent (cdent) wrote :

This Conversation is Weird.

Some comments within to give a bit more color to where it seems like things stand.

The goal of being able to choose the number of processes is not just to use all of the available cpus, but to be able to tweak to minimize context switching and in some cases even do cpu pinning (see nginx and uwsgi for examples).

If you have N cores and M services which are cpu-usage-heavy and you want to avoid context switching, then each M gets N/M processes.

If you have lots of services which are fully I/O dependent then it is a different story: they you _want_ to context switch and yes the OS is the smartest thing going for how to determine when that should happen.

If, however, you are memory constrained, which is far too often the case, then you've got a whole 'nother set of reasons for not launching a ton of processes. This is especially the case in in OpenStack services which are notoriously leaky and consumptive when given an "unsafe" request.

Let's take gnocchi-metricd as an example. It doesn't consume a ton of memory, which is nice, but when it is busy it will use all the cpus all the time and be consuming both tons of cpus and tons of I/O. If metricd is on its own machine this is _fantastic_. If it's not, that's not so great: other process are frequently waiting (1 min load average of 40-70 in my testing).

A curious person will find out they can set the works on metricd and lower it and all will be well.

Empirical (but still anecdotal) evidence makes it pretty clear that people aren't always all that curious when it comes to managing their telemetry services, so some measure of safety may be warranted.

I agree that it is unfortunate that the default position ought to be tuned down, but it is aligned with the introductory cases: spinning up proofs of concept, doing a devstack. Only once you've gained some familiarity with the car should you turn on the nitrous.

Revision history for this message
Liusheng (liusheng) wrote :

Hi, cdent, thanks for your so detailed explanation! awesome :). so let's make a consensus of opinion and keep the default value as former ? if so, this bug description can be changed to moving the workers options to the corresponding service config section and dropping the unused count_cpu function. I have update the patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/209333
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=717bd7d0642bc6b7c1b0aa83df148251148a4d46
Submitter: Jenkins
Branch: master

commit 717bd7d0642bc6b7c1b0aa83df148251148a4d46
Author: liu-sheng <email address hidden>
Date: Wed Aug 5 10:48:40 2015 +0800

    Change and move the workers options to corresponding service section

    Currently, the workers options of collector, api and notification services
    are all located in [DEFAULT] section. This change unify the options name
    and move them to corresponding service configure section.
    Additionally, this change will remove the "workers set by cpu number"
    functionality, because it has no effect if the workers options has default
    value.

    Closes-Bug: #1481254

    Change-Id: Idde86762ab6520d3adcbdd2b86d0f4de3a8517cd

Changed in ceilometer:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ceilometer:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ceilometer:
milestone: liberty-3 → 5.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.