Need to set production-oriented configuration parameters for Nova and Neutron

Bug #1324914 reported by Timur Nurlygayanov
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Timur Nurlygayanov
5.0.x
Fix Released
High
Timur Nurlygayanov
Mirantis OpenStack
Fix Released
High
Timur Nurlygayanov
5.0.x
Fix Released
High
Timur Nurlygayanov
5.1.x
Fix Released
High
Timur Nurlygayanov

Bug Description

This issue affects Fuel 4.1.x and Fuel 5.x as well. Need to fix in both cases.

We have simple Neutron config for any environments - and the same configuration for OpenStack on 2 VMs in Virtual Box and for 50 berametal servers.
As a result all works cool on small environments and doesn't work on production environments: we can see many errors in log files and looks like we can fix it - to do this we need just update several parameters in Neutron configuration: need to increase the number of API workers and also encrease the pool of requests to data base:

Need to change Neutron configuration on compute nodes.
/etc/neutron/neutron.conf:

[DEFAULT]
...
api_workers = CPUs count
...

[database]
...
max_pool_size = 50

-------------------------------
Also for Nova API we have several workers (count of workers is equal to the count of CPUs on controller), but we should calculate the count of workers based on the count of compute nodes, like:
[DEFAULT]
osapi_compute_workers = CPUs count

Tags: neutron nova
description: updated
summary: - Need to set production-oriented configuration parameters for Neutron
+ Need to set production-oriented configuration parameters for Nova and
+ Neutron
description: updated
Changed in fuel:
assignee: nobody → Matthew Mosesohn (raytrac3r)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Matthew, could you please review it and fix or assign to someone else?

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

I am not sure that this is a bug at all. We need much more detailed description of enironment and use cases in case we want to treat this as bug. Let's add test scenarios and actual/expected results section in order to understand what's going wrong.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Vladimir, the real use cases are the following:
Fuel 4.1a, 50 hardware servers with 64 Gb of RAM, 16 cores per server.
We want to install OpenStack cloud with 3 controllers and 47 compute nodes.
Users want to run Rally benchmark and verify that this cloud works correctly.

Expected Result: Rally benchmark shows that OpenStack cloud works with 1000+ users in parallel
Observed Result: Rally benchmark shows that OpenStack cloud works with 20 users in parallel without errors, and 50 users with errors in Neutron logs. If we will increase the count of workers and pool size to data base, as described here, we can encrise count of users up to 150 users simultaneously without errors.

see detailed report of performance testing: https://docs.google.com/a/mirantis.com/spreadsheets/d/1zPGz-HHTdv5L9D3sLA_COJuksm4yYK3rzjIyCvKRYMI/edit#gid=577770957

Changed in fuel:
assignee: Matthew Mosesohn (raytrac3r) → Vladimir Kuklin (vkuklin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/100859

Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Timur Nurlygayanov (tnurlygayanov)
status: Confirmed → In Progress
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

enikanorov:
btw, how about rpc_workers?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/100909

no longer affects: fuel/4.1.x
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

tl;dr

1 + compute_nodes_count*5 is no way for large deployments (hundreds of compute nodes). We should consider a smaller value.

Long version:

As you may know OpenStack projects use eventlet green threads for handling of concurrent requests. In theory, it allows one to scale an IO-bound application (e.g. API services like Nova API, which mostly do RPC/DB calls) to processing thousands of requests concurrently in a single OS thread by simply monkey patching all socket operations, so that green thread context switch happens when the socket operation blocks.

In practice, not all socket operations can be monkey patched. Notable exceptions are Python modules using C libraries. In our deployments we use MySQL-python DB API driver which delegates all MySQL connectivity tasks to libmysqlclient C library. Obviously, eventlet can only monkey patch operations on Python socket objects, so calls to MySQL-python->libmysqlclient will block the whole process (i.e. if some DB query in Nova takes 2s to complete, all other green threads will be blocked for 2s, no other API requests can be processed).

eventlet has been providing a work around for this which allows one to execute blocking calls in OS threads, rather than green threads (database.use_tpool option in Nova, probably in other projects too). The problem with this approach is that you need a custom eventlet build for this feature to work (https://bitbucket.org/eventlet/eventlet/pull-request/29/ hasn't been merged to eventlet master branch yet).

Nova/Neutron/probably other services too have also been providing another work around for this problem. By the means of api_workers/osapi_compute_worker/etc options you can tell nova-api/neutron-server/etc process to fork right after the start. So even if one of the processes is blocked, there will be a few forks which can process new requests.

So the problem with the numbers you suggest here is that they are a way too large: e.g. for a 200 compute nodes deployment, you'll end up running 1001 nova-api/neutron-server forks on the controller nodes (which is just a waste of memory and CPU resources). I'd suggest to keep the number of forks small (2-3*number of CPU cores).

But we still have to solve the problem with MySQL-python blocking the whole process. So we have a few options here:

0. Do nothing. As long as we are using *sane* number of forks, API processes should work fine (though, obliviously, eventlet won't be that efficient as we want it to be).
1. Build custom eventlet package and set database.use_tpool to True in config files (Rackspace claim they do that, but I doubt anyone else has tried this).
2. Use pure python DB API driver (e.g. pymysql), but this can result in performance drop for API services.
3. Use PostgreSQL + psycopg2, but this is not a short term solution :-)
4. Check if it's possible to make MySQL-python cope well with eventlet (this will probably require a custom MySQL-python) - I have a few ideas on how to do this and going to spend some time on research. Will post the results here.

So whichever way we choose, I believe, we should not increase the number of forks that much, but rather keep it small, instead.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Regarding the maximum number of db connection in the pool:

max_pool_size = 10 + compute_nodes_count*5 is probably a way too large too. Both PostgreSQL and MySQL default to 100 (and at least for the former default values are there for a reason :-) ). Each nova-api fork will have its own connection pool, as well as other OS API services running on the same node. If, for some reason, one service starts later than others, it *can* happen, that you will run out of free MySQL connections, cause all of them will be already in services' connection pools.

Igor Marnat (imarnat)
Changed in mos:
importance: Undecided → High
assignee: nobody → Timur Nurlygayanov (tnurlygayanov)
milestone: none → 5.1
milestone: 5.1 → 5.0.1
milestone: 5.0.1 → 5.1
milestone: 5.1 → 5.0.1
tags: added: nova
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/101859

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/5.0)

Change abandoned by Timur Nurlygayanov (<email address hidden>) on branch: stable/5.0
Review: https://review.openstack.org/100909
Reason: Fixed in https://review.openstack.org/#/c/101859/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/100859
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=9031dcf0a5297874fa4a646bddb29506778003c1
Submitter: Jenkins
Branch: master

commit 9031dcf0a5297874fa4a646bddb29506778003c1
Author: TimurNurlygayanov <email address hidden>
Date: Wed Jun 18 15:11:17 2014 +0400

    Fixed default parameters for Nova and Neutron

    Default values of performance-critical parameters
    changed to more comfortable for production environments.

    Change-Id: Ibfb81279892303166dbafe9f1cdc87927c0fe9a0
    Closes-Bug: #1324914

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/5.0)

Reviewed: https://review.openstack.org/101859
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=362440931f0f4dfac8470b4cdd856c48f6103f99
Submitter: Jenkins
Branch: stable/5.0

commit 362440931f0f4dfac8470b4cdd856c48f6103f99
Author: TimurNurlygayanov <email address hidden>
Date: Wed Jun 18 17:48:05 2014 +0400

    Fixed default parameters for Nova and Neutron

    Default values of performance-critical parameters
    changed to more comfortable for production environments.

    Change-Id: Ibfb81279892303166dbafe9f1cdc87927c0fe9a0
    Closes-Bug: #1324914

Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/5.1.x
Revision history for this message
Alexander Gubanov (ogubanov) wrote :
Changed in mos:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.