Apache can consume too much RAM by spawning too many workers

Bug #1619205 reported by Dmitry Tantsur
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Jiří Stránský

Bug Description

I've kept my overcloud just running for less than 24 hours and now I can't access some API services.

For nova-api.log:

2016-09-01 00:05:57.956 3574 INFO nova.osapi_compute.wsgi.server [-] 192.0.2.18 "GET /v2.1/servers/detail?all_tenants=True&changes-since=2016-08-31T23%3A54%3A57.945183%2B00%3A00&host=overcloud-novacompute-0 HTTP/1.1" status: 503 len: 0 time: 60.0056162
2016-09-01 00:06:25.992 3574 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 504
2016-09-01 00:06:25.992 3574 WARNING keystonemiddleware.auth_token [-] Identity response: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

2016-09-01 00:06:25.992 3574 CRITICAL keystonemiddleware.auth_token [-] Unable to validate token: Failed to fetch token data from identity server

For ironic-api journald logs:

Sep 01 00:57:05 overcloud-controller-0 ironic-api[22942]: 2016-09-01 00:57:05.733 23012 INFO eventlet.wsgi.server [-] 192.0.2.18 "GET /v1/nodes/detail HTTP/1.1" status: 503
Sep 01 00:58:07 overcloud-controller-0 ironic-api[22942]: 2016-09-01 00:58:07.739 23012 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 504
Sep 01 00:58:07 overcloud-controller-0 ironic-api[22942]: 2016-09-01 00:58:07.740 23012 WARNING keystonemiddleware.auth_token [-] Identity response: <html><body><h1>504 Gate
Sep 01 00:58:07 overcloud-controller-0 ironic-api[22942]: The server didn't respond in time.
Sep 01 00:58:07 overcloud-controller-0 ironic-api[22942]: </body></html>
Sep 01 00:58:07 overcloud-controller-0 ironic-api[22942]: 2016-09-01 00:58:07.740 23012 CRITICAL keystonemiddleware.auth_token [-] Unable to validate token: Failed to fetch

However, keystone.log does not contain anything suspicious, neither does keystone_wsgi_admin_error.log. Restarting httpd takes more than a minute but does solve the problem.

Revision history for this message
Dmitry Tantsur (divius) wrote :

The last logs from httpd before restart:

Aug 31 16:55:53 overcloud-controller-0 python[23879]: ERROR:scss.ast:Function not found: function-exists:1
Aug 31 16:55:53 overcloud-controller-0 python[23879]: ERROR:scss.ast:Function not found: function-exists:1
Aug 31 16:55:53 overcloud-controller-0 python[23879]: ERROR:scss.ast:Function not found: function-exists:1
Aug 31 16:55:53 overcloud-controller-0 python[23879]: ERROR:scss.ast:Function not found: function-exists:1
Aug 31 16:55:53 overcloud-controller-0 python[23879]: ERROR:scss.ast:Function not found: function-exists:1
Aug 31 16:55:53 overcloud-controller-0 python[23879]: ERROR:scss.ast:Function not found: function-exists:1
Aug 31 16:55:53 overcloud-controller-0 python[23879]: ERROR:scss.ast:Function not found: function-exists:1
Aug 31 16:55:53 overcloud-controller-0 python[23879]: ERROR:scss.ast:Function not found: function-exists:1
Aug 31 16:55:53 overcloud-controller-0 python[23879]: ERROR:scss.ast:Function not found: function-exists:1
Aug 31 16:55:53 overcloud-controller-0 python[23879]: ERROR:scss.ast:Function not found: function-exists:1
Aug 31 16:55:53 overcloud-controller-0 python[23879]: ERROR:scss.ast:Function not found: function-exists:1
Aug 31 16:55:53 overcloud-controller-0 python[23879]: SassDeprecationWarning: Can't find any matching rules to extend u'.mdi-code-tags' -- thiswill be fatal in 2.0, unless !
Aug 31 16:55:53 overcloud-controller-0 python[23879]: WARNING:py.warnings:SassDeprecationWarning: Can't find any matching rules to extend u'.mdi-code-tags' -- thiswill be fa
Aug 31 16:55:53 overcloud-controller-0 python[23879]: SassDeprecationWarning: Can't find any matching rules to extend u'.mdi-link-variant-off' -- thiswill be fatal in 2.0, u
Aug 31 16:55:53 overcloud-controller-0 python[23879]: WARNING:py.warnings:SassDeprecationWarning: Can't find any matching rules to extend u'.mdi-link-variant-off' -- thiswil
Aug 31 16:56:00 overcloud-controller-0 python[23879]: ERROR:scss.compiler:Maximum number of supported selectors in Internet Explorer (4095) exceeded!
Aug 31 16:56:01 overcloud-controller-0 python[23879]: Found 'compress' tags in:
Aug 31 16:56:01 overcloud-controller-0 python[23879]: /usr/share/openstack-dashboard/openstack_dashboard/templates/horizon/_conf.html
Aug 31 16:56:01 overcloud-controller-0 python[23879]: /usr/share/openstack-dashboard/openstack_dashboard/templates/_stylesheets.html
Aug 31 16:56:01 overcloud-controller-0 python[23879]: /usr/share/openstack-dashboard/openstack_dashboard/templates/horizon/_scripts.html
Aug 31 16:56:01 overcloud-controller-0 python[23879]: Compressing... done
Aug 31 16:56:01 overcloud-controller-0 python[23879]: Compressed 6 block(s) from 3 template(s) for 2 context(s).

Revision history for this message
Dmitry Tantsur (divius) wrote :

Though Aug 31 16:55:53 is probably from the last environment build, I'm not sure it's the root cause..

Revision history for this message
Marius Cornea (mcornea) wrote :

I've seen this behavior on my environment as well and from my investigation it got stuck because the httpd MaxClients:

[root@controller-1 ~]# grep -Ri MaxClients /etc/httpd/conf*
/etc/httpd/conf.modules.d/prefork.conf: MaxClients 256

On each of the controllers we can see the numbers of httpd processes is close to 256:

[heat-admin@controller-0 ~]$ sudo ps axu | grep httpd | wc -l
263

[heat-admin@controller-1 ~]$ sudo ps axu | grep httpd | wc -l
263

[heat-admin@controller-2 ~]$ sudo ps axu | grep httpd | wc -l
263

Dmitry Tantsur (divius)
Changed in tripleo:
status: New → Confirmed
Changed in tripleo:
milestone: none → newton-rc2
Revision history for this message
Carlos Camacho (ccamacho) wrote :

I have deployed a HA (3 controllers + 1 compute) env. now trying to reproduce it.

Revision history for this message
Carlos Camacho (ccamacho) wrote :

These are the results from my tests.

When deploying the Overcloud (3+1 6GB RAM) the httpd processes went up to 235, in which case the controllers went out of RAM and hit the issue in which the APIs were not responding (just restarted httpd in the controllers and started to work again).

The second test was to deploy the same environment but using a 4GB swap file in each OC node, in this case, I had enough memory and didn't hit the issue. The httpd processes went up to 100 and the decreased to 23.

Im waiting for a mem upgrade in my local env to test this with 9GB OC nodes (no swap) as we dont add swap by default in our deployments.

Changed in tripleo:
status: Confirmed → Triaged
Changed in tripleo:
status: Triaged → In Progress
assignee: nobody → Carlos Camacho (ccamacho)
Revision history for this message
Jiří Stránský (jistr) wrote :

Working on this.

Changed in tripleo:
assignee: Carlos Camacho (ccamacho) → Jiří Stránský (jistr)
Revision history for this message
Jiří Stránský (jistr) wrote :

Sorry i didn't have the page fully refreshed so i didn't see the assignee field. Will sync up with Carlos.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/374136

summary: - Overcloud API services go down after some time due to keystonemiddleware
- failure
+ Apache can consume too much RAM by spawning too many workers
Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/374136
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=b524c0999f447d7931fcacb37e2989d3bf66ad26
Submitter: Jenkins
Branch: master

commit b524c0999f447d7931fcacb37e2989d3bf66ad26
Author: Jiri Stransky <email address hidden>
Date: Wed Sep 21 13:53:19 2016 +0200

    Provide for RAM-constrained environments

    We hit problems in environments which don't have a lot of RAM (e.g. dev
    envs, could be also CI) that Apache ate too much memory due to
    too many worker processes being spawned.

    This commit allows customizing the Apache MaxRequestWorkers and
    ServerLimit directives via Heat parameters. The default stays 256 as
    that's the default in the Puppet module, to be suited for production
    environments with powerful machines. Also low-memory-usage.yaml
    environment file is added, which can be used to make dev/test/CI
    overclouds less memory hungry, where the limits are now set to 32.

    Change-Id: Ibcf1d9c3326df8bb5b380066166c4ae3c4bf8d96
    Co-Authored-By: Carlos Camacho <email address hidden>
    Closes-Bug: #1619205

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 5.0.0.0rc2

This issue was fixed in the openstack/tripleo-heat-templates 5.0.0.0rc2 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.