Bug #1666072 “CI / SSL: 504 Gateway Time-out” : Bugs : Ceilometer

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-02-19:

#1

I see in journal of controller also os-collect-config errors, although it could be red herring:

http://logs.openstack.org/15/359215/62/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/136beee//logs/overcloud-controller-0/var/log/messages

Feb 19 11:04:50.201420 overcloud-controller-0 os-collect-config[2462]: [2017-02-19 11:04:50,200] (heat-config) [INFO]
Feb 19 11:04:50.201965 overcloud-controller-0 os-collect-config[2462]: [2017-02-19 11:04:50,201] (heat-config) [ERROR] Error running heat-config-notify. [1]
Feb 19 11:04:50.202294 overcloud-controller-0 os-collect-config[2462]: [2017-02-19 11:04:50,201] (heat-config) [ERROR] [2017-02-19 11:04:50,137] (heat-config-notify) [DEBUG] Signaling to https://192.168.24.2:138
08/v1/AUTH_75f8c60dff834958a5ed63d159851421/copy_ssh_key-20852643-f7e2-4212-a31a-4e578176ddd5/c2bb6901-fe07-4c82-94a8-98922d4cf6d3?temp_url_sig=e35fcb8bd1f5596ba4a095f38b7b4ecf01fb1871&temp_url_expires=148752027
8 via PUT
Feb 19 11:04:50.202651 overcloud-controller-0 os-collect-config[2462]: Traceback (most recent call last):
Feb 19 11:04:50.202973 overcloud-controller-0 os-collect-config[2462]: File "/usr/bin/heat-config-notify", line 163, in <module>
Feb 19 11:04:50.203291 overcloud-controller-0 os-collect-config[2462]: sys.exit(main(sys.argv, sys.stdin))
Feb 19 11:04:50.203626 overcloud-controller-0 os-collect-config[2462]: File "/usr/bin/heat-config-notify", line 110, in main
Feb 19 11:04:50.203941 overcloud-controller-0 os-collect-config[2462]: headers={'content-type': 'application/json'})
Feb 19 11:04:50.204245 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/api.py", line 123, in put
Feb 19 11:04:50.204546 overcloud-controller-0 os-collect-config[2462]: return request('put', url, data=data, **kwargs)
Feb 19 11:04:50.204869 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/api.py", line 56, in request
Feb 19 11:04:50.205154 overcloud-controller-0 os-collect-config[2462]: return session.request(method=method, url=url, **kwargs)
Feb 19 11:04:50.205458 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 475, in request
Feb 19 11:04:50.205776 overcloud-controller-0 os-collect-config[2462]: resp = self.send(prep, **send_kwargs)
Feb 19 11:04:50.206058 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 596, in send
Feb 19 11:04:50.206347 overcloud-controller-0 os-collect-config[2462]: r = adapter.send(request, **kwargs)
Feb 19 11:04:50.206697 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 497, in send
Feb 19 11:04:50.206998 overcloud-controller-0 os-collect-config[2462]: raise SSLError(e, request=request)
Feb 19 11:04:50.207336 overcloud-controller-0 os-collect-config[2462]: requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:579)

I see in journal of controller also os-collect-config errors, although it could be red herring:

http://logs.openstack.org/15/359215/62/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/136beee//logs/overcloud-controller-0/var/log/messages

Feb 19 11:04:50.201420 overcloud-controller-0 os-collect-config[2462]: [2017-02-19 11:04:50,200] (heat-config) [INFO]
Feb 19 11:04:50.201965 overcloud-controller-0 os-collect-config[2462]: [2017-02-19 11:04:50,201] (heat-config) [ERROR] Error running heat-config-notify. [1]
Feb 19 11:04:50.202294 overcloud-controller-0 os-collect-config[2462]: [2017-02-19 11:04:50,201] (heat-config) [ERROR] [2017-02-19 11:04:50,137] (heat-config-notify) [DEBUG] Signaling to https://192.168.24.2:138
08/v1/AUTH_75f8c60dff834958a5ed63d159851421/copy_ssh_key-20852643-f7e2-4212-a31a-4e578176ddd5/c2bb6901-fe07-4c82-94a8-98922d4cf6d3?temp_url_sig=e35fcb8bd1f5596ba4a095f38b7b4ecf01fb1871&temp_url_expires=148752027
8 via PUT
Feb 19 11:04:50.202651 overcloud-controller-0 os-collect-config[2462]: Traceback (most recent call last):
Feb 19 11:04:50.202973 overcloud-controller-0 os-collect-config[2462]: File "/usr/bin/heat-config-notify", line 163, in <module>
Feb 19 11:04:50.203291 overcloud-controller-0 os-collect-config[2462]: sys.exit(main(sys.argv, sys.stdin))
Feb 19 11:04:50.203626 overcloud-controller-0 os-collect-config[2462]: File "/usr/bin/heat-config-notify", line 110, in main
Feb 19 11:04:50.203941 overcloud-controller-0 os-collect-config[2462]: headers={'content-type': 'application/json'})
Feb 19 11:04:50.204245 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/api.py", line 123, in put
Feb 19 11:04:50.204546 overcloud-controller-0 os-collect-config[2462]: return request('put', url, data=data, **kwargs)
Feb 19 11:04:50.204869 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/api.py", line 56, in request
Feb 19 11:04:50.205154 overcloud-controller-0 os-collect-config[2462]: return session.request(method=method, url=url, **kwargs)
Feb 19 11:04:50.205458 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 475, in request
Feb 19 11:04:50.205776 overcloud-controller-0 os-collect-config[2462]: resp = self.send(prep, **send_kwargs)
Feb 19 11:04:50.206058 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 596, in send
Feb 19 11:04:50.206347 overcloud-controller-0 os-collect-config[2462]: r = adapter.send(request, **kwargs)
Feb 19 11:04:50.206697 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 497, in send
Feb 19 11:04:50.206998 overcloud-controller-0 os-collect-config[2462]: raise SSLError(e, request=request)
Feb 19 11:04:50.207336 overcloud-controller-0 os-collect-config[2462]: requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:579)

Revision history for this message

Juan Antonio Osorio Robles (juan-osorio-robles) wrote on 2017-02-20:

#2

Well, it is pretty strange that os-collect-config is trying to access a public endpoint from heat. shouldn't it be trying to access only internal ones?

Also, it is an issue that the overcloud nodes don't trust the CA, and we actually do inject that into the image. So there is something wrong with the CA that's being used. We should fix this; however, seems to me that we should also fix the fact that it's trying to access a public endpoint from the overcloud.

Revision history for this message

Juan Antonio Osorio Robles (juan-osorio-robles) wrote on 2017-02-20:

#3

Then again, if the failure is at the pingtest, it seems that this os-collect-config issue might be unrelated; if os-collect-config was failing we wouldn't have been able to complete the overcloud deploy.

Revision history for this message

Juan Antonio Osorio Robles (juan-osorio-robles) wrote on 2017-02-20:

#4

The Glance logs point to an error accessing keystone http://logs.openstack.org/15/359215/62/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/d048040/logs/overcloud-controller-0/var/log/glance/api.txt.gz#_2017-02-19_15_06_04_811 which is where the 504 is coming from. I still don't figure out why that token request timed out.

Revision history for this message

Alfredo Moralejo (amoralej) wrote on 2017-02-20:

#5

We are hitting the same issue in RDO-CI promotion pipeline. In this case we are seeing the issue in overcloud_deploy_post.log [1], but i'd say it's the same issue. We are failing to collect logs from compute and controller nodes, so it's not very helpful. The differences between dlrn repos in last passing jobs and fist failing one is in http://paste.openstack.org/show/599614/.

And as Juan Antonio mentioned, the issue seems to be when glance try to connect to keystone for token validation. Note that similar errors appear in other service log as gnocchi log [2]:

2017-02-19 15:04:14.259 188952 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 504
2017-02-19 15:04:14.260 188952 WARNING keystonemiddleware.auth_token [-] Identity response: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

In the logs mentioned in comment #1 I see following error in apache log:

[Sun Feb 19 14:46:05.764384 2017] [mpm_prefork:error] [pid 188948] AH00161: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting

I can't find the httpd configuration in jobs logs, so i'm not sure if it may be undersized in terms of max processes or if thre is something else.

About https://review.rdoproject.org/r/#/c/5390/ , jobs in ocata release are passing after this change [3]. Additionally, tests in https://review.openstack.org/#/c/435693/ run with similar package combination in terms of heat-templeates/heat-agent and they passed so i'm inclined to think that it's unrelated.

[1] https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-delorean-minimal_pacemaker-489/undercloud/home/stack/
[2] http://logs.openstack.org/15/359215/62/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/d048040/logs/overcloud-controller-0/var/log/gnocchi/app.txt.gz
[3] https://ci.centos.org/view/rdo/view/promotion-pipeline/job/rdo_trunk-promote-ocata-current-tripleo/27/

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-02-20:

#6

We set ServerLimit in Apache to 32 in low-memory-usage template[1], in periodic jobs it causes the error about preforking and MaxRequestWorkers (which is set by default to ServerLimit). After I increases it to 300, everything started to work.
But we use low-memory-usage in all our jobs, not only periodic and everywhere it's OK. So it's more symptom of problem, than its reason. Something is taking all the ServerLimit forks and doesn't allow to keystone receive packets.

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/environments/low-memory-usage.yaml#L15

Emilien Macchi (emilienm) on 2017-02-20

Changed in tripleo:
milestone:	ongoing → ocata-rc2

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-02-20:

#7

Probably related to https://bugs.launchpad.net/tripleo/+bug/1666171

Revision history for this message

Alfredo Moralejo (amoralej) wrote on 2017-02-20:

#8

Make sense, jobs with ha configuration are passing as redis issue does not impact when it's run as pacemaker resource but we hit it in -nonha and in -update jobs.

I just wonder how it's passing in ocata minimal job in RDO-CI without pacemaker with new redis (https://ci.centos.org/job/tripleo-quickstart-promote-ocata-rdo_trunk-minimal/13/). Maybe there is some more aggressive retry policy in master that in stable/ocata.

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-02-21:

#9

Redis was updated and works, but jobs still fail.
ovb-updates job:

http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-updates/7f67ea6/console.html

Although gnocchi still can't connect and may take all apache resources: http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-updates/7f67ea6/logs/overcloud-controller-0/var/log/gnocchi/metricd.txt.gz

Continue to investigate

Sagi (Sergey) Shnaidman (sshnaidm) on 2017-02-21

Changed in tripleo:
assignee:	nobody → Sagi (Sergey) Shnaidman (sshnaidm)

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-02-21:

#10

invest.txt Edit (17.1 KiB, text/plain)

After an investigation, I found some problems with apache threads on controller: http://paste.openstack.org/show/599888/ (the same file is attached)

Seems like all 32 possible apache "preforks" (and ServerLimit is 32) are busy with request GET /v1/capabilities/ HTTP/1.1 which is initiated to gnocchi_wsgi:
/var/log/httpd/gnocchi_wsgi_access.log:192.168.24.11 - - [21/Feb/2017:18:27:26 +0000] "GET /v1/capabilities/ HTTP/1.1" 500 531 "-" "ceilometer-agent-notification keystoneauth1/2.18.0 python-requests/2.11.1 CPython/2.7.5"

Processing them in parallel, apache can't fork another preforks (because it reached the limit) and no new connection is accepted, including those of nova, neutron, etc - of pingtest. Restarting apache usually fixes it.
I tried to run the same jobs without gnocchi enabled and they passed:
https://review.openstack.org/#/c/436497/

So problem is most likely in gnocchi or its interconnection with ceilometer.
So I see a few solutions:
1) Investigate and solve it in gnocchi/ceilometer projects
2) Increase apache ServerLimit to something more than 32 in low-memory-usage template: https://github.com/openstack/tripleo-heat-templates/blob/c99c48b84e20925b4f4b728e9b103d6c8bcb3d11/environments/low-memory-usage.yaml#L15-L15
3) To set up apache to work with mpm_event instead of mpm_prefork.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2017-02-21:

#11

Hypothesis: it's possible Ceilometer request /v1/capabilities with a Keystone token. Since several processes are doing that in parallel, it can consume the 32 requests to do so. The problem is that then to authorized /v1/capabilities, I imagine the keystone middleware from Gnocchi requests Keystone to validate the token, and then… it blocks forever since there's no more Apache work allowed.

Does that sound reasonable?

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-02-21:

#12

meanwhile I submitted a patch for increasing Apache limit to 100, it passed with consistent repos: https://review.openstack.org/#/c/436650/

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-02-21:

#13

Download full text (4.4 KiB)

@Julien, could it be that there are 32 parallel requests from ceilometer at once? Because apache makes prefork on parallel requests.

For example this job:
http://logs.openstack.org/23/436623/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9c7556c/console.html#_2017-02-21_22_33_43_449098

I see a lot of almost parallel requests to keystone that receive 503 error:
http://logs.openstack.org/23/436623/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9c7556c/logs/overcloud-controller-0/var/log/gnocchi/app.txt.gz#_2017-02-21_21_58_43_239

the same story with ceilometer:
http://logs.openstack.org/23/436623/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9c7556c/logs/overcloud-controller-0/var/log/ceilometer/central.txt.gz

Ceilometer is bombing gnocchi with capabilities requests:
http://logs.openstack.org/23/436623/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9c7556c/logs/overcloud-controller-0/var/log/httpd/gnocchi_wsgi_access.txt.gz

Since it's "capabilities" requests that blocked all preforks in apache, could it be that gnocchi can't neither serve them nor drop? And just blocking on them. Maybe some problem with serving capabilities request?
It may be your case, when just no slots to turn to keystone, anyway there are few problems:

1) Too much parallel same requests from ceilometer to gnocchi. It's definetly a bug.
2) Gnocchi shouldn't block request, need some timeout and release the apache prefork after it.

We don't have problem running these jobs on master with curent-tripleo repo with gnocchi:

openstack-gnocchi-api-3.1.1-0.20170217232151.54631be.el7.centos.noarch
openstack-gnocchi-common-3.1.1-0.20170217232151.54631be.el7.centos.noarch
openstack-gnocchi-indexer-sqlalchemy-3.1.1-0.20170217232151.54631be.el7.centos.noarch
openstack-gnocchi-metricd-3.1.1-0.20170217232151.54631be.el7.centos.noarch
openstack-gnocchi-statsd-3.1.1-0.20170217232151.54631be.el7.centos.noarch
puppet-gnocchi-10.3.0-0.20170208142639.5ec568d.el7.centos.noarch
python2-gnocchiclient-3.0.0-0.20170208105743.32b62fb.el7.centos.noarch
python-gnocchi-3.1.1-0.20170217232151.54631be.el7.centos.noarch

ceilometer:

openstack-ceilometer-api-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-central-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-collector-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-common-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-compute-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-notification-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-polling-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
puppet-ceilometer-10.3.0-0.20170215160028.c3d330e.el7.centos.noarch
python2-ceilometerclient-2.8.0-0.20170206121840.658d7e2.el7.centos.noarch
python-ceilometer-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
python-ceilometermiddleware-1.0.0-0.20170206160402.87bd49b.el7.centos.noarch

And we have problems with consistent repo with packages:

Gnocchi:
openstack-gnocchi-api-3.1.1-0.20170218182819.ca27a17.el7.centos.noarch
openstack-gnocchi-common-3.1.1-0.20170218182819.ca27a17.el7.centos.noarch
openstack-g...

@Julien, could it be that there are 32 parallel requests from ceilometer at once? Because apache makes prefork on parallel requests.

For example this job:
http://logs.openstack.org/23/436623/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9c7556c/console.html#_2017-02-21_22_33_43_449098

I see a lot of almost parallel requests to keystone that receive 503 error:
http://logs.openstack.org/23/436623/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9c7556c/logs/overcloud-controller-0/var/log/gnocchi/app.txt.gz#_2017-02-21_21_58_43_239

the same story with ceilometer:
http://logs.openstack.org/23/436623/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9c7556c/logs/overcloud-controller-0/var/log/ceilometer/central.txt.gz

Ceilometer is bombing gnocchi with capabilities requests:
http://logs.openstack.org/23/436623/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9c7556c/logs/overcloud-controller-0/var/log/httpd/gnocchi_wsgi_access.txt.gz

Since it's "capabilities" requests that blocked all preforks in apache, could it be that gnocchi can't neither serve them nor drop? And just blocking on them. Maybe some problem with serving capabilities request?
It may be your case, when just no slots to turn to keystone, anyway there are few problems:

1) Too much parallel same requests from ceilometer to gnocchi. It's definetly a bug.
2) Gnocchi shouldn't block request, need some timeout and release the apache prefork after it.

We don't have problem running these jobs on master with curent-tripleo repo with gnocchi:

openstack-gnocchi-api-3.1.1-0.20170217232151.54631be.el7.centos.noarch
openstack-gnocchi-common-3.1.1-0.20170217232151.54631be.el7.centos.noarch
openstack-gnocchi-indexer-sqlalchemy-3.1.1-0.20170217232151.54631be.el7.centos.noarch
openstack-gnocchi-metricd-3.1.1-0.20170217232151.54631be.el7.centos.noarch
openstack-gnocchi-statsd-3.1.1-0.20170217232151.54631be.el7.centos.noarch
puppet-gnocchi-10.3.0-0.20170208142639.5ec568d.el7.centos.noarch
python2-gnocchiclient-3.0.0-0.20170208105743.32b62fb.el7.centos.noarch
python-gnocchi-3.1.1-0.20170217232151.54631be.el7.centos.noarch

ceilometer:

openstack-ceilometer-api-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-central-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-collector-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-common-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-compute-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-notification-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-polling-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
puppet-ceilometer-10.3.0-0.20170215160028.c3d330e.el7.centos.noarch
python2-ceilometerclient-2.8.0-0.20170206121840.658d7e2.el7.centos.noarch
python-ceilometer-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
python-ceilometermiddleware-1.0.0-0.20170206160402.87bd49b.el7.centos.noarch

And we have problems with consistent repo with packages:

Gnocchi:
openstack-gnocchi-api-3.1.1-0.20170218182819.ca27a17.el7.centos.noarch
openstack-gnocchi-common-3.1.1-0.20170218182819.ca27a17.el7.centos.noarch
openstack-gnocchi-indexer-sqlalchemy-3.1.1-0.20170218182819.ca27a17.el7.centos.noarch
openstack-gnocchi-metricd-3.1.1-0.20170218182819.ca27a17.el7.centos.noarch
openstack-gnocchi-statsd-3.1.1-0.20170218182819.ca27a17.el7.centos.noarch
puppet-gnocchi-10.3.0-0.20170208142639.5ec568d.el7.centos.noarch
python2-gnocchiclient-3.0.0-0.20170208105743.32b62fb.el7.centos.noarch
python-gnocchi-3.1.1-0.20170218182819.ca27a17.el7.centos.noarch

Ceilometer:
openstack-ceilometer-api-8.0.1-0.20170218153543.68e5bab.el7.centos.noarch
openstack-ceilometer-central-8.0.1-0.20170218153543.68e5bab.el7.centos.noarch
openstack-ceilometer-collector-8.0.1-0.20170218153543.68e5bab.el7.centos.noarch
openstack-ceilometer-common-8.0.1-0.20170218153543.68e5bab.el7.centos.noarch
openstack-ceilometer-compute-8.0.1-0.20170218153543.68e5bab.el7.centos.noarch
openstack-ceilometer-notification-8.0.1-0.20170218153543.68e5bab.el7.centos.noarch
openstack-ceilometer-polling-8.0.1-0.20170218153543.68e5bab.el7.centos.noarch
python-ceilometer-8.0.1-0.20170218153543.68e5bab.el7.centos.noarch
python-ceilometermiddleware-1.0.0-0.20170206160402.87bd49b.el7.centos.noarch
python2-ceilometerclient-2.8.0-0.20170206121840.658d7e2.el7.centos.noarch
puppet-ceilometer-10.3.0-0.20170215160028.c3d330e.el7.centos.noarch

So something between these versions broke promotion jobs.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2017-02-22:

#14

I guess you did not understand completely the hypothesis I formulated. I'll reply to your comments:

> Since it's "capabilities" requests that blocked all preforks in apache, could it be that gnocchi can't neither serve them nor drop?
> And just blocking on them. Maybe some problem with serving capabilities request?

Yes, it can't serve them because it has to verify the authentication through Keystone. But Keystone is unreachable since Apache is already full.
Capabilities is static data, so the problem is not *in* Gnocchi, it's just before in the Keystone middleware it uses for authentication.

> 1) Too much parallel same requests from ceilometer to gnocchi. It's definetly a bug.

Not really. It's just Ceilometer spawning threads to handle concurrently messages, and each threads need to do this request. I'll try to see if it can be lowered.

> Gnocchi shouldn't block request, need some timeout and release the apache prefork after it.

This is not Gnocchi blocking. This is the Keystone auth token middleware.

At least, if my scenario is correct and unless proven otherwise. :-)

Sagi (Sergey) Shnaidman (sshnaidm) on 2017-02-22

tags:

removed: alert promotion-blocker

Emilien Macchi (emilienm) on 2017-02-22

Changed in tripleo:
status:	Triaged → Fix Released

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-02-22:

#15

@Julien, thanks, now it's more clear. Even if this exactly what happens (unless proven otherwise):

> It's just Ceilometer spawning threads to handle concurrently messages, and each threads need to do this request. I'll try to see if it can be lowered

That's what I call a bug. Service shouldn't execute tens of similar requests, it's wrong behaviour/design.

Also please notice my comments about gnocchi/ceilometer versions, because it works without problems in master current-tripleo, only versions from consistent repo have those problems, seems like recent changes caused this.

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-02-22:

#16

@Emilien, @Julien - if we close this bug, should we open a bug in gnocchi or ceilometer about it?

Revision history for this message

Julien Danjou (jdanjou) wrote on 2017-02-22:

#17

Sagi: I already sent https://review.openstack.org/436863 to remove this behavior from Ceilometer, we'll see.

Revision history for this message

gordon chung (chungg) wrote on 2017-02-22:

#18

do you install/start gnocchi before ceilometer? you should do that.

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-02-22:

#19

@gordon chung, even if not - gnocchi and ceilometer should consider it and handle this case properly.

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-02-22:

#20

I created a job which mimics periodic job and depends-on mentioned above ceilometer patch. It also runs without the workaround with apache. Let's see if it helps: https://review.openstack.org/#/c/437008/

Revision history for this message

gordon chung (chungg) wrote on 2017-02-22:

#21

well what would you expect to happen if you ran a service and mysql wasn't started? it seems like we should replicate this behaviour.

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-02-22:

#22

@gordon chung I'd expect service to check within a reasonable timeout (60 secs for example) if it can connect and report if not. But definitely not to destroy everything around.

Revision history for this message

gordon chung (chungg) wrote on 2017-02-22:

#23

ok, we should do that then. let's figure out how to do that in gnocchi/ceilometer.

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-02-22:

#24

gate-tripleo-ci-centos-7-ovb-nonha SUCCESS in 2h 19m 58s
gate-tripleo-ci-centos-7-ovb-updates SUCCESS in 2h 33m 11s

Seems like it helped.

gordon chung (chungg) on 2017-02-23

no longer affects:

gnocchi

OpenStack Infra (hudson-openstack) on 2017-02-23

Changed in ceilometer:
assignee:	nobody → gordon chung (chungg)
status:	New → In Progress

Revision history for this message

gordon chung (chungg) wrote on 2017-02-23:

#25

can you test against https://review.openstack.org/#/c/437466/?

you can have the previous fix too but just validate ^ while you can still reproduce issue.

gordon chung (chungg) on 2017-02-23

Changed in ceilometer:
importance:	Undecided → Medium

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-02-27: Fix merged to ceilometer (master)

#26

Reviewed: https://review.openstack.org/437466
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=7c6160e3a19ed50562e51137b2779c328a750bce
Submitter: Jenkins
Branch: master

commit 7c6160e3a19ed50562e51137b2779c328a750bce
Author: gord chung <email address hidden>
Date: Thu Feb 23 15:49:57 2017 +0000

support gnocchi timeout

    set gnocchi timeout to 6.05 because requests recommends multiple
    of 3[1]. also, it seems safe to assume gnocchi is not functioning well
    and will not scale if we can't post in 6s. in my benchmark, it took
    under 0.25s to post 5000 points. the only thing that took >3s was
    getting >80K datapoints and this workflow is only pushing, and only
    getting small chunks occasionally. making it 6s because it seems
    keystone adds decent overhead.

do not set timeout for upgrade because it might take longer than normal
workflow

[1] http://docs.python-requests.org/en/master/user/advanced/#timeouts

Change-Id: I41b5e4f0c0f54b7f4b627d9116e63096ca2decf9
Closes-Bug: #1666072

Changed in ceilometer:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-02-28: Related fix merged to ceilometer (master)

#27

Reviewed: https://review.openstack.org/436863
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=79859c7dedd4895b395b3dabde454dda772af2ad
Submitter: Jenkins
Branch: master

commit 79859c7dedd4895b395b3dabde454dda772af2ad
Author: Julien Danjou <email address hidden>
Date: Wed Feb 22 10:31:17 2017 +0100

Revert "verify gnocchi connection before processing"

This reverts commit 152aa1af01769841956da1bc0225bde88b80af08.

    This makes Ceilometer spam Gnocchi at startup for no good reason. Gnocchi being
    started or not at the start of Ceilometer should not change anything, as it can
    become down at anytime and Ceilometer ought to handle that case anyway.

Related-Bug: #1666072
Change-Id: Ic03ee3343f68428775f2dd780a26298ab0434800

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-04-15:

#28

I see the error again in Promotion job (only OVN nonha):
http://logs.openstack.org/15/359215/86/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/a461628/console.html#_2017-04-14_21_46_24_047787

Changed in tripleo:
milestone:	ocata-rc2 → pike-2
assignee:	Sagi (Sergey) Shnaidman (sshnaidm) → nobody
tags:	added: alert
Changed in tripleo:
status:	Fix Released → Triaged

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-04-16:

#29

More debug:
http://logs.openstack.org/37/457037/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/37ab76c/console.html#_2017-04-16_01_54_18_684099

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-04-16:

#30

Download full text (4.6 KiB)

2017-04-16 01:52:18.643435 | REQ: curl -g -i -X GET https://192.168.24.2:13004/v1/c1336c5e08744b72a59ac12f14986197/stacks/overcloud/events?marker=b126e655-ecd7-4f9f-9841-f3ecca90364d&nested_depth=2&sort_dir=asc -H "User-Agent: python-heatclient" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}47c9501746fdfaf0b2ac7e905ac21139b667eef4"
2017-04-16 01:54:18.646629 | "GET /v1/c1336c5e08744b72a59ac12f14986197/stacks/overcloud/events?marker=b126e655-ecd7-4f9f-9841-f3ecca90364d&nested_depth=2&sort_dir=asc HTTP/1.1" 504 None
2017-04-16 01:54:18.649440 | RESP: [504] Cache-Control: no-cache Connection: close Content-Type: text/html
2017-04-16 01:54:18.649526 | RESP BODY: Omitted, Content-Type is set to text/html. Only application/json responses have their bodies logged.
2017-04-16 01:54:18.649550 |
2017-04-16 01:54:18.682452 | ERROR: <html><body><h1>504 Gateway Time-out</h1>
2017-04-16 01:54:18.682534 | The server didn't respond in time.
2017-04-16 01:54:18.682564 | </body></html>
2017-04-16 01:54:18.682594 | Traceback (most recent call last):
2017-04-16 01:54:18.682640 | File "/usr/lib/python2.7/site-packages/cliff/app.py", line 400, in run_subcommand
2017-04-16 01:54:18.682669 | result = cmd.run(parsed_args)
2017-04-16 01:54:18.682715 | File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
2017-04-16 01:54:18.682749 | return super(Command, self).run(parsed_args)
2017-04-16 01:54:18.682791 | File "/usr/lib/python2.7/site-packages/cliff/command.py", line 90, in run
2017-04-16 01:54:18.682823 | return self.take_action(parsed_args) or 0
2017-04-16 01:54:18.682877 | File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 865, in take_action
2017-04-16 01:54:18.682919 | self._deploy_tripleo_heat_templates_tmpdir(stack, parsed_args)
2017-04-16 01:54:18.682986 | File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 368, in _deploy_tripleo_heat_templates_tmpdir
2017-04-16 01:54:18.683012 | new_tht_root, tht_root)
2017-04-16 01:54:18.683074 | File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 442, in _deploy_tripleo_heat_templates
2017-04-16 01:54:18.683102 | parsed_args.run_validations)
2017-04-16 01:54:18.683165 | File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 454, in _try_overcloud_deploy_with_compat_yaml
2017-04-16 01:54:18.683189 | run_validations)
2017-04-16 01:54:18.683243 | File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 233, in _heat_deploy
2017-04-16 01:54:18.683272 | run_validations=run_validations)
2017-04-16 01:54:18.683351 | File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/deployment.py", line 83, in deploy_and_wait
2017-04-16 01:54:18.683401 | orchestration_client, plan_name, marker, action, verbose_events)
2017-04-16 01:54:18.683453 | File "/usr/lib/python2.7/site-packages/tripleoclient/utils.py", line 190, in wait_for_stack_ready
2017-04-16 01:54:18.683491 | poll_period=5, marker=marker, out=out, nested_depth=2)
2017-04-16 01:54:18.683545 | File "/usr/lib/python2.7/site-packag...

Affects		Status	Importance	Assigned to	Milestone
	Ceilometer	Fix Released	Medium	gordon chung
	tripleo	Fix Released	Critical	Unassigned

Ceilometer

CI / SSL: 504 Gateway Time-out

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Changed in tripleo:
assignee:	nobody → Emilien Macchi (emilienm)
status:	Triaged → In Progress

Changed in tripleo:
assignee:	Emilien Macchi (emilienm) → nobody