CI / SSL: 504 Gateway Time-out

Bug #1666072 reported by Emilien Macchi
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceilometer
Fix Released
Medium
gordon chung
tripleo
Fix Released
Critical
Unassigned

Bug Description

Something is blocking promotion jobs. The pingtest doesn't work anymore, it fails to reach OpenStack endpoints:
http://logs.openstack.org/15/359215/62/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/d048040/console.html#_2017-02-19_15_06_04_835012

Packaging diff between a job that worked and another that fails:
https://www.diffchecker.com/CPTgZwzQ

I've been investigating certificates but they're good (until Nov 7th 2017 by the way).

I've seen some errors with heat-agents in logs.

It could be related to https://review.rdoproject.org/r/#/c/5390/ but I'm not sure.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

I see in journal of controller also os-collect-config errors, although it could be red herring:

http://logs.openstack.org/15/359215/62/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/136beee//logs/overcloud-controller-0/var/log/messages

Feb 19 11:04:50.201420 overcloud-controller-0 os-collect-config[2462]: [2017-02-19 11:04:50,200] (heat-config) [INFO]
Feb 19 11:04:50.201965 overcloud-controller-0 os-collect-config[2462]: [2017-02-19 11:04:50,201] (heat-config) [ERROR] Error running heat-config-notify. [1]
Feb 19 11:04:50.202294 overcloud-controller-0 os-collect-config[2462]: [2017-02-19 11:04:50,201] (heat-config) [ERROR] [2017-02-19 11:04:50,137] (heat-config-notify) [DEBUG] Signaling to https://192.168.24.2:138
08/v1/AUTH_75f8c60dff834958a5ed63d159851421/copy_ssh_key-20852643-f7e2-4212-a31a-4e578176ddd5/c2bb6901-fe07-4c82-94a8-98922d4cf6d3?temp_url_sig=e35fcb8bd1f5596ba4a095f38b7b4ecf01fb1871&temp_url_expires=148752027
8 via PUT
Feb 19 11:04:50.202651 overcloud-controller-0 os-collect-config[2462]: Traceback (most recent call last):
Feb 19 11:04:50.202973 overcloud-controller-0 os-collect-config[2462]: File "/usr/bin/heat-config-notify", line 163, in <module>
Feb 19 11:04:50.203291 overcloud-controller-0 os-collect-config[2462]: sys.exit(main(sys.argv, sys.stdin))
Feb 19 11:04:50.203626 overcloud-controller-0 os-collect-config[2462]: File "/usr/bin/heat-config-notify", line 110, in main
Feb 19 11:04:50.203941 overcloud-controller-0 os-collect-config[2462]: headers={'content-type': 'application/json'})
Feb 19 11:04:50.204245 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/api.py", line 123, in put
Feb 19 11:04:50.204546 overcloud-controller-0 os-collect-config[2462]: return request('put', url, data=data, **kwargs)
Feb 19 11:04:50.204869 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/api.py", line 56, in request
Feb 19 11:04:50.205154 overcloud-controller-0 os-collect-config[2462]: return session.request(method=method, url=url, **kwargs)
Feb 19 11:04:50.205458 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 475, in request
Feb 19 11:04:50.205776 overcloud-controller-0 os-collect-config[2462]: resp = self.send(prep, **send_kwargs)
Feb 19 11:04:50.206058 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 596, in send
Feb 19 11:04:50.206347 overcloud-controller-0 os-collect-config[2462]: r = adapter.send(request, **kwargs)
Feb 19 11:04:50.206697 overcloud-controller-0 os-collect-config[2462]: File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 497, in send
Feb 19 11:04:50.206998 overcloud-controller-0 os-collect-config[2462]: raise SSLError(e, request=request)
Feb 19 11:04:50.207336 overcloud-controller-0 os-collect-config[2462]: requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:579)

Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

Well, it is pretty strange that os-collect-config is trying to access a public endpoint from heat. shouldn't it be trying to access only internal ones?

Also, it is an issue that the overcloud nodes don't trust the CA, and we actually do inject that into the image. So there is something wrong with the CA that's being used. We should fix this; however, seems to me that we should also fix the fact that it's trying to access a public endpoint from the overcloud.

Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

Then again, if the failure is at the pingtest, it seems that this os-collect-config issue might be unrelated; if os-collect-config was failing we wouldn't have been able to complete the overcloud deploy.

Revision history for this message
Juan Antonio Osorio Robles (juan-osorio-robles) wrote :

The Glance logs point to an error accessing keystone http://logs.openstack.org/15/359215/62/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/d048040/logs/overcloud-controller-0/var/log/glance/api.txt.gz#_2017-02-19_15_06_04_811 which is where the 504 is coming from. I still don't figure out why that token request timed out.

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

We are hitting the same issue in RDO-CI promotion pipeline. In this case we are seeing the issue in overcloud_deploy_post.log [1], but i'd say it's the same issue. We are failing to collect logs from compute and controller nodes, so it's not very helpful. The differences between dlrn repos in last passing jobs and fist failing one is in http://paste.openstack.org/show/599614/.

And as Juan Antonio mentioned, the issue seems to be when glance try to connect to keystone for token validation. Note that similar errors appear in other service log as gnocchi log [2]:

2017-02-19 15:04:14.259 188952 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 504
2017-02-19 15:04:14.260 188952 WARNING keystonemiddleware.auth_token [-] Identity response: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

In the logs mentioned in comment #1 I see following error in apache log:

[Sun Feb 19 14:46:05.764384 2017] [mpm_prefork:error] [pid 188948] AH00161: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting

I can't find the httpd configuration in jobs logs, so i'm not sure if it may be undersized in terms of max processes or if thre is something else.

About https://review.rdoproject.org/r/#/c/5390/ , jobs in ocata release are passing after this change [3]. Additionally, tests in https://review.openstack.org/#/c/435693/ run with similar package combination in terms of heat-templeates/heat-agent and they passed so i'm inclined to think that it's unrelated.

[1] https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-delorean-minimal_pacemaker-489/undercloud/home/stack/
[2] http://logs.openstack.org/15/359215/62/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/d048040/logs/overcloud-controller-0/var/log/gnocchi/app.txt.gz
[3] https://ci.centos.org/view/rdo/view/promotion-pipeline/job/rdo_trunk-promote-ocata-current-tripleo/27/

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

We set ServerLimit in Apache to 32 in low-memory-usage template[1], in periodic jobs it causes the error about preforking and MaxRequestWorkers (which is set by default to ServerLimit). After I increases it to 300, everything started to work.
But we use low-memory-usage in all our jobs, not only periodic and everywhere it's OK. So it's more symptom of problem, than its reason. Something is taking all the ServerLimit forks and doesn't allow to keystone receive packets.

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/environments/low-memory-usage.yaml#L15

Changed in tripleo:
milestone: ongoing → ocata-rc2
Revision history for this message
Emilien Macchi (emilienm) wrote :
Revision history for this message
Alfredo Moralejo (amoralej) wrote :

Make sense, jobs with ha configuration are passing as redis issue does not impact when it's run as pacemaker resource but we hit it in -nonha and in -update jobs.

I just wonder how it's passing in ocata minimal job in RDO-CI without pacemaker with new redis (https://ci.centos.org/job/tripleo-quickstart-promote-ocata-rdo_trunk-minimal/13/). Maybe there is some more aggressive retry policy in master that in stable/ocata.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Redis was updated and works, but jobs still fail.
ovb-updates job:

http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-updates/7f67ea6/console.html

Although gnocchi still can't connect and may take all apache resources: http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-updates/7f67ea6/logs/overcloud-controller-0/var/log/gnocchi/metricd.txt.gz

Continue to investigate

Changed in tripleo:
assignee: nobody → Sagi (Sergey) Shnaidman (sshnaidm)
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

After an investigation, I found some problems with apache threads on controller: http://paste.openstack.org/show/599888/ (the same file is attached)

Seems like all 32 possible apache "preforks" (and ServerLimit is 32) are busy with request GET /v1/capabilities/ HTTP/1.1 which is initiated to gnocchi_wsgi:
/var/log/httpd/gnocchi_wsgi_access.log:192.168.24.11 - - [21/Feb/2017:18:27:26 +0000] "GET /v1/capabilities/ HTTP/1.1" 500 531 "-" "ceilometer-agent-notification keystoneauth1/2.18.0 python-requests/2.11.1 CPython/2.7.5"

Processing them in parallel, apache can't fork another preforks (because it reached the limit) and no new connection is accepted, including those of nova, neutron, etc - of pingtest. Restarting apache usually fixes it.
I tried to run the same jobs without gnocchi enabled and they passed:
https://review.openstack.org/#/c/436497/

So problem is most likely in gnocchi or its interconnection with ceilometer.
So I see a few solutions:
1) Investigate and solve it in gnocchi/ceilometer projects
2) Increase apache ServerLimit to something more than 32 in low-memory-usage template: https://github.com/openstack/tripleo-heat-templates/blob/c99c48b84e20925b4f4b728e9b103d6c8bcb3d11/environments/low-memory-usage.yaml#L15-L15
3) To set up apache to work with mpm_event instead of mpm_prefork.

Revision history for this message
Julien Danjou (jdanjou) wrote :

Hypothesis: it's possible Ceilometer request /v1/capabilities with a Keystone token. Since several processes are doing that in parallel, it can consume the 32 requests to do so. The problem is that then to authorized /v1/capabilities, I imagine the keystone middleware from Gnocchi requests Keystone to validate the token, and then… it blocks forever since there's no more Apache work allowed.

Does that sound reasonable?

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

meanwhile I submitted a patch for increasing Apache limit to 100, it passed with consistent repos: https://review.openstack.org/#/c/436650/

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Download full text (4.4 KiB)

@Julien, could it be that there are 32 parallel requests from ceilometer at once? Because apache makes prefork on parallel requests.

For example this job:
http://logs.openstack.org/23/436623/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9c7556c/console.html#_2017-02-21_22_33_43_449098

I see a lot of almost parallel requests to keystone that receive 503 error:
http://logs.openstack.org/23/436623/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9c7556c/logs/overcloud-controller-0/var/log/gnocchi/app.txt.gz#_2017-02-21_21_58_43_239

the same story with ceilometer:
http://logs.openstack.org/23/436623/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9c7556c/logs/overcloud-controller-0/var/log/ceilometer/central.txt.gz

Ceilometer is bombing gnocchi with capabilities requests:
http://logs.openstack.org/23/436623/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/9c7556c/logs/overcloud-controller-0/var/log/httpd/gnocchi_wsgi_access.txt.gz

Since it's "capabilities" requests that blocked all preforks in apache, could it be that gnocchi can't neither serve them nor drop? And just blocking on them. Maybe some problem with serving capabilities request?
It may be your case, when just no slots to turn to keystone, anyway there are few problems:

1) Too much parallel same requests from ceilometer to gnocchi. It's definetly a bug.
2) Gnocchi shouldn't block request, need some timeout and release the apache prefork after it.

We don't have problem running these jobs on master with curent-tripleo repo with gnocchi:

openstack-gnocchi-api-3.1.1-0.20170217232151.54631be.el7.centos.noarch
openstack-gnocchi-common-3.1.1-0.20170217232151.54631be.el7.centos.noarch
openstack-gnocchi-indexer-sqlalchemy-3.1.1-0.20170217232151.54631be.el7.centos.noarch
openstack-gnocchi-metricd-3.1.1-0.20170217232151.54631be.el7.centos.noarch
openstack-gnocchi-statsd-3.1.1-0.20170217232151.54631be.el7.centos.noarch
puppet-gnocchi-10.3.0-0.20170208142639.5ec568d.el7.centos.noarch
python2-gnocchiclient-3.0.0-0.20170208105743.32b62fb.el7.centos.noarch
python-gnocchi-3.1.1-0.20170217232151.54631be.el7.centos.noarch

ceilometer:

openstack-ceilometer-api-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-central-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-collector-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-common-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-compute-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-notification-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
openstack-ceilometer-polling-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
puppet-ceilometer-10.3.0-0.20170215160028.c3d330e.el7.centos.noarch
python2-ceilometerclient-2.8.0-0.20170206121840.658d7e2.el7.centos.noarch
python-ceilometer-8.0.1-0.20170217042451.83412c8.el7.centos.noarch
python-ceilometermiddleware-1.0.0-0.20170206160402.87bd49b.el7.centos.noarch

And we have problems with consistent repo with packages:

Gnocchi:
openstack-gnocchi-api-3.1.1-0.20170218182819.ca27a17.el7.centos.noarch
openstack-gnocchi-common-3.1.1-0.20170218182819.ca27a17.el7.centos.noarch
openstack-g...

Read more...

Revision history for this message
Julien Danjou (jdanjou) wrote :

I guess you did not understand completely the hypothesis I formulated. I'll reply to your comments:

> Since it's "capabilities" requests that blocked all preforks in apache, could it be that gnocchi can't neither serve them nor drop?
> And just blocking on them. Maybe some problem with serving capabilities request?

Yes, it can't serve them because it has to verify the authentication through Keystone. But Keystone is unreachable since Apache is already full.
Capabilities is static data, so the problem is not *in* Gnocchi, it's just before in the Keystone middleware it uses for authentication.

> 1) Too much parallel same requests from ceilometer to gnocchi. It's definetly a bug.

Not really. It's just Ceilometer spawning threads to handle concurrently messages, and each threads need to do this request. I'll try to see if it can be lowered.

> Gnocchi shouldn't block request, need some timeout and release the apache prefork after it.

This is not Gnocchi blocking. This is the Keystone auth token middleware.

At least, if my scenario is correct and unless proven otherwise. :-)

tags: removed: alert promotion-blocker
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

@Julien, thanks, now it's more clear. Even if this exactly what happens (unless proven otherwise):

> It's just Ceilometer spawning threads to handle concurrently messages, and each threads need to do this request. I'll try to see if it can be lowered

That's what I call a bug. Service shouldn't execute tens of similar requests, it's wrong behaviour/design.

Also please notice my comments about gnocchi/ceilometer versions, because it works without problems in master current-tripleo, only versions from consistent repo have those problems, seems like recent changes caused this.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

@Emilien, @Julien - if we close this bug, should we open a bug in gnocchi or ceilometer about it?

Revision history for this message
Julien Danjou (jdanjou) wrote :

Sagi: I already sent https://review.openstack.org/436863 to remove this behavior from Ceilometer, we'll see.

Revision history for this message
gordon chung (chungg) wrote :

do you install/start gnocchi before ceilometer? you should do that.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

@gordon chung, even if not - gnocchi and ceilometer should consider it and handle this case properly.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

I created a job which mimics periodic job and depends-on mentioned above ceilometer patch. It also runs without the workaround with apache. Let's see if it helps: https://review.openstack.org/#/c/437008/

Revision history for this message
gordon chung (chungg) wrote :

well what would you expect to happen if you ran a service and mysql wasn't started? it seems like we should replicate this behaviour.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

@gordon chung I'd expect service to check within a reasonable timeout (60 secs for example) if it can connect and report if not. But definitely not to destroy everything around.

Revision history for this message
gordon chung (chungg) wrote :

ok, we should do that then. let's figure out how to do that in gnocchi/ceilometer.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

gate-tripleo-ci-centos-7-ovb-nonha SUCCESS in 2h 19m 58s
gate-tripleo-ci-centos-7-ovb-updates SUCCESS in 2h 33m 11s

Seems like it helped.

gordon chung (chungg)
no longer affects: gnocchi
Changed in ceilometer:
assignee: nobody → gordon chung (chungg)
status: New → In Progress
Revision history for this message
gordon chung (chungg) wrote :

can you test against https://review.openstack.org/#/c/437466/?

you can have the previous fix too but just validate ^ while you can still reproduce issue.

gordon chung (chungg)
Changed in ceilometer:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/437466
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=7c6160e3a19ed50562e51137b2779c328a750bce
Submitter: Jenkins
Branch: master

commit 7c6160e3a19ed50562e51137b2779c328a750bce
Author: gord chung <email address hidden>
Date: Thu Feb 23 15:49:57 2017 +0000

    support gnocchi timeout

    set gnocchi timeout to 6.05 because requests recommends multiple
    of 3[1]. also, it seems safe to assume gnocchi is not functioning well
    and will not scale if we can't post in 6s. in my benchmark, it took
    under 0.25s to post 5000 points. the only thing that took >3s was
    getting >80K datapoints and this workflow is only pushing, and only
    getting small chunks occasionally. making it 6s because it seems
    keystone adds decent overhead.

    do not set timeout for upgrade because it might take longer than normal
    workflow

    [1] http://docs.python-requests.org/en/master/user/advanced/#timeouts

    Change-Id: I41b5e4f0c0f54b7f4b627d9116e63096ca2decf9
    Closes-Bug: #1666072

Changed in ceilometer:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/436863
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=79859c7dedd4895b395b3dabde454dda772af2ad
Submitter: Jenkins
Branch: master

commit 79859c7dedd4895b395b3dabde454dda772af2ad
Author: Julien Danjou <email address hidden>
Date: Wed Feb 22 10:31:17 2017 +0100

    Revert "verify gnocchi connection before processing"

    This reverts commit 152aa1af01769841956da1bc0225bde88b80af08.

    This makes Ceilometer spam Gnocchi at startup for no good reason. Gnocchi being
    started or not at the start of Ceilometer should not change anything, as it can
    become down at anytime and Ceilometer ought to handle that case anyway.

    Related-Bug: #1666072
    Change-Id: Ic03ee3343f68428775f2dd780a26298ab0434800

Revision history for this message
Emilien Macchi (emilienm) wrote :
Changed in tripleo:
milestone: ocata-rc2 → pike-2
assignee: Sagi (Sergey) Shnaidman (sshnaidm) → nobody
tags: added: alert
Changed in tripleo:
status: Fix Released → Triaged
Revision history for this message
Emilien Macchi (emilienm) wrote :
Revision history for this message
Emilien Macchi (emilienm) wrote :
Download full text (4.6 KiB)

2017-04-16 01:52:18.643435 | REQ: curl -g -i -X GET https://192.168.24.2:13004/v1/c1336c5e08744b72a59ac12f14986197/stacks/overcloud/events?marker=b126e655-ecd7-4f9f-9841-f3ecca90364d&nested_depth=2&sort_dir=asc -H "User-Agent: python-heatclient" -H "Accept: application/json" -H "X-Auth-Token: {SHA1}47c9501746fdfaf0b2ac7e905ac21139b667eef4"
2017-04-16 01:54:18.646629 | "GET /v1/c1336c5e08744b72a59ac12f14986197/stacks/overcloud/events?marker=b126e655-ecd7-4f9f-9841-f3ecca90364d&nested_depth=2&sort_dir=asc HTTP/1.1" 504 None
2017-04-16 01:54:18.649440 | RESP: [504] Cache-Control: no-cache Connection: close Content-Type: text/html
2017-04-16 01:54:18.649526 | RESP BODY: Omitted, Content-Type is set to text/html. Only application/json responses have their bodies logged.
2017-04-16 01:54:18.649550 |
2017-04-16 01:54:18.682452 | ERROR: <html><body><h1>504 Gateway Time-out</h1>
2017-04-16 01:54:18.682534 | The server didn't respond in time.
2017-04-16 01:54:18.682564 | </body></html>
2017-04-16 01:54:18.682594 | Traceback (most recent call last):
2017-04-16 01:54:18.682640 | File "/usr/lib/python2.7/site-packages/cliff/app.py", line 400, in run_subcommand
2017-04-16 01:54:18.682669 | result = cmd.run(parsed_args)
2017-04-16 01:54:18.682715 | File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
2017-04-16 01:54:18.682749 | return super(Command, self).run(parsed_args)
2017-04-16 01:54:18.682791 | File "/usr/lib/python2.7/site-packages/cliff/command.py", line 90, in run
2017-04-16 01:54:18.682823 | return self.take_action(parsed_args) or 0
2017-04-16 01:54:18.682877 | File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 865, in take_action
2017-04-16 01:54:18.682919 | self._deploy_tripleo_heat_templates_tmpdir(stack, parsed_args)
2017-04-16 01:54:18.682986 | File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 368, in _deploy_tripleo_heat_templates_tmpdir
2017-04-16 01:54:18.683012 | new_tht_root, tht_root)
2017-04-16 01:54:18.683074 | File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 442, in _deploy_tripleo_heat_templates
2017-04-16 01:54:18.683102 | parsed_args.run_validations)
2017-04-16 01:54:18.683165 | File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 454, in _try_overcloud_deploy_with_compat_yaml
2017-04-16 01:54:18.683189 | run_validations)
2017-04-16 01:54:18.683243 | File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_deploy.py", line 233, in _heat_deploy
2017-04-16 01:54:18.683272 | run_validations=run_validations)
2017-04-16 01:54:18.683351 | File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/deployment.py", line 83, in deploy_and_wait
2017-04-16 01:54:18.683401 | orchestration_client, plan_name, marker, action, verbose_events)
2017-04-16 01:54:18.683453 | File "/usr/lib/python2.7/site-packages/tripleoclient/utils.py", line 190, in wait_for_stack_ready
2017-04-16 01:54:18.683491 | poll_period=5, marker=marker, out=out, nested_depth=2)
2017-04-16 01:54:18.683545 | File "/usr/lib/python2.7/site-packag...

Read more...

Revision history for this message
Zane Bitter (zaneb) wrote :
Download full text (3.9 KiB)

Searching through the heat-api logs for "/v1/c1336c5e08744b72a59ac12f14986197/stacks/overcloud/events", there's once instance where it takes an inordinately long time:

2017-04-16 01:52:18.646 9644 DEBUG heat.api.middleware.version_negotiation [req-f2262aba-43f0-41f1-9903-a4c4cf02bb75 - admin - - -] Processing request: GET /v1/c1336c5e08744b72a59ac12f14986197/stacks/overcloud/events Accept: application/json process_request /usr/lib/python2.7/site-packages/heat/api/middleware/version_negotiation.py:50
2017-04-16 01:52:18.657 9644 DEBUG heat.common.wsgi [req-e15ccae1-86fe-4dca-b00d-347a9608d7d5 - admin - - -] Calling <heat.api.openstack.v1.stacks.StackController object at 0x7f55232cbb50> : lookup __call__ /usr/lib/python2.7/site-packages/heat/common/wsgi.py:833
2017-04-16 01:52:18.660 9644 DEBUG oslo_messaging._drivers.amqpdriver [req-e15ccae1-86fe-4dca-b00d-347a9608d7d5 - admin - - -] CALL msg_id: 9d0be2b3670e4ce7b91246172170f2be exchange 'heat' topic 'engine' _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:489
...
2017-04-16 01:56:06.717 9644 DEBUG oslo_messaging._drivers.amqpdriver [-] received reply msg_id: 9d0be2b3670e4ce7b91246172170f2be __call__ /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:346

So nearly 4 minutes to respond to that RPC message. Note that this isn't even the RPC call to fetch the events - it's the one to look up the ID of the stack.

Switching to the heat-engine logs, we can see that the engine didn't in fact receive the message until just before it replied (very promptly, within 100ms):

2017-04-16 01:56:06.627 9060 DEBUG oslo_messaging._drivers.amqpdriver [-] received message msg_id: 9d0be2b3670e4ce7b91246172170f2be reply to reply_e05126dea57b4da0ae7b4aee827d2fb0 __call__ /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:213
2017-04-16 01:56:06.714 9060 DEBUG oslo_messaging._drivers.amqpdriver [req-e15ccae1-86fe-4dca-b00d-347a9608d7d5 - admin - - -] sending reply msg_id: 9d0be2b3670e4ce7b91246172170f2be reply queue: reply_e05126dea57b4da0ae7b4aee827d2fb0 time elapsed: 0.0844531349994s _send_reply /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:90

The logs don't reveal any obvious reasons for the delete. PIDs 9062 and 9063 are busy processing stuff during this time (although I wouldn't have expected them to exhaust their thread pools). PID 9061 does some stuff at the beginning of that period, and then appears to be mostly idle.

PID 9060, which eventually receives the message, logs nothing between just before the message is sent:

2017-04-16 01:52:18.117 9060 DEBUG heat.engine.parameter_groups [req-6d3ebf36-ee75-4849-b318-e4fe5a3ca369 admin admin - - -] Validating Parameter Groups: OS::project_id, OS::stack_id validate /usr/lib/python2.7/site-packages/heat/engine/parameter_groups.py:42

and shortly before it is received:

2017-04-16 01:56:05.745 9060 INFO heat.engine.resource [req-6d3ebf36-ee75-4849-b318-e4fe5a3ca369 admin admin - - -] Validating TemplateResource "57"

which is suspicious. That could explain it if:

* There's a context switch after PID 9060 has claimed the message, but before it logs/starts processing it ...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/457264

Changed in tripleo:
assignee: nobody → Emilien Macchi (emilienm)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/457264
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=6cb95e6a69677755a070cc73062b70c84452824e
Submitter: Jenkins
Branch: master

commit 6cb95e6a69677755a070cc73062b70c84452824e
Author: Emilien Macchi <email address hidden>
Date: Mon Apr 17 11:30:23 2017 -0400

    HAproxy/heat_api: increase timeout to 10m

    Default timeout is 2min but it doesn't reflect the rpc_response_timeout
    value that we set in THT and instack-undercloud, which is 600 (10 min).

    In some cases (in low-memory environments), Heat needs more than 2
    minutes to reply to the client, when deploying the overcloud.
    It makes sense to increase the timeout to the value of rpc_timeout to
    give a chance to Heat to reply to the client, otherwise HAproxy will
    kill the connection and send 504 to the client.

    Depends-On: I9669d40d86d762101734704fcef153e360767690
    Change-Id: I32c71fe7930c8798d306046d6933e4b20c22740c
    Related-Bug: 1666072

Changed in tripleo:
status: In Progress → Fix Released
tags: removed: alert
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

RDOCloud are experiencing 504s when concurrently deleting many stacks, so I'd like to use this bug to backport this change as far as newton

tags: added: newton-backport-potential ocata-backport-potential
Revision history for this message
Emilien Macchi (emilienm) wrote :
Changed in tripleo:
status: Fix Released → Triaged
milestone: pike-2 → pike-3
tags: added: alert
Revision history for this message
Emilien Macchi (emilienm) wrote :
Changed in tripleo:
assignee: Emilien Macchi (emilienm) → nobody
Revision history for this message
Emilien Macchi (emilienm) wrote :
tags: added: promotion-blocker
Revision history for this message
Emilien Macchi (emilienm) wrote :
Changed in tripleo:
status: Triaged → Fix Released
milestone: pike-3 → none
tags: removed: alert promotion-blocker
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ceilometer 9.0.0

This issue was fixed in the openstack/ceilometer 9.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.