Searching through the heat-api logs for "/v1/c1336c5e08744b72a59ac12f14986197/stacks/overcloud/events", there's once instance where it takes an inordinately long time:
So nearly 4 minutes to respond to that RPC message. Note that this isn't even the RPC call to fetch the events - it's the one to look up the ID of the stack.
Switching to the heat-engine logs, we can see that the engine didn't in fact receive the message until just before it replied (very promptly, within 100ms):
The logs don't reveal any obvious reasons for the delete. PIDs 9062 and 9063 are busy processing stuff during this time (although I wouldn't have expected them to exhaust their thread pools). PID 9061 does some stuff at the beginning of that period, and then appears to be mostly idle.
PID 9060, which eventually receives the message, logs nothing between just before the message is sent:
* There's a context switch after PID 9060 has claimed the message, but before it logs/starts processing it (why would oslo_messaging do that though?)
* The ResourceChain that PID 9060 is validating something (what?) very slowly - for nearly 4 minutes - between the parameter groups and the resources, without doing any I/O or otherwise prompting a context switch.
Or:
* There's a *massive* GC pause
Or:
* The undercloud box is swapping, and there's a *massive* pause for that process while we wait for it.
Alternatively, a less likely possibility is that Rabbit really does take that long to deliver the message for some reason.
In any event, the solution is probably to retry on the client side if you get a 504, or to increase the gateway timeout to at least as much as Heat's RPC timeout (TBH I thought we had already done this, so it's probably a regression in the HAProxy config).
Searching through the heat-api logs for "/v1/c1336c5e08 744b72a59ac12f1 4986197/ stacks/ overcloud/ events" , there's once instance where it takes an inordinately long time:
2017-04-16 01:52:18.646 9644 DEBUG heat.api. middleware. version_ negotiation [req-f2262aba- 43f0-41f1- 9903-a4c4cf02bb 75 - admin - - -] Processing request: GET /v1/c1336c5e087 44b72a59ac12f14 986197/ stacks/ overcloud/ events Accept: application/json process_request /usr/lib/ python2. 7/site- packages/ heat/api/ middleware/ version_ negotiation. py:50 86fe-4dca- b00d-347a9608d7 d5 - admin - - -] Calling <heat.api. openstack. v1.stacks. StackController object at 0x7f55232cbb50> : lookup __call__ /usr/lib/ python2. 7/site- packages/ heat/common/ wsgi.py: 833 _drivers. amqpdriver [req-e15ccae1- 86fe-4dca- b00d-347a9608d7 d5 - admin - - -] CALL msg_id: 9d0be2b3670e4ce 7b91246172170f2 be exchange 'heat' topic 'engine' _send /usr/lib/ python2. 7/site- packages/ oslo_messaging/ _drivers/ amqpdriver. py:489 _drivers. amqpdriver [-] received reply msg_id: 9d0be2b3670e4ce 7b91246172170f2 be __call__ /usr/lib/ python2. 7/site- packages/ oslo_messaging/ _drivers/ amqpdriver. py:346
2017-04-16 01:52:18.657 9644 DEBUG heat.common.wsgi [req-e15ccae1-
2017-04-16 01:52:18.660 9644 DEBUG oslo_messaging.
...
2017-04-16 01:56:06.717 9644 DEBUG oslo_messaging.
So nearly 4 minutes to respond to that RPC message. Note that this isn't even the RPC call to fetch the events - it's the one to look up the ID of the stack.
Switching to the heat-engine logs, we can see that the engine didn't in fact receive the message until just before it replied (very promptly, within 100ms):
2017-04-16 01:56:06.627 9060 DEBUG oslo_messaging. _drivers. amqpdriver [-] received message msg_id: 9d0be2b3670e4ce 7b91246172170f2 be reply to reply_e05126dea 57b4da0ae7b4aee 827d2fb0 __call__ /usr/lib/ python2. 7/site- packages/ oslo_messaging/ _drivers/ amqpdriver. py:213 _drivers. amqpdriver [req-e15ccae1- 86fe-4dca- b00d-347a9608d7 d5 - admin - - -] sending reply msg_id: 9d0be2b3670e4ce 7b91246172170f2 be reply queue: reply_e05126dea 57b4da0ae7b4aee 827d2fb0 time elapsed: 0.0844531349994s _send_reply /usr/lib/ python2. 7/site- packages/ oslo_messaging/ _drivers/ amqpdriver. py:90
2017-04-16 01:56:06.714 9060 DEBUG oslo_messaging.
The logs don't reveal any obvious reasons for the delete. PIDs 9062 and 9063 are busy processing stuff during this time (although I wouldn't have expected them to exhaust their thread pools). PID 9061 does some stuff at the beginning of that period, and then appears to be mostly idle.
PID 9060, which eventually receives the message, logs nothing between just before the message is sent:
2017-04-16 01:52:18.117 9060 DEBUG heat.engine. parameter_ groups [req-6d3ebf36- ee75-4849- b318-e4fe5a3ca3 69 admin admin - - -] Validating Parameter Groups: OS::project_id, OS::stack_id validate /usr/lib/ python2. 7/site- packages/ heat/engine/ parameter_ groups. py:42
and shortly before it is received:
2017-04-16 01:56:05.745 9060 INFO heat.engine. resource [req-6d3ebf36- ee75-4849- b318-e4fe5a3ca3 69 admin admin - - -] Validating TemplateResource "57"
which is suspicious. That could explain it if:
* There's a context switch after PID 9060 has claimed the message, but before it logs/starts processing it (why would oslo_messaging do that though?)
* The ResourceChain that PID 9060 is validating something (what?) very slowly - for nearly 4 minutes - between the parameter groups and the resources, without doing any I/O or otherwise prompting a context switch.
Or:
* There's a *massive* GC pause
Or:
* The undercloud box is swapping, and there's a *massive* pause for that process while we wait for it.
Alternatively, a less likely possibility is that Rabbit really does take that long to deliver the message for some reason.
In any event, the solution is probably to retry on the client side if you get a 504, or to increase the gateway timeout to at least as much as Heat's RPC timeout (TBH I thought we had already done this, so it's probably a regression in the HAProxy config).