Comment 31 for bug 1666072

Revision history for this message
Zane Bitter (zaneb) wrote :

Searching through the heat-api logs for "/v1/c1336c5e08744b72a59ac12f14986197/stacks/overcloud/events", there's once instance where it takes an inordinately long time:

2017-04-16 01:52:18.646 9644 DEBUG heat.api.middleware.version_negotiation [req-f2262aba-43f0-41f1-9903-a4c4cf02bb75 - admin - - -] Processing request: GET /v1/c1336c5e08744b72a59ac12f14986197/stacks/overcloud/events Accept: application/json process_request /usr/lib/python2.7/site-packages/heat/api/middleware/version_negotiation.py:50
2017-04-16 01:52:18.657 9644 DEBUG heat.common.wsgi [req-e15ccae1-86fe-4dca-b00d-347a9608d7d5 - admin - - -] Calling <heat.api.openstack.v1.stacks.StackController object at 0x7f55232cbb50> : lookup __call__ /usr/lib/python2.7/site-packages/heat/common/wsgi.py:833
2017-04-16 01:52:18.660 9644 DEBUG oslo_messaging._drivers.amqpdriver [req-e15ccae1-86fe-4dca-b00d-347a9608d7d5 - admin - - -] CALL msg_id: 9d0be2b3670e4ce7b91246172170f2be exchange 'heat' topic 'engine' _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:489
...
2017-04-16 01:56:06.717 9644 DEBUG oslo_messaging._drivers.amqpdriver [-] received reply msg_id: 9d0be2b3670e4ce7b91246172170f2be __call__ /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:346

So nearly 4 minutes to respond to that RPC message. Note that this isn't even the RPC call to fetch the events - it's the one to look up the ID of the stack.

Switching to the heat-engine logs, we can see that the engine didn't in fact receive the message until just before it replied (very promptly, within 100ms):

2017-04-16 01:56:06.627 9060 DEBUG oslo_messaging._drivers.amqpdriver [-] received message msg_id: 9d0be2b3670e4ce7b91246172170f2be reply to reply_e05126dea57b4da0ae7b4aee827d2fb0 __call__ /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:213
2017-04-16 01:56:06.714 9060 DEBUG oslo_messaging._drivers.amqpdriver [req-e15ccae1-86fe-4dca-b00d-347a9608d7d5 - admin - - -] sending reply msg_id: 9d0be2b3670e4ce7b91246172170f2be reply queue: reply_e05126dea57b4da0ae7b4aee827d2fb0 time elapsed: 0.0844531349994s _send_reply /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:90

The logs don't reveal any obvious reasons for the delete. PIDs 9062 and 9063 are busy processing stuff during this time (although I wouldn't have expected them to exhaust their thread pools). PID 9061 does some stuff at the beginning of that period, and then appears to be mostly idle.

PID 9060, which eventually receives the message, logs nothing between just before the message is sent:

2017-04-16 01:52:18.117 9060 DEBUG heat.engine.parameter_groups [req-6d3ebf36-ee75-4849-b318-e4fe5a3ca369 admin admin - - -] Validating Parameter Groups: OS::project_id, OS::stack_id validate /usr/lib/python2.7/site-packages/heat/engine/parameter_groups.py:42

and shortly before it is received:

2017-04-16 01:56:05.745 9060 INFO heat.engine.resource [req-6d3ebf36-ee75-4849-b318-e4fe5a3ca369 admin admin - - -] Validating TemplateResource "57"

which is suspicious. That could explain it if:

* There's a context switch after PID 9060 has claimed the message, but before it logs/starts processing it (why would oslo_messaging do that though?)
* The ResourceChain that PID 9060 is validating something (what?) very slowly - for nearly 4 minutes - between the parameter groups and the resources, without doing any I/O or otherwise prompting a context switch.

Or:

* There's a *massive* GC pause

Or:

* The undercloud box is swapping, and there's a *massive* pause for that process while we wait for it.

Alternatively, a less likely possibility is that Rabbit really does take that long to deliver the message for some reason.

In any event, the solution is probably to retry on the client side if you get a 504, or to increase the gateway timeout to at least as much as Heat's RPC timeout (TBH I thought we had already done this, so it's probably a regression in the HAProxy config).