Rabbit failure when sending an OOPS seems to hang the producer (socket timeout)
Bug #901449 reported by
Francis J. Lacoste
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
python-oops-amqp |
Triaged
|
High
|
Unassigned |
Bug Description
On 2011-12-07, we had a poppy bug which exhausted RabbitMQ's memory, causing thrashing and swapping and eventual death. During this period the LP app servers appeared to suffer, and oops-amqp is a candidate culprit.
=INFO REPORT==== 7-Dec-2011:
Shortly after that, app servers stopped responding to Nagios checks (10s socket timeout).
We aren't probably handling that Rabbit error very well.
Changed in python-oops-amqp: | |
status: | New → Triaged |
importance: | Undecided → Critical |
tags: | added: rabbit |
summary: |
- Rabbit failure when sending an OOPS seems to hang the producer + Rabbit failure when sending an OOPS seems to hang the producer (socket + timeout) |
tags: | added: socket-timeout |
To post a comment you must log in.
Reproducing this in a test will be tricky.
The situation was this: the rabbit service was receiving millions of
messages, we have no reason to believe any one message was more than a
few MB. The consumers could not keep up and eventually rabbit ran out
of ram (though these are persistent messages why its holding any info
on them in ram is an open question).
So the possible failure modes are:
- rabbit was giving slow socket behaviour rather than responding
rapidly - doing a tar pit impression
- rabbit was not acking the message promptly / at all
- rabbit had d/c'd but we didn't notice a socket error?
- rabbit had d/c'd but connect() wasn't erroring immediately.
This situation shouldn't have gotten this bad but our nagios alert
wasn't checking queue length. It is now.