The situation was this: the rabbit service was receiving millions of
messages, we have no reason to believe any one message was more than a
few MB. The consumers could not keep up and eventually rabbit ran out
of ram (though these are persistent messages why its holding any info
on them in ram is an open question).
So the possible failure modes are:
- rabbit was giving slow socket behaviour rather than responding
rapidly - doing a tar pit impression
- rabbit was not acking the message promptly / at all
- rabbit had d/c'd but we didn't notice a socket error?
- rabbit had d/c'd but connect() wasn't erroring immediately.
This situation shouldn't have gotten this bad but our nagios alert
wasn't checking queue length. It is now.
Reproducing this in a test will be tricky.
The situation was this: the rabbit service was receiving millions of
messages, we have no reason to believe any one message was more than a
few MB. The consumers could not keep up and eventually rabbit ran out
of ram (though these are persistent messages why its holding any info
on them in ram is an open question).
So the possible failure modes are:
- rabbit was giving slow socket behaviour rather than responding
rapidly - doing a tar pit impression
- rabbit was not acking the message promptly / at all
- rabbit had d/c'd but we didn't notice a socket error?
- rabbit had d/c'd but connect() wasn't erroring immediately.
This situation shouldn't have gotten this bad but our nagios alert
wasn't checking queue length. It is now.