Rabbit failure when sending an OOPS seems to hang the producer (socket timeout)

Bug #901449 reported by Francis J. Lacoste
4
This bug affects 1 person
Affects Status Importance Assigned to Milestone
python-oops-amqp
Triaged
High
Unassigned

Bug Description

On 2011-12-07, we had a poppy bug which exhausted RabbitMQ's memory, causing thrashing and swapping and eventual death. During this period the LP app servers appeared to suffer, and oops-amqp is a candidate culprit.

=INFO REPORT==== 7-Dec-2011::21:06:44 === alarm_handler: {set,{vm_memory_high_watermark,[]}}

Shortly after that, app servers stopped responding to Nagios checks (10s socket timeout).

We aren't probably handling that Rabbit error very well.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 901449] [NEW] Rabbit failure when sending an OOPS seems to hang the producer

Reproducing this in a test will be tricky.

The situation was this: the rabbit service was receiving millions of
messages, we have no reason to believe any one message was more than a
few MB. The consumers could not keep up and eventually rabbit ran out
of ram (though these are persistent messages why its holding any info
on them in ram is an open question).

So the possible failure modes are:
 - rabbit was giving slow socket behaviour rather than responding
rapidly - doing a tar pit impression
 - rabbit was not acking the message promptly / at all
 - rabbit had d/c'd but we didn't notice a socket error?
 - rabbit had d/c'd but connect() wasn't erroring immediately.

This situation shouldn't have gotten this bad but our nagios alert
wasn't checking queue length. It is now.

Revision history for this message
Robert Collins (lifeless) wrote : Re: Rabbit failure when sending an OOPS seems to hang the producer

OOPS-04319168ea720529358ae77eb667fea9 may be an example of this:
 error: [Errno 104] Connection reset by peer

    Traceback (most recent call last):
  Module lp.services.messaging.rabbit, line 188, in finish
    super(RabbitUnreliableSession, self).finish()
  Module lp.services.messaging.rabbit, line 134, in finish
    self.reset()
  Module lp.services.messaging.rabbit, line 139, in reset
    self.disconnect()
  Module lp.services.messaging.rabbit, line 119, in disconnect
    self._connection.close()
  Module amqplib.client_0_8.connection, line 301, in close
    (10, 61), # Connection.close_ok
  Module amqplib.client_0_8.abstract_channel, line 89, in wait
    self.channel_id, allowed_methods)
  Module amqplib.client_0_8.connection, line 218, in _wait_method
    self.wait()
  Module amqplib.client_0_8.abstract_channel, line 105, in wait
    return amqp_method(self, args)
  Module amqplib.client_0_8.connection, line 365, in _close
    self._x_close_ok()
  Module amqplib.client_0_8.connection, line 384, in _x_close_ok
    self._send_method((10, 61))
  Module amqplib.client_0_8.abstract_channel, line 70, in _send_method
    method_sig, args, content)
  Module amqplib.client_0_8.method_framing, line 233, in write_method
    self.dest.write_frame(1, channel, payload)
  Module amqplib.client_0_8.transport, line 125, in write_frame
    frame_type, channel, size, payload, 0xce))
  Module socket, line 1, in sendall
error: [Errno 104] Connection reset by peer

the no url and so forth is a little worrying too.

Changed in python-oops-amqp:
status: New → Triaged
importance: Undecided → Critical
tags: added: rabbit
Revision history for this message
Robert Collins (lifeless) wrote :

(note that this trace appears to be coming from the LP internal code now I look at it more closely - so the erroroneous behaviour may not have anything to do with oops.

Revision history for this message
Robert Collins (lifeless) wrote :

I've had more of a look around and can't see any obvious sign that oops-amqp is actually at fault here. We may well benefit from putting some very low socket timeouts in place - and may well want to do that globally in LP.

For now, I'm going to make this high: making LP more robust against backend services getting self-DOSed is important, but it is not the source of the problem.

Changed in python-oops-amqp:
importance: Critical → High
description: updated
summary: - Rabbit failure when sending an OOPS seems to hang the producer
+ Rabbit failure when sending an OOPS seems to hang the producer (socket
+ timeout)
tags: added: socket-timeout
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.