So, when thedac tried manually they didn't get a timeout, it failed instantly. That suggests that it wasn't a tcp timeout but something more nefarious causing the issue.
The short list of potential candidates are:
- a bug in the LP preparation of the oops_twisted.Config object
- a bug in oops_twisted.Config.publish
- a bug in oops_twisted's adapter for non-twisted publishers.
In addition to that:
- if we were getting socket timeout errors (e.g. due to black-hole firewalling) we should fix that and use a low (0.5->1.0 second timeout)
- we probably want to do that on all services - e.g. in oops_amqp or even globally.
I think the next actions to take are:
- reproduce this problem and craft a fix for it. I'm going to reopen this bug accordingly: our amqp code must fail-soft.
- separately, high priority not critical, lower the socket timeout for amqp connection attempts. Doing that may be sufficient to solve this bug, but as its conceptually a separate issue, I'd start with a separate bug.
So, when thedac tried manually they didn't get a timeout, it failed instantly. That suggests that it wasn't a tcp timeout but something more nefarious causing the issue.
The short list of potential candidates are: Config. publish
- a bug in the LP preparation of the oops_twisted.Config object
- a bug in oops_twisted.
- a bug in oops_twisted's adapter for non-twisted publishers.
In addition to that:
- if we were getting socket timeout errors (e.g. due to black-hole firewalling) we should fix that and use a low (0.5->1.0 second timeout)
- we probably want to do that on all services - e.g. in oops_amqp or even globally.
I think the next actions to take are:
- reproduce this problem and craft a fix for it. I'm going to reopen this bug accordingly: our amqp code must fail-soft.
- separately, high priority not critical, lower the socket timeout for amqp connection attempts. Doing that may be sufficient to solve this bug, but as its conceptually a separate issue, I'd start with a separate bug.