poppy goes into a death spiral when amqp is not available

Bug #906211 reported by Julian Edwards
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
txpkgupload
Triaged
Critical
Unassigned

Bug Description

On germanium, the poppy server will suddenly get slow to respond to SFTP requests and eventually time out. In its logs you can see that at some point it stops trying to contact the XMLRPC auth server. During this time, FTP requests are still served OK.

Tags: poppy
Changed in launchpad:
status: New → Triaged
importance: Undecided → Critical
description: updated
description: updated
Revision history for this message
Julian Edwards (julian-edwards) wrote :

(reposting last comment with IP addresses hidden)

So far I can see that xmlrpc requests to the auth server are timing out. See the first line and then further on there's a "
User timeout caused connection failure". This is only ever caused by a TCP-level timeout of the SYN package which defaults to 30 seconds.

2011-12-16 10:27:38+0000 [SSHService ssh-userauth on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] Starting factory <twisted.web.xmlrpc._QueryFactory instance at 0x7b53248>
2011-12-16 10:27:57+0000 [FTP (ProtocolWrapper),21,xxx] DTPFactory.setTimeout set to 10 seconds
2011-12-16 10:27:57+0000 [FTP (ProtocolWrapper),21,xxx] DTPFactory starting on 58422
2011-12-16 10:27:57+0000 [FTP (ProtocolWrapper),21,xxx] Starting factory <twisted.protocols.ftp.DTPFactory instance at 0xa47b248>
2011-12-16 10:27:58+0000 [twisted.protocols.ftp.DTPFactory] DTPFactory.buildProtocol
2011-12-16 10:27:58+0000 [twisted.protocols.ftp.DTPFactory] cancelling DTP timeout
2011-12-16 10:27:59+0000 [FTP (ProtocolWrapper),21xxx] cleanupDTP
2011-12-16 10:27:59+0000 [FTP (ProtocolWrapper),21,xxx] <<class 'twisted.internet.tcp.Port'> of twisted.protocols.ftp.DTPFactory on 58422>
2011-12-16 10:27:59+0000 [FTP (ProtocolWrapper),21xxx] dtpFactory.stopFactory
2011-12-16 10:27:59+0000 [FTP (ProtocolWrapper),21,xxx] DTPFactory.setTimeout set to 10 seconds
2011-12-16 10:27:59+0000 [FTP (ProtocolWrapper),21,xxx] DTPFactory starting on 34064
2011-12-16 10:27:59+0000 [FTP (ProtocolWrapper),21,xxx] Starting factory <twisted.protocols.ftp.DTPFactory instance at 0xa7cefc8>
2011-12-16 10:27:59+0000 [-] (TCP Port 58422 Closed)
2011-12-16 10:27:59+0000 [-] Stopping factory <twisted.protocols.ftp.DTPFactory instance at 0xa47b248>
2011-12-16 10:27:59+0000 [-] dtpFactory.stopFactory
2011-12-16 10:28:00+0000 [twisted.protocols.ftp.DTPFactory] DTPFactory.buildProtocol
2011-12-16 10:28:00+0000 [twisted.protocols.ftp.DTPFactory] cancelling DTP timeout
2011-12-16 10:28:08+0000 [-] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.TimeoutError'>: User timeout caused connection failure.
        ]
2011-12-16 10:28:08+0000 [-] udienz failed auth publickey
2011-12-16 10:28:08+0000 [-] unauthorized login: unable to get avatar id
2011-12-16 10:28:08+0000 [-] Stopping factory <twisted.web.xmlrpc._QueryFactory instance at 0x7b53248>

Revision history for this message
Julian Edwards (julian-edwards) wrote :

It seems as though this lack of response coincided with fast downtime on the appservers. It also seems to send poppy into a death spiral, although I am not sure why it never reconnects once it's had a failure.

Revision history for this message
William Grant (wgrant) wrote :

There was no fastdowntime on Friday. All there was was a poppy restart and a nodowntime. The nodowntime could improbably have resulted in a single 502 to poppy's requests.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Ok thedac told me there was a FDT. Either way, connections were unavailable to the auth server for whatever reason.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

<mthaddon> bigjools: we did 2xNDT and 1xFDT according to the deployment calendar last friday

Revision history for this message
Julian Edwards (julian-edwards) wrote :

LOSAs appear confused between FDT and ppa/ftpmaster rollouts.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

So the rollout finished at @10:08. There was a successful SFTP test at 10:15:

2011-12-16 10:15:11+0000 [SSHService ssh-userauth on KeepAliveSettingSSHServerTransport (Timeou
tProtocol)] Starting factory <twisted.web.xmlrpc._QueryFactory instance at 0x6737128>
2011-12-16 10:15:11+0000 [QueryProtocol,client] Stopping factory <twisted.web.xmlrpc._QueryFact
ory instance at 0x6737128>

The next SFTP access, as in comment 2, failed.

Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: poppy loses connectivity to the auth server
Download full text (3.9 KiB)

There seems to be a lot of threading going on. I don't know if this is normal, but as far as I can see the only threads created should be by oops_twisted when it is publishing an OOPS.

ps -eLf | grep ^1001
1001 12478 12159 12478 0 1 11:57 pts/0 00:00:00 su - lp_upload
1001 12487 12478 12487 0 1 11:57 pts/0 00:00:00 -su
1001 15485 1 15485 0 11 12:00 ? 00:00:31 /usr/bin/python2.6 -S /srv/launchpad.net/codelines/current/bin/twistd --pidfile /srv/launchpad.net/var/ftpmaster-poppy-sftp.pid --python /srv/launchpad.net/codelines/current/daemons/poppy-sftp.tac --logfile /srv/launchpad.net/production-logs/lp_upload/poppy-sftp.log --umask=0
1001 15485 1 15636 0 11 12:00 ? 00:00:01 /usr/bin/python2.6 -S /srv/launchpad.net/codelines/current/bin/twistd --pidfile /srv/launchpad.net/var/ftpmaster-poppy-sftp.pid --python /srv/launchpad.net/codelines/current/daemons/poppy-sftp.tac --logfile /srv/launchpad.net/production-logs/lp_upload/poppy-sftp.log --umask=0
1001 15485 1 15772 0 11 12:00 ? 00:00:02 /usr/bin/python2.6 -S /srv/launchpad.net/codelines/current/bin/twistd --pidfile /srv/launchpad.net/var/ftpmaster-poppy-sftp.pid --python /srv/launchpad.net/codelines/current/daemons/poppy-sftp.tac --logfile /srv/launchpad.net/production-logs/lp_upload/poppy-sftp.log --umask=0
1001 15485 1 15773 0 11 12:00 ? 00:00:01 /usr/bin/python2.6 -S /srv/launchpad.net/codelines/current/bin/twistd --pidfile /srv/launchpad.net/var/ftpmaster-poppy-sftp.pid --python /srv/launchpad.net/codelines/current/daemons/poppy-sftp.tac --logfile /srv/launchpad.net/production-logs/lp_upload/poppy-sftp.log --umask=0
1001 15485 1 15774 0 11 12:00 ? 00:00:01 /usr/bin/python2.6 -S /srv/launchpad.net/codelines/current/bin/twistd --pidfile /srv/launchpad.net/var/ftpmaster-poppy-sftp.pid --python /srv/launchpad.net/codelines/current/daemons/poppy-sftp.tac --logfile /srv/launchpad.net/production-logs/lp_upload/poppy-sftp.log --umask=0
1001 15485 1 15775 0 11 12:00 ? 00:00:02 /usr/bin/python2.6 -S /srv/launchpad.net/codelines/current/bin/twistd --pidfile /srv/launchpad.net/var/ftpmaster-poppy-sftp.pid --python /srv/launchpad.net/codelines/current/daemons/poppy-sftp.tac --logfile /srv/launchpad.net/production-logs/lp_upload/poppy-sftp.log --umask=0
1001 15485 1 15776 0 11 12:00 ? 00:00:01 /usr/bin/python2.6 -S /srv/launchpad.net/codelines/current/bin/twistd --pidfile /srv/launchpad.net/var/ftpmaster-poppy-sftp.pid --python /srv/launchpad.net/codelines/current/daemons/poppy-sftp.tac --logfile /srv/launchpad.net/production-logs/lp_upload/poppy-sftp.log --umask=0
1001 15485 1 15777 0 11 12:00 ? 00:00:01 /usr/bin/python2.6 -S /srv/launchpad.net/codelines/current/bin/twistd --pidfile /srv/launchpad.net/var/ftpmaster-poppy-sftp.pid --python /srv/launchpad.net/codelines/current/daemons/poppy-sftp.tac --logfile /srv/launchpad.net/production-logs/lp_upload/poppy-sftp.log --umask=0
1001 15485 1 15833 0 11 12:01 ? 00:00:01 /usr/bin/python2.6 -S /srv/launchpad.net/codelines/current/bin/twistd --pidfile /srv/launchpad.net/var/ftpm...

Read more...

tags: added: poppy
summary: - poppy died in a pulp of its own swappiness
+ poppy loses connectivity to the aith server
summary: - poppy loses connectivity to the aith server
+ poppy loses connectivity to the auth server
description: updated
Revision history for this message
Julian Edwards (julian-edwards) wrote :

An strace earlier today revealed that for the first few minutes everything works ok and then eventually it gets slower to respond to SFTP connections to the point where it exceeds the 30 second timeout. The strace shows that normally it calls socket() followed by connect() but the gap between the two increases.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

The problem has been diagnosed: the connection to Rabbit from germanium was firewalled.

When writing an oops, it defers to a thread to talk to Rabbit. Because we write oopses to Rabbit, it was blocking the thread on a TCP timeout and then when the thread pool was full it got fairly catastrophic.

The fix needs to be in the twisted_oops code to use txamqp instead of threads and use a much shorter timeout.

Changed in launchpad:
status: Triaged → Invalid
Revision history for this message
Robert Collins (lifeless) wrote :

So, when thedac tried manually they didn't get a timeout, it failed instantly. That suggests that it wasn't a tcp timeout but something more nefarious causing the issue.

The short list of potential candidates are:
 - a bug in the LP preparation of the oops_twisted.Config object
 - a bug in oops_twisted.Config.publish
 - a bug in oops_twisted's adapter for non-twisted publishers.

In addition to that:
 - if we were getting socket timeout errors (e.g. due to black-hole firewalling) we should fix that and use a low (0.5->1.0 second timeout)
 - we probably want to do that on all services - e.g. in oops_amqp or even globally.

I think the next actions to take are:
 - reproduce this problem and craft a fix for it. I'm going to reopen this bug accordingly: our amqp code must fail-soft.
 - separately, high priority not critical, lower the socket timeout for amqp connection attempts. Doing that may be sufficient to solve this bug, but as its conceptually a separate issue, I'd start with a separate bug.

Changed in launchpad:
status: Invalid → Triaged
summary: - poppy loses connectivity to the auth server
+ poppy goes into a death spiral when amqp is not available
Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 906211] Re: poppy loses connectivity to the auth server

I still don't think this bug is in poppy, it's the oops_amqp / oops_twisted
stuff.

Revision history for this message
Julian Edwards (julian-edwards) wrote :
Download full text (4.1 KiB)

The good news - the firewall fix has indeed worked and poppy is now happy with sftp connections.

The bad news - I can't see an OOPSes, so why the heck is it deferring to a thread for stuff?

Here's a log from a session - there are no errors:

2011-12-22 10:38:08+0000 [KeepAliveSettingSSHServerTransport (TimeoutProtocol),28,82.71.93.254] NEW KEYS
2011-12-22 10:38:08+0000 [KeepAliveSettingSSHServerTransport (TimeoutProtocol),28,82.71.93.254] starting service ssh-userauth
2011-12-22 10:38:08+0000 [SSHService ssh-userauth on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] julian-edwards trying auth none
2011-12-22 10:38:08+0000 [SSHService ssh-userauth on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] julian-edwards trying auth publickey
2011-12-22 10:38:08+0000 [SSHService ssh-userauth on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] Starting factory <twisted.web.xmlrpc._QueryFactory instance at 0x959d998>
2011-12-22 10:38:08+0000 [QueryProtocol,client] Stopping factory <twisted.web.xmlrpc._QueryFactory instance at 0x959d998>
2011-12-22 10:38:08+0000 [SSHService ssh-userauth on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] julian-edwards trying auth publickey
2011-12-22 10:38:08+0000 [SSHService ssh-userauth on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] julian-edwards authenticated with publickey
2011-12-22 10:38:08+0000 [SSHService ssh-userauth on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] starting service ssh-connection
2011-12-22 10:38:08+0000 [SSHService ssh-userauth on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] INFO:/srv/launchpad.net/production-logs/lp_upload/poppy-sftp-access.log:[100225824] julian-edwards logged in.
2011-12-22 10:38:08+0000 [SSHService ssh-connection on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] got channel session request
2011-12-22 10:38:08+0000 [SSHChannel session (0) on SSHService ssh-connection on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] channel open
2011-12-22 10:38:09+0000 [SSHChannel session (0) on SSHService ssh-connection on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] unhandled request for env
2011-12-22 10:38:09+0000 [SSHChannel session (0) on SSHService ssh-connection on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] unhandled request for env
2011-12-22 10:38:09+0000 [SSHChannel session (0) on SSHService ssh-connection on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] asking for subsystem "sftp"
2011-12-22 10:38:09+0000 [SSHChannel session (0) on SSHService ssh-connection on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] {'sftp': <class lp.services.sshserver.sftp.FileTransferServer at 0x5112c50>}
2011-12-22 10:38:09+0000 [SSHChannel session (0) on SSHService ssh-connection on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] INFO:/srv/launchpad.net/production-logs/lp_upload/poppy-sftp-access.log:[100225824] julian-edwards started SFTP session.
2011-12-22 10:38:09+0000 [SSHChannel session (0) on SSHService ssh-connection on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] sending close 0
2011-12-22 10:38:09+0000 [SSHChannel session (0) on SSHService ssh-connection on KeepAliveSettingSS...

Read more...

Colin Watson (cjwatson)
affects: launchpad → txpkgupload
Changed in txpkgupload:
assignee: nobody → yongyuth chuenwanichakool (chox1111)
status: Triaged → Incomplete
Colin Watson (cjwatson)
Changed in txpkgupload:
status: Incomplete → Triaged
assignee: yongyuth chuenwanichakool (chox1111) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.