poppy-sftp OOPSes infinitely

Bug #901498 reported by William Grant
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Julian Edwards

Bug Description

One can trivially cause poppy-sftp to go into an infinite OOPS death spiral like so:

http://paste.ubuntu.com/763279/

To reproduce, start poppy-sftp locally, and over FTP upload a changes file that will fail signature verification.

It looks like something is feeding back the Python logger into Twisted logging, causing the INFO log about OOPS logging to itself log an OOPS.

Tags: oops qa-ok

Related branches

Revision history for this message
Robert Collins (lifeless) wrote :

utilities/start-dev-soyuz.sh

Revision history for this message
Robert Collins (lifeless) wrote :

14:22 < lifeless> wgrant: what ip/port does poppy-sftp use in dev?
14:22 < wgrant> lifeless: *:2121
14:22 < wgrant> 5023 for SFTP, but that's not the problem here.

Revision history for this message
Robert Collins (lifeless) wrote :

Ran it with /var/tmp/poppy/incoming missing, and it very happily writes one OOPS per upload.

Reproduced with that dir made and a foo.changes file which was empty.

Revision history for this message
Robert Collins (lifeless) wrote :

editing startLoggingWithObserver (force setStdOut to 0) to let me use -n and debug is very useful.

Some q&d tests:
I managed to trigger 12K oopses once.
I'm getting 1 consistently after that.

Revision history for this message
Robert Collins (lifeless) wrote :

The one oops I get is:
"/var/tmp/lperr/2011-12-08/OOPS-ba20d8ded4e8e05363d5ea7fcec26c88" may be a binary file. See it anyway?
Changes file must be signed with a valid GPG signature: Verification failed 3 times: ['No data', 'No data', 'No data']

Unhandled Error
Traceback (most recent call last):
Failure: twisted.protocols.ftp.PermissionDeniedError: Changes file must be signed with a valid GPG signature: Verification failed 3 times: ['No data', 'No data', 'No data']

So the first useful observation I have is that this shouldn't really be an OOPS, nor even logged - its sent directly to the user anyway, and we expect this to occur in production when users are muppets.

Revision history for this message
Robert Collins (lifeless) wrote :

def startLoggingWithObserver(observer, setStdout=1) is the culprit in twisted.

we call that from lib/lp/services/sshserver/service.py(174)startService() (and in fact its called earlier anyhow).

pathology:
our logging (not log) code has INFO wired up to stderr.
stderr is mapped by startLoggingWithObserver into isError reports
generating an OOPS triggers an INFO report.

the stderr wiring is 100% useless here, because we don't have any ability to write to the console at all.

Revision history for this message
Robert Collins (lifeless) wrote :

set_up_oops_reporting does this:
oops_observer = OOPSObserver(errorlog.globalErrorUtility._oops_config,
    log.PythonLoggingObserver(loggerName=name).emit)

to deliberately pass non-error messages onto the regular LP framework logging setup.

So changing the logging setup for twistd apps to never ever use stdout/stderr will fix this.

Revision history for this message
Robert Collins (lifeless) wrote :

Whats weird is that we separately have python logging glued onto the twisted logging in setup_tacfile_logging.

I think what we want is:
 - 'logging' messsages end up in logs
 - twistd messages end up in logs
 - exceptions from logging make oopses
 - exceptions from twistd code make oopses

seems like the /usual/ way to do this is to use the twisted logging layer so that twistd options work.

So:
 - setup_tacfile_logging gets us all the logging. calls. We must have this.
 - we should not forward twisted messages to python logging in general.
 - set_up_tacfile_logging is poor at the moment - it turns everything into a string and back again rather than translating it (e.g. to preserve backtraces as isError etc).
 - but the *default* for twistd is to capture stdout and stderr and respectively turn them into msg() and err() calls
 - so setup_tacfile_logging is entirely redundant if we have python logging writing to either stdout or stderr.
 - the default for python logging in the absence of any configuration is to output ERROR: lines on stderr.
 - if setLevel is called on the root logger, this will affect the default output

-> we should:
 - delete setup_tacfile_logging as no better than the default.
 - explicitly configure python logging to log to stdout only - no log files, no nothing. This will result in messages being caught by twistd's aggressive redirects.
 - *either* configure a logging OopsHandler (stdlib logging specific) -or- configure logging to send exceptions to stderr, to cause stderr exceptions to trigger oopses
 - configure the fallback observer for OopsObserver to be a UnixLogObserver or something similar.

Revision history for this message
Steve Kowalik (stevenk) wrote :

lp:~stevenk/launchpad/kill-set_up_tacfile_logging is my WIP branch that removes setup_tacfile_logging.

Changed in launchpad:
assignee: nobody → Julian Edwards (julian-edwards)
status: Triaged → In Progress
Revision history for this message
Julian Edwards (julian-edwards) wrote :

How does all this fit in with the buildd-managers use of RotatableLogFileObserver? I still haven't got my head around all this.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Also, removing setup_tacfile_logging has not helped so far.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Aha, sorted. I kept setup_tacfile_logging but changed log.StdioOnnaStick with sys.stdout in the StreamHandler. Errors now go to stdout properly and there's no OOPS loop.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Well I spoke too soon. Using sys.stdout means I get no python logger messages at all. Using StdioOnnaStick makes each message get written to the log ~20 times. WTF?

Revision history for this message
Julian Edwards (julian-edwards) wrote :

The root cause of the issue seems to be the use of logging.getLogger("poppy-sftp").
If that's replaced with logging.getLogger() then we don't get the recursive behaviour. Unfortunately right now I can't fix the issue in my previous comment.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

My experiments so far:
http://pastebin.ubuntu.com/764003/

Revision history for this message
Robert Collins (lifeless) wrote :

This is a sketch that should get the FTP service all honkey dory. Because SSHService does its own oops setup we probably end up with 2 rotatable file observers, which isn't good (but my sketch changes the api so hopefully will break anyhow :P) - I think we get a good core laid out the changes around other code to work with it will be quite containable.

http://pastebin.com/LV3QtjT9

Revision history for this message
Robert Collins (lifeless) wrote :

Julian made the sketch work.. http://pastebin.ubuntu.com/764210/ but that doesn't do what we want (what it doesn't isn't precisely defined either).

instrumenting startLoggingWithObserver, our observer isn't used there, and the symptoms look like it isn't added elsewhere.

Revision history for this message
Robert Collins (lifeless) wrote :

For instance, running with -n:
Starting poppy-sftp.
0 <bound method FileLogObserver.emit of <twisted.python.log.FileLogObserver instance at 0xb347c8c>>
INFO:/tmp/poppy-access.log:---- Server started ----
^CINFO:/tmp/poppy-access.log:---- Server stopped ----
INFO:poppy-sftp.Hooks:Post-processing finished

This shows setStdOut gets 0, so doesn't override the stdout, stderr channels.

Without -n setStdOut still gets 0.

The setup happens from 'AppLogger.start()'

(Pdb) self._observerFactory
(Pdb) application.getComponent(ILogObserver, None)

so both the proscribed ways of getting a custom observer are failing.

Revision history for this message
Robert Collins (lifeless) wrote :

So directlyProvides(oops_observer, ILogObserver) should DTRT.

Revision history for this message
Robert Collins (lifeless) wrote :

except that zope interfaces hates the world.

  File "/home/robertc/source/launchpad/lp-branches/working/lib/lp/services/twistedsupport/loggingsupport.py", line 65, in set_up_oops_reporting
    directlyProvides(oops_observer, log.ILogObserver)
  File "/home/robertc/source/launchpad/lp-sourcedeps/eggs/zope.interface-3.5.2-py2.6-linux-i686.egg/zope/interface/declarations.py", line 861, in directlyProvides
    object.__provides__ = Provides(cls, *interfaces)
exceptions.AttributeError: 'instancemethod' object has no attribute '__provides__'
...
so the example code in twisted is misleading (doing an emit method is incompatible with using actual classes). Worse they don't implement the needed interface themselves!

Revision history for this message
Robert Collins (lifeless) wrote :

ok, small tweak to oops_twisted to accomodate this hard learnt lesson and its copacetic on the twisted side:
2011-12-09 05:39:21+0000 [-] Log opened.
2011-12-09 05:39:21+0000 [-] twistd 11.1.0 (/usr/bin/python2.6 2.6.5) starting up.
2011-12-09 05:39:21+0000 [-] reactor class: twisted.internet.pollreactor.PollReactor.
2011-12-09 05:39:21+0000 [-] FTPFactory starting on 2121
2011-12-09 05:39:21+0000 [-] Starting factory <twisted.protocols.ftp.FTPFactory instance at 0xa7beecc>
2011-12-09 05:39:21+0000 [-] Factory (TimeoutFactory) starting on 5023
2011-12-09 05:39:21+0000 [-] Starting factory <lp.services.sshserver.service.Factory instance at 0xa7d992c>
2011-12-09 05:39:21+0000 [-] Starting factory <twisted.protocols.policies.TimeoutFactory instance at 0xa7d962c>
2011-12-09 05:39:21+0000 [-] daemon ready!
2011-12-09 05:39:26+0000 [FTP (ProtocolWrapper),0,192.168.1.102] DTPFactory.setTimeout set to 10 seconds
2011-12-09 05:39:26+0000 [FTP (ProtocolWrapper),0,192.168.1.102] DTPFactory starting on 52020
2011-12-09 05:39:26+0000 [FTP (ProtocolWrapper),0,192.168.1.102] Starting factory <twisted.protocols.ftp.DTPFactory instance at 0xaa509ec>
2011-12-09 05:39:26+0000 [twisted.protocols.ftp.DTPFactory] DTPFactory.buildProtocol
2011-12-09 05:39:26+0000 [twisted.protocols.ftp.DTPFactory] cancelling DTP timeout
2011-12-09 05:39:26+0000 [DTP,0,192.168.1.102] Unexpected error receiving file from client:
2011-12-09 05:39:26+0000 [DTP,0,192.168.1.102] Logged OOPS id OOPS-955a9002adadc01c92abf2ef3f85a7f7: PermissionDeniedError: Changes file must be signed with a valid GPG signature: Verification failed 3 times: ['No data', 'No data', 'No data']

Revision history for this message
Robert Collins (lifeless) wrote :

http://twistedmatrix.com/trac/ticket/5425 for the misleading inspiration in twisted.

Revision history for this message
Robert Collins (lifeless) wrote :

bug 902006 for the twisted-oops changes needed.

Revision history for this message
Robert Collins (lifeless) wrote :

Running with them and a tweak to the basicConfig gets this:
2011-12-09 06:21:41+0000 [-] Log opened.
2011-12-09 06:21:41+0000 [-] twistd 11.1.0 (/usr/bin/python2.6 2.6.5) starting up.
2011-12-09 06:21:41+0000 [-] reactor class: twisted.internet.pollreactor.PollReactor.
2011-12-09 06:21:41+0000 [-] FTPFactory starting on 2121
2011-12-09 06:21:41+0000 [-] Starting factory <twisted.protocols.ftp.FTPFactory instance at 0xb70ce4c>
2011-12-09 06:21:41+0000 [-] INFO:/tmp/poppy-access.log:---- Server started ----
2011-12-09 06:21:41+0000 [-] Factory (TimeoutFactory) starting on 5023
2011-12-09 06:21:41+0000 [-] Starting factory <lp.services.sshserver.service.Factory instance at 0xb715b0c>
2011-12-09 06:21:41+0000 [-] Starting factory <twisted.protocols.policies.TimeoutFactory instance at 0xb7157ec>
2011-12-09 06:21:41+0000 [-] daemon ready!

2011-12-09 06:22:02+0000 [FTP (ProtocolWrapper),0,192.168.1.102] DTPFactory.setTimeout set to 10 seconds
2011-12-09 06:22:02+0000 [FTP (ProtocolWrapper),0,192.168.1.102] DTPFactory starting on 47420
2011-12-09 06:22:02+0000 [FTP (ProtocolWrapper),0,192.168.1.102] Starting factory <twisted.protocols.ftp.DTPFactory instance at 0xb82550c>
2011-12-09 06:22:02+0000 [twisted.protocols.ftp.DTPFactory] DTPFactory.buildProtocol
2011-12-09 06:22:02+0000 [twisted.protocols.ftp.DTPFactory] cancelling DTP timeout
2011-12-09 06:22:02+0000 [DTP,0,192.168.1.102] Unexpected error receiving file from client:
2011-12-09 06:22:02+0000 [DTP,0,192.168.1.102] Logged OOPS id OOPS-3e5b1d25f4c7841b94f777b7a252dc3e: PermissionDeniedError: Changes file must be signed with a valid GPG signature: Verification failed 3 times: ['No data', 'No data', 'No data']
2011-12-09 06:22:05+0000 [FTP (ProtocolWrapper),0,192.168.1.102] cleanupDTP
2011-12-09 06:22:05+0000 [FTP (ProtocolWrapper),0,192.168.1.102] <<class 'twisted.internet.tcp.Port'> of twisted.protocols.ftp.DTPFactory on 47420>
2011-12-09 06:22:05+0000 [FTP (ProtocolWrapper),0,192.168.1.102] dtpFactory.stopFactory
2011-12-09 06:22:05+0000 [FTP (ProtocolWrapper),0,192.168.1.102] DTPFactory.setTimeout set to 10 seconds
2011-12-09 06:22:05+0000 [FTP (ProtocolWrapper),0,192.168.1.102] DTPFactory starting on 48188
2011-12-09 06:22:05+0000 [FTP (ProtocolWrapper),0,192.168.1.102] Starting factory <twisted.protocols.ftp.DTPFactory instance at 0xb86322c>
2011-12-09 06:22:05+0000 [-] (TCP Port 47420 Closed)
2011-12-09 06:22:05+0000 [-] Stopping factory <twisted.protocols.ftp.DTPFactory instance at 0xb82550c>
2011-12-09 06:22:05+0000 [-] dtpFactory.stopFactory
2011-12-09 06:22:05+0000 [twisted.protocols.ftp.DTPFactory] DTPFactory.buildProtocol
2011-12-09 06:22:05+0000 [twisted.protocols.ftp.DTPFactory] cancelling DTP timeout
2011-12-09 06:22:05+0000 [DTP,0,192.168.1.102] Unexpected error receiving file from client:
2011-12-09 06:22:05+0000 [DTP,0,192.168.1.102] Logged OOPS id OOPS-656bd30762dedb123cfea23fe4223b4d: PermissionDeniedError: Changes file must be signed with a valid GPG signature: Verification failed 3 times: ['No data', 'No data', 'No data']

in the log. I claim success. (the INFO: line is regular python logging).

Revision history for this message
Robert Collins (lifeless) wrote :

I've pushed a branch up that has the right behaviour at least this far:
 - logging output goes to the file
 - log output goes to the file
 - oopses are made when log.err() is called (a log event with isError is emitted)

I didn't reuse setup_tacfile_logging because the contract was all wonky, and keeping it the same would lead to silent WTF's elsewhere.

Other than fixing the obvious fallout the only remaining thing I can think of is to slap a regular logging OOPS adapter on after the basicConfig - this would be useful to turn logging.error() from random places inside LP itself into OOPS reports.

However, I think this is an optional extra and not needed right now, as poppy is meant to be, well, /twisted/.

Revision history for this message
Robert Collins (lifeless) wrote :

(that said, it should be as simple as
logging.getLogger().addHandler(canonical.launchpad.scripts.logger.OopsHandler('poppy'))
Should.

I wouldn't want that to be live without a test that it all hangs together correctly.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

The branch I attached was for adding gpg debugging in poppy - however it won't pass all tests because of this bug so I'm adding lifeless's fixes plus others I need to add.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

I'm landing a fix for this now, it seems to pass all tests locally. We'll need to QA very carefully as this has affected the librarian, importds, puller and poppy, which will now use the RotatableFileLobObserver.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 901498] Re: poppy-sftp OOPSes infinitely

On Sat, Dec 10, 2011 at 6:25 AM, Julian Edwards
<email address hidden> wrote:
> I'm landing a fix for this now, it seems to pass all tests locally.
> We'll need to QA very carefully as this has affected the librarian,
> importds, puller and poppy, which will now use the
> RotatableFileLobObserver.

What about the bzr smart server?

Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Fri Dec 9 17:48:57 2011, Robert Collins wrote:
> On Sat, Dec 10, 2011 at 6:25 AM, Julian Edwards
> <email address hidden> wrote:
>> I'm landing a fix for this now, it seems to pass all tests locally.
>> We'll need to QA very carefully as this has affected the librarian,
>> importds, puller and poppy, which will now use the
>> RotatableFileLobObserver.
>
> What about the bzr smart server?
>

That doesn't use set_up_oops_logging() does it?

Revision history for this message
Robert Collins (lifeless) wrote :

On Sun, Dec 11, 2011 at 12:39 AM, Julian Edwards
<email address hidden> wrote:
>> What about the bzr smart server?
>>
>
> That doesn't  use set_up_oops_logging() does it?

Via SSHService it does. Perhaps thats what you meant when you said 'importds'.

-Rob

Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
Revision history for this message
Julian Edwards (julian-edwards) wrote :

So far I've tested OK with:
 poppy
 librarian

I found that the librarian was not restarted on qastaging when code is updated, I had to test it on dogfood.

I see that bzr-sftp is also using the old logging style on tellurium and thedac says that it does get restarted when there's a code update. I'm not sure either way.

Also I can't get a code import to run on staging.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Ok qastaging code import done.

tags: added: qa-ok
removed: qa-needstesting
William Grant (wgrant)
Changed in launchpad:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.