loggerhead

loggerhead OOPS - error: [Errno 32] Broken pipe

Bug #701329 reported by Robert Collins on 2011-01-11

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	Critical	John A Meinel	Launchpad itself 11.03
	loggerhead	Invalid	High	Unassigned

Bug Description

we get this raised as an oops 10K times a day.
OOPS-1836CBA1
http://bazaar.launchpad.net/%7Evcs-imports/busybox/main/changes

I suspect this is due to the request taking more than 30 seconds and haproxy/apache terminating the connection on the assumption the server (loggerhead) has gone awol.

This gives us some idea of the magnitude of the 'first hit is slow' problem.

I'm not sure what we should do about this - but perhaps adding a timeout cap to loggerhead is appropriate: we could show an error page, perhaps even use a refresh to try again. We probably want to gather and graph this data, but an oops each time isn't terribly useful (at this scale).

OTOH perhaps we should just focus on driving it down.

error: [Errno 32] Broken pipe

    Traceback (most recent call last):
  Module launchpad_loggerhead.app, line 340, in wrapped_app
    app_iter = iter(app(environ, wrapped.start_response))
  Module launchpad_loggerhead.debug, line 106, in wrapped_application
    result = application(environ, response_hook)
  Module __main__, line 165, in wrapped
  Module launchpad_loggerhead.debug, line 45, in wrapped
    return app(environ, start_response)
  Module paste.translogger, line 68, in __call__
    return self.application(environ, replacement_start_response)
  Module paste.deploy.config, line 285, in __call__
    return self.app(environ, start_response)
  Module __main__, line 149, in wrapped
  Module paste.wsgilib, line 179, in catch_errors
    app_iter = application(environ, start_response)
  Module launchpad_loggerhead.session, line 53, in __call__
    return self.cookie_handler(environ, start_response)
  Module paste.auth.cookie, line 305, in __call__
    return self.application(environ, response_hook)
  Module launchpad_loggerhead.session, line 80, in _process
    return self.application(environ, response_hook)
  Module paste.httpexceptions, line 636, in __call__
    return self.application(environ, start_response)
  Module launchpad_loggerhead.app, line 228, in __call__
    return view.app(environ, start_response)
  Module loggerhead.apps.branch, line 167, in app
    return c(environ, start_response)
  Module loggerhead.controllers, line 103, in __call__
    template.expand_into(w, **vals)
  Module loggerhead.zptsupport, line 64, in expand_into
    self.template.expand(context, f, 'utf-8')
  Module simpletal.simpleTAL, line 731, in expand
    self.expandInline (context, encodingFile, interpreter)
  Module simpletal.simpleTAL, line 617, in expandInline
    ourInterpreter.execute (self)
  Module simpletal.simpleTAL, line 212, in execute
    self.commandHandler[cmnd[0]] (cmnd[0], cmnd[1])
  Module simpletal.simpleTAL, line 432, in cmdEndTagEndScope
    resultVal.expandInline (self.context, self.file, self)
  Module simpletal.simpleTAL, line 617, in expandInline
    ourInterpreter.execute (self)
  Module simpletal.simpleTAL, line 212, in execute
    self.commandHandler[cmnd[0]] (cmnd[0], cmnd[1])
  Module simpletal.simpleTAL, line 432, in cmdEndTagEndScope
    resultVal.expandInline (self.context, self.file, self)
  Module simpletal.simpleTAL, line 617, in expandInline
    ourInterpreter.execute (self)
  Module simpletal.simpleTAL, line 212, in execute
    self.commandHandler[cmnd[0]] (cmnd[0], cmnd[1])
  Module simpletal.simpleTAL, line 476, in cmdOutput
    self.file.write (args)
  Module codecs, line 352, in write
    self.stream.write(data)
  Module loggerhead.controllers, line 48, in write
    self.flush()
  Module loggerhead.controllers, line 39, in flush
    self.writefunc(''.join(self.buf))
  Module launchpad_loggerhead.debug, line 104, in wrapped_writer
    return writer(arg)
  Module launchpad_loggerhead.app, line 302, in write_wrapper
    self._write_callable(data)
  Module paste.httpserver, line 150, in wsgi_write_chunk
    self.wfile.write(chunk)
  Module socket, line 300, in write
    self.flush()
  Module socket, line 286, in flush
    self._sock.sendall(buffer)
error: [Errno 32] Broken pipe

See original description

Tags:

Related branches

lp:~jameinel/launchpad/loggerhead-disconnect-701329

Merged into lp:launchpad at revision 12480

Robert Collins (community): Approve on 2011-02-24

William Grant: Approve (code*) on 2011-02-23

Martin Pool (community): Needs Fixing on 2011-02-07

Michael Hudson-Doyle: Pending requested 2011-02-09

Revision history for this message

Robert Collins (lifeless) wrote on 2011-01-11:

Alternatively this might be when the client disconnects normally, (e.g. due to hitting refresh before the page loads - and perhaps we should just ignore the error. Perhaps the first thing to do is to gather more data: we don't currently fill out the duration field in the oops, so we can't determine if this is a timeout or not.

Robert Collins (lifeless) on 2011-01-12

Changed in launchpad:
status:	New → Triaged
importance:	Undecided → Critical

Robert Collins (lifeless) on 2011-01-13

description:

updated

Revision history for this message

Max Kanat-Alexander (mkanat) wrote on 2011-01-13:

I see this on my local loggerhead installs if I close the browser before the page is done, or stop the request before the page is done sending content to me.

Revision history for this message

Martin Pool (mbp) wrote on 2011-01-27:

Robert mentioned the error count is very stable from one day to the next, which suggests it's being driven by some kind of bot.

If loggerhead sees the connection being closed, that's probably because haproxy is closing the connection. That could be because something downstream closed the connection, or because haproxy timed out. If the latter we should be able to see how many timeouts happen per day and on what urls.

Having the timeout in haproxy is kind of nice because it's external, but has a couple of bugs:
* the loggerhead thread presumably does the whole request then aborts, creating wasted load
* we don't get a traceback showing what it was doing when the time expired

So perhaps there should also be an internal timeout.

Revision history for this message

Martin Pool (mbp) wrote on 2011-01-27:

spm says we don't log haproxy, because if it's turned on it logs far too much. It would be nice to at least log at level warn or on an aggressive rotation schedule. Perhaps we can have an RT.

Revision history for this message

Max Kanat-Alexander (mkanat) wrote on 2011-01-27:

That level of stability may indicate that it's the LOSAs' bot that checks if the system is still up.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-01-28:

So this is the daily summary:
   7766 http://0.0.0.0:8081/%7Evcs-imports/busybox/main/changes (Unknown)
       OOPS-1851CBB1, OOPS-1851CBB10, OOPS-1851CBB100, OOPS-1851CBB1000, OOPS-1851CBB1001
   7752 http://0.0.0.0:8080/%7Evcs-imports/busybox/main/changes (Unknown)
       OOPS-1851CBA1, OOPS-1851CBA10, OOPS-1851CBA100, OOPS-1851CBA1000, OOPS-1851CBA1001
     22 http://bazaar.launchpad.net/%7Eibid-core/ibid/trunk/changes (Unknown)
       OOPS-1851CBA1376, OOPS-1851CBA1711, OOPS-1851CBA2041, OOPS-1851CBA2704, OOPS-1851CBA3034
   [118 other URLs]

thats 7766 and 7752 to the two instances, 22 on another branch, and diminishing frequency thereafter. There is no refer, no request variables, no connected-client info, so we cannot include or exclude haproxy / sysadmin scripts on the info we have.

I'm going to file a separate, critical bug, on the data gathering.

Revision history for this message

Max Kanat-Alexander (mkanat) wrote on 2011-01-28:

That is definitely the sysadmin script. It checks that branch.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-01-28: Re: [Bug 701329] Re: error: [Errno 32] Broken pipe

The sysadmins are claiming that they do a tiny fraction of the # of
requests we're seeing.

Revision history for this message

Max Kanat-Alexander (mkanat) wrote on 2011-01-28: Re: error: [Errno 32] Broken pipe

That seems incredibly unlikely, given that ~vcs-imports/busybox/main/changes is the exact URL that the uptime checker checks, and it's checking *directly* on the backend, which only the sysadmin scripts could do.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2011-01-30: Re: [Bug 701329] Re: error: [Errno 32] Broken pipe

#10

FWIW, I agree with Max -- given that the logged URL is 0.0.0.0:8080 not
bazaar.launchpad.net, it seems that it must be something internal.

Cheers,
mwh

Robert Collins (lifeless) on 2011-01-30

summary:

- error: [Errno 32] Broken pipe
+ loggerhead OOPS - error: [Errno 32] Broken pipe

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2011-01-31:

#11

This seems exactly like the type of error we can safely swallow and ignore. It just means the client went away before receiving all the data.

John A Meinel (jameinel) on 2011-01-31

Changed in loggerhead:
assignee:	nobody → John A Meinel (jameinel)

Revision history for this message

John A Meinel (jameinel) wrote on 2011-01-31:

#12

This doesn't seem to be a bug in loggerhead, but rather how it is being called from the Launchpad code. So for now I'm closing the loggerhead side of it.

I determined this by running with this patch:
=== modified file 'loggerhead/apps/branch.py'
--- loggerhead/apps/branch.py 2010-12-01 08:30:02 +0000
+++ loggerhead/apps/branch.py 2011-01-31 18:34:00 +0000
@@ -171,7 +171,11 @@
         try:
             try:
                 c = cls(self, self.get_history)
- return c(environ, start_response)
+ try:
+ return c(environ, start_response)
+ except socket.error, e:
+ self.log.warning('Caught a failure.... baby')
+ raise
             except:
                 environ['exc_info'] = sys.exc_info()
                 environ['branch'] = self

=== modified file 'loggerhead/controllers/__init__.py'
--- loggerhead/controllers/__init__.py 2009-10-17 08:59:33 +0000
+++ loggerhead/controllers/__init__.py 2011-01-31 18:18:36 +0000
@@ -100,6 +100,7 @@
         template = load_template(self.template_path)
         z = time.time()
         w = BufferingWriter(writer, 8192)
+ import pdb; pdb.set_trace()
         template.expand_into(w, **vals)
         w.flush()
         self.log.info(

Basically, adding an exception grab as part of the branch app, and a pdb giving me plenty of time to sync with closing the browser.

After doing so, the output on the console was:

> c:\users\jameinel\dev\bzr\plugins\loggerhead\loggerhead\controllers\__init__.py(104)__call__()
-> template.expand_into(w, **vals)
(Pdb) c
Caught a failure.... baby

In other words, no traceback, etc. So whatever the normal mechanism is in loggerhead, it cleanly suppresses the EPIPE error. It could just be the EPIPE handling from 'bzr serve' or whatever. But it seems to make the most sense that it should be the Launchpad codebase that suppresses the error, and not done locally in the Branch.app code.

This doesn't seem to be a bug in loggerhead, but rather how it is being called from the Launchpad code. So for now I'm closing the loggerhead side of it.

I determined this by running with this patch:
=== modified file 'loggerhead/apps/branch.py'
--- loggerhead/apps/branch.py   2010-12-01 08:30:02 +0000
+++ loggerhead/apps/branch.py   2011-01-31 18:34:00 +0000
@@ -171,7 +171,11 @@
         try:
             try:
                 c = cls(self, self.get_history)
-                return c(environ, start_response)
+                try:
+                    return c(environ, start_response)
+                except socket.error, e:
+                    self.log.warning('Caught a failure.... baby')
+                    raise
             except:
                 environ['exc_info'] = sys.exc_info()
                 environ['branch'] = self

=== modified file 'loggerhead/controllers/__init__.py'
--- loggerhead/controllers/__init__.py  2009-10-17 08:59:33 +0000
+++ loggerhead/controllers/__init__.py  2011-01-31 18:18:36 +0000
@@ -100,6 +100,7 @@
         template = load_template(self.template_path)
         z = time.time()
         w = BufferingWriter(writer, 8192)
+        import pdb; pdb.set_trace()
         template.expand_into(w, **vals)
         w.flush()
         self.log.info(

Basically, adding an exception grab as part of the branch app, and a pdb giving me plenty of time to sync with closing the browser.

After doing so, the output on the console was:

> c:\users\jameinel\dev\bzr\plugins\loggerhead\loggerhead\controllers\__init__.py(104)__call__()
-> template.expand_into(w, **vals)
(Pdb) c
Caught a failure.... baby

Revision history for this message

John A Meinel (jameinel) wrote on 2011-01-31:

#13

I don't think this is a bug in the loggerhead code, but in the launchpad calling code.

Changed in loggerhead:
assignee:	John A Meinel (jameinel) → nobody
status:	Triaged → Invalid

Revision history for this message

John A Meinel (jameinel) wrote on 2011-01-31:

#14

Digging into it a bit more, because I wasn't completely happy with "it just doesn't fail locally". It looks like this is being handled in the paste.httpserver code. Specifically this:

class WSGIHandler...

    def handle(self):
        # don't bother logging disconnects while handling a request
        try:
            BaseHTTPRequestHandler.handle(self)
        except SocketErrors, exce:
            self.wsgi_connection_drop(exce)

Where SocketErrors is defined as either:
SocketErrors = (socket.error, SSL.ZeroReturnError, SSL.SysCallError)
or
SocketErrors = (socket.error,)

Depending on if OpenSSL can be imported.

So it looks like paste defaults to just discarding all socket errors. We might consider doing the same in the Launchpad codebase.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-01-31:

#15

So, the launchpad oops logic is hooked into the wsgi stack, its not calling it any differently. My comments about wanting to know if these are are basically timeouts or not are still relevant I think.(see the bug I referenced)

Revision history for this message

John A Meinel (jameinel) wrote on 2011-01-31:

#16

Looking at the traceback, that isn't entirely true.

  Module launchpad_loggerhead.session, line 80, in _process
    return self.application(environ, response_hook)
  Module paste.httpexceptions, line 636, in __call__
    return self.application(environ, start_response)
  Module launchpad_loggerhead.app, line 228, in __call__
    return view.app(environ, start_response)

Shows that launchpad_loggerhead.session is indirecting via paste.httpexceptions and then into launchpad_loggerhead.app and into the view stack,etc.

The stack that is built up is about here:
http://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel/view/head:/lib/launchpad_loggerhead/app.py#L224

And you can see that there isn't any sort of socket.error handling at that level.

My guess is that the fix would fit the best in the 'oops_middleware' code. At the moment it explicitly catches StopIteration, and then has a blanket "except:"
http://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel/view/head:/lib/launchpad_loggerhead/app.py#L348

Just changing that so it also had a:

except (socket.error,), e:
  # suppress socket errors, because they generally mean client disconnected, and not an actual bug in the code
  pass
except:
  ...

If you look at the loggerhead stack itself:
http://bazaar.launchpad.net/+branch/loggerhead/view/head:/__init__.py#L99

It is doing:

from paste.httpserver import serve

And the paste HTTPServer code is where the 'suppress all socket exceptions' is occurring.

basically, instead of seeing
Module launchpad_loggerhead.session, line 80, in _process
return self.application(environ, response_hook)

you would see something to do with the paste HTTPServer. And I'm just suggesting that we should put that same "discard errors like socket.error" in the launchpad app wsgi layer.

At that point, you'll stop getting this OOPS. if you want to track timeouts, etc, then we can do so, and that part of this bug is still worth exploring (but should probably be a different bug).

Regardless, the *loggerhead* code doesn't need changing, it is the Launchpad specific glue that needs some updating.

Looking at the traceback, that isn't entirely true.

Shows that launchpad_loggerhead.session is indirecting via paste.httpexceptions and then into launchpad_loggerhead.app and into the view stack,etc.

The stack that is built up is about here:
http://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel/view/head:/lib/launchpad_loggerhead/app.py#L224

And you can see that there isn't any sort of socket.error handling at that level.

Just changing that so it also had a:

except (socket.error,), e:
  # suppress socket errors, because they generally mean client disconnected, and not an actual bug in the code
  pass
except:
  ...

If you look at the loggerhead stack itself:
http://bazaar.launchpad.net/+branch/loggerhead/view/head:/__init__.py#L99

It is doing:

from paste.httpserver import serve

And the paste HTTPServer code is where the 'suppress all socket exceptions' is occurring.

basically, instead of seeing 
 Module launchpad_loggerhead.session, line 80, in _process
    return self.application(environ, response_hook)

you would see something to do with the paste HTTPServer. And I'm just suggesting that we should put that same "discard errors like socket.error" in the launchpad app wsgi layer.

At that point, you'll stop getting this OOPS. if you want to track timeouts, etc, then we can do so, and that part of this bug is still worth exploring (but should probably be a different bug).

Regardless, the *loggerhead* code doesn't need changing, it is the Launchpad specific glue that needs some updating.

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-01:

#17

It turns out it's not the sysadmin script but rather haproxy is checking these urls itself.

> option httpchk HEAD /~vcs-imports/busybox/main/changes HTTP/1.1

Unfortunately the oopses don't specify the method which would have made it more obvious.

It seems likely the real cause here is that loggerhead doesn't handle an HTTP HEAD correctly: it tries to send the body and the server drops the connection.

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-01:

#18

When I do

curl --head http://bazaar.launchpad.net/~vcs-imports/busybox/main/changes

or the same against loggerhead running locally, it does seem to work correctly, returning a header with no body.

Looking at the access log for codebrowse, I see a lot of

[20110127-19:52:12.040] 91.189.90.11 - - [27/Jan/2011:19:52:10 +0000] "HEAD /%7Evcs-imports/busybox/main/changes HTTP/1.1" 200 - "-" "-"
[20110127-19:52:22.596] 91.189.90.11 - - [27/Jan/2011:19:52:22 +0000] "HEAD /%7Evcs-imports/busybox/main/changes HTTP/1.1" 200 - "-" "-"
[20110127-19:52:32.925] 91.189.90.11 - - [27/Jan/2011:19:52:32 +0000] "HEAD /%7Evcs-imports/busybox/main/changes HTTP/1.1" 200 - "-" "-"

(roughly every 10s) which I presume is haproxy, and also a few of

[20110127-19:51:56.133] 127.0.0.1 - - [27/Jan/2011:19:51:54 +0000] "GET /%7Evcs-imports/busybox/main/changes HTTP/1.1" 200 - "-" "check_http/v1.4.14 (nagios-plugins 1.4.14)"

In the debug.log I see it's decided to give the 200 error and then later on hit the broken pipe.

I would guess that haproxy just disconnects once it's seen the 200 line, without even reading the whole response header.

The only surprising thing is the traceback saying we're rendering the body content.

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-01:

#19

I wonder if the launchpad glue (lib/launchpad_loggerhead) is written such that it actually generates the whole page on every request, and then Paste or wsgi throws away the body content for HEAD requests, or indeed perhaps just serves it up and something else chops it off? From http://pythonpaste.org/do-it-yourself-framework.html it looks like it's our responsibility to check the method and I don't see where we are doing it.

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-01:

#20

The next thing I was going to do here, but I don't have time tonight, is to write a tiny python program that connects, writes a request and disconnects without reading the response. Then see if that gets the same error as in production.

Revision history for this message

John A Meinel (jameinel) wrote on 2011-02-01: Re: [Bug 701329] Re: loggerhead OOPS - error: [Errno 32] Broken pipe

#21

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/1/2011 2:04 AM, Martin Pool wrote:
> The next thing I was going to do here, but I don't have time tonight, is
> to write a tiny python program that connects, writes a request and
> disconnects without reading the response. Then see if that gets the same
> error as in production.
>

Note that the default haproxy check is just whether it can get a TCP
connection on a given port. Though you can (obviously) configure it to
do an http request, etc.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1IIagACgkQJdeBCYSNAAPUgQCfZvBhJPdwUZ8r4fStsvA1kLwv
lW4An18Ud2uEzM86X2iKOuEJ7Hqfc5hF
=lE9n
-----END PGP SIGNATURE-----

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-01:

#22

In this case it's clear from the configuration and the logs that it is
sending a HEAD request.

- Martin
On 02/02/2011 2:07 AM, "John Arbash Meinel" <email address hidden> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 2/1/2011 2:04 AM, Martin Pool wrote:
>> The next thing I was going to do here, but I don't have time tonight, is
>> to write a tiny python program that connects, writes a request and
>> disconnects without reading the response. Then see if that gets the same
>> error as in production.
>>
>
> Note that the default haproxy check is just whether it can get a TCP
> connection on a given port. Though you can (obviously) configure it to
> do an http request, etc.
>
> John
> =:->
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (Cygwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk1IIagACgkQJdeBCYSNAAPUgQCfZvBhJPdwUZ8r4fStsvA1kLwv
> lW4An18Ud2uEzM86X2iKOuEJ7Hqfc5hF
> =lE9n
> -----END PGP SIGNATURE-----
>

Revision history for this message

John A Meinel (jameinel) wrote on 2011-02-04:

#23

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/1/2011 3:55 PM, Martin Pool wrote:
> In this case it's clear from the configuration and the logs that it is
> sending a HEAD request.
>

I managed to reproduce it with:

echo -e "HEAD /~jameinel/+junk/foo/changes HTTP/1.0\n\n" | \
telnet localhost 8080

Note that I was *not* able to reproduce it going through the Apache
proxy. So "telnet bazaar.launchpad.dev 80" did not reproduce the
traceback. My guess is that Apache is reading all the headers, even if
haproxy is not.

Which might be why you had troubles triggering it?

Though 'curl --head' to localhost 8080 also doesn't trigger it. You
really need something like telnet which drops the connection right away.

I'll try to post a patch to launchpad about it. Though I'm not sure how
to test it appropriately.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1MJ2cACgkQJdeBCYSNAANM7wCghryMRpdeGesOn40l9pvA2Uqg
EUAAn39dmXACNwurt9m++cu0v1uBuLTh
=UHfV
-----END PGP SIGNATURE-----

Revision history for this message

John A Meinel (jameinel) wrote on 2011-02-04:

#24

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/4/2011 10:20 AM, John A Meinel wrote:
> On 2/1/2011 3:55 PM, Martin Pool wrote:
>> In this case it's clear from the configuration and the logs that it is
>> sending a HEAD request.

I managed to finally track down scripts/start-loggerhead.py which
decides how the stack is built up.

I see that at the top level, it is still calling:

[paste.]httpserver.serve(...)

Which is what the bzrlib commandline is doing.
It also isn't passing in a custom 'handler', so it is using the default
WSGIHandler. And in WSGIHandler.handle() it suppresses socket errors.

However, the oops_middleware app is being inserted into the app pipeline
as the final app:

app = oops_middleware(app)

Which from what I can tell, means that the oops are being generated at a
lower level than where they are getting suppressed.

Some options then seem to be available:

1) Use a custom handler, rather than oops_middleware.
I'm honestly not really sure how the WSGI stack interacts with the
HTTP Handler/Service/etc stack.

2) Teach oops_middleware to just ignore socket errors. I'm personally
in favor of this one. And I'll put together a patch which does this.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1MNcYACgkQJdeBCYSNAAMllgCglI2QNMu+hOjGM6z37OlsHiR6
6dgAoKdMMiG2GyydMjGSUMYddR50icNK
=9wWW
-----END PGP SIGNATURE-----

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-04:

#25

Before we consider this fixed, I'd like to know why it's rendering the
page content at all for a HEAD request. That is a waste of cpu, even
if it doesn't cause an error. I think the problem is that the http
method='HEAD' parameter is getting lost somewhere down the stack.

Revision history for this message

John A Meinel (jameinel) wrote on 2011-02-04:

#26

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/4/2011 4:59 PM, Martin Pool wrote:
> Before we consider this fixed, I'd like to know why it's rendering the
> page content at all for a HEAD request. That is a waste of cpu, even
> if it doesn't cause an error. I think the problem is that the http
> method='HEAD' parameter is getting lost somewhere down the stack.
>

Note that when I was testing it, you can get the failure while writing
headers, regardless of rendering the rest of the page. At least:

echo "HEAD /... HTTP/1.0" | telnet localhost 8080

can get the exception to trigger very early.

As for why should we render the page even for just a HEAD request... If
HEAD requests are only happening at the haproxy level, why not? Is it
worth spending developer time just to handle this case?

Arguably there is the proxy situation, where a given proxy could say
"try HEAD and see if I can just use my cached value". I didn't think
loggerhead was really ready for that, because of needing ETAGS support, etc.

I don't doubt that it is a waste of CPU, but if <1% of our time is spent
in HEAD requests, it doesn't seem worthwhile to spend developer time to
fix it. (We may need to have better logs to determine what the specific
overhead is.)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1MiVYACgkQJdeBCYSNAAOWwgCeM9ojvwYx10YeqJlje4vsO9Xl
Es4An3YhxQmNvW+Jvkiri0L+M79a6rtr
=uU8v
-----END PGP SIGNATURE-----

Revision history for this message

John A Meinel (jameinel) wrote on 2011-02-04:

#27

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I guess ultimately, this sounds like a separate bug to me. (loggerhead
renders full content even for HEAD requests.)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1MicUACgkQJdeBCYSNAAOuuwCgm7OoxF56xYlN0i0ggObF37Kr
7icAoKZp1mFnY3qKYKTlA25upJVAVDWE
=4a0Z
-----END PGP SIGNATURE-----

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-05:

#28

On 5 February 2011 10:18, John Arbash Meinel <email address hidden> wrote:
> Note that when I was testing it, you can get the failure while writing
> headers, regardless of rendering the rest of the page. At least:
>
> echo "HEAD /... HTTP/1.0" | telnet localhost 8080
>
> can get the exception to trigger very early.
>
> As for why should we render the page even for just a HEAD request... If
> HEAD requests are only happening at the haproxy level, why not? Is it
> worth spending developer time just to handle this case?
>
> Arguably there is the proxy situation, where a given proxy could say
> "try HEAD and see if I can just use my cached value". I didn't think
> loggerhead was really ready for that, because of needing ETAGS support, etc.
>
> I don't doubt that it is a waste of CPU, but if <1% of our time is spent
> in HEAD requests, it doesn't seem worthwhile to spend developer time to
> fix it. (We may need to have better logs to determine what the specific
> overhead is.)

haproxy sends a request every 10s, so if they're doing unnecessary
work it's potentially going to create a noticeable load. It's a bit
surprising too. But it's not really essential to fixing this bug: we
could file a separate issue saying we noticed this, and it can wait
until we want to reduce the load on that machine.

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2011-02-09:

#29

Rendering full content for a HEAD is one thing, attempting to send it another. Isn't there also a risk of needlessly tying up the connection and/or the thread?

So ISTM there are multiple bugs here: unnecessary rendering, unnecessary attempt to send the body, and an unwanted oops for socket errors. I wouldn't fix the third without fixing the second first, otherwise we might end up hiding the problem.

Revision history for this message

John A Meinel (jameinel) wrote on 2011-02-09:

#30

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/9/2011 11:25 AM, Jeroen T. Vermeulen wrote:
> Rendering full content for a HEAD is one thing, attempting to send it
> another. Isn't there also a risk of needlessly tying up the connection
> and/or the thread?
>
> So ISTM there are multiple bugs here: unnecessary rendering, unnecessary
> attempt to send the body, and an unwanted oops for socket errors. I
> wouldn't fix the third without fixing the second first, otherwise we
> might end up hiding the problem.
>

We can trigger this failure while writing the headers to the socket.
You're welcome to dig into this as much as you want, but I think the
error is pretty shallow.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1S2c4ACgkQJdeBCYSNAAOSugCgxkfvragModW6v3/EwJnnF4ZU
BiAAoM7arQBW1gAMu6L3oBJIknEgylnu
=ntxx
-----END PGP SIGNATURE-----

Revision history for this message

John A Meinel (jameinel) wrote on 2011-02-09:

#31

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I did just do another check, and it looks like HEAD is returning the
full content. I still think that is another bug, but you can treat it as
you like.

See pages like this:
http://blog.dscpl.com.au/2009/10/wsgi-issues-with-http-head-requests.html

basically, wsgi defaults to *not* treating HEAD differently than GET,
and expects that you'll put something like Apache in front of it, to
strip the actual body content.

Note also that I can confirm this. If I do:

$ telnet localhost 8080
HEAD /~jameinel/+junk/foo/changes HTTP/1.0
Host: bazaar.launchpad.dev

I get the full header and all the body. If I do:

$ telnet bazaar.launchpad.dev 80
HEAD /~jameinel/+junk/foo/changes HTTP/1.0
Host: bazaar.launchpad.dev

I get just the header data.

So:

1) We don't actually violate the spec to the outside world.
2) Paste/WSGI doesn't make it particularly easy to run as a standalone
service that gets HEAD correct.

According to the above link, part of the issue is how the WSGI stack
*works*. Specifically, you can have WSGI middleware that would change
the values of the headers. For example, a middleware that just
compresses the content, which would then change the "Content-Length:"
header. And without actually doing the work, you can't guarantee that
the HEAD headers will be *identical* to the GET headers.

Even further, in the article it mentions that if you configure
Apache/mod_wsgi, it will check for output filters, and if HEAD comes in
over the wire, it will rewrite the request into GET, and then filter the
result anyway.

So in the end, is it *really* worth putting in any effort to fix this
outside of what I've already done?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1S6boACgkQJdeBCYSNAAPRqgCgh7kmbELh8mB6WhOAvTsoL1T2
7AMAnRIiMS7LO9IL2OVFvGWoL9x29Njn
=1sv3
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/9/2011 11:25 AM, Jeroen T. Vermeulen wrote:
> Rendering full content for a HEAD is one thing, attempting to send it
> another.  Isn't there also a risk of needlessly tying up the connection
> and/or the thread?
> 
> So ISTM there are multiple bugs here: unnecessary rendering, unnecessary
> attempt to send the body, and an unwanted oops for socket errors.  I
> wouldn't fix the third without fixing the second first, otherwise we
> might end up hiding the problem.
>

I did just do another check, and it looks like HEAD is returning the
full content. I still think that is another bug, but you can treat it as
you like.

See pages like this:
http://blog.dscpl.com.au/2009/10/wsgi-issues-with-http-head-requests.html

basically, wsgi defaults to *not* treating HEAD differently than GET,
and expects that you'll put something like Apache in front of it, to
strip the actual body content.

Note also that I can confirm this. If I do:

$ telnet localhost 8080
HEAD /~jameinel/+junk/foo/changes HTTP/1.0
Host: bazaar.launchpad.dev

I get the full header and all the body. If I do:

$ telnet bazaar.launchpad.dev 80
HEAD /~jameinel/+junk/foo/changes HTTP/1.0
Host: bazaar.launchpad.dev

I get just the header data.

So:

1) We don't actually violate the spec to the outside world.
2) Paste/WSGI doesn't make it particularly easy to run as a standalone
service that gets HEAD correct.

So in the end, is it *really* worth putting in any effort to fix this
outside of what I've already done?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1S6boACgkQJdeBCYSNAAPRqgCgh7kmbELh8mB6WhOAvTsoL1T2
7AMAnRIiMS7LO9IL2OVFvGWoL9x29Njn
=1sv3
-----END PGP SIGNATURE-----

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-09:

#32

> I did just do another check, and it looks like HEAD is returning the
> full content. I still think that is another bug, but you can treat it as
> you like.

Yes, I thought it was. It's a separate bug. It's unique to the
launchpad_loggerhead setup and I can't reproduce it with regular
loggerhead.

It should be a single line 'if method==HEAD' fix.

Revision history for this message

John A Meinel (jameinel) wrote on 2011-02-09:

#33

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/9/2011 2:41 PM, Martin Pool wrote:
>> I did just do another check, and it looks like HEAD is returning the
>> full content. I still think that is another bug, but you can treat it as
>> you like.
>
> Yes, I thought it was. It's a separate bug. It's unique to the
> launchpad_loggerhead setup and I can't reproduce it with regular
> loggerhead.
>
> It should be a single line 'if method==HEAD' fix.
>

I can reproduce it with "bzr serve --http". However, as the link I
pointed to shows, it is a fairly fundamental issue with WSGI, that isn't
really going to be fixed soon.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1TCBgACgkQJdeBCYSNAAO/cgCgr3567S+eBK7i8Z0U9DL0QGsO
oqwAnRvZM2lTH/Bk2jTGhOuBWWWAmSQz
=eYVm
-----END PGP SIGNATURE-----

Revision history for this message

Martin Pool (mbp) wrote on 2011-02-10:

#34

On 10 February 2011 07:41, Martin Pool <email address hidden> wrote:
>> I did just do another check, and it looks like HEAD is returning the
>> full content. I still think that is another bug, but you can treat it as
>> you like.
>
> Yes, I thought it was. It's a separate bug. It's unique to the
> launchpad_loggerhead setup and I can't reproduce it with regular
> loggerhead.

bug 716201

William Grant (wgrant) on 2011-02-28

Changed in launchpad:
assignee:	nobody → John A Meinel (jameinel)
status:	Triaged → In Progress

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2011-03-01:

#35

Fixed in stable r12480 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/12480>.

Changed in launchpad:
milestone:	none → 11.03
tags:	added: qa-needstesting
Changed in launchpad:
status:	In Progress → Fix Committed

William Grant (wgrant) on 2011-03-01

tags:

added: qa-ok
removed: qa-needstesting

Revision history for this message

William Grant (wgrant) wrote on 2011-03-01:

#36

I consider this OK to deploy. It does introduce a new OOPS (bug #726985), but in a case somewhat rarer than the one it fixed. It should be a net win on the OOPS count front, and the new OOPS does not constitute service degradation.

William Grant (wgrant) on 2011-03-02