Open Library

find a suitable load balancer for fastcgis

Bug #584804 reported by Anand Chitipothu on 2010-05-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Open Library	Fix Released	High	Anand Chitipothu	Open Library stability-july-28

Bug Description

Open Library uses lighttpd as web server and load-balancer for the fastcgis. The lighttpd load-balancer brings the whole website down when some fastcgis are down. Becuase of this the website is going down whenever we deploy new code.

Fixing this problem also solves the memory leaks issue. We can setup a process to monitor the fastcgis and restart them when their memory usage goes beyond a threshold.

We need to:

1. verify this behavior of lighttpd.

* start couple of OL fastcgis and expose them via lighttpd.
* start hitting the lighttpd using ab (apache bench)
* kill one or more fastcgis and see if there are any failures

2. try the same with nginx

Are there any other options to try?

See original description

Anand Chitipothu (anandology) on 2010-05-24

Changed in openlibrary:
milestone:	none → stability
assignee:	nobody → Anand Chitipothu (anandology)
importance:	Undecided → Critical
description:	updated

Revision history for this message

George (george-archive) wrote on 2010-05-25:

Is this something Ops can take care of?

Revision history for this message

Anand Chitipothu (anandology) wrote on 2010-05-26: Re: [Bug 584804] Re: find a suitable load balancer

On 25-May-10, at 10:56 PM, George wrote:

> Is this something Ops can take care of?

Yes.

Revision history for this message

George (george-archive) wrote on 2010-05-27: Re: find a suitable load balancer

Ralf - please describe the preparatory work you need Anand to do before you can add load balancing.

Changed in openlibrary:
assignee:	Anand Chitipothu (anandology) → Ralf Muehlen (launchpad-muehlen)

Revision history for this message

George (george-archive) wrote on 2010-05-27:

(Might be good to have http load balancing, but what we're really looking for is fastcgi load balancing.)

Revision history for this message

samuel-archive (samuel-archive) wrote on 2010-05-27:

Also, please consider hosting the HTTP backend services world inside apache's
mod_wsgi instead of the current lighttpd+fastcgi(flup?) container

running inside a prefork mpm mode on apache will also solve the memory
leak problems, and will allow for graceful code rolls to new versions.

It will also dodge the python-GIL thread performance penalty,
and offer a more robust environment to monitor & control the HTTP service layer.

For the public facing side we may wanna consider:
http://trafficserver.apache.org/
(also could try haproxy or varnish or another reverse-proxy-server)

a basic strategy to get the site reliable:
public face runs traditional reverse-proxy-server (right now lighttpd plays this role)
    - request routing -- sends requests to appropriate backend servers
    - perhaps sends static file requests to lighttpd
    - has heartbeat feaure, and is the fail-whale server

lighttpd or nginx for static & on disk content

apache mod_wsgi or (maybe) lighttpd+fastcgi+flup (or maybe) python epolling httpserver
- serves application universe

Advantages:
all logging is on public face.
one place to go to see whats up on the site.
no keepalive on backend, all keepalive on very front
get rid of fastcgi middle layer .. all the api calls go over HTTP anyway and the fastcgi is hard to test & analyze
more reliable fail-pages

Anand Chitipothu (anandology) on 2010-05-29

summary:

- find a suitable load balancer
+ find a suitable load balancer for fastcgis

Revision history for this message

Anand Chitipothu (anandology) wrote on 2010-05-29:

Yes, mod_wsgi is a potential candidate to try. I'll try to setup apache+mod_wsgi on ia331533 and see how it works.

Trying a reverse-proxy is a good idea. In fact, I've experimented Varnish sometime ago. trafficserver looks good too. Does archive.org use any reverse-proxy? Sam, do you have experience with these reverse proxies?

George (george-archive) on 2010-06-04

Changed in openlibrary:
importance:	Critical → High

Ralf Muehlen (launchpad-muehlen) on 2010-06-21

Changed in openlibrary:
assignee:	Ralf Muehlen (launchpad-muehlen) → samuel-archive (samuel-archive)

Revision history for this message

George (george-archive) wrote on 2010-06-29:

Bump.

Anand - we'd like to do a first small step towards load balancing, and that's to create a new second webserver that we can load balance into the existing setup. Please attend to this within the next week, so, by Tuesday July 6.

Can you please create another identical webserver, and post back here with a URL/test URL to hit.

This load balancing of web servers will be a good first step because it will mean that we won't have to pull the site down when we do a new release and we can quickly fold into existing Ops monitoring with a minimum of fuss.

Once that's done, we can take a look at fastcgi/Apache monitoring issues.

Changed in openlibrary:
assignee:	samuel-archive (samuel-archive) → Anand Chitipothu (anandology)

Revision history for this message

Anand Chitipothu (anandology) wrote on 2010-06-30: Re: [Bug 584804] Re: find a suitable load balancer for fastcgis

On 30-Jun-10, at 4:57 AM, George wrote:

> Bump.
>
> Anand - we'd like to do a first small step towards load balancing, and
> that's to create a new second webserver that we can load balance into
> the existing setup. Please attend to this within the next week, so, by
> Tuesday July 6.
>
> Can you please create another identical webserver, and post back here
> with a URL/test URL to hit.
>
> This load balancing of web servers will be a good first step because
> it
> will mean that we won't have to pull the site down when we do a new
> release and we can quickly fold into existing Ops monitoring with a
> minimum of fuss.
>
> Once that's done, we can take a look at fastcgi/Apache monitoring
> issues.

I don't agree. Webserver is not a bottleneck in our setup. We had to
take the site down because the production db was under migration. I
think we should focus of making the db node more stable.

Anand

Revision history for this message

Ralf Muehlen (launchpad-muehlen) wrote on 2010-06-30:

A few observations based on my understanding of the OL system:

1) There is only one front-end web server. If that host is down, openlibrary.org is down.
2) If I understand Edward correctly, development necessitates the restart of the front-end, which leads to 30 seconds of unavailability of ol.org several times a day.
3) nagios notices these outages and alerts. Development happens in multiple time zones, so development-caused alerts can go out any time of the day.
4) Occasionally, the connection between lighttpd and fastcgis breaks.
5) There is a memory leak, requiring frequent restarts of the fastcgis.

Please correct me if any of these are wrong.

1-3 could be fixed by having 2 front-end web servers on separate nodes. I propose to use the existing production load balancer already in use for archive.org and IAS3. To make this happened, I would need
a) Someone to setup a second web server on a different OL node.
b) A script that can judge the health of a web server. This script should test a whole stack, i.e. it should hit all back-end components (DB, SE) in a non-cached way. See /petabox/sys/masterfiles/app/keepalived/bin/keepalived-healthcheck.bash for an example.

4-5 might be solved with using Apache as the webserver. Sam thinks that the connection to the fastcgis cannot break with Apache. Limiting the life time of the children can work around the memory leak.

The perfect is the enemy of the good here. There might be load balancers and reverse proxies out there that might do a better job than the existing production load balancer. But the latter is available, well-understood and supported.

Similarly, it would be good to investigate why 4 and especially 5 are happening. In the mean time, we might have a work around that improves reliability.

Ralf

Revision history for this message

George (george-archive) wrote on 2010-06-30:

#10

Anand - there is concurrent research going on about making PostgreSQL replicable - Jim Shankland is leading the charge on that, and will have a plan & approach within the next week or so.

We want to do both DB replication and load balancing of web servers.

Revision history for this message

samuel-archive (samuel-archive) wrote on 2010-08-02:

#11

I did some experiments with lighttpd on ia331511
lighttpd-1.4.19 (ssl) - a light and fast webserver
Build-Date: Jul 29 2008 20:07:53

By multiply bringing up and down and pausing a pool of
fcgi flup servers I was able to get lighttpd into a state where it returned
500 errors, even though functioning flup workers were available.

My lighttpd config has a block like this:

    fastcgi.server = (
        "/run.py" => (
            ("host" => "127.0.0.1", "port" => 9091, "check-local" => "disable"),
            ("host" => "127.0.0.1", "port" => 9092, "check-local" => "disable")
        )
    )

My fastcgi looks like this:
#!/usr/bin/env python

import sys
import time

port = int(sys.argv[1])

def myapp(environ, start_response):
    start_response('200 OK', [('Content-Type', 'text/plain')])
    time.sleep(0.3)
    return ['Hello World! %d %f \n' % (port, time.time())]

if __name__ == "__main__":
from flup.server.fcgi import WSGIServer
WSGIServer(myapp, None, True, False, ("127.0.0.1", port )).run()

I had to run 3 threads doing this to see the failures:
while true; do timeout 2 curl http://ia331511.us.archive.org:9090/run.py/ ; date ; done

The easiest way to enter a broken state is as follows:
0) Start the requestor threads.
1) Start the 2nd fcgi server, leave the first one in the config down.
2) Start lighttpd.
(this does not always trigger the failing mode, but does most of the time)

In this state 1/2 of all requests will get 500 errors

If I bring up the first fcgi server (so both are running) everything is fine.

If I then kill the second server its all 500s (no requests are routed to the
now working 1st fcgi server)

I haven't seen lighttpd recover from this all 500s state. (bringing back both
fcgi servers doesn't do it)

Revision history for this message

Anand Chitipothu (anandology) wrote on 2010-08-03:

#12

Found a way to run fastcgis in prefork mode. In that mode, a pool of processes are started for handling the requests and it is possible to specify the maximum number of requests a process should handle. That will save us from memory leaks.

Shall we try that for coverstore fastcgis? I think it might be a good idea to move the coverstore fastcgis to 26.

Revision history for this message

samuel-archive (samuel-archive) wrote on 2010-08-03:

#13

Yes, lets try this on coverstore.

From my experiences with apache prefork running mod_wsgi

I suggest start at 50 children, with a max at 150.
I suggest 1000 requests per child before exit & respawn.
(we can go higher if the rss non-shared memory footprint
doesn't grow too much)

Revision history for this message

Anand Chitipothu (anandology) wrote on 2010-09-16:

#14

Tried gunicorn + lighttpd proxy for coverstore and it seems to be working great. memory usage is quite low even after 24 hours.

One good thing about gunicorn is that it allows restarting the worker processes without taking the website down.

Revision history for this message

Anand Chitipothu (anandology) wrote on 2011-01-21:

#15

switching to gunicorn solved this issue.

Changed in openlibrary:
status:	New → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.