Bug #126713 “have real ops team run the servers” : Bugs : Open Library

Aaron Swartz (aaronsw) on 2007-07-25

Changed in openlibrary:
assignee:	nobody → aaronsw
importance:	Undecided → High
status:	New → Confirmed
assignee:	aaronsw → anandology

Revision history for this message

Aaron Swartz (aaronsw) wrote on 2007-08-22:

#1

Download full text (4.5 KiB)

From Joerg:

I agree with handing it over to ops... I'm going to give a quick rant
about what I would want in an ideal world (skip all this crap and go to
the bottom for what will probably happen) Sorry for all of the
parenthesis and the unkempt fervor of the mail. There are useful bits
in there even though there isn't full resolution.

Databases:

Replication is nice, but I would be happy with the following. Frequent
(multiple times a day?) backups from the server onto a non-busy spindle
on the same server, copying the backup to a luke warm spare as well as
off-site (Library of Alexandria? We can get Youssef Eldakar in the loop)

The like warm spare loads the backup into it's instance of postgresql.

There are several scripts and healthchecks that check teh OpenLibrary
database for sanity. They have exit codes of 0,1,2 (OK, WARN, CRIT) so
they are easy to plug into Nagios.

Search Engine:

The "R" in SolR stands for Replication. I would like both instances
(full text and normal) to be replicated using SolR replication. Health
checks like those for the DB should be written... I'm really in favor of
those who have written the app being those to write/come up with the
checks since they have more clue.

Rsyncing the database off to another spindle and then to Alexandria also
makes sense. If there is going to be replication, then I would LOVE for
the application (and this goes for the DB as well) to have a sense of
LAGGY READER and a WRITER/REAL-TIME READER. That way, Operations can do
fun things like run keepalived and load balancing against these services
resulting in a service that can go into read-only mode but stay up when
things get funky/wedged.

Website/Frontend:

Lot's of replicated servers, a minimum of 3. Easily load balanced and
health checked. Caching servers or proxies would be anotehr layer in
front of these.

---------------------------------------------

Application monitoring and self healing:

I'm a huge fan of applications trying to keep themselves alive...
specifically, a tool like "monit" can be really handy.. scaling the
application theoretically can only go so far.. so sticking in bounds
checking for high load, spinning CPU's, overusing RAM and automatically
restarting webservers and SolR instances sounds like a good idea to
me... there are corner cases one needs to watch for (like a SolR
instance using a lot of CPU and causing high load for an extended period
during re-indexing.. etc..) but for the most part I think monit would be
a good thing to noodle on.

---------------------------------------------

Hardware: In an ideal world, we'd buy 8 servers for $50K to $100K and
make life simple.

We only have one "nice" machine in the OpenLibrary mix. Given the scale
and architecture of the project, some machines I've been playing with
recently seem like the right thing. 2U, 12x500GB drives, 32GB ram and
2x4core CPU's. Nice and fast, lots of spindles for lots of speed,
redundancy, hot-swappability and aut-rebuildability.

------------------------------------------

Reality!!

1) Get a list of URL's and Scripts from OpenLibrary team that do good
health checking of DB/SolR/Website
** really, this is the most imp...

From Joerg:

I agree with handing it over to ops... I'm going to give a quick rant
about what I would want in an ideal world (skip all this crap and go to
the bottom for what will probably happen)  Sorry for all of the
parenthesis and the unkempt fervor of the mail.  There are useful bits
in there even though there isn't full resolution.

Databases:

Replication is nice, but I would be happy with the following.   Frequent
(multiple times a day?) backups from the server onto a non-busy spindle
on the same server, copying the backup to a luke warm spare as well as
off-site (Library of Alexandria? We can get Youssef Eldakar in the loop)

The like warm spare loads the backup into it's instance of postgresql.

There are several scripts and healthchecks that check teh OpenLibrary
database for sanity.  They have exit codes of 0,1,2 (OK, WARN, CRIT) so
they are easy to plug into Nagios.

Search Engine:

The "R" in SolR stands for Replication.  I would like both instances
(full text and normal) to be replicated using SolR replication.  Health
checks like those for the DB should be written... I'm really in favor of
those who have written the app being those to write/come up with the
checks since they have more clue.

Rsyncing the database off to another spindle and then to Alexandria also
makes sense.  If there is going to be replication, then I would LOVE for
the application (and this goes for the DB as well) to have a sense of
LAGGY READER and a WRITER/REAL-TIME READER.  That way, Operations can do
fun things like run keepalived and load balancing against these services
resulting in a service that can go into read-only mode but stay up when
things get funky/wedged.

Website/Frontend:

Lot's of replicated servers, a minimum of 3.  Easily load balanced and
health checked.  Caching servers or proxies would be anotehr layer in
front of these.

---------------------------------------------

Application monitoring and self healing:

I'm a huge fan of applications trying to keep themselves alive...
specifically, a tool like "monit" can be really handy..  scaling the
application theoretically can only go so far.. so sticking in bounds
checking for high load, spinning CPU's, overusing RAM and automatically
restarting webservers and SolR instances sounds like a good idea to
me... there are corner cases one needs to watch for (like a SolR
instance using a lot of CPU and causing high load for an extended period
during re-indexing.. etc..) but for the most part I think monit would be
a good thing to noodle on.

---------------------------------------------

Hardware:  In an ideal world, we'd buy 8 servers for $50K to $100K and
make life simple.

We only have one "nice" machine in the OpenLibrary mix.  Given the scale
and architecture of the project, some machines I've been playing with
recently seem like the right thing.  2U, 12x500GB drives, 32GB ram and
2x4core CPU's.  Nice and fast, lots of spindles for lots of speed,
redundancy, hot-swappability and aut-rebuildability.

------------------------------------------

Reality!!

1) Get a list of URL's and Scripts from OpenLibrary team that do good
health checking of DB/SolR/Website
 ** really, this is the most important thing.. specifically, the ideal
would be a http://solrserver/testingurlquery that returns a page who's
md5 digest is always the same (even better than substring matching)
 *** For the web frontend, write a /howhealthyamI.cgi or whatever which
does something like:
--  check local webserver for sanity (high load, disk full, can it talk
to the database server? both of them? does db give expected result? can
it talk to solr instances and get correct results from all that it needs
to talk to (full text, etc..)

If that all works, then the .cgi will print "I'm healthy" if not, it
won't... that's the URL we will check with Nagios...  if you want to
make your like really really easy, call this
http://servername/xx/healthcheck.php (this is what the current
www.archive.org servers use)  if you REALLY want to do it right, like
the www's, then make sure this URL doesn't get rewritten to the
Canonical name of the load balanced hostname. We had a problem where
www01.us.archive.org/xx/healthcheck.php was rewritten as
www.archive.org/xx/healthcheck.php, which was kind of pointless cause
then you are checking the load balanced VIP, not an individual server
(well, you were checking an individual server, but only whether it could
give you a redirect, not whether it could execute a script.)

2) Twist arms until 2 identical types of hardware for each tier exist
3)

Revision history for this message

brewster (brewster) wrote on 2007-08-22:

#2

we have a whole system for monitoring and fixing machines. these were put in a "what the heck" area with no controls.
as we settle down what the hardware should we we should buckle it down (monitoring and the like). are we there now?

-brewster

Revision history for this message

Anand Chitipothu (anandology) wrote on 2009-02-16:

#3

moved to 1.7 milestore

Changed in openlibrary:
milestone:	1.0 → 1.7

Revision history for this message

George (george-archive) wrote on 2009-05-08:

#4

Anand - Any idea when you might be able to get Step 1 to the Ops team? IE, this list of health checking URLs?

I hope to talk to Ralf and the Ops team when we have everyone together in June to get all this nailed.

Revision history for this message

George (george-archive) wrote on 2009-05-08:

#5

IA has files called stuff like xx_healthcheck.php that they ping every 2 seconds or so... You know?

Ralf is happy to talk to you about it if you like?

George (george-archive) on 2010-06-04

Changed in openlibrary:
milestone:	1.7 → stability

George (george-archive) on 2010-06-30

Changed in openlibrary:
status:	Confirmed → Invalid

George (george-archive) on 2010-07-01

Changed in openlibrary:
milestone:	stability-july-28 → general-bucket

Open Library

have real ops team run the servers

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches