have real ops team run the servers

Bug #126713 reported by Aaron Swartz
12
Affects Status Importance Assigned to Milestone
Open Library
Invalid
High
Anand Chitipothu

Bug Description

Put them under supervise. Have nagios look at them? What does IA do for stuff like this?

Aaron Swartz (aaronsw)
Changed in openlibrary:
assignee: nobody → aaronsw
importance: Undecided → High
status: New → Confirmed
assignee: aaronsw → anandology
Revision history for this message
Aaron Swartz (aaronsw) wrote :
Download full text (4.5 KiB)

From Joerg:

I agree with handing it over to ops... I'm going to give a quick rant
about what I would want in an ideal world (skip all this crap and go to
the bottom for what will probably happen) Sorry for all of the
parenthesis and the unkempt fervor of the mail. There are useful bits
in there even though there isn't full resolution.

Databases:

Replication is nice, but I would be happy with the following. Frequent
(multiple times a day?) backups from the server onto a non-busy spindle
on the same server, copying the backup to a luke warm spare as well as
off-site (Library of Alexandria? We can get Youssef Eldakar in the loop)

The like warm spare loads the backup into it's instance of postgresql.

There are several scripts and healthchecks that check teh OpenLibrary
database for sanity. They have exit codes of 0,1,2 (OK, WARN, CRIT) so
they are easy to plug into Nagios.

Search Engine:

The "R" in SolR stands for Replication. I would like both instances
(full text and normal) to be replicated using SolR replication. Health
checks like those for the DB should be written... I'm really in favor of
those who have written the app being those to write/come up with the
checks since they have more clue.

Rsyncing the database off to another spindle and then to Alexandria also
makes sense. If there is going to be replication, then I would LOVE for
the application (and this goes for the DB as well) to have a sense of
LAGGY READER and a WRITER/REAL-TIME READER. That way, Operations can do
fun things like run keepalived and load balancing against these services
resulting in a service that can go into read-only mode but stay up when
things get funky/wedged.

Website/Frontend:

Lot's of replicated servers, a minimum of 3. Easily load balanced and
health checked. Caching servers or proxies would be anotehr layer in
front of these.

---------------------------------------------

Application monitoring and self healing:

I'm a huge fan of applications trying to keep themselves alive...
specifically, a tool like "monit" can be really handy.. scaling the
application theoretically can only go so far.. so sticking in bounds
checking for high load, spinning CPU's, overusing RAM and automatically
restarting webservers and SolR instances sounds like a good idea to
me... there are corner cases one needs to watch for (like a SolR
instance using a lot of CPU and causing high load for an extended period
during re-indexing.. etc..) but for the most part I think monit would be
a good thing to noodle on.

---------------------------------------------

Hardware: In an ideal world, we'd buy 8 servers for $50K to $100K and
make life simple.

We only have one "nice" machine in the OpenLibrary mix. Given the scale
and architecture of the project, some machines I've been playing with
recently seem like the right thing. 2U, 12x500GB drives, 32GB ram and
2x4core CPU's. Nice and fast, lots of spindles for lots of speed,
redundancy, hot-swappability and aut-rebuildability.

------------------------------------------

Reality!!

1) Get a list of URL's and Scripts from OpenLibrary team that do good
health checking of DB/SolR/Website
 ** really, this is the most imp...

Read more...

Revision history for this message
brewster (brewster) wrote :

we have a whole system for monitoring and fixing machines. these were put in a "what the heck" area with no controls.
as we settle down what the hardware should we we should buckle it down (monitoring and the like). are we there now?

-brewster

Revision history for this message
Anand Chitipothu (anandology) wrote :

moved to 1.7 milestore

Changed in openlibrary:
milestone: 1.0 → 1.7
Revision history for this message
George (george-archive) wrote :

Anand - Any idea when you might be able to get Step 1 to the Ops team? IE, this list of health checking URLs?

I hope to talk to Ralf and the Ops team when we have everyone together in June to get all this nailed.

Revision history for this message
George (george-archive) wrote :

IA has files called stuff like xx_healthcheck.php that they ping every 2 seconds or so... You know?

Ralf is happy to talk to you about it if you like?

George (george-archive)
Changed in openlibrary:
milestone: 1.7 → stability
George (george-archive)
Changed in openlibrary:
status: Confirmed → Invalid
George (george-archive)
Changed in openlibrary:
milestone: stability-july-28 → general-bucket
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.