have real ops team run the servers
Bug #126713 reported by
Aaron Swartz
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Open Library |
Invalid
|
High
|
Anand Chitipothu |
Bug Description
Put them under supervise. Have nagios look at them? What does IA do for stuff like this?
Changed in openlibrary: | |
assignee: | nobody → aaronsw |
importance: | Undecided → High |
status: | New → Confirmed |
assignee: | aaronsw → anandology |
Changed in openlibrary: | |
milestone: | 1.7 → stability |
Changed in openlibrary: | |
status: | Confirmed → Invalid |
Changed in openlibrary: | |
milestone: | stability-july-28 → general-bucket |
To post a comment you must log in.
From Joerg:
I agree with handing it over to ops... I'm going to give a quick rant
about what I would want in an ideal world (skip all this crap and go to
the bottom for what will probably happen) Sorry for all of the
parenthesis and the unkempt fervor of the mail. There are useful bits
in there even though there isn't full resolution.
Databases:
Replication is nice, but I would be happy with the following. Frequent
(multiple times a day?) backups from the server onto a non-busy spindle
on the same server, copying the backup to a luke warm spare as well as
off-site (Library of Alexandria? We can get Youssef Eldakar in the loop)
The like warm spare loads the backup into it's instance of postgresql.
There are several scripts and healthchecks that check teh OpenLibrary
database for sanity. They have exit codes of 0,1,2 (OK, WARN, CRIT) so
they are easy to plug into Nagios.
Search Engine:
The "R" in SolR stands for Replication. I would like both instances
(full text and normal) to be replicated using SolR replication. Health
checks like those for the DB should be written... I'm really in favor of
those who have written the app being those to write/come up with the
checks since they have more clue.
Rsyncing the database off to another spindle and then to Alexandria also
makes sense. If there is going to be replication, then I would LOVE for
the application (and this goes for the DB as well) to have a sense of
LAGGY READER and a WRITER/REAL-TIME READER. That way, Operations can do
fun things like run keepalived and load balancing against these services
resulting in a service that can go into read-only mode but stay up when
things get funky/wedged.
Website/Frontend:
Lot's of replicated servers, a minimum of 3. Easily load balanced and
health checked. Caching servers or proxies would be anotehr layer in
front of these.
------- ------- ------- ------- ------- ------- ---
Application monitoring and self healing:
I'm a huge fan of applications trying to keep themselves alive...
specifically, a tool like "monit" can be really handy.. scaling the
application theoretically can only go so far.. so sticking in bounds
checking for high load, spinning CPU's, overusing RAM and automatically
restarting webservers and SolR instances sounds like a good idea to
me... there are corner cases one needs to watch for (like a SolR
instance using a lot of CPU and causing high load for an extended period
during re-indexing.. etc..) but for the most part I think monit would be
a good thing to noodle on.
------- ------- ------- ------- ------- ------- ---
Hardware: In an ideal world, we'd buy 8 servers for $50K to $100K and
make life simple.
We only have one "nice" machine in the OpenLibrary mix. Given the scale
and architecture of the project, some machines I've been playing with
recently seem like the right thing. 2U, 12x500GB drives, 32GB ram and
2x4core CPU's. Nice and fast, lots of spindles for lots of speed,
redundancy, hot-swappability and aut-rebuildability.
------- ------- ------- ------- ------- -------
Reality!!
1) Get a list of URL's and Scripts from OpenLibrary team that do good
health checking of DB/SolR/Website
** really, this is the most imp...