Comment 3 for bug 561894

Revision history for this message
George (george-archive) wrote :

NAGIOS
0 = all good; 1 = warning; 2 = critical error
- commonly just look for strings within webpages, and expect them within a certain timeframe

Current NAGIOS setup?
- http://nagios2.us.archive.org/control/nagios-status.php?hostgroup=24.openlibrary&style=detail&hoststatustypes=15
- Ralf: "It works, but I don't know what to do in the middle of the night. I just restarted everything and it seemed to fix it."
- There are some memory leaks in the program. Restarting the fastcgis should fix those.
- current timeframes (3 seconds: warning; 5 seconds: critical; 10 seconds: give up)
- coverstore already monitored; no change required
- all services in upstart script; all restarts call that

SOLR
- live updates to one SOLR
- 4GB upstream SOLR memory (untested under production load)
- check in SOLR restart stuff into OL repo at olsystem/event.d/solr-upstream

NEED
- To check that Upstream SOLR is online (individually) by hitting it directly (not through the website). Send a URL request:
- A test for each of the indexes (x4)
  - http://ia331507.us.archive.org:8983/solr/works/select?wt=json&q=city
  - http://ia331507.us.archive.org:8984/solr/authors/select?wt=json&q=mark
  - http://ia331507.us.archive.org:8984/solr/subjects/select?wt=json&q=city
  - http://ia331507.us.archive.org:8984/solr/editions/select?wt=json&q=mark
    - searches will contain this if working: "response":{"numFound":
    - if it's not working, you'll get responses in HTML, not JSON
- Stick to one port for SOLR - 8983 (the default)
- Update http://home.us.archive.org/cgi-bin/twiki/viewauth/OpenLibrary/WebHome#Ops
- Ralf needs to know when to start what service - we mostly only restart fastcgis; rarely go beyond that
- Should we monitor each process on each node? (Ralf to investigate)
    - or, CRON job every 5 mins; NAGIOS can ping
    - or, could run /usr/lib/nagios/plugins/check_procs -h against benchmarks (benchmarks TBD)
    - main memory concern is the OL software. Perhaps we should watch that specifically
    - or, look for OUT OF MEMORY in standard error in the log (SOLR)
        - If SOLR runs out of memory, you'll see:
            - SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
- to check memory use for fastcgis - run on ia311532 and ia311533 nodes
    - /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a openlibrary-server
        - right now this checks all fastcgis; might be able to isolate specific processes

UPSTREAM NOW
- shared DB with production (database; infobase server; http interface)
- different URL structures

TO LAUNCH
- set up testing server on May 3; run tests through that
- migration:
    - need to remove the URL adapter by adding new versions for all pages
        - restart all the memcache servers
    - move all templates to "a regular place"
    - restart server to make sure Upstream plugin is loading
    - Profit!

TO DO
- Ralf will put these checks into NAGIOS; then we'll review
- ia331532 and ia331533
    - /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a 'fastcgi 7031'
    - /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a 'fastcgi 7030'
- Anand: benchmarks for memory usage - if <2GB it's fine; <3GB warning; >3GB critical