NAGIOS 0 = all good; 1 = warning; 2 = critical error - commonly just look for strings within webpages, and expect them within a certain timeframe Current NAGIOS setup? - - Ralf: "It works, but I don't know what to do in the middle of the night. I just restarted everything and it seemed to fix it." - There are some memory leaks in the program. Restarting the fastcgis should fix those. - current timeframes (3 seconds: warning; 5 seconds: critical; 10 seconds: give up) - coverstore already monitored; no change required - all services in upstart script; all restarts call that SOLR - live updates to one SOLR - 4GB upstream SOLR memory (untested under production load) - check in SOLR restart stuff into OL repo at olsystem/event.d/solr-upstream NEED - To check that Upstream SOLR is online (individually) by hitting it directly (not through the website). Send a URL request: - A test for each of the indexes (x4) - - - - - searches will contain this if working: "response":{"numFound": - if it's not working, you'll get responses in HTML, not JSON - Stick to one port for SOLR - 8983 (the default) - Update - Ralf needs to know when to start what service - we mostly only restart fastcgis; rarely go beyond that - Should we monitor each process on each node? (Ralf to investigate) - or, CRON job every 5 mins; NAGIOS can ping - or, could run /usr/lib/nagios/plugins/check_procs -h against benchmarks (benchmarks TBD) - main memory concern is the OL software. Perhaps we should watch that specifically - or, look for OUT OF MEMORY in standard error in the log (SOLR) - If SOLR runs out of memory, you'll see: - SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded - to check memory use for fastcgis - run on ia311532 and ia311533 nodes - /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a openlibrary-server - right now this checks all fastcgis; might be able to isolate specific processes UPSTREAM NOW - shared DB with production (database; infobase server; http interface) - different URL structures TO LAUNCH - set up testing server on May 3; run tests through that - migration: - need to remove the URL adapter by adding new versions for all pages - restart all the memcache servers - move all templates to "a regular place" - restart server to make sure Upstream plugin is loading - Profit! TO DO - Ralf will put these checks into NAGIOS; then we'll review - ia331532 and ia331533 - /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a 'fastcgi 7031' - /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a 'fastcgi 7030' - Anand: benchmarks for memory usage - if <2GB it's fine; <3GB warning; >3GB critical