Make sure it's possible for Ops to restart fastcgi processes through NAGIOS

Bug #561894 reported by George
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Open Library
Fix Released
High
Ralf Muehlen

Bug Description

This afternoon, 532 began seizing up, Edward was at the doctor's, and there wasn't a thing anyone else could do.

If we're going to get anything like 24 hour coverage when we go to WWW, Ops need to be able to restart things for us.

I've attached a screenshot of the state of the NAGIOS checks at the moment... you can see what cannot be restarted yet on there.

Revision history for this message
George (george-archive) wrote :
Changed in openlibrary:
assignee: nobody → Anand Chitipothu (anandology)
milestone: none → upstream-to-www
importance: Undecided → Critical
Revision history for this message
George (george-archive) wrote :

Edward:
- document work search SOLR for Ops team
- Add monitoring for both instances of Upstream SOLRs
- look to move SOLR update process off Edward's dev box onto the SOLR production box (*07)

Revision history for this message
George (george-archive) wrote :
Download full text (3.2 KiB)

NAGIOS
0 = all good; 1 = warning; 2 = critical error
- commonly just look for strings within webpages, and expect them within a certain timeframe

Current NAGIOS setup?
- http://nagios2.us.archive.org/control/nagios-status.php?hostgroup=24.openlibrary&style=detail&hoststatustypes=15
- Ralf: "It works, but I don't know what to do in the middle of the night. I just restarted everything and it seemed to fix it."
- There are some memory leaks in the program. Restarting the fastcgis should fix those.
- current timeframes (3 seconds: warning; 5 seconds: critical; 10 seconds: give up)
- coverstore already monitored; no change required
- all services in upstart script; all restarts call that

SOLR
- live updates to one SOLR
- 4GB upstream SOLR memory (untested under production load)
- check in SOLR restart stuff into OL repo at olsystem/event.d/solr-upstream

NEED
- To check that Upstream SOLR is online (individually) by hitting it directly (not through the website). Send a URL request:
- A test for each of the indexes (x4)
  - http://ia331507.us.archive.org:8983/solr/works/select?wt=json&q=city
  - http://ia331507.us.archive.org:8984/solr/authors/select?wt=json&q=mark
  - http://ia331507.us.archive.org:8984/solr/subjects/select?wt=json&q=city
  - http://ia331507.us.archive.org:8984/solr/editions/select?wt=json&q=mark
    - searches will contain this if working: "response":{"numFound":
    - if it's not working, you'll get responses in HTML, not JSON
- Stick to one port for SOLR - 8983 (the default)
- Update http://home.us.archive.org/cgi-bin/twiki/viewauth/OpenLibrary/WebHome#Ops
- Ralf needs to know when to start what service - we mostly only restart fastcgis; rarely go beyond that
- Should we monitor each process on each node? (Ralf to investigate)
    - or, CRON job every 5 mins; NAGIOS can ping
    - or, could run /usr/lib/nagios/plugins/check_procs -h against benchmarks (benchmarks TBD)
    - main memory concern is the OL software. Perhaps we should watch that specifically
    - or, look for OUT OF MEMORY in standard error in the log (SOLR)
        - If SOLR runs out of memory, you'll see:
            - SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
- to check memory use for fastcgis - run on ia311532 and ia311533 nodes
    - /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a openlibrary-server
        - right now this checks all fastcgis; might be able to isolate specific processes

UPSTREAM NOW
- shared DB with production (database; infobase server; http interface)
- different URL structures

TO LAUNCH
- set up testing server on May 3; run tests through that
- migration:
    - need to remove the URL adapter by adding new versions for all pages
        - restart all the memcache servers
    - move all templates to "a regular place"
    - restart server to make sure Upstream plugin is loading
    - Profit!

TO DO
- Ralf will put these checks into NAGIOS; then we'll review
- ia331532 and ia331533
    - /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a 'fastcgi 7031'
    - /usr/lib/nagios/plugins/check_procs --metric=VSZ -w 2000000 -c 3000000 -a 'fastcgi 7030'
- Anand: benchmarks...

Read more...

Revision history for this message
George (george-archive) wrote :

6PM PST is preferred for calls with Anand if needed.

Revision history for this message
George (george-archive) wrote :

Can we check that other Ops peeps can log on to OL machines? Like, Sam?

Revision history for this message
Anand Chitipothu (anandology) wrote : Re: [Bug 561894] Re: Make sure it's possible for Ops to restart fastcgi processes through NAGIOS

On 24-Apr-10, at 5:51 AM, George wrote:

> Can we check that other Ops peeps can log on to OL machines? Like,
> Sam?

Yes, ops people have login to all nodes in the cluster.

George (george-archive)
Changed in openlibrary:
status: New → In Progress
Revision history for this message
George (george-archive) wrote :

Ralf - let us know what you still need from us to restart fastcgis.

Changed in openlibrary:
assignee: Anand Chitipothu (anandology) → Ralf Muehlen (launchpad-muehlen)
importance: Critical → High
George (george-archive)
Changed in openlibrary:
milestone: upstream-to-www → stability
Revision history for this message
Ralf Muehlen (launchpad-muehlen) wrote :

I updated nagios to include restart links for newer services, and dropped the old upstream services. The only service that cannot be restarted currently are the Search Engines on ia331508 and 09. If someone provides a restart script, I can make the nagios links.

Changed in openlibrary:
status: In Progress → Fix Released
Revision history for this message
Ralf Muehlen (launchpad-muehlen) wrote :

Search Engines now also have restart links.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.