Make gsa_sync tolerant to Business Center not responding to requests

Bug #1205380 reported by Nat Katin-Borland
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
High
Paul Everitt

Bug Description

Nat wrote:

When GSA/Business Center has an outage KARL GSA sync frequently goes down and must be manually restarted by the KARL dev team. We need to make KARL more resilient to GSA outages.

Analysis
=================

Our error monitor stays in a constant state of alert these days because of the following traceback:

Traceback (most recent call last):
  File "/srv/osfkarl/production/19/eggs/karlserve-1.23-py2.6.egg/karlserve/scripts/main.py", line 204, in wrapper
    return func(args)
  File "/srv/osfkarl/production/19/eggs/osi-3.47.1-py2.6.egg/osi/scripts/gsa_sync.py", line 56, in main
    gsa_sync = GsaSync(site, args.url, args.user, args.password)
  File "/srv/osfkarl/production/19/eggs/osi-3.47.1-py2.6.egg/osi/sync/gsa_sync.py", line 104, in __init__
    resource = urllib2.urlopen(request)
  File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.6/urllib2.py", line 391, in open
    response = self._open(req, data)
  File "/usr/lib/python2.6/urllib2.py", line 409, in _open
    '_open', req)
  File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.6/urllib2.py", line 1189, in https_open
    return self.do_open(httplib.HTTPSConnection, req)
  File "/usr/lib/python2.6/urllib2.py", line 1156, in do_open
    raise URLError(err)
URLError: <urlopen error [Errno 104] Connection reset by peer>

However, every time this happens, I immediately go to my browser, open the URL that gets me to the XML, and it works fine. There are several points here:

- We should catch this exception and log it as an INFO, to avoid Nagios getting mad and emailing us constantly

- I'm suspicious about the actual problem. Is gsa_sync unable to connect from gocept but I am able to connect from Virginia? Is there some other, lower level problem (perhaps a certificate issue, DNS issue)?

- Sometimes the cron job gets wedged for days and I have to go in and kill the process. Can we set a Python 2.6 timeout on the socket?

Note:

I previously had worried that, based on seeing two progresses in "ps auwwx | grep gsa", we were running gsa_sync both from cron and supervisord. I confirmed that I am wrong. There is a shell script that is run from cron with gsa in the shell file name, which calls a Python module via karlserve that has gsa in the module name.

Changed in karl3:
assignee: nobody → Paul Everitt (paul-agendaless)
Changed in karl3:
importance: Undecided → Low
milestone: none → m127
summary: - KARL-GSA Sync Failure
+ Make gsa_sync tolerant to Business Center not responding to requests
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Assigning to Tres to try and make gsa_sync more resilient.

description: updated
Changed in karl3:
assignee: Paul Everitt (paul-agendaless) → Tres Seaver (tseaver)
importance: Low → High
Revision history for this message
Tres Seaver (tseaver) wrote :

"Connection reset by peer" means that the GSA server process we are
talking to hung up unexpectedly (likely, it crashed).

Looking at today's logs, it seems that the process was crashing all
weekend: maybe Ajo fixed it around 10:53 this morning? I can change
the level of the error reported by the script (e.g., to WARNING or
INFO), but I don't see what else we could do.

Changed in karl3:
status: New → In Progress
Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 1205380] Re: Make gsa_sync tolerant to Business Center not responding to requests
Download full text (3.6 KiB)

I know we've had some cases where our side has hung (meaning, the sync process had been running for a couple of days and I had to kill it.) Is it too risky trying to put in some socket timeout code?

--Paul

On Jul 29, 2013, at 11:26 AM, Tres Seaver <email address hidden> wrote:

> "Connection reset by peer" means that the GSA server process we are
> talking to hung up unexpectedly (likely, it crashed).
>
> Looking at today's logs, it seems that the process was crashing all
> weekend: maybe Ajo fixed it around 10:53 this morning? I can change
> the level of the error reported by the script (e.g., to WARNING or
> INFO), but I don't see what else we could do.
>
>
> ** Changed in: karl3
> Status: New => In Progress
>
> --
> You received this bug notification because you are subscribed to KARL3.
> https://bugs.launchpad.net/bugs/1205380
>
> Title:
> Make gsa_sync tolerant to Business Center not responding to requests
>
> Status in KARL3:
> In Progress
>
> Bug description:
> Nat wrote:
>
> When GSA/Business Center has an outage KARL GSA sync frequently goes
> down and must be manually restarted by the KARL dev team. We need to
> make KARL more resilient to GSA outages.
>
> Analysis
> =================
>
> Our error monitor stays in a constant state of alert these days
> because of the following traceback:
>
> Traceback (most recent call last):
> File "/srv/osfkarl/production/19/eggs/karlserve-1.23-py2.6.egg/karlserve/scripts/main.py", line 204, in wrapper
> return func(args)
> File "/srv/osfkarl/production/19/eggs/osi-3.47.1-py2.6.egg/osi/scripts/gsa_sync.py", line 56, in main
> gsa_sync = GsaSync(site, args.url, args.user, args.password)
> File "/srv/osfkarl/production/19/eggs/osi-3.47.1-py2.6.egg/osi/sync/gsa_sync.py", line 104, in __init__
> resource = urllib2.urlopen(request)
> File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
> return _opener.open(url, data, timeout)
> File "/usr/lib/python2.6/urllib2.py", line 391, in open
> response = self._open(req, data)
> File "/usr/lib/python2.6/urllib2.py", line 409, in _open
> '_open', req)
> File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
> result = func(*args)
> File "/usr/lib/python2.6/urllib2.py", line 1189, in https_open
> return self.do_open(httplib.HTTPSConnection, req)
> File "/usr/lib/python2.6/urllib2.py", line 1156, in do_open
> raise URLError(err)
> URLError: <urlopen error [Errno 104] Connection reset by peer>
>
> However, every time this happens, I immediately go to my browser, open
> the URL that gets me to the XML, and it works fine. There are several
> points here:
>
> - We should catch this exception and log it as an INFO, to avoid
> Nagios getting mad and emailing us constantly
>
> - I'm suspicious about the actual problem. Is gsa_sync unable to
> connect from gocept but I am able to connect from Virginia? Is there
> some other, lower level problem (perhaps a certificate issue, DNS
> issue)?
>
> - Sometimes the cron job gets wedged for days and I have to go in and
> kill the process. Can we set a Python 2.6 timeout on the socket?
>
> Not...

Read more...

Revision history for this message
Tres Seaver (tseaver) wrote :

I have pushed changes on the 'gsa_sync-timout-1205380' branch which add a
'--timeout' argument to the 'gsa_sync' script. The value defaults to 15 seconds.

Changed in karl3:
status: In Progress → Fix Committed
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Over to me for setup on karlstaging.

Changed in karl3:
assignee: Tres Seaver (tseaver) → Paul Everitt (paul-agendaless)
Changed in karl3:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.