End-to-end check intermittently failing

Bug #1038963 reported by Jonathan Lange
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pkgme service
Fix Released
Critical
Jonathan Lange

Bug Description

For Canonicalers: https://wiki.canonical.com/IncidentReports/2012-08-14-CA-PkgmeErrorSubmittingSuccess

Every 20 minutes we run an end-to-end check on the Canonical production & staging pkgme-services. The end-to-end check consists of requesting that a PDF (a copy of "The Jabberwocky") be packaged.

Sometimes, the check fails with "Connection Refused" and we don't know why.

Here's the full error::
Submitting success info to 'http://localhost:53290/callback': https://pkgme-service.canonical.com/+output/jabberwocky-11004.tar.gz
[2012-08-14 05:00:58,672: ERROR/MainProcess] Task djpkgme.tasks.BuildPackageTask[a8ca47a0-8aaa-48fb-9747-46c5e5e7f4a5] raised exception: error(111, 'Connection refused')
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/celery/execute/trace.py", line 153, in trace_task
    R = retval = task(*args, **kwargs)
  File "/srv/pkgme-service.canonical.com/production/pkgme-service/sourcecode/../src/djpkgme/tasks.py", line 529, in run
    logger=self.get_logger())
  File "/srv/pkgme-service.canonical.com/production/pkgme-service/sourcecode/../src/djpkgme/tasks.py", line 113, in submit_pkgme_info
    return submit_to_myapps(metadata['callback_url'], body, logger=logger)
  File "/srv/pkgme-service.canonical.com/production/pkgme-service/sourcecode/../src/djpkgme/tasks.py", line 106, in submit_to_myapps
    url, method='PUT', headers=headers, body=json_body)
  File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 1444, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 1196, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 1132, in _conn_request
    conn.connect()
  File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 798, in connect
    raise socket.error, msg
error: [Errno 111] Connection refused

We use nagios to do the check. Because it has built-in limits preventing a check taking more than 10s, and because we expect packaging to sometimes take more than 10s, we do the actual packaging request in a cron job. The nagios check looks at the stored output from the last run of cron and evaluates it.

Both the submission and the evaluation are done using tools from lp:txpkgme: submit-for-packaging and check-submit-for-packaging-result respectively.

Revision history for this message
Jonathan Lange (jml) wrote :

Bug 1038967 makes this more difficult to debug. I recommend that we deploy a workaround/fix for that asap so we can better debug this problem.

Jonathan Lange (jml)
summary: - End-to-end check intermittently failing with "Connection refused"
+ End-to-end check intermittently failing
Revision history for this message
Jonathan Lange (jml) wrote :

We have thought that bug 1038998 might be the cause of this.

Revision history for this message
Jonathan Lange (jml) wrote :

But since then, we've concluded that it's actually variations in the run-time of the actual packaging. The check fails if it took more than 30s, and sometimes it's taking 60s::

 [2012-08-17 14:20:01,714: INFO/PoolWorker-1] djpkgme.tasks.BuildPackageTask[a15ed0e1-4a70-4e4b-80b5-f0dcf23a76a5]: Running pkgme
 [2012-08-17 14:20:55,910: INFO/PoolWorker-1] djpkgme.tasks.BuildPackageTask[a15ed0e1-4a70-4e4b-80b5-f0dcf23a76a5]: pkgme completed

Revision history for this message
James Westby (james-w) wrote :

The current belief is that it's the code that uses launchpadlib to get the current ubuntu
release that is doing this. If it times out talking to LP then it could slow the work down such
that we see the above.

That should be confirmed as best we can.

Then we might want to adjust timeouts, or prevent retry to keep the time bounded.

We perhaps want to retry at a higher level.

Nothing would seem to be able to prevent an issue when it can't talk to LP, but that seems
ok, provided it's clear what the failure was talking to LP.

Thanks,

James

Jonathan Lange (jml)
Changed in pkgme-service:
status: Triaged → In Progress
assignee: nobody → Jonathan Lange (jml)
Jonathan Lange (jml)
Changed in pkgme-service:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.