We should default to local archiving in case of a ssh timeout or just retry it?

Bug #1385229 reported by Caio Begotti
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Capomastro
Triaged
Medium
Unassigned

Bug Description

[2014-10-15 17:18:28,362: ERROR/MainProcess] Task archives.tasks.generate_checksums[741c01f4-455d-43d2-b89e-77bc144e9db3] raised unexpected: error(110, 'Connection timed out')
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 218, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 398, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/archives/tasks.py", line 126, in generate_checksums
    transport.start()
  File "/usr/lib/python2.7/dist-packages/archives/transports.py", line 197, in start
    self.ssh_client, self.sftp_client = self._get_ssh_clients()
  File "/usr/lib/python2.7/dist-packages/archives/transports.py", line 180, in _get_ssh_clients
    pkey=self.archive.ssh_credentials.get_pkey())
  File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 300, in connect
    retry_on_signal(lambda: sock.connect(addr))
  File "/usr/lib/python2.7/dist-packages/paramiko/util.py", line 278, in retry_on_signal
    return function()
  File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 300, in <lambda>
    retry_on_signal(lambda: sock.connect(addr))
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 110] Connection timed out

That causes the build to actually not finish completely and we lost its real status it seems.

Tags: story
Revision history for this message
Caio Begotti (caio1982) wrote :

We'd need to try to simulate it further as from what I've seen retry_on_signal should keep trying the connection until no error is found.

Revision history for this message
Caio Begotti (caio1982) wrote :
Revision history for this message
Caio Begotti (caio1982) wrote :

Also, no artifact is archived if the checksums generation fails (possibly because they both use a single connection that is timing out). Just had this now:

[2015-02-09 16:18:33,695: ERROR/MainProcess] Task archives.tasks.generate_checksums[395e6510-1a9a-4197-8ccb-4ac799c1a7d5] raised unexpected: error(110, 'Connection timed out')
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 218, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 398, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/archives/tasks.py", line 126, in generate_checksums
    transport.start()
  File "/usr/lib/python2.7/dist-packages/archives/transports.py", line 197, in start
    self.ssh_client, self.sftp_client = self._get_ssh_clients()
  File "/usr/lib/python2.7/dist-packages/archives/transports.py", line 180, in _get_ssh_clients
    pkey=self.archive.ssh_credentials.get_pkey())
  File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 300, in connect
    retry_on_signal(lambda: sock.connect(addr))
  File "/usr/lib/python2.7/dist-packages/paramiko/util.py", line 278, in retry_on_signal
    return function()
  File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 300, in <lambda>
    retry_on_signal(lambda: sock.connect(addr))
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 110] Connection timed out

Revision history for this message
Daniel Manrique (roadmr) wrote :

Interesting. For swift archiving, there's no shell-accessible storage of any kind, which breaks some of the assumptions the current checksum calculation makes. What I'm thinking of doing is somehow multiplexing the file-like object we're reading from jenkins, and using hashlib's update method to calculate checksum of the stream as it comes in (while simultaneously feeding it to swift). We could adapt this to ssh and local storage too, so we don't even rely on the sha256sums tool to be locally installed. Maybe that'll help mitigate this a bit.

Revision history for this message
Caio Begotti (caio1982) wrote :

Another possibility is to evaluate a newer version of Paramiko since the old Capomastro is using is nearly 2 years old and I can't get the SSH channel working again after a timeout so perhaps we will need to work on this first to at least get things working stable before Swift eventually lands.

Revision history for this message
Daniel Manrique (roadmr) wrote :

Oh sure, that'll be simpler than my swift pipe dreams :)

Revision history for this message
Caio Begotti (caio1982) wrote :

I have tried the latest release of Paramiko through some backports and it didn't help at all. Manually checking access with nc I noticed the ubuntu user in jenkins/0 at Wendigo could not connect to the archiver anymore.

On staging I noticed that the secgroup to allow SSH connection between units has vanished. We can check it again in the future with:

nova secgroup-list-rules juju-stg-pes-capomastro | grep 162.213.32.93

It should return something, this IP is the IP of our Jenkins boxes on staging in the IS router, it is fixed.

To manually fix this problem and get the archives working again, run this on Wendigo:

nova secgroup-add-rule juju-stg-pes-capomastro tcp 22 22 162.213.32.93/32

Revision history for this message
Daniel Manrique (roadmr) wrote :

OK, so Caio figured out why the archiver was failing but we still need to think of whether it makes sense to have a backup archiver of some kind.

Using local storage would be the most reliable solution, the main issue I see is that the capomastro host may not have enough resources to archive locally (otherwise we wouldn't need an external archiver :).

The safest way to implement this would be to somehow declare the amount of local disk space that could be used for failed-archiver backups, and implement a FIFO queue there, so once the space is used up, older backed-up artifacts will be deleted. And we also need to make provisions for people to be able to recover artifacts from within the capomastro UI, thinking that sometimes direct host access is not possible in our deployment scenario.

Ideally, if builds fail to archive, we can be loud enough that artifacts won't get lost in the queue because a developer didn't check on his build status.

I'm triaging this as future as it looks like a slightly larger feature but feel free to move around if it needs to be done sooner.

Changed in capomastro:
importance: Undecided → Medium
Revision history for this message
Daniel Manrique (roadmr) wrote :

OH, and we could implement a simple retrying policy to protect against transient archiver failures; though it wouldn't have helped in the problem we had because firewall access was entirely broken :/

Changed in capomastro:
status: New → Triaged
milestone: none → future
tags: added: story
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.