Capomastro

We should default to local archiving in case of a ssh timeout or just retry it?

Bug #1385229 reported by Caio Begotti on 2014-10-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Capomastro	Triaged	Medium	Unassigned	Capomastro future

Bug Description

[2014-10-15 17:18:28,362: ERROR/MainProcess] Task archives.tasks.generate_checksums[741c01f4-455d-43d2-b89e-77bc144e9db3] raised unexpected: error(110, 'Connection timed out')
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 218, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 398, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/archives/tasks.py", line 126, in generate_checksums
    transport.start()
  File "/usr/lib/python2.7/dist-packages/archives/transports.py", line 197, in start
    self.ssh_client, self.sftp_client = self._get_ssh_clients()
  File "/usr/lib/python2.7/dist-packages/archives/transports.py", line 180, in _get_ssh_clients
    pkey=self.archive.ssh_credentials.get_pkey())
  File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 300, in connect
    retry_on_signal(lambda: sock.connect(addr))
  File "/usr/lib/python2.7/dist-packages/paramiko/util.py", line 278, in retry_on_signal
    return function()
  File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 300, in <lambda>
    retry_on_signal(lambda: sock.connect(addr))
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 110] Connection timed out

That causes the build to actually not finish completely and we lost its real status it seems.

Tags:

Revision history for this message

Caio Begotti (caio1982) wrote on 2014-12-08:

We'd need to try to simulate it further as from what I've seen retry_on_signal should keep trying the connection until no error is found.

Revision history for this message

Caio Begotti (caio1982) wrote on 2014-12-08:

That's likely related to it: http://www.hoboes.com/Mimsy/hacks/timeout-class-retry-python/retry-ssh-connections-after-transient-error/

Revision history for this message

Caio Begotti (caio1982) wrote on 2015-02-09:

Also, no artifact is archived if the checksums generation fails (possibly because they both use a single connection that is timing out). Just had this now:

[2015-02-09 16:18:33,695: ERROR/MainProcess] Task archives.tasks.generate_checksums[395e6510-1a9a-4197-8ccb-4ac799c1a7d5] raised unexpected: error(110, 'Connection timed out')
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 218, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 398, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/archives/tasks.py", line 126, in generate_checksums
    transport.start()
  File "/usr/lib/python2.7/dist-packages/archives/transports.py", line 197, in start
    self.ssh_client, self.sftp_client = self._get_ssh_clients()
  File "/usr/lib/python2.7/dist-packages/archives/transports.py", line 180, in _get_ssh_clients
    pkey=self.archive.ssh_credentials.get_pkey())
  File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 300, in connect
    retry_on_signal(lambda: sock.connect(addr))
  File "/usr/lib/python2.7/dist-packages/paramiko/util.py", line 278, in retry_on_signal
    return function()
  File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 300, in <lambda>
    retry_on_signal(lambda: sock.connect(addr))
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 110] Connection timed out

Revision history for this message

Daniel Manrique (roadmr) wrote on 2015-02-09:

Interesting. For swift archiving, there's no shell-accessible storage of any kind, which breaks some of the assumptions the current checksum calculation makes. What I'm thinking of doing is somehow multiplexing the file-like object we're reading from jenkins, and using hashlib's update method to calculate checksum of the stream as it comes in (while simultaneously feeding it to swift). We could adapt this to ssh and local storage too, so we don't even rely on the sha256sums tool to be locally installed. Maybe that'll help mitigate this a bit.

Revision history for this message

Caio Begotti (caio1982) wrote on 2015-02-09:

Another possibility is to evaluate a newer version of Paramiko since the old Capomastro is using is nearly 2 years old and I can't get the SSH channel working again after a timeout so perhaps we will need to work on this first to at least get things working stable before Swift eventually lands.

Revision history for this message

Daniel Manrique (roadmr) wrote on 2015-02-09:

Oh sure, that'll be simpler than my swift pipe dreams :)

Revision history for this message

Caio Begotti (caio1982) wrote on 2015-02-09:

I have tried the latest release of Paramiko through some backports and it didn't help at all. Manually checking access with nc I noticed the ubuntu user in jenkins/0 at Wendigo could not connect to the archiver anymore.

On staging I noticed that the secgroup to allow SSH connection between units has vanished. We can check it again in the future with:

nova secgroup-list-rules juju-stg-pes-capomastro | grep 162.213.32.93

It should return something, this IP is the IP of our Jenkins boxes on staging in the IS router, it is fixed.

To manually fix this problem and get the archives working again, run this on Wendigo:

nova secgroup-add-rule juju-stg-pes-capomastro tcp 22 22 162.213.32.93/32

Revision history for this message

Daniel Manrique (roadmr) wrote on 2015-02-11:

OK, so Caio figured out why the archiver was failing but we still need to think of whether it makes sense to have a backup archiver of some kind.

Using local storage would be the most reliable solution, the main issue I see is that the capomastro host may not have enough resources to archive locally (otherwise we wouldn't need an external archiver :).

The safest way to implement this would be to somehow declare the amount of local disk space that could be used for failed-archiver backups, and implement a FIFO queue there, so once the space is used up, older backed-up artifacts will be deleted. And we also need to make provisions for people to be able to recover artifacts from within the capomastro UI, thinking that sometimes direct host access is not possible in our deployment scenario.

Ideally, if builds fail to archive, we can be loud enough that artifacts won't get lost in the queue because a developer didn't check on his build status.

I'm triaging this as future as it looks like a slightly larger feature but feel free to move around if it needs to be done sooner.

Changed in capomastro:
importance:	Undecided → Medium

Revision history for this message

Daniel Manrique (roadmr) wrote on 2015-02-11:

OH, and we could implement a simple retrying policy to protect against transient archiver failures; though it wouldn't have helped in the problem we had because firewall access was entirely broken :/

Changed in capomastro:
status:	New → Triaged
milestone:	none → future
tags:	added: story

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.