Network problems in CI environment

Bug #1567580 reported by Ben Nemec
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

We seem to be having intermittent networking glitches in the CI environment. This is manifesting as undercloud installs that fail cloning the puppet modules with messages like:

fatal: unable to access 'https://github.com/puppetlabs/puppetlabs-concat.git/': Could not resolve host: github.com; Name or service not known

or

error: RPC failed; result=6, HTTP code = 0
fatal: The remote end hung up unexpectedly

I can't logstash on this right now (although I pushed a CI change that would allow us to: https://review.openstack.org/302999), but it appears to be happening frequently. If you see a job that failed in between 15 and 25 minutes chances are it's this. I've also seen image builds fail in similar ways, so there are probably some longer failed jobs that are also being hit by this.

Just eyeballing the status page, I'd say half of our jobs are failing on something related to this right now, so I'm calling it critical.

Revision history for this message
Ben Nemec (bnemec) wrote :

I should note that I don't think this is just github flakiness. The DNS lookup failure happens before we even try to talk to github.

Revision history for this message
Ben Nemec (bnemec) wrote :

Can probably add

2016-04-07 17:55:42.960 | Slave went offline during the build
2016-04-07 17:55:42.960 | ERROR: Connection was broken: java.io.IOException: Unexpected termination of the channel

to the list of symptoms.

Revision history for this message
John Trowbridge (trown) wrote :

Setting alert on this, status page is almost all red.

tags: added: alert
Revision history for this message
Derek Higgins (derekh) wrote :

I noticed last night that nodepool seems to have leaked all of its floating IP's, after this it started constantly starting new instances and deleting them again (after they fail to get a floating ip), I deleted all the free and unused floating IP's from nodepools account that that seems to have settled things down(as they are now available to new instances).

We also currently have a problem with a commit to the mongodb puppet modules, once that is fixed we'll know for sure if the network problem is now solved.

Revision history for this message
Emilien Macchi (emilienm) wrote :

Looking at http://tripleo.org/cistatus.html today, a lot of green.

* mongodb patch has been reverted, we don't have the bug anymore.
* nodepool issue seems fixed by Derek.

Dropping the tag for now.

tags: removed: alert
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
Derek Higgins (derekh) wrote :

Closing this, as we havn't had seen these errors in some time.

Closing this bug, please feel free to reopen it if you disagree.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.