environs/openstack: provider leaks file descriptors

Bug #1170595 reported by Dave Cheney
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Dave Cheney

Bug Description

Discovered during load testing

The PA will start to provision new machines till around ~200, then start to fail with

2013/04/19 02:37:59 WARNING environs/openstack: ignoring constraints, using default-instance-type flavor "standard.small" 2013/04/19 02:37:59 ERROR worker/provisioner: cannot start instance for machine "297": failed to get list of flavours caused by: failed executing the request https://az-2.region-a.geo-1.compute.hpcloudsvc.com/v1.1/17031369947864/flavors
caused by: Get https://az-2.region-a.geo-1.compute.hpcloudsvc.com/v1.1/17031369947864/flavors: lookup az-2.region-a.geo-1.compute.hpcloudsvc.com: Temporary failure in name resolution 2013/04/19 02:37:59 INFO environs: reading tools with major version 1
2013/04/19 02:37:59 DEBUG environs/tools: reading v1.* tools 2013/04/19 02:37:59 DEBUG environs/tools: found 1.10.0.1-precise-amd64 2013/04/19 02:37:59 INFO environs: filtering tools by version: 1.10.0.1
2013/04/19 02:37:59 INFO environs: filtering tools by series: precise 2013/04/19 02:37:59 WARNING environs/openstack: ignoring constraints, using default-instance-type flavor "standard.small" 2013/04/19 02:37:59 ERROR worker/provisioner: cannot start instance for machine "298": failed to get list of flavours
caused by: failed executing the request https://az-2.region-a.geo-1.compute.hpcloudsvc.com/v1.1/17031369947864/flavors caused by: Get https://az-2.region-a.geo-1.compute.hpcloudsvc.com/v1.1/17031369947864/flavors: lookup az-2.region-a.geo-1.compute.hpcloudsvc.com: Temporary failure in name resolution

2013/04/19 05:29:11 ERROR state: TLS handshake failed: local error: unexpected message 2013/04/19 05:29:11 ERROR state: connection failed, paused for 2s: dial tcp 127.0.0.1:37017: too many open files 2013/04/19 05:29:13 ERROR state: connection failed, paused for 2s: dial tcp 127.0.0.1:37017: too many open files

This time around, the failure pointed to a fd leak which explains the DNS resolution problems exactly.

Go uses the system resolver library by default, these libraries are still based on select, which has a 1024 fd limit, so if you leak more than 1024 file descriptors (sometimes not even that many), suddenly you can't do DNS resolution.

Related branches

Revision history for this message
Dave Cheney (dave-cheney) wrote :

Further investigations lead me to believe the leak is the connection to the private control bucket.

Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Dave Cheney (dave-cheney) wrote :

It probably wasn't the control bucket exactly, but that shared the same ip address as some other services, and one of those was rate limiting our requests (not surprising) during the load test.

Changed in juju-core:
status: In Progress → Fix Committed
Tim Penhey (thumper)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.