Comment 6 for bug 1420057

Revision history for this message
Dave Cheney (dave-cheney) wrote : Re: [Bug 1420057] Re: agents see "too many open files" errors after many failed API attempts

https://github.com/golang/go/issues/10866

On Fri, May 15, 2015 at 11:43 AM, Cheryl Jennings
<email address hidden> wrote:
> I think I was able to recreate a file handle leak by setting up an EC2
> environment with one mysql machine and 7 state servers. I manually
> shut down two of the state servers, and had a script on the others that
> would kill jujud every 1 - 2 minutes.
>
> After running overnight, I saw that there were 163 sockets belonging to
> jujud in the CLOSE_WAIT state as reported by lsof.
>
> The current suspicion is that there is a problem in the go.net library
> when we try to close the websocket:
>
> // Close implements the io.Closer interface.
> func (ws *Conn) Close() error {
> err := ws.frameHandler.WriteClose(ws.defaultCloseStatus)
> if err != nil {
> return err
> }
> return ws.rwc.Close()
> }
>
> I have confirmed that we are getting an EOF error from WriteClose, and
> that closing rwc even if we get an error there seems to eliminate the
> problem of extra sockets laying around in CLOSE_WAIT (only initial
> testing). However, it seems to make the local juju/juju tests explode
> and we need to work on figuring out why.
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> Matching subscriptions: MOAR JUJU SPAM!
> https://bugs.launchpad.net/bugs/1420057
>
> Title:
> agents see "too many open files" errors after many failed API attempts
>
> Status in juju-core:
> Triaged
> Status in juju-core 1.24 series:
> Triaged
>
> Bug description:
> While investigating a customer OpenStack deployment managed by Juju I
> noticed that many unit and machine agents were failing due to file
> handle exhaustion ("too many open files") after many failed
> connections to the (broken) Juju state servers. These agents weren't
> able to reconnect until they were manually restarted.
>
> My guess is that a failed API connection attempt leaks at least one
> file handle (but this is just a guess at this stage). It looks like it
> took about 2 days of failed connection attempts before file handles
> were exhausted.
>
> The issue was seen with Juju 1.20.9 but it is likely that it's still
> there in more recent versions.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1420057/+subscriptions