agents see "too many open files" errors after many failed API attempts

Bug #1420057 reported by Menno Finlay-Smits
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Dave Cheney
1.22
Fix Released
Critical
Dave Cheney
1.23
Fix Released
Critical
Dave Cheney
1.24
Fix Released
Critical
Dave Cheney

Bug Description

While investigating a customer OpenStack deployment managed by Juju I noticed that many unit and machine agents were failing due to file handle exhaustion ("too many open files") after many failed connections to the (broken) Juju state servers. These agents weren't able to reconnect until they were manually restarted.

My guess is that a failed API connection attempt leaks at least one file handle (but this is just a guess at this stage). It looks like it took about 2 days of failed connection attempts before file handles were exhausted.

The issue was seen with Juju 1.20.9 but it is likely that it's still there in more recent versions.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :
description: updated
Tim Penhey (thumper)
Changed in juju-core:
status: New → Triaged
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: none → 1.21.3
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.21.3 → 1.23
Curtis Hovey (sinzui)
no longer affects: juju-core/1.21
no longer affects: juju-core/1.22
Revision history for this message
Casey Marshall (cmars) wrote :

Took a look at this one this afternoon. I tried to get the API client to leak sockets in a simple standalone main, to no avail. I'm not convinced it was the API client connections that leaked fds here.

Seems suspicious to me that there were a large number of prior mongo errors that got us to the point of exhausting fds though.

What was the mongo configuration on this state server like? Was it clustered?

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

This environment had 3 state servers.

I'm not sure that mongo could have much to do with this. It was unit and machine agents on non-state server hosts that ran out of file handles while try to establish an API connection to the state servers. These agents don't talk directly to mongo.

It might not be the API connection attempts themselves that caused files handles to leak, but something did. Did you try with 1.20.9? It's possible the problem has been fixed since.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.23 → 1.24-alpha1
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.24-alpha1 → 1.24.0
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.24.0 → 1.25.0
Nate Finch (natefinch)
tags: added: cpce
tags: removed: cpce
Changed in juju-core:
assignee: nobody → Cheryl Jennings (cherylj)
Revision history for this message
Cheryl Jennings (cherylj) wrote :

I've been working on a recreate for this problem and I *might* have one. I have an EC2 environment up and running with 7 state servers and two non state servers. Two state servers are down, and the others are repeatedly killing jujud every minute or so. We started with about 40 sockets in CLOSE_WAIT for jujud on the non state server I'm monitoring, and after about 2 hours it's up to 100.

The total number of sockets in CLOSE_WAIT seems to vary wildly, and previous recreate attempts saw it go up to about 100 and drop down to 40 at various stages, so I'm going to let this run overnight to confirm if I am also seeing the leaked file handles.

In between recreate attempts, I have been walking various code paths to see if I could spot a case of an api connection not being closed and wasn't able to find any as of yet.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

I think I was able to recreate a file handle leak by setting up an EC2
environment with one mysql machine and 7 state servers. I manually
shut down two of the state servers, and had a script on the others that
would kill jujud every 1 - 2 minutes.

After running overnight, I saw that there were 163 sockets belonging to
jujud in the CLOSE_WAIT state as reported by lsof.

The current suspicion is that there is a problem in the go.net library
when we try to close the websocket:

// Close implements the io.Closer interface.
func (ws *Conn) Close() error {
        err := ws.frameHandler.WriteClose(ws.defaultCloseStatus)
        if err != nil {
                return err
        }
        return ws.rwc.Close()
}

I have confirmed that we are getting an EOF error from WriteClose, and
that closing rwc even if we get an error there seems to eliminate the
problem of extra sockets laying around in CLOSE_WAIT (only initial
testing). However, it seems to make the local juju/juju tests explode
and we need to work on figuring out why.

Revision history for this message
Dave Cheney (dave-cheney) wrote : Re: [Bug 1420057] Re: agents see "too many open files" errors after many failed API attempts

https://github.com/golang/go/issues/10866

On Fri, May 15, 2015 at 11:43 AM, Cheryl Jennings
<email address hidden> wrote:
> I think I was able to recreate a file handle leak by setting up an EC2
> environment with one mysql machine and 7 state servers. I manually
> shut down two of the state servers, and had a script on the others that
> would kill jujud every 1 - 2 minutes.
>
> After running overnight, I saw that there were 163 sockets belonging to
> jujud in the CLOSE_WAIT state as reported by lsof.
>
> The current suspicion is that there is a problem in the go.net library
> when we try to close the websocket:
>
> // Close implements the io.Closer interface.
> func (ws *Conn) Close() error {
> err := ws.frameHandler.WriteClose(ws.defaultCloseStatus)
> if err != nil {
> return err
> }
> return ws.rwc.Close()
> }
>
> I have confirmed that we are getting an EOF error from WriteClose, and
> that closing rwc even if we get an error there seems to eliminate the
> problem of extra sockets laying around in CLOSE_WAIT (only initial
> testing). However, it seems to make the local juju/juju tests explode
> and we need to work on figuring out why.
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> Matching subscriptions: MOAR JUJU SPAM!
> https://bugs.launchpad.net/bugs/1420057
>
> Title:
> agents see "too many open files" errors after many failed API attempts
>
> Status in juju-core:
> Triaged
> Status in juju-core 1.24 series:
> Triaged
>
> Bug description:
> While investigating a customer OpenStack deployment managed by Juju I
> noticed that many unit and machine agents were failing due to file
> handle exhaustion ("too many open files") after many failed
> connections to the (broken) Juju state servers. These agents weren't
> able to reconnect until they were manually restarted.
>
> My guess is that a failed API connection attempt leaks at least one
> file handle (but this is just a guess at this stage). It looks like it
> took about 2 days of failed connection attempts before file handles
> were exhausted.
>
> The issue was seen with Juju 1.20.9 but it is likely that it's still
> there in more recent versions.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1420057/+subscriptions

Revision history for this message
Cheryl Jennings (cherylj) wrote :

I re-ran my recreate scenario with the websocket fix and noticed that we still leaked a couple file handles. After some additional tracing, I see that we are missing a close for the API in the case where we fail to start some of the MachineAgent workers. We need to add a defer function to cmd/jujud/machine.go: APIWorker after a successful call to openAPIState that looks like this:

        defer func() {
                if err != nil {
                        st.Close()
                }
        }()

I will add in these changes and re-run my tests overnight to see if any other file handles are leaked.

Changed in juju-core:
assignee: Cheryl Jennings (cherylj) → Dave Cheney (dave-cheney)
Revision history for this message
Dave Cheney (dave-cheney) wrote :

1.22 does not use x/net/websocket

Changed in juju-core:
status: Triaged → Fix Committed
Revision history for this message
John A Meinel (jameinel) wrote :

Given we saw the bug in 1.20 and 1.22 is the stable release we are targeting, I think we still need a fix here. I'm guessing we're using an older version of the library before it was 'x/net', can we just bugfix that library, or do we have to switch our dependencies in 1.22?

If the dependency fix is too much, we *could* probably get away with WontFix on 1.22, as this should only be a problem when an environment is unhappy. But we'd definitely want people to know that this bug exists and that if they have API servers going down they may need to restart all of their Juju agents.

Revision history for this message
Dave Cheney (dave-cheney) wrote :

I have a fix for 1.22 pending review.

On Mon, May 18, 2015 at 2:04 PM, John A Meinel <email address hidden> wrote:
> Given we saw the bug in 1.20 and 1.22 is the stable release we are
> targeting, I think we still need a fix here. I'm guessing we're using an
> older version of the library before it was 'x/net', can we just bugfix
> that library, or do we have to switch our dependencies in 1.22?
>
> If the dependency fix is too much, we *could* probably get away with
> WontFix on 1.22, as this should only be a problem when an environment is
> unhappy. But we'd definitely want people to know that this bug exists
> and that if they have API servers going down they may need to restart
> all of their Juju agents.
>
>
> ** Changed in: juju-core/1.22
> Milestone: None => 1.22.4
>
> --
> You received this bug notification because you are a bug assignee.
> Matching subscriptions: MOAR JUJU SPAM!
> https://bugs.launchpad.net/bugs/1420057
>
> Title:
> agents see "too many open files" errors after many failed API attempts
>
> Status in juju-core:
> Fix Committed
> Status in juju-core 1.22 series:
> In Progress
> Status in juju-core 1.23 series:
> In Progress
> Status in juju-core 1.24 series:
> In Progress
>
> Bug description:
> While investigating a customer OpenStack deployment managed by Juju I
> noticed that many unit and machine agents were failing due to file
> handle exhaustion ("too many open files") after many failed
> connections to the (broken) Juju state servers. These agents weren't
> able to reconnect until they were manually restarted.
>
> My guess is that a failed API connection attempt leaks at least one
> file handle (but this is just a guess at this stage). It looks like it
> took about 2 days of failed connection attempts before file handles
> were exhausted.
>
> The issue was seen with Juju 1.20.9 but it is likely that it's still
> there in more recent versions.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1420057/+subscriptions

Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.