agents see "too many open files" errors after many failed API attempts
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju-core |
High
|
Dave Cheney | ||
| | 1.22 |
Critical
|
Dave Cheney | ||
| | 1.23 |
Critical
|
Dave Cheney | ||
| | 1.24 |
Critical
|
Dave Cheney | ||
Bug Description
While investigating a customer OpenStack deployment managed by Juju I noticed that many unit and machine agents were failing due to file handle exhaustion ("too many open files") after many failed connections to the (broken) Juju state servers. These agents weren't able to reconnect until they were manually restarted.
My guess is that a failed API connection attempt leaks at least one file handle (but this is just a guess at this stage). It looks like it took about 2 days of failed connection attempts before file handles were exhausted.
The issue was seen with Juju 1.20.9 but it is likely that it's still there in more recent versions.
| Changed in juju-core: | |
| status: | New → Triaged |
| Changed in juju-core: | |
| milestone: | none → 1.21.3 |
| Changed in juju-core: | |
| milestone: | 1.21.3 → 1.23 |
| no longer affects: | juju-core/1.21 |
| no longer affects: | juju-core/1.22 |
| Casey Marshall (cmars) wrote : | #2 |
| Menno Finlay-Smits (menno.smits) wrote : | #3 |
This environment had 3 state servers.
I'm not sure that mongo could have much to do with this. It was unit and machine agents on non-state server hosts that ran out of file handles while try to establish an API connection to the state servers. These agents don't talk directly to mongo.
It might not be the API connection attempts themselves that caused files handles to leak, but something did. Did you try with 1.20.9? It's possible the problem has been fixed since.
| Changed in juju-core: | |
| milestone: | 1.23 → 1.24-alpha1 |
| Changed in juju-core: | |
| milestone: | 1.24-alpha1 → 1.24.0 |
| Changed in juju-core: | |
| milestone: | 1.24.0 → 1.25.0 |
| tags: | added: cpce |
| tags: | removed: cpce |
| Changed in juju-core: | |
| assignee: | nobody → Cheryl Jennings (cherylj) |
| Cheryl Jennings (cherylj) wrote : | #4 |
I've been working on a recreate for this problem and I *might* have one. I have an EC2 environment up and running with 7 state servers and two non state servers. Two state servers are down, and the others are repeatedly killing jujud every minute or so. We started with about 40 sockets in CLOSE_WAIT for jujud on the non state server I'm monitoring, and after about 2 hours it's up to 100.
The total number of sockets in CLOSE_WAIT seems to vary wildly, and previous recreate attempts saw it go up to about 100 and drop down to 40 at various stages, so I'm going to let this run overnight to confirm if I am also seeing the leaked file handles.
In between recreate attempts, I have been walking various code paths to see if I could spot a case of an api connection not being closed and wasn't able to find any as of yet.
| Cheryl Jennings (cherylj) wrote : | #5 |
I think I was able to recreate a file handle leak by setting up an EC2
environment with one mysql machine and 7 state servers. I manually
shut down two of the state servers, and had a script on the others that
would kill jujud every 1 - 2 minutes.
After running overnight, I saw that there were 163 sockets belonging to
jujud in the CLOSE_WAIT state as reported by lsof.
The current suspicion is that there is a problem in the go.net library
when we try to close the websocket:
// Close implements the io.Closer interface.
func (ws *Conn) Close() error {
err := ws.frameHandler
if err != nil {
}
return ws.rwc.Close()
}
I have confirmed that we are getting an EOF error from WriteClose, and
that closing rwc even if we get an error there seems to eliminate the
problem of extra sockets laying around in CLOSE_WAIT (only initial
testing). However, it seems to make the local juju/juju tests explode
and we need to work on figuring out why.
| Dave Cheney (dave-cheney) wrote : Re: [Bug 1420057] Re: agents see "too many open files" errors after many failed API attempts | #6 |
https:/
On Fri, May 15, 2015 at 11:43 AM, Cheryl Jennings
<email address hidden> wrote:
> I think I was able to recreate a file handle leak by setting up an EC2
> environment with one mysql machine and 7 state servers. I manually
> shut down two of the state servers, and had a script on the others that
> would kill jujud every 1 - 2 minutes.
>
> After running overnight, I saw that there were 163 sockets belonging to
> jujud in the CLOSE_WAIT state as reported by lsof.
>
> The current suspicion is that there is a problem in the go.net library
> when we try to close the websocket:
>
> // Close implements the io.Closer interface.
> func (ws *Conn) Close() error {
> err := ws.frameHandler
> if err != nil {
> return err
> }
> return ws.rwc.Close()
> }
>
> I have confirmed that we are getting an EOF error from WriteClose, and
> that closing rwc even if we get an error there seems to eliminate the
> problem of extra sockets laying around in CLOSE_WAIT (only initial
> testing). However, it seems to make the local juju/juju tests explode
> and we need to work on figuring out why.
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> Matching subscriptions: MOAR JUJU SPAM!
> https:/
>
> Title:
> agents see "too many open files" errors after many failed API attempts
>
> Status in juju-core:
> Triaged
> Status in juju-core 1.24 series:
> Triaged
>
> Bug description:
> While investigating a customer OpenStack deployment managed by Juju I
> noticed that many unit and machine agents were failing due to file
> handle exhaustion ("too many open files") after many failed
> connections to the (broken) Juju state servers. These agents weren't
> able to reconnect until they were manually restarted.
>
> My guess is that a failed API connection attempt leaks at least one
> file handle (but this is just a guess at this stage). It looks like it
> took about 2 days of failed connection attempts before file handles
> were exhausted.
>
> The issue was seen with Juju 1.20.9 but it is likely that it's still
> there in more recent versions.
>
> To manage notifications about this bug go to:
> https:/
| Cheryl Jennings (cherylj) wrote : | #7 |
I re-ran my recreate scenario with the websocket fix and noticed that we still leaked a couple file handles. After some additional tracing, I see that we are missing a close for the API in the case where we fail to start some of the MachineAgent workers. We need to add a defer function to cmd/jujud/
defer func() {
if err != nil {
}
}()
I will add in these changes and re-run my tests overnight to see if any other file handles are leaked.
| Changed in juju-core: | |
| assignee: | Cheryl Jennings (cherylj) → Dave Cheney (dave-cheney) |
| Dave Cheney (dave-cheney) wrote : | #8 |
1.22 does not use x/net/websocket
| Changed in juju-core: | |
| status: | Triaged → Fix Committed |
| John A Meinel (jameinel) wrote : | #9 |
Given we saw the bug in 1.20 and 1.22 is the stable release we are targeting, I think we still need a fix here. I'm guessing we're using an older version of the library before it was 'x/net', can we just bugfix that library, or do we have to switch our dependencies in 1.22?
If the dependency fix is too much, we *could* probably get away with WontFix on 1.22, as this should only be a problem when an environment is unhappy. But we'd definitely want people to know that this bug exists and that if they have API servers going down they may need to restart all of their Juju agents.
| Dave Cheney (dave-cheney) wrote : | #10 |
I have a fix for 1.22 pending review.
On Mon, May 18, 2015 at 2:04 PM, John A Meinel <email address hidden> wrote:
> Given we saw the bug in 1.20 and 1.22 is the stable release we are
> targeting, I think we still need a fix here. I'm guessing we're using an
> older version of the library before it was 'x/net', can we just bugfix
> that library, or do we have to switch our dependencies in 1.22?
>
> If the dependency fix is too much, we *could* probably get away with
> WontFix on 1.22, as this should only be a problem when an environment is
> unhappy. But we'd definitely want people to know that this bug exists
> and that if they have API servers going down they may need to restart
> all of their Juju agents.
>
>
> ** Changed in: juju-core/1.22
> Milestone: None => 1.22.4
>
> --
> You received this bug notification because you are a bug assignee.
> Matching subscriptions: MOAR JUJU SPAM!
> https:/
>
> Title:
> agents see "too many open files" errors after many failed API attempts
>
> Status in juju-core:
> Fix Committed
> Status in juju-core 1.22 series:
> In Progress
> Status in juju-core 1.23 series:
> In Progress
> Status in juju-core 1.24 series:
> In Progress
>
> Bug description:
> While investigating a customer OpenStack deployment managed by Juju I
> noticed that many unit and machine agents were failing due to file
> handle exhaustion ("too many open files") after many failed
> connections to the (broken) Juju state servers. These agents weren't
> able to reconnect until they were manually restarted.
>
> My guess is that a failed API connection attempt leaks at least one
> file handle (but this is just a guess at this stage). It looks like it
> took about 2 days of failed connection attempts before file handles
> were exhausted.
>
> The issue was seen with Juju 1.20.9 but it is likely that it's still
> there in more recent versions.
>
> To manage notifications about this bug go to:
> https:/
| Changed in juju-core: | |
| status: | Fix Committed → Fix Released |


Took a look at this one this afternoon. I tried to get the API client to leak sockets in a simple standalone main, to no avail. I'm not convinced it was the API client connections that leaked fds here.
Seems suspicious to me that there were a large number of prior mongo errors that got us to the point of exhausting fds though.
What was the mongo configuration on this state server like? Was it clustered?