1.8beta3: Too Many Open Files in maas.log

Bug #1446810 reported by Jason Hobbs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Gavin Panella

Bug Description

My maas.log has lots of these errors in it:

"maas.log:Apr 21 14:10:02 maas-trusty-back-may22 maas.power: [ERROR] mizuchi.local: Failed to refresh power state: [Errno 24] Too many open files"

Also some like this:
maas.log:Apr 21 18:30:41 maas-trusty-back-may22 maas.lease_upload_service: [ERROR] Failed to upload leases: [Errno 24] Too many open files

I've attached logs of "lsof -n" and "ps auxf" and MAAS logs.

There are loads of connects from clusterd to the regiond (on http) in CLOSE_WAIT. Looks like maybe clusterd is leaking fds?

Tags: oil

Related branches

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

How many leases are there in /var/lib/maas/dhcp/dhcpd.leases ?

Changed in maas:
importance: Undecided → Critical
milestone: none → 1.8.0
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I don't follow the relevance, but:

ubuntu@maas-trusty-back-may22:~$ cat /var/lib/maas/dhcp/dhcpd.leases | grep ^lease | wc -l
27477

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Looks like tgtd also has a huge number of open files. They all seem to be from the same snapshot, so don't think its an issue.

Jason,

Your correct there is a huge number of CLOSE_WAIT sockets. I don't think the cluster is closing the socket correctly.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

There is a total of 940 CLOSE_WAIT connections. They are HTTP connections, looks like the cluster connecting to the region to get the RPC event-loop information.

ubuntu@maas-trusty-back-may22:/var/lib/maas$ sudo lsof -u maas | grep CLOSE_WAIT | wc -l
940

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Okay the CLOSE_WAIT might now be the biggest issue here. tgtd is killing the number of open files.

ubuntu@maas-trusty-back-may22:/var/lib/maas$ sudo lsof | grep tgtd | wc -l
70669

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Looks like tgtd just normally has that many files open. Restarting it keeps it the same amount, and the configuration for maas in tgt looks correct.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1446810] Re: 1.8beta3: Too Many Open Files in maas.log

The open files limit is per process tree. tgtd is a system service owned by
root; its open files don't count against those used by other parts of MAAS.

On Tue, Apr 21, 2015 at 2:29 PM, Blake Rouse <email address hidden>
wrote:

> Okay the CLOSE_WAIT might now be the biggest issue here. tgtd is killing
> the number of open files.
>
> ubuntu@maas-trusty-back-may22:/var/lib/maas$ sudo lsof | grep tgtd | wc -l
> 70669
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1446810
>
> Title:
> 1.8beta3: Too Many Open Files in maas.log
>
> Status in MAAS:
> New
>
> Bug description:
> My maas.log has lots of these errors in it:
>
> "maas.log:Apr 21 14:10:02 maas-trusty-back-may22 maas.power: [ERROR]
> mizuchi.local: Failed to refresh power state: [Errno 24] Too many open
> files"
>
> Also some like this:
> maas.log:Apr 21 18:30:41 maas-trusty-back-may22
> maas.lease_upload_service: [ERROR] Failed to upload leases: [Errno 24] Too
> many open files
>
> I've attached logs of "lsof -n" and "ps auxf" and MAAS logs.
>
> There are loads of connects from clusterd to the regiond (on http) in
> CLOSE_WAIT. Looks like maybe clusterd is leaking fds?
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1446810/+subscriptions
>

Revision history for this message
Raphaël Badin (rvb) wrote :

> Looks like tgtd just normally has that many files open. Restarting it keeps it the same amount, and the configuration for maas
> in tgt looks correct.

That's correct, tgtd changes the max open files limit (it's set to 1048576 on my test system) for its process.

Revision history for this message
Raphaël Badin (rvb) wrote :

> There is a total of 940 CLOSE_WAIT connections. They are HTTP connections, looks like the cluster connecting to the region to
> get the RPC event-loop information.
> ubuntu@maas-trusty-back-may22:/var/lib/maas$ sudo lsof -u maas | grep CLOSE_WAIT | wc -l
> 940

What's the default limit on this machine? On my test system, ulimit -n → 1024 and 940 is dangerously close to 1024.

Gavin Panella (allenap)
Changed in maas:
status: New → In Progress
assignee: nobody → Gavin Panella (allenap)
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

On 04/22/2015 02:27 AM, Raphaël Badin wrote:
>> There is a total of 940 CLOSE_WAIT connections. They are HTTP connections, looks like the cluster connecting to the region to
>> get the RPC event-loop information.
>> ubuntu@maas-trusty-back-may22:/var/lib/maas$ sudo lsof -u maas | grep CLOSE_WAIT | wc -l
>> 940
>
> What's the default limit on this machine? On my test system, ulimit -n
> → 1024 and 940 is dangerously close to 1024.

It's the same.

Jason

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Still seeing this:

May 2 13:32:51 maas-trusty-back-may22 maas.lease_upload_service: [ERROR] Failed to upload leases: [Errno 24] Too many open files

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Jason, can you confirm this stills happens on 1.8b5?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

On 05/13/2015 11:44 AM, Andres Rodriguez wrote:
> Jason, can you confirm this stills happens on 1.8b5?

We haven't seen that error since May 8th. It may take a while to show up
though.

There are still a TON of CLOSE_WAIT sockets:

ubuntu@maas-trusty-back-may22:/var/log/maas$ sudo lsof -n | grep
CLOSE_WAIT | grep ^twistd | wc -l
9150

Jason

Revision history for this message
Gavin Panella (allenap) wrote :

Playing around here it appears that lsof reports a single socket once for each thread/task in a process (see the TID column in lsof's output), so that count may be inflated. Doesn't mean there isn't a problem though.

Revision history for this message
Larry Michel (lmic) wrote :

Still haven't seen this since May 8th.

Here's number of CLOSE_WAIT sockets.
ubuntu@maas-trusty-back-may22:~/logs$ sudo lsof -n | grep CLOSE_WAIT | grep ^twistd | wc -l
2948

Revision history for this message
Larry Michel (lmic) wrote :

Current maas version is 1.8.0~beta6+bzr3899-0ubuntu1~trusty1

Changed in maas:
status: In Progress → Incomplete
Gavin Panella (allenap)
Changed in maas:
status: Incomplete → In Progress
Gavin Panella (allenap)
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.