dbus needs more than the default 1024 open files

Bug #381063 reported by Steve Bergman
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
dbus (Ubuntu)
Fix Released
Medium
Unassigned
ltsp (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

=== PROBLEM ===
On a multiuser system with many desktop users, the system dbus-daemon process can easily exceed the 1024 open files allowed by the default ulimit. When it exceeds that, it goes into a tight loop, sucking up 100% processor, and nobody can log in.

=== WORKAROUND ===
Add the line
limit nofile 10000 10000
to /etc/init/dbus.conf . Tested and works on Ubuntu 9.04 and 10.04.

Note: Editing /etc/default/dbus does not work any longer since the transition to the upstart job.

=== ORIGINAL DESCRIPTION ===
Ubuntu 9.04
dbus 1.2.12-0ubuntu2
We're on x86_64, but this is really arch independent

On a multiuser system with many desktop users, the system dbus-daemon process can easily exceed the 1024 open files allowed by the default ulimit. When it exceeds that, it goes into a tight loop, sucking up 100% processor, and nobody can log in. And, of course, everything using dbus is then adversely affected. Nothing in the logs points to too many open files as being the problem. Only attaching strace to the processes clarified what was happening. This is bad for thin client people. e.g. Edubuntu users and corporate desktop rollouts.

As an example, I have a corporate server running 58 gnome desktops via XDMCP and dbus-daemon has about 1200 files open.

Increasing the default "ulimit -n" setting for user 'messagebus' or including a higher "ulimit -n" in the sysvinit file solves the problem.

Other than that, Ubuntu has done wonderfully compared to the Fedora that we just upgraded from. I'm most pleased with your excellent work.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

I need confirmation of the "tight loop" - could you cause this and run strace on the dbus-daemon to capture what it's doing.

Ideally also run "gdb" on it and use "bt" to obtain a backtrace

Changed in dbus (Ubuntu):
importance: Undecided → Medium
status: New → Incomplete
Revision history for this message
Jean-Michel Dault (jmdault) wrote :

We have the same problem too:

root@SLXATS2:~# ps -u messagebus
  PID TTY TIME CMD
 2677 ? 02:28:11 dbus-daemon

root@SLXATS2:~# ls /proc/2677/fd|wc -l
1024
root@SLXATS2:~#

root@SLXATS2:~# (strace -p 2677 2>&1 1>&3 | grep "Too many open files"
1>&2) 3>&1
accept(3, 0xbf8ec168, [16]) = -1 EMFILE (Too many open files)
accept(3, 0xbf8ec168, [16]) = -1 EMFILE (Too many open files)
accept(3, 0xbf8ec168, [16]) = -1 EMFILE (Too many open files)
accept(3, 0xbf8ec168, [16]) = -1 EMFILE (Too many open files)
accept(3, 0xbf8ec168, [16]) = -1 EMFILE (Too many open files)
accept(3, 0xbf8ec168, [16]) = -1 EMFILE (Too many open files)
accept(3, 0xbf8ec168, [16]) = -1 EMFILE (Too many open files)
accept(3, 0xbf8ec168, [16]) = -1 EMFILE (Too many open files)
accept(3, 0xbf8ec168, [16]) = -1 EMFILE (Too many open files)

Here is a detailed excerpt from ptrace:
accept(3, 0xbf8ec168, [16]) = -1 EMFILE (Too many open files)
gettimeofday({1252007529, 345943}, NULL) = 0
poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=10, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=30, events=POLLIN}, {fd=32, events=POLLIN}, {fd=29, events=POLLIN}, {fd=329, events=POLLIN}, {fd=343, events=POLLIN}, {fd=345, events=POLLIN}, {fd=346, events=POLLIN}, {fd=358, events=POLLIN}, {fd=38, events=POLLIN}, {fd=17, events=POLLIN}, {fd=21, events=POLLIN}, {fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=27, events=POLLIN}, {fd=211, events=POLLIN}, {fd=217, events=POLLIN}, {fd=221, events=POLLIN}, {fd=222, events=POLLIN}, {fd=271, events=POLLIN}, {fd=321, events=POLLIN}, {fd=325, events=POLLIN}, {fd=327, events=POLLIN}, {fd=328, events=POLLIN}, {fd=36, events=POLLIN}, {fd=339, events=POLLIN}, ...], 1018, 254392) = 1 ([{fd=3, revents=POLLIN}])
gettimeofday({1252007529, 349210}, NULL) = 0
accept(3, 0xbf8ec168, [16]) = -1 EMFILE (Too many open files)
gettimeofday({1252007529, 349515}, NULL) = 0
poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=10, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=30, events=POLLIN}, {fd=32, events=POLLIN}, {fd=29, events=POLLIN}, {fd=329, events=POLLIN}, {fd=343, events=POLLIN}, {fd=345, events=POLLIN}, {fd=346, events=POLLIN}, {fd=358, events=POLLIN}, {fd=38, events=POLLIN}, {fd=17, events=POLLIN}, {fd=21, events=POLLIN}, {fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=27, events=POLLIN}, {fd=211, events=POLLIN}, {fd=217, events=POLLIN}, {fd=221, events=POLLIN}, {fd=222, events=POLLIN}, {fd=271, events=POLLIN}, {fd=321, events=POLLIN}, {fd=325, events=POLLIN}, {fd=327, events=POLLIN}, {fd=328, events=POLLIN}, {fd=36, events=POLLIN}, {fd=339, events=POLLIN}, ...], 1018, 254388) = 1 ([{fd=3, revents=POLLIN}])
gettimeofday({1252007529, 464829}, NULL) = 0
accept(3, 0xbf8ec168, [16]) = -1 EMFILE (Too many open files)
gettimeofday({1252007529, 471040}, NULL) = 0

Revision history for this message
Steve Bergman (sbergman27) wrote :

James, with all due respect that is absolutely out of the question. I have 70 business users with stable desktops now that "ulimit -n" is set to an appropriate value. You are asking me to essentially crash 70 users in three cities, and then clean up all the domino effect problems, like residual evolution processes, etc. as users report them to me. I cannot justify that when it is clear that since dbus opens some 20 or so files per desktop, 1024 is just a ridiculously low value for a real desktop server.

If we do this test, I'll need an address at Canonical to send the support bill to, along with compensation for my customer's losses. Because they would *never* approve this experiment otherwise.

Changed in dbus (Ubuntu):
status: Incomplete → New
Revision history for this message
Guillaume Pratte (guillaumepratte) wrote :

As a workaround, one can add this line to /etc/default/dbus :

ulimit -n 65535

Revision history for this message
Jean-Michel Dault (jmdault) wrote :

The /etc/default/dbus is a great way to fix the problem. It's also configurable.

I suggest the following patch in the dbus package:
--- /etc/default/dbus.orig 2009-09-03 17:19:14.747613907 -0400
+++ /etc/default/dbus 2009-09-03 16:22:09.117573599 -0400
@@ -8,3 +8,8 @@

 # Parameters to pass to dbus.
 PARAMS=""
+
+# dbus-daemon --system needs way more than 1024 opened files
+# on a multi-user system (LTSP, etc)
+ulimit -n 65535
+

Revision history for this message
Steve Bergman (sbergman27) wrote :

As an admin, the worst part of this issue was the fact that things would randomly stop working and lock up. Logins would inexplicably fail. All with no particular pattern, getting worse and worse... and yet the cause was not at all obvious. Nothing in daemon.log. Nothing in syslog. Nothing in messages. Nothing in dmesg. No pop ups on the desktop about anything being out of files. No indication that dbus was the culprit. While I do think that the default needs to be higher, my main objection is just how *silent* this system-wide disaster is.

Solving the problem was like administering Windows. All my usual voluminous logs were absolutely useless, and there was no obvious place to focus my efforts.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: [Bug 381063] Re: dbus needs more than the default 1024 open files

On Thu, 2009-09-03 at 19:55 +0000, Jean-Michel Dault wrote:

> We have the same problem too:
>
Were you able to obtain a backtrace from gdb?

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

On Thu, 2009-09-03 at 20:16 +0000, Steve Bergman wrote:

> James, with all due respect that is absolutely out of the question. I
> have 70 business users with stable desktops now that "ulimit -n" is set
> to an appropriate value. You are asking me to essentially crash 70 users
> in three cities, and then clean up all the domino effect problems, like
> residual evolution processes, etc. as users report them to me. I cannot
> justify that when it is clear that since dbus opens some 20 or so files
> per desktop, 1024 is just a ridiculously low value for a real desktop
> server.
>
The request was to find out where in the code it's looping, since that's
clearly a bug as well. Unlike out-of-memory situations, the number of
open files isn't going to go down by spinning in one place.

> If we do this test, I'll need an address at Canonical to send the
> support bill to, along with compensation for my customer's losses.
> Because they would *never* approve this experiment otherwise.
>
This is a most unhelpful attitude.

Perhaps you'd like to bear in mind how much money *you* have given
Canonical for the use of Ubuntu, which is clearly benefiting your
business.

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message
Steve Bergman (sbergman27) wrote :

I'm sorry you feel that way, but I've already done all I can reasonably do without doing a disservice to my customer. And if Canonical does not understand that, then perhaps Ubuntu doesn't belong in the enterprise.

All things considered, RHEL support subscriptions cost a lot less than this experiment would. And they have enough business sense not to take your attitude.

Revision history for this message
Steve Bergman (sbergman27) wrote :

What I can do, when my work load permits, is bring up 60+ vncserver sessions on a test server. Or, really, I supposed anyone else who cares could do it if they get to it first. I suspect that this is going to be quite reproduceable. It's not some hard to reproduce problem that only the bug reporter could possibly troubleshoot. And you *are* supposed to be the maintainer.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

On Tue, 2009-09-08 at 15:53 +0000, Steve Bergman wrote:

> I'm sorry you feel that way, but I've already done all I can reasonably
> do without doing a disservice to my customer. And if Canonical does not
> understand that, then perhaps Ubuntu doesn't belong in the enterprise.
>
> All things considered, RHEL support subscriptions cost a lot less than
> this experiment would. And they have enough business sense not to take
> your attitude.
>
If you filed the same bug upstream, you would have exactly the same
response.

If you filed the bug with RHEL, it would be closed if you did not have a
support contract with them. If you had a support contract with
Canonical, you could have filed this with your account manager.

Sorry, but I'm not the one with the attitude here. You've filed a bug,
I've asked for more information for the bug to allow me to reproduce it,
and you refused and got all snotty about it.

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message
Steve Bergman (sbergman27) wrote :

Well, let me ask you this. Is there someplace where my customers can take out an official Canonical support contract that would be sufficient to persuade Canonical support employees to get off their butts and actually do some troubleshooting instead of looking for any way they can to put it all back on the user? By that, I don't mean that my customers pay money to get told the same thing. But something that would actually make a difference? If so, I have no problems pitching it to them. And if I recommended it, they would probably do it.

For some, running Linux is part of their business and not just a hobby, in case you were not aware of that.

Revision history for this message
Steve Bergman (sbergman27) wrote :

And one more thing. Note that I had my problem solved before I ever opened this ticket. I opened it, as a courtesy, to report the problem, give some insight as to things that made it particularly difficult to track down, and get the workaround published so that others might not have quite so much difficulty in the future.

Beyond that, whether this gets solved or not is not really of material consequence to me. Keep all that in mind.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

On Tue, 2009-09-08 at 18:36 +0000, Steve Bergman wrote:

> And one more thing. Note that I had my problem solved before I ever
> opened this ticket. I opened it, as a courtesy, to report the problem,
> give some insight as to things that made it particularly difficult to
> track down, and get the workaround published so that others might not
> have quite so much difficulty in the future.
>
So why are you being so arsey when a developer asks you for some
information so that they can understand what you found to be
particularly difficult to track down, so they can track it down
themselves?

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message
Andres Freund (andres-anarazel) wrote :

I just hit the same problem. Unfortunately in a single user environment, but thats likely a separate issue...

Backtrace and strace attached.

Revision history for this message
Andres Freund (andres-anarazel) wrote :

Thats a current lucid of today with dbus 1.2.16-2ubuntu2.

Revision history for this message
Andres Freund (andres-anarazel) wrote :

Hrrm. In my case that seems to be the session bus not the system bus, so it might not be related (it looks like its leaking fd's somewhere because during normal usage it just has 110fds open), new bug or similar enough?

description: updated
Changed in dbus (Ubuntu):
status: New → Confirmed
Revision history for this message
Martin Pitt (pitti) wrote :

This should be fixed by 1.4.6 according to upstream NEWS.

dbus (1.4.6-1ubuntu1) natty; urgency=low

  * Merge with Debian unstable. Remaining Ubuntu changes:
    - Install into / rather than /usr.
    - debian/dbus.postinst: Use upstart call instead of invoking the init.d
      script for checking if we are already running.
    - Add debian/dbus.upstart.
    - 0001-activation-allow-for-more-variation-than-just-system.patch,
      0002-bus-change-systemd-activation-to-activation-systemd.patch,
      0003-upstart-add-upstart-as-a-possible-activation-type.patch,
      0004-upstart-add-UpstartJob-to-service-desktop-files.patch,
      0005-activation-implement-upstart-activation.patch: Patches from Scott
      James Remnant to implement Upstart service activation. Not upstream.
    - 20_system_conf_limit.patch: Increase max_match_rules_per_connection for
      the system bus to 5000 (LP #454093)
    - 81-session.conf-timeout.patch: Raise the service startup timeout from 25
      to 60 seconds. It may be too short on the live CD with slow machines.
  * debian/rules: Fix creation of /usr/lib/libdbus-1.so symlink.
  * debian/libdbus-1-dev.install: Put back .pc.

 -- Martin Pitt <email address hidden> Thu, 24 Feb 2011 16:41:36 +0100

Changed in dbus (Ubuntu):
status: Confirmed → Fix Released
Changed in ltsp (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.