building ruby1.8 with pthread support causes puppet hangs

Bug #520715 reported by Joel Ebel on 2010-02-11
62
This bug affects 10 people
Affects Status Importance Assigned to Milestone
eglibc (Ubuntu)
Undecided
Matthias Klose
Lucid
Undecided
Matthias Klose
ruby-defaults (Ubuntu)
Undecided
Unassigned
Lucid
Undecided
Unassigned

Bug Description

Binary package hint: ruby1.8

Puppet is hanging for us under Lucid with ruby1.8 1.8.7.249-1. I've filed the following bug with upstream ruby regarding this bug:

http://redmine.ruby-lang.org/issues/show/2739

We're not the only ones seeing this problem:
https://groups.google.com/group/puppet-users/browse_thread/thread/7efd79bcd807de4c#

Given the importance of puppet to Ubuntu, I think it best to reconsider building ruby1.8 without pthread support for the time being. As discussed in bug 307462 it provides a performance boost as well. It disables libtcltk-ruby1.8, but no packages depend on that other than the ruby-defaults, so I'd consider puppet to be a far more important use case.

I've provided patches for both packages to disable pthread support, and not build libtcltk-ruby1.8.

Andrew Pollock (apollock) wrote :

Some of the issues with Ruby and pthreads were called out in https://blueprints.launchpad.net/ubuntu/+spec/server-karmic-puppet-integration

Changed in ruby1.8 (Ubuntu):
status: New → Triaged
Changed in ruby-defaults (Ubuntu):
status: New → Triaged

On 11/02/10 at 22:33 -0000, Joel Ebel wrote:
> Public bug reported:
>
> Binary package hint: ruby1.8
>
> Puppet is hanging for us under Lucid with ruby1.8 1.8.7.249-1. I've
> filed the following bug with upstream ruby regarding this bug:
>
> http://redmine.ruby-lang.org/issues/show/2739
>
> We're not the only ones seeing this problem:
> https://groups.google.com/group/puppet-users/browse_thread/thread/7efd79bcd807de4c#
>
> Given the importance of puppet to Ubuntu, I think it best to reconsider
> building ruby1.8 without pthread support for the time being. As
> discussed in bug 307462 it provides a performance boost as well. It
> disables libtcltk-ruby1.8, but no packages depend on that other than the
> ruby-defaults, so I'd consider puppet to be a far more important use
> case.

Have you checked that it doesn't break other (ruby) libs that are in the
archive?

puppet is pure-ruby. Are you sure that it isn't a bug in puppet? Or
something that can be fixed in the interpreter?
--
| Lucas Nussbaum
| <email address hidden> http://www.lucas-nussbaum.net/ |
| jabber: <email address hidden> GPG: 1024D/023B3F4F |

tags: added: patch
Joel Ebel (jbebel) wrote :

I don't have a test suite for other ruby libs, so no I can't say with certainty that building them without pthreads doesn't break them, but being the default for usptream ruby, and used by Red Hat leads me to believe that ruby without pthreads has been pretty thoroughly tested elsewhere.

I'm pretty sure it's a ruby bug. installing an older version of ruby from Karmic allowed puppet to work fine. You can see in the ruby bug I linked that I was able to find the specific upstream change, which was related to threading, where the problem began. And I've been able to reliably run puppet without problems by disabling pthreads in the current version.

For convenience, here's the upstream change where the badness started happening:
http://svn.ruby-lang.org/cgi-bin/viewvc.cgi?view=rev&revision=24104
And the file diff of what changed:
http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/branches/ruby_1_8_7/eval.c?r1=23997&r2=24104&diff_format=h

On 12/02/10 at 00:00 -0000, Joel Ebel wrote:
> I don't have a test suite for other ruby libs, so no I can't say with
> certainty that building them without pthreads doesn't break them, but
> being the default for usptream ruby, and used by Red Hat leads me to
> believe that ruby without pthreads has been pretty thoroughly tested
> elsewhere.

How many libs have you tried ?

> I'm pretty sure it's a ruby bug. installing an older version of ruby
> from Karmic allowed puppet to work fine. You can see in the ruby bug I
> linked that I was able to find the specific upstream change, which was
> related to threading, where the problem began. And I've been able to
> reliably run puppet without problems by disabling pthreads in the
> current version.
>
> For convenience, here's the upstream change where the badness started happening:
> http://svn.ruby-lang.org/cgi-bin/viewvc.cgi?view=rev&revision=24104
> And the file diff of what changed:
> http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/branches/ruby_1_8_7/eval.c?r1=23997&r2=24104&diff_format=h

Then please try to talk to upstream so that this bug is fixed, rather
than working around it by disabling pthreads.

With that info, it shouldn't be too hard to really fix the bug for
someone who knows the code, no?
--
| Lucas Nussbaum
| <email address hidden> http://www.lucas-nussbaum.net/ |
| jabber: <email address hidden> GPG: 1024D/023B3F4F |

Lucas Nussbaum (lucas) wrote :

OK, let's try to summarize...

Facts, on --enable-pthread vs --disable-pthread:
- There's a noticeable performance gain with --disable-pthread, at least with some benchmarks.
- Consequences of switching to --disable-pthread are unclear. Linking a non-pthread ruby with a pthread tk doesn't work (so that breaks Ruby/Tk), and other libs might (are likely?) to break as well.
- Ruby has been built with --enable-pthread for ages on Debian and Ubuntu (at least since Dapper).
- You claim that Red Hat doesn't use --enable-pthread, but you are wrong according to http://cvs.fedoraproject.org/viewvc/F-12/ruby/ruby.spec?r1=1.74&r2=1.75&
- Building with --disable-pthread is just a work-around for this issue. And since we don't know what the real issue is yet, it might just hide the issue a bit deeper.

Given the above, it is not a viable solution to build with --disable-pthread.

On the issue itself:
you apparently have a test case, since you were able to pinpoint a specific SVN commit? Could you make it public?

Joel Ebel (jbebel) wrote :

It appears I was incorrect about Red Hat. However, the default ruby build does not include pthreads, which had me confused for quite some time why the debian package was failing while when I built ruby by hand it worked fine.

Here's what I know:

Reasons to consider disabling pthreads:
- As you say, there are performance impacts of enabling pthreads, particularly with puppet.
- Pthreads causes breakage to puppet under unknown circumstances, but repeatable in certain varied environments.
- Puppet is an important part of ubuntu's plan for cloud management.

Reasons to not disable pthreads:
- It breaks libtcltk-ruby1.8
- This will affect at most 0.03% (22 people) of the debian community according to the votes at:
http://qa.debian.org/popcon.php?package=ruby1.8 Ubuntu popcon data appears to be down at the moment.
- libtcltk-ruby has no packages that depend on it and provides no functionality of its own, thus nothing ubuntu provides makes any use of it.
- Everybody else uses pthreads, and it's what we've always done.
- Disabling pthreads might do something bad.

In summary, pthreads causes significant problems in a well known and important package to the ubuntu community. Disabling it breaks something that is almost completely unused. Otherwise it MAY break something else and we're scared to try it because no one else is doing it. The fact that it's how it's been done for a few years is not in my mind a valid reason to continue doing it when the original reason it was implemented is not clear.

The reasons to disable it are known and significant. The reasons to leave it are unknown or irrational. I think it's time to evaluate priorities.

Here is what I don't know:
- Anything about ruby. I just know I need to get puppet working in our environment, and this was blocking my progress. I found a solution by analyzing things that changed from version to version, and tracked down where the problem began. I don't know how to test ruby libraries or even what any of them do. I would expect that Ubuntu would have a suite of tests to verify that any package changes don't cause regressions. If such tests existed, I think they could verify that functionality without pthreads was still acceptable.

Sure it would be nice if ruby worked with pthreads, and without any performance impact. I've filed the bug, but there's been no response. Even if they look at it, I'm not sure how likely it is to propagate into Lucid before release. Being an LTS release, and the likelihood of enterprises focusing on it, I would think that having a properly working puppet upon release would be a priority, and I have yet to see any tangible benefit that having pthreads enabled gives us.

I do not have a reproducible test case I can make public yet, and even so, it appears to depend on hardware. Some machines do not exhibit the error. Running puppet under strace makes the problem go away. The two together imply that it is very timing dependent. I will continue trying to find a simple way to reproduce the bug.

Lucas Nussbaum (lucas) wrote :

On 12/02/10 at 19:55 -0000, Joel Ebel wrote:
> - As you say, there are performance impacts of enabling pthreads, particularly with puppet.

No. I said that there are performance impacts of enabling pthreads, at
least with some benchmarks. I didn't say anything about puppet.

> - Pthreads causes breakage to puppet under unknown circumstances, but repeatable in certain varied environments.

No. A bug in the pthread code currently causes breakage to puppet.
Puppet worked fine with versions that were not affected by that bug
(even with --enable-pthread).

> Reasons to not disable pthreads:
> - It breaks libtcltk-ruby1.8
> - This will affect at most 0.03% (22 people) of the debian community according to the votes at:
> http://qa.debian.org/popcon.php?package=ruby1.8 Ubuntu popcon data appears to be down at the moment.
> - libtcltk-ruby has no packages that depend on it and provides no functionality of its own, thus nothing ubuntu provides makes any use of it.
> - Everybody else uses pthreads, and it's what we've always done.
> - Disabling pthreads might do something bad.

Your arguments are fallacious. You failed to mention that disabling
pthreads might break other libraries/software.

> In summary, pthreads causes significant problems in a well known and
> important package to the ubuntu community. Disabling it breaks
> something that is almost completely unused. Otherwise it MAY break
> something else and we're scared to try it because no one else is doing
> it. The fact that it's how it's been done for a few years is not in my
> mind a valid reason to continue doing it when the original reason it was
> implemented is not clear.
>
> The reasons to disable it are known and significant. The reasons to
> leave it are unknown or irrational. I think it's time to evaluate
> priorities.

(lol)

> Here is what I don't know:
> - Anything about ruby. I just know I need to get puppet working in our environment, and this was blocking my progress.
> [...]

Let me rephrase your point. You don't care about Ruby. You just care
about Puppet. Ruby has a bug, that happens to affect Puppet. Apparently,
modifying ruby allows to work-around the bug, possibly affecting other
Ruby users, but you don't care, and prefer to (possibly) break Ruby for
everybody except you.
--
| Lucas Nussbaum
| <email address hidden> http://www.lucas-nussbaum.net/ |
| jabber: <email address hidden> GPG: 1024D/023B3F4F |

Nigel Kersten (nigelk) wrote :

Lucas, I'm going to work on a reproducible case that doesn't involve Puppet at all.

I do believe Puppet is triggering a more fundamental problem, but agree we need to clearly demonstrate this.

Micah Anderson (micah-debian) wrote :

On 11/02/10 at 19:55 -0000, Joel Ebel wrote:
>Puppet is hanging for us under Lucid with ruby1.8 1.8.7.249-1.

Could you provide a puppet snippet that causes this hang? I'd be interested in how you are invoking puppet as well. I'm not currently seeing this issue, but it seems to be a specific case, and I'd like to find out if that case is repeatable in Debian as well.

Andrew Pollock (apollock) wrote :

It's really difficult to reproduce. Puppet hangs loading the facts, before it ever gets down to any real business. The fact it consistently hangs on is a custom one we've written. Furthermore, it only seems to hang on some types of hardware, and not under strace.

Although that said, when Joel was trying to track down particular Ruby changes, he was able to get a different build of Ruby to reliably hang, even under strace.

I don't think he could get Puppet to hang when run against a trivial manifest, though, so we've had a hard time trying to come up with a minimalist, self-contained test case.

Nigel Kersten (nigelk) wrote :

So this isn't a Puppet bug at all.

It looks to be a bug in the Ruby Timeout module that seems to be triggered when most of your cores are busy.

I can reliably reproduce it by firing up openssl speed (n-1) times where n is the number of cores and then using the Timeout module.

#!/usr/bin/ruby1.8
#

%x{/usr/bin/touch /tmp/7777}
puts "executed without timeout ok"

puts "executing with timeout"

require 'timeout'

status = Timeout::timeout(5) {
       %x{/usr/bin/touch /tmp/7777}
}

puts "executed with timeout ok"

which will produce something like:

root@testhost:~# ps auxww|grep [o]penssl
root 22337 99.6 0.0 14616 2028 pts/6 R 15:04 2:52 openssl speed
root 22338 99.9 0.0 14616 2028 pts/6 R 15:04 2:49 openssl speed
root 22339 100 0.0 14616 2024 pts/6 R 15:04 2:49 openssl speed

root@testhost:~# ~/tickle_ruby.rb
executed without timeout ok
executing with timeout
/usr/lib/ruby/1.8/timeout.rb:60: execution expired (Timeout::Error)
 from /root/tickle_ruby.rb:11

root@testhost:~# killall openssl
[1] Terminated openssl speed &>/dev/null
[2]- Terminated openssl speed &>/dev/null
[3]+ Terminated openssl speed &>/dev/null

root@testhost:~# ~/tickle_ruby.rb
executed without timeout ok
executing with timeout
executed with timeout ok

Lucas Nussbaum (lucas) wrote :

On 16/02/10 at 23:10 -0000, Nigel Kersten wrote:
> So this isn't a Puppet bug at all.
>
> It looks to be a bug in the Ruby Timeout module that seems to be
> triggered when most of your cores are busy.

Hi,

Just to clarify: how do you know it is the same problem?

> I can reliably reproduce it by firing up openssl speed (n-1) times where
> n is the number of cores and then using the Timeout module.
>
> #!/usr/bin/ruby1.8
> #
>
> %x{/usr/bin/touch /tmp/7777}
> puts "executed without timeout ok"
>
> puts "executing with timeout"
>
> require 'timeout'
>
> status = Timeout::timeout(5) {
> %x{/usr/bin/touch /tmp/7777}
> }
>
> puts "executed with timeout ok"
>
>
> which will produce something like:
>
> root@testhost:~# ps auxww|grep [o]penssl
> root 22337 99.6 0.0 14616 2028 pts/6 R 15:04 2:52 openssl speed
> root 22338 99.9 0.0 14616 2028 pts/6 R 15:04 2:49 openssl speed
> root 22339 100 0.0 14616 2024 pts/6 R 15:04 2:49 openssl speed
>
> root@testhost:~# ~/tickle_ruby.rb
> executed without timeout ok
> executing with timeout
> /usr/lib/ruby/1.8/timeout.rb:60: execution expired (Timeout::Error)
> from /root/tickle_ruby.rb:11
>
> root@testhost:~# killall openssl
> [1] Terminated openssl speed &>/dev/null
> [2]- Terminated openssl speed &>/dev/null
> [3]+ Terminated openssl speed &>/dev/null
>
> root@testhost:~# ~/tickle_ruby.rb
> executed without timeout ok
> executing with timeout
> executed with timeout ok

I could not reproduce this problem on Debian (same Ruby version as on
Ubuntu). Could it be that the libc is at fault, instead of Ruby
itself?
--
| Lucas Nussbaum
| <email address hidden> http://www.lucas-nussbaum.net/ |
| jabber: <email address hidden> GPG: 1024D/023B3F4F |

Nigel Kersten (nigelk) wrote :

Sorry, I should have made this clearer.

I work with Joel and Andrew, and am responsible for the Puppet infrastructure here.

I spent a while debugging what was causing Facter and/or Puppet to hang, and it all came down to the calls being wrapped in Timeout.

I did reproduce this on Debian Testing under VMware, but it wasn't *quite* as reproducible as it appears to be under Lucid for me, it would fail one in every few runs rather than absolutely every run when other cores were busy.

How many cores were on the machine you tried to reproduce on Lucas?

Nigel Kersten (nigelk) wrote :

PS. I'm open to the possibility libc is at fault. The patch Joel linked to earlier:

http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/branches/ruby_1_8_7/eval.c?r1=23997&r2=24104&diff_format=l

worries me a little with line # 12319

Lucas Nussbaum (lucas) wrote :

On 17/02/10 at 17:45 -0000, Nigel Kersten wrote:
> Sorry, I should have made this clearer.
>
> I work with Joel and Andrew, and am responsible for the Puppet
> infrastructure here.
>
> I spent a while debugging what was causing Facter and/or Puppet to hang,
> and it all came down to the calls being wrapped in Timeout.
>
> I did reproduce this on Debian Testing under VMware, but it wasn't
> *quite* as reproducible as it appears to be under Lucid for me, it would
> fail one in every few runs rather than absolutely every run when other
> cores were busy.
>
> How many cores were on the machine you tried to reproduce on Lucas?

First 2, then 8.

On the 8 cores one, I tried, using ruby1.8 from Debian unstable:
- with libc from Debian unstable (2.10)
- with libc from Debian experimental (2.11)

I could not reproduce the problem in any case, unfortunately.
(I'm running the script in a bash while loop.)

I don't have any Ubuntu images I could deploy on this machine, and I
also have a lot of other work to do tonight (CET, i.e now), so I won't
be able to spend a lot more time on this before tomorrow.

Are you able to reproduce it in a stock Ubuntu environment, or is it a
Google-specific one?

Note that the Timeout module is pure ruby, so it might be easy to reduce
your test case a bit more.

Thanks a lot for investigating this issue!
--
| Lucas Nussbaum
| <email address hidden> http://www.lucas-nussbaum.net/ |
| jabber: <email address hidden> GPG: 1024D/023B3F4F |

Nigel Kersten (nigelk) wrote :

I did verify the issue exists on the latest stock Lucid, but didn't dig this deeply at the time I did that. It will be interesting to see whether I get the same behavior pattern as Lucid + Google stuff or Debian testing.

We'll get some more data in soon.

Nigel Kersten (nigelk) wrote :

I had a quick look at Timeout. The problem is there in:

      x = Thread.current
      y = Thread.start {
        sleep sec
        x.raise exception, "execution expired" if x.alive?
      }

and Thread x.status returns a sleep state even after the test execs complete, which seems pretty fundamentally broken... I'll have a look and see if I can find a dupe

confirmed on stock Lucid and Debian testing.

Lucas Nussbaum (lucas) wrote :

On 18/02/10 at 06:51 -0000, Nigel Kersten wrote:
> I had a quick look at Timeout. The problem is there in:
>
> x = Thread.current
> y = Thread.start {
> sleep sec
> x.raise exception, "execution expired" if x.alive?
> }
>
> and Thread x.status returns a sleep state even after the test execs
> complete, which seems pretty fundamentally broken... I'll have a look
> and see if I can find a dupe
>
> confirmed on stock Lucid and Debian testing.

No, that's not a bug. How it works is that the code is executed (yield
sec), and if it finishes before the timeout, the thread is killed
(y.kill if y and y.alive?), so the exception is never raised.

So, on the paper, it looks correct. However, I'm not sure of the
atomicity guarantees provided by the interpreter here. Might be a
problem.
--
| Lucas Nussbaum
| <email address hidden> http://www.lucas-nussbaum.net/ |
| jabber: <email address hidden> GPG: 1024D/023B3F4F |

Nigel Kersten (nigelk) wrote :

Sorry, I wasn't clear again :)

I understand how the code works, I was pointing out that if something this trivial isn't working there is something fundamental broken with Thread.status, Thread.alive? etc.

Lucas Nussbaum (lucas) wrote :

Could you write a test case ? I'm still not sure I understand what you are seeing.

Andrew Pollock (apollock) wrote :

I'm trying to pick up the ball for Nigel because he's rather busy with something else at the moment.

What Nigel was trying to say in comment #19, was that we're not seeing the correct behaviour. i.e. the code was not working the way it is supposed to. The thread is not getting killed when there is a timeout.

From IRC, just so there's a record here, it sounds like you're asking for a test case in pure Ruby that doesn't involve the Timeout module/class.

Lucas Nussbaum (lucas) wrote :

Some more data points. On the same machine, running Linux 2.6.32-3-amd64 (from Debian)
- I cannot reproduce the problem from comment #13 in a sid chroot
- I can reproduce the problem in a lucid chroot
- I can reproduce the problem in a sid chroot, using the libc packages from Debian experimental (2.11.0)

So it is clearly either a libc bug, or a bug in ruby triggered by a change in libc.

Lucas Nussbaum (lucas) wrote :

upstream bug updated with that info.

Andrew Pollock (apollock) wrote :

Reassigning to eglibc and doko as directed by slangasek

affects: ruby1.8 (Ubuntu Lucid) → eglibc (Ubuntu Lucid)
Changed in eglibc (Ubuntu Lucid):
assignee: nobody → Matthias Klose (doko)
Stephan Rügamer (sruegamer) wrote :

Guys,

anything todo to give you some hands?
I can support you regarding puppet, glibc, and hardware .. ( I know google has more then I will ever have, but we have some nice blades which need something to do...)

Furthermore puppet + ubuntu == our way to work

So, andrew, lucus, doko....give me something to test...

regards,

\sh

Lucas Nussbaum (lucas) wrote :

Well, a good start would be to reproduce the issue, analyze how ruby1.8 is using pthreads, and see if you can either produce a test case without ruby (to be able to find a specific bug or change in eglibc), or understand what ruby is doing with pthreads and fix it.

Nigel Kersten (nigelk) wrote :

Are we even sure pthreads is central to the issue?

I've been a bit flat out which is why I disappeared in this bug history, but I'll see if I can get some more concrete info this week.

Lucas Nussbaum (lucas) wrote :

On 08/03/10 at 16:02 -0000, Nigel Kersten wrote:
> Are we even sure pthreads is central to the issue?

I am quite sure that the way ruby plays with pthread is central to this
issue, yes. But it is not clear whether ruby triggers a glibc bug, or
whether the glibc triggers a ruby bug.

Note that there has been more comments in the upstream bug.
--
| Lucas Nussbaum
| <email address hidden> http://www.lucas-nussbaum.net/ |
| jabber: <email address hidden> GPG: 1024D/023B3F4F |

Lucas Nussbaum (lucas) wrote :

After investigation it is not a glibc bug.

Changed in eglibc (Ubuntu Lucid):
status: Triaged → Invalid
Lucas Nussbaum (lucas) wrote :

Fixed package just uploaded to Debian, ruby1.8 1.8.7.249-2. Will ask for a sync as soon as it is in the Debian archive.

Colin Watson (cjwatson) wrote :

Synced:

ruby1.8 (1.8.7.249-2) unstable; urgency=low

  * Add 100312_timeout-fix.dpatch: Backport upstream change to fix
    problem with threads and timeouts. Closes: #539987

 -- Lucas Nussbaum <email address hidden> Fri, 12 Mar 2010 07:13:47 +0100

Changed in ruby-defaults (Ubuntu Lucid):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers