Puppet agent using 100% CPU, in sched_yield() loop. Looks like an issue with ruby2.3 which has been fixed but not yet made it into Ubuntu.

Bug #1834072 reported by DAN FINN on 2019-06-24
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ruby2.3 (Debian)
Fix Released
Unknown
ruby2.3 (Ubuntu)
Xenial
Undecided
Unassigned

Bug Description

[Impact]

Ruby processes can sometimes get stuck in a loop consuming 100% CPU, as described upstream and in the debian bug report. It has most commonly been seen in the puppet agent.

[Test Case]
It's not easy to reproduce. It has been suggested that this script eventually reproduces the problem:

while nice -n19 /opt/puppetlabs/puppet/bin/ruby sched_yield_loop.rb; do :; done

Where sched_yield_loop.rb comes from "https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=876377;filename=sched_yield_loop.rb;msg=22"

I personally haven't seen it happen with the script, but maybe it could take days.

[Regression Potential]
Races with threads can be hard to reproduce, and so can regressions.
Patch has been applied upstream and to debian for more than a year now.

[Other Info]
Bionic has ruby 2.5.1 and it has this fix already, as do all later ubuntu releaes.

[Original Description]

Ubuntu 16.04
ruby 2.3.1-2~16.04.12
kernel 4.4.0-148-generic

We've noticed an issue across multiple servers where puppet agent will seem to get stuck and consume 100% CPU for days or weeks on end until manually killed.

root@ps-prod-jenkins-qa-ui02:~# ps auxwwww|grep -i puppe[t]
root 1412 0.0 0.2 143716 38680 ? Ssl Jun11 0:39 /usr/bin/ruby /usr/bin/puppet agent
root 34884 74.4 0.3 286848 53724 ? Rs Jun23 1141:44 puppet agent: applying configuration
root 111481 94.1 0.3 288572 54996 ? Rs Jun18 8642:32 puppet agent: applying configuration
root 128479 54.8 0.3 286744 53596 ? Rs 10:30 250:17 puppet agent: applying configuration

Strace shows it in a sched_yield() loop:

root@ps-prod-jenkins-qa-ui02:~# strace -p 34884 -c
strace: Process 34884 attached
^Cstrace: Process 34884 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.002130 0 123189 sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00 0.002130 123189 total

Some googling shows this is a common issue which was supposedly fixed/backported to ruby 2.3:

https://tickets.puppetlabs.com/browse/PA-1743
https://bugs.ruby-lang.org/issues/13794
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=876377

The following open Ubuntu bugs look to be having the same issue and suggest that this fix made it into Debian but never made it into Ubuntu:

https://bugs.launchpad.net/ubuntu/+source/pcs/+bug/1783161
https://bugs.launchpad.net/ubuntu/+source/pcs/+bug/1772998

Related branches

Andreas Hasenack (ahasenack) wrote :

https://salsa.debian.org/ruby-team/ruby/commit/50d860d0bd7834e95214a2b1ff5b8e0ede7910a1 seems to be the fix

If I build packages in a ppa, can you test for us, prior to the upload to proposed?

tags: added: server-next
Changed in ruby2.3 (Ubuntu):
status: New → Triaged
importance: Undecided → High
DAN FINN (dfinn) wrote :

Yes, I could test it on a dev box

Changed in ruby2.3 (Debian):
status: Unknown → Fix Released
Andreas Hasenack (ahasenack) wrote :

I built xenial packages in this ppa with the above patch, if you could give them a try:

https://launchpad.net/~ahasenack/+archive/ubuntu/ruby-cpu-busy-loop-1834072/

DAN FINN (dfinn) wrote :

Yes, I will work on that. The only tricky part is that I don't know of a way to manually reproduce this. It seems to happen randomly to us and sometimes infrequently.

DAN FINN (dfinn) wrote :

I could try the ruby script mentioned in message #22 of this thread:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=876377

Andreas Hasenack (ahasenack) wrote :

Thanks!

The arm builders are slow, so apt won't see that ppa as populated just yet, but you can download the individual packages. Here are the amd64 ones, for example: https://launchpad.net/~ahasenack/+archive/ubuntu/ruby-cpu-busy-loop-1834072/+build/17190313

DAN FINN (dfinn) wrote :

OK, I'm playing with that ruby script mentioned above now. I've run it several times across several servers and it's only crashed one, leaving a hung process using 100% CPU. I'm trying to get a feel for how often it does this before I patch a test system.

DAN FINN (dfinn) wrote :

Do you want me to install all the .deb packages at that URL?

DAN FINN (dfinn) wrote :

We have the following packages installed that match the ones you updated so I will just be installing your updates to these:

[dfinn@ps-backup-db05 PS SERVER 0 ~]$ dpkg -l|grep -i ruby2
ii libruby2.3:amd64 2.3.1-2~16.04.12 amd64 Libraries necessary to run Ruby 2.3
ii ruby2.3 2.3.1-2~16.04.12 amd64 Interpreter of object-oriented scripting language Ruby
ii ruby2.3-dev:amd64 2.3.1-2~16.04.12 amd64 Header files for compiling extension modules for the Ruby 2.3

I was able to get it to happen twice in somewhere around 100 tries so this isn't a great test but I'm going to update one system and continually run it on there.

Andreas Hasenack (ahasenack) wrote :

The packages from the ppa finally built overnight, so you can also just add the ppa. Instructions are at https://launchpad.net/~ahasenack/+archive/ubuntu/ruby-cpu-busy-loop-1834072/, but it boils down to this command:

sudo add-apt-repository ppa:ahasenack/ruby-cpu-busy-loop-1834072

Then:

sudo apt-get update

and the new ruby packages should be visible on your xenial system and ready to be upgraded via usual apt commands.

Andreas Hasenack (ahasenack) wrote :

Thanks for your testing!

DAN FINN (dfinn) wrote :

Yeah, my pleasure. I updated a server with your packages yesterday and have been testing since, including having a loop run overnight. I've yet to see it get stuck in the sched_yield() loop. So far everything looks much better.

Fran Garcia (hgfran) wrote :

Hi,

Adding another datapoint here. We had the same issue. Been testing the patched packages since last night (on a loop as well), with no issues so far

DAN FINN (dfinn) wrote :

Andreas, can you provide any insight on when we might be able to expect this to make it into the repos? I'm not sure how long that process takes.

Andreas Hasenack (ahasenack) wrote :

I was just waiting for some feedback, as this is hard to reproduce. I tried the script for a few minutes but nothing out of the ordinary happened.

I see two +1s now, so I'll proceed with the SRU.

Changed in ruby2.3 (Ubuntu):
status: Triaged → In Progress
assignee: nobody → Andreas Hasenack (ahasenack)
description: updated
description: updated
Andreas Hasenack (ahasenack) wrote :

Uploaded to the unapproved queue, next step is from the SRU team.

DAN FINN (dfinn) wrote :

Hi Andreas, any update on this?

Paride Legovini (legovini) wrote :

Hi DAN, the package is still waiting for approval in the "unapproved" queue; you can monitor it from this page:

  https://launchpad.net/ubuntu/xenial/+queue?queue_state=1

For more information on the Stable Release Update (SRU) process see this wiki page:

  https://wiki.ubuntu.com/StableReleaseUpdates

Andreas Hasenack (ahasenack) wrote :

In other words, we are waiting for the SRU team to go through the queue.

description: updated

Hello DAN, or anyone else affected,

Accepted ruby2.3 into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ruby2.3/2.3.1-2~ubuntu16.04.13 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in ruby2.3 (Ubuntu Xenial):
status: New → Fix Committed
tags: added: verification-needed verification-needed-xenial
Changed in ruby2.3 (Ubuntu):
status: In Progress → Invalid
DAN FINN (dfinn) wrote :

I ran some quick tests last night and everything looked good.

Fran Garcia (hgfran) wrote :

Same here, new package looks good

no longer affects: ruby2.3 (Ubuntu)
Andreas Hasenack (ahasenack) wrote :

Thanks for all the testing, but please note the verification steps laid out in comment #21. In particular, it would help the SRU team immensely if you could confirm in writing which package you tested (name - version), what the result was, and then flip the verification tag as detailed in that comment. Even though it might sound redundant, it keeps a record of what was done and verified.

Thanks!

DAN FINN (dfinn) wrote :

I tested ruby2.3, ruby2.3-dev and libruby2.3. I'm not sure how to flip the tag, still trying to figure that out.

tags: added: verification-done-xenial
removed: server-next verification-needed verification-needed-xenial
tags: added: server-next

All autopkgtests for the newly accepted ruby2.3 (2.3.1-2~ubuntu16.04.13) for xenial have finished running.
The following regressions have been reported in tests triggered by the package:

ruby2.3/2.3.1-2~ubuntu16.04.13 (s390x)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/xenial/update_excuses.html#ruby2.3

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Andreas Hasenack (ahasenack) wrote :

The s390x failure is not new, according to http://autopkgtest.ubuntu.com/packages/r/ruby2.3/xenial/s390x

Last time it passed was in 2017-05-19, that's over 2 years ago.

The current failures are these two:

a)
FAIL test/ruby/test_time_tz.rb
------------------------------

Run options:

# Running tests:

..................................................................................................F...........

Finished tests in 0.027207s, 4043.1301 tests/s, 37895.1553 assertions/s.

  1) Failure:
TestTimeTZ#test_gen_lisbon_99 [/tmp/autopkgtest.lxVrvG/autopkgtest_tmp/test/ruby/test_time_tz.rb:435]:
Expected [[false, false, false, false, false], [false, false, false, false, false]] to include [true, true, true, true, true].

b)
FAIL test/io/console/test_io_console.rb
---------------------------------------

Run options:

# Running tests:

...F................

Finished tests in 2.432259s, 8.2228 tests/s, 54.2705 assertions/s.

  1) Failure:
TestIO_Console#test_getpass [/tmp/autopkgtest.lxVrvG/autopkgtest_tmp/test/io/console/test_io_console.rb:190]:
<"\r\n"> expected but was
<"asdf\r\n">.

The package from 2017 that last passed tests, doesn't pass them now. I'll keep digging in the meantime, but I don't think these failures were introduced by my changes in this SRU.

Andreas Hasenack (ahasenack) wrote :

The same two tests fail in the other architectures, except amd64 which has one extra:

FAIL test/rdoc/test_rdoc_markup_pre_process.rb
----------------------------------------------

Run options:

# Running tests:

F..........................................

Finished tests in 0.320807s, 134.0370 tests/s, 240.0198 assertions/s.

  1) Failure:
TestRDocMarkupPreProcess#test_class_post_process [/tmp/autopkgtest.sI4gSn/autopkgtest_tmp/test/rdoc/test_rdoc_markup_pre_process.rb:33]:
Expected: 1
  Actual: 2

This one passed on s390x and everywhere else.

Andreas Hasenack (ahasenack) wrote :

The timezone test is expecting these to be equal:
 ["1911-12-31 23:59:59 -0036", -2192, "-0036", "-00:36", "-00:36:32"],
 ["1912-01-01 00:36:31 +0000", 0, "+0000", "+00:00", "+00:00:00"],

If I downgrade tzdata to the version that was released with xenial, then this test passes.

$ sudo apt install tzdata=2016d-0ubuntu0.16.04
sudo: unable to resolve host xenial-ruby23
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be DOWNGRADED:
  tzdata
(...)

$ ruby2.3 test/runner.rb test/ruby/test_time_tz.rb
Run options:

# Running tests:

Finished tests in 0.056691s, 1940.3463 tests/s, 18186.3371 assertions/s.
110 tests, 1031 assertions, 0 failures, 0 errors, 0 skips

ruby -v: ruby 2.3.1p112 (2016-04-26) [s390x-linux-gnu]

So I would say that a timezone rule changed since the xenial release, and the test wasn't updated.

Andreas Hasenack (ahasenack) wrote :

If I patch test_io_console.rb like below, and that is what is in ruby2.5, then the test passes:

ubuntu@xenial-ruby23:~/deb/ruby-my-sru/ruby2.3-2.3.1$ diff -u test/io/console/test_io_console.rb{.orig,}
--- test/io/console/test_io_console.rb.orig 2019-08-30 21:47:34.815612212 +0000
+++ test/io/console/test_io_console.rb 2019-08-30 21:47:48.495619431 +0000
@@ -185,8 +185,9 @@
     skip unless IO.method_defined?("getpass")
     run_pty("p IO.console.getpass('> ')") do |r, w|
       assert_equal("> ", r.readpartial(10))
+ sleep 0.1
       w.print "asdf\n"
- sleep 1
+ sleep 0.1
       assert_equal("\r\n", r.gets)
       assert_equal("\"asdf\"", r.gets.chomp)
     end

ubuntu@xenial-ruby23:~/deb/ruby-my-sru/ruby2.3-2.3.1$ ruby2.3 test/runner.rb test/io/console/test_io_console.rb
Run options:

# Running tests:

Finished tests in 1.578071s, 12.6737 tests/s, 84.9138 assertions/s.
20 tests, 134 assertions, 0 failures, 0 errors, 0 skips

ruby -v: ruby 2.3.1p112 (2016-04-26) [s390x-linux-gnu]

So it's some kind of race.

Should I include these fixes in this SRU? Namely:
a) skip the timezone test
b) patch the console test like above

That still leaves the odd test/rdoc/test_rdoc_markup_pre_process.rb failure on amd64, but amd64 is "always failed". I could look at it next, see if I can reproduce it, or if it's a fluke.

Łukasz Zemczak (sil2100) wrote :

Andreas, thanks for the detailed investigation! Could you fill in a bug with the information you have gathered so far? I'll hint the failure for now and release, but it would be nice if we could possibly get this resolved with some future upload.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ruby2.3 - 2.3.1-2~ubuntu16.04.13

---------------
ruby2.3 (2.3.1-2~ubuntu16.04.13) xenial; urgency=medium

  * d/p/do-not-wakeup-inside-child-processes.patch: avoid child ruby processes
    being stuck in a busy loop (LP: #1834072)

 -- Andreas Hasenack <email address hidden> Tue, 25 Jun 2019 11:52:54 -0300

Changed in ruby2.3 (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for ruby2.3 has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.