Puppet agent using 100% CPU, in sched_yield() loop. Looks like an issue with ruby2.3 which has been fixed but not yet made it into Ubuntu.

Bug #1834072 reported by DAN FINN on 2019-06-24
28
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ruby2.3 (Debian)
Fix Released
Unknown
ruby2.3 (Ubuntu)
Xenial
Undecided
Unassigned

Bug Description

[Impact]

Ruby processes can sometimes get stuck in a loop consuming 100% CPU, as described upstream and in the debian bug report. It has most commonly been seen in the puppet agent.

[Test Case]
It's not easy to reproduce. It has been suggested that this script eventually reproduces the problem:

while nice -n19 /opt/puppetlabs/puppet/bin/ruby sched_yield_loop.rb; do :; done

Where sched_yield_loop.rb comes from "https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=876377;filename=sched_yield_loop.rb;msg=22"

I personally haven't seen it happen with the script, but maybe it could take days.

[Regression Potential]
Races with threads can be hard to reproduce, and so can regressions.
Patch has been applied upstream and to debian for more than a year now.

[Other Info]
Bionic has ruby 2.5.1 and it has this fix already, as do all later ubuntu releaes.

[Original Description]

Ubuntu 16.04
ruby 2.3.1-2~16.04.12
kernel 4.4.0-148-generic

We've noticed an issue across multiple servers where puppet agent will seem to get stuck and consume 100% CPU for days or weeks on end until manually killed.

root@ps-prod-jenkins-qa-ui02:~# ps auxwwww|grep -i puppe[t]
root 1412 0.0 0.2 143716 38680 ? Ssl Jun11 0:39 /usr/bin/ruby /usr/bin/puppet agent
root 34884 74.4 0.3 286848 53724 ? Rs Jun23 1141:44 puppet agent: applying configuration
root 111481 94.1 0.3 288572 54996 ? Rs Jun18 8642:32 puppet agent: applying configuration
root 128479 54.8 0.3 286744 53596 ? Rs 10:30 250:17 puppet agent: applying configuration

Strace shows it in a sched_yield() loop:

root@ps-prod-jenkins-qa-ui02:~# strace -p 34884 -c
strace: Process 34884 attached
^Cstrace: Process 34884 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.002130 0 123189 sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00 0.002130 123189 total

Some googling shows this is a common issue which was supposedly fixed/backported to ruby 2.3:

https://tickets.puppetlabs.com/browse/PA-1743
https://bugs.ruby-lang.org/issues/13794
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=876377

The following open Ubuntu bugs look to be having the same issue and suggest that this fix made it into Debian but never made it into Ubuntu:

https://bugs.launchpad.net/ubuntu/+source/pcs/+bug/1783161
https://bugs.launchpad.net/ubuntu/+source/pcs/+bug/1772998

Related branches

Andreas Hasenack (ahasenack) wrote :

https://salsa.debian.org/ruby-team/ruby/commit/50d860d0bd7834e95214a2b1ff5b8e0ede7910a1 seems to be the fix

If I build packages in a ppa, can you test for us, prior to the upload to proposed?

tags: added: server-next
Changed in ruby2.3 (Ubuntu):
status: New → Triaged
importance: Undecided → High
DAN FINN (dfinn) wrote :

Yes, I could test it on a dev box

Changed in ruby2.3 (Debian):
status: Unknown → Fix Released
Andreas Hasenack (ahasenack) wrote :

I built xenial packages in this ppa with the above patch, if you could give them a try:

https://launchpad.net/~ahasenack/+archive/ubuntu/ruby-cpu-busy-loop-1834072/

DAN FINN (dfinn) wrote :

Yes, I will work on that. The only tricky part is that I don't know of a way to manually reproduce this. It seems to happen randomly to us and sometimes infrequently.

DAN FINN (dfinn) wrote :

I could try the ruby script mentioned in message #22 of this thread:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=876377

Andreas Hasenack (ahasenack) wrote :

Thanks!

The arm builders are slow, so apt won't see that ppa as populated just yet, but you can download the individual packages. Here are the amd64 ones, for example: https://launchpad.net/~ahasenack/+archive/ubuntu/ruby-cpu-busy-loop-1834072/+build/17190313

DAN FINN (dfinn) wrote :

OK, I'm playing with that ruby script mentioned above now. I've run it several times across several servers and it's only crashed one, leaving a hung process using 100% CPU. I'm trying to get a feel for how often it does this before I patch a test system.

DAN FINN (dfinn) wrote :

Do you want me to install all the .deb packages at that URL?

DAN FINN (dfinn) wrote :

We have the following packages installed that match the ones you updated so I will just be installing your updates to these:

[dfinn@ps-backup-db05 PS SERVER 0 ~]$ dpkg -l|grep -i ruby2
ii libruby2.3:amd64 2.3.1-2~16.04.12 amd64 Libraries necessary to run Ruby 2.3
ii ruby2.3 2.3.1-2~16.04.12 amd64 Interpreter of object-oriented scripting language Ruby
ii ruby2.3-dev:amd64 2.3.1-2~16.04.12 amd64 Header files for compiling extension modules for the Ruby 2.3

I was able to get it to happen twice in somewhere around 100 tries so this isn't a great test but I'm going to update one system and continually run it on there.

Andreas Hasenack (ahasenack) wrote :

The packages from the ppa finally built overnight, so you can also just add the ppa. Instructions are at https://launchpad.net/~ahasenack/+archive/ubuntu/ruby-cpu-busy-loop-1834072/, but it boils down to this command:

sudo add-apt-repository ppa:ahasenack/ruby-cpu-busy-loop-1834072

Then:

sudo apt-get update

and the new ruby packages should be visible on your xenial system and ready to be upgraded via usual apt commands.

Andreas Hasenack (ahasenack) wrote :

Thanks for your testing!

DAN FINN (dfinn) wrote :

Yeah, my pleasure. I updated a server with your packages yesterday and have been testing since, including having a loop run overnight. I've yet to see it get stuck in the sched_yield() loop. So far everything looks much better.

Fran Garcia (hgfran) wrote :

Hi,

Adding another datapoint here. We had the same issue. Been testing the patched packages since last night (on a loop as well), with no issues so far

DAN FINN (dfinn) wrote :

Andreas, can you provide any insight on when we might be able to expect this to make it into the repos? I'm not sure how long that process takes.

Andreas Hasenack (ahasenack) wrote :

I was just waiting for some feedback, as this is hard to reproduce. I tried the script for a few minutes but nothing out of the ordinary happened.

I see two +1s now, so I'll proceed with the SRU.

Changed in ruby2.3 (Ubuntu):
status: Triaged → In Progress
assignee: nobody → Andreas Hasenack (ahasenack)
description: updated
description: updated
Andreas Hasenack (ahasenack) wrote :

Uploaded to the unapproved queue, next step is from the SRU team.

DAN FINN (dfinn) wrote :

Hi Andreas, any update on this?

Paride Legovini (legovini) wrote :

Hi DAN, the package is still waiting for approval in the "unapproved" queue; you can monitor it from this page:

  https://launchpad.net/ubuntu/xenial/+queue?queue_state=1

For more information on the Stable Release Update (SRU) process see this wiki page:

  https://wiki.ubuntu.com/StableReleaseUpdates

Andreas Hasenack (ahasenack) wrote :

In other words, we are waiting for the SRU team to go through the queue.

description: updated

Hello DAN, or anyone else affected,

Accepted ruby2.3 into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ruby2.3/2.3.1-2~ubuntu16.04.13 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in ruby2.3 (Ubuntu Xenial):
status: New → Fix Committed
tags: added: verification-needed verification-needed-xenial
Changed in ruby2.3 (Ubuntu):
status: In Progress → Invalid
DAN FINN (dfinn) wrote :

I ran some quick tests last night and everything looked good.

Fran Garcia (hgfran) wrote :

Same here, new package looks good

no longer affects: ruby2.3 (Ubuntu)
Andreas Hasenack (ahasenack) wrote :

Thanks for all the testing, but please note the verification steps laid out in comment #21. In particular, it would help the SRU team immensely if you could confirm in writing which package you tested (name - version), what the result was, and then flip the verification tag as detailed in that comment. Even though it might sound redundant, it keeps a record of what was done and verified.

Thanks!

DAN FINN (dfinn) wrote :

I tested ruby2.3, ruby2.3-dev and libruby2.3. I'm not sure how to flip the tag, still trying to figure that out.

tags: added: verification-done-xenial
removed: server-next verification-needed verification-needed-xenial
tags: added: server-next
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.