Facter bug causes puppet to hang

Bug #986973 reported by Jared Biel on 2012-04-22
42
This bug affects 7 people
Affects Status Importance Assigned to Milestone
facter (Ubuntu)
Medium
Unassigned
Precise
Medium
Robie Basak
Quantal
Medium
Unassigned

Bug Description

[Impact]

"puppet apply" hangs about 20% of the time on affected servers. This
affects users using puppet without a puppetmaster. The normal use case
is to make "puppet apply" run regularly. This causes hung puppet
processes to build up, eventually exhausting system resources.

For those users using puppet without a puppetmaster, this makes puppet
unusable.

[Workaround]

Run puppet with the timeout command. For example:

     timeout -k 300s 290s puppet apply /etc/puppet/manifests/site.pp

This assumes that there won't be another hang on the next run. This is useful if you run "puppet apply" from a cron job, for example, and assumes that not all attempts hang. If you need puppet to actually apply when you run the command, then no workaround is available.

[Development Fix]

Fixed upstream by removing some unnecessary thread-exclusive locks that
were evidently causing a deadlock when the facter module is used from
puppet in the "puppet apply" case.

[Stable Fix]

Development fix cherry-picked from upstream.

[Test Case]

1. apt-get install -y puppet
2. cat > /etc/puppet/manifests/site.pp <<EOT
file { '/tmp/foo': content => 'bar'; }
EOT
3. successes=0; failures=0; for i in `seq 1 100`; do timeout -k 35s 30s puppet apply /etc/puppet/manifests/site.pp; rv=$?; echo "Exit status $rv"; if [ $rv -eq 0 ]; then successes=$(($successes+1)); else failures=$(($failures+1)); fi; done; echo "Successses: $successes; failures: $failures"

Expected result: Successes: 100; failures: 0
Actual result (approx): Successes: 80; failures: 20

Note that this is dependent on a race condition, so this is difficult to
reproduce. I have not managed to reproduce the fault reliably except on
particular servers where it happens to occur. I have a server where it
is happening, and the fix eliminates the problem, and am prepared to use
this for -proposed verification.

[Regression Potential]

The code that runs inside each lock section is trivial, clearly does not
need a lock, and upstream concurs:
http://projects.puppetlabs.com/issues/12723

[Original Description]

I found a bug with Facter that's affecting the version distributed with 12.04. Upstream bug report here: http://projects.puppetlabs.com/issues/12723 This makes puppet completely unusable for us unless this change is applied.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: facter 1.6.5-1ubuntu1
ProcVersionSignature: Ubuntu 3.2.0-23.36-virtual 3.2.14
Uname: Linux 3.2.0-23-virtual x86_64
ApportVersion: 2.0.1-0ubuntu5
Architecture: amd64
Date: Sun Apr 22 19:11:56 2012
Ec2AMI: ami-2616cd4f
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-east-1b
Ec2InstanceType: c1.medium
Ec2Kernel: aki-825ea7eb
Ec2Ramdisk: unavailable
PackageArchitecture: all
ProcEnviron:
 LC_CTYPE=en_US.UTF-8
 TERM=xterm
 LANG=en_US.UTF-8
 SHELL=/usr/bin/zsh
SourcePackage: facter
UpgradeStatus: No upgrade log present (probably fresh install)

Jared Biel (jared-biel) wrote :
Robie Basak (racb) wrote :
no longer affects: facter
Changed in facter (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
Marc Cluet (lynxman) wrote :

This has been targeted in puppet labs for a 2.0.0 release, it would be counter-productive to merge before that, I'll try to get further feedback from puppet labs to try to take the best approach to this.

Marc Cluet (lynxman) on 2012-04-25
Changed in facter (Ubuntu):
assignee: nobody → Marc Cluet (lynxman)
James Page (james-page) wrote :

jamespage: ping, about bug #986973 after talking with upstream they reckon it's not worth fixing as of yet and recommend to wait to facter 2.0.

Marking 'Won't fix' for precise and raising task for quantal so we don't forget.

Changed in facter (Ubuntu Precise):
status: Triaged → Won't Fix
Changed in facter (Ubuntu Quantal):
status: New → Triaged
importance: Undecided → Medium
James Page (james-page) wrote :

Quantal won't be shipping facter 2.0.0 as this is intended to support puppet 3.x (which we also won't ship).

Marking 'Won't Fix' for quantal and raising task for r-series.

Changed in facter (Ubuntu Quantal):
status: Triaged → Won't Fix
no longer affects: facter (Ubuntu R-series)
Changed in facter (Ubuntu):
assignee: Marc Cluet (lynxman) → nobody
Chris Abbey (cabbey+launchpad) wrote :

Just a note that after this was WONTFIX'd, puppet labs moved the upstream bug back from 2.x to 1.7.0. Perhaps a reconsideration is in order?

Robie Basak (racb) wrote :

@Chris Abbey

Thanks for pointing this out. I can't seem to find any corresponding commit related to a fix in 1.7.0 though. I'm looking at https://github.com/puppetlabs/facter/commits/1.7.1/lib/facter/memory.rb for example - given that the original pull request https://github.com/puppetlabs/facter/commit/15b857472a75caf9cd596c2b33d285577382293a involved a change to this file.

I think I've seen this issue myself in Precise. I am running "puppet apply" from a cronjob, and it occasionally hangs, eventually consuming all available memory. I am running the whole thing via the timeout command as a workaround. I think it is reasonable to fix this in Precise if there is a simple, obviously correct fix, but I can't find anything to cherry-pick.

Robie Basak (racb) wrote :

Note that "zombie" has a specific meaning that doesn't apply here.

summary: - Facter bug causes puppet to become a zombie
+ Facter bug causes puppet to freeze
summary: - Facter bug causes puppet to freeze
+ Facter bug causes puppet to hang
Jared Biel (jared-biel) wrote :

@Robie Basak

The copy of memory.rb that you linked in your update does contain the fix (removal of "require 'thread'" etc.) I just checked out the tag 1.7.0 and found the commit (https://github.com/puppetlabs/facter/commit/9ff4453b2f0843df887ae1ec2098a4336291df0c). Note: that commit is only for memory.rb. Patching of processor.rb may also necessary.

It'd be great if this were backported to the Ubuntu versions because right now it's necessary for us to maintain our own version with patched memory.rb and processor.rb. The code is also present in quantal and raring (facter 1.6.9), although I cannot say if this bug also affects those distributions.

Jared Biel (jared-biel) wrote :

I just did some digging and found that processor.rb isn't affected in versions 1.6.5 (precise) or 1.6.9. So, commit 9ff4453b2f0843df887ae1ec2098a4336291df0c is the only necessary one to cherry-pick here.

Shane Meyers (shanemeyers) wrote :

It would be fantastic if we could get memory.rb back ported to Precise, as this is really irritating to need to maintain a separate copy of memory.rb. Without the patched version, we experience many hangs per day with 'puppet apply' running via cron. Surprising to see a hang bug remain for so long with an LTS release.

Robie Basak (racb) wrote :

@Shane

Could you please confirm the patch that you're using on memory.rb, and confirm that this fixes the problem for you?

I've also been seeing 'puppet apply' hanging via cron, but never figured out if this was the issue or if this patch fixed it. I've been working around it with a timeout.

If we can get some confidence in the fix, I'll put this forward as an SRU in Precise.

Changed in facter (Ubuntu Precise):
assignee: Marc Cluet (lynxman) → nobody
status: Won't Fix → New
Robie Basak (racb) wrote :

We'll also need this fixed in Saucy before we can commence an SRU. What is the fix needed in Saucy? Just memory.rb? The same patch? Or processor.rb as well?

Shane Meyers (shanemeyers) wrote :

Installing the patched version of memory.rb from https://github.com/puppetlabs/facter/commit/9ff4453b2f0843df887ae1ec2098a4336291df0c helped reduce the number of puppet hangs that we experience, but we were still getting hung processes. Attaching strace to a "hung" puppet process showed a constant output of the following lines repeated:

31024 futex(0x7f96be6ed5b0, FUTEX_WAKE_PRIVATE, 1) = 0
31024 futex(0x7f96be6ed584, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 231453, {1371588406, 113327589}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
31024 clock_gettime(CLOCK_REALTIME, {1371588406, 113578908}) = 0
31024 futex(0x7f96be6ed5b0, FUTEX_WAKE_PRIVATE, 1) = 0
31024 futex(0x7f96be6ed584, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 231455, {1371588406, 123578908}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
31024 clock_gettime(CLOCK_REALTIME, {1371588406, 123789011}) = 0

It turns out that puppet isn't actually hung, but is waiting forever for the futex() call to stop timing out.

Once I copied in the patched processor.rb from the same git changeset linked above, the hang problem stopped occurring for us. So from my perspective, both memory.pp and processor.rb need to be patched in the puppet 1.6.5 (precise) package.

Jared Biel (jared-biel) wrote :

@Shane

Yes - it appears you are correct and I was in error in my previous comment. The source file lib/facter/processor.rb does not make use of Thread::Exclusive (except for Windows), but lib/facter/util/processor.rb does, so it's necessary to patch that file also.

Robie Basak (racb) wrote :

I found a machine exhibiting this problem. On Precise (facter 1.6.5-1ubuntu1), I ran the following experiments:

No patch: "puppet apply" failed 20 out of 100 runs.
Patched both memory.rb and processor.rb: "puppet apply" failed 0 out of 100 runs.
Patched only memory.rb: "puppet apply" failed 23 out of 100 runs.
Pathced only processor.rb: "puppet apply" failed 0 out of 100 runs.

So for me, patching just processor.rb fixed the problem. However, this being a race, it might be that I just wasn't able to trigger a race involving memory.rb in my specific circumstances.

The patch I used was https://github.com/puppetlabs/facter/commit/9ff4453b2f0843df887ae1ec2098a4336291df0c.patch. I can see that at least on Linux, all the affected areas do is parse /proc/meminfo and /proc/cpuinfo in a way that I don't see why any thread exclusive locks would be needed at all. Upstream seemed to agree in https://github.com/puppetlabs/facter/pull/174 and had no explanation for why they were put in either. So I think it's safe to backport all of the patch to Precise. I'll do this next week.

Changed in facter (Ubuntu Precise):
assignee: nobody → Robie Basak (racb)
Robie Basak (racb) wrote :

(this appears fixed in Saucy, so just the Precise SRU needed)

description: updated
Changed in facter (Ubuntu Precise):
status: New → In Progress
Changed in facter (Ubuntu):
status: Triaged → Fix Released
Robie Basak (racb) on 2013-07-02
description: updated
Dave Walker (davewalker) wrote :

This is blocked on in-flight SRU bug 1170325.

Thanks

Dave Walker (davewalker) wrote :

As discussed with Robie, this upload is being folded with the current one for dual verification.

Changed in facter (Ubuntu Precise):
status: In Progress → Fix Committed
tags: added: verification-needed

Hello Jared, or anyone else affected,

Accepted facter into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/facter/1.6.5-1ubuntu1.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Robie Basak (racb) wrote :

In the ten days before applying the -proposed update (1.6.5-1ubuntu1.2), one of my affected servers had nine failures (I haven't been testing any others). In the ten days since applying the -proposed update, the same server has had no failures at all. I haven't made any other changes in this time.

I still need to verify bug 1170325 by hand since it seems that nobody else will, and then this SRU can land.

tags: added: verification-done
removed: verification-needed
Tuomas Starck (humppa) wrote :

tl;dr: I second the verification

Dear everyone,

I have used puppet to maintain a collection of some dozens of Precise servers. This bug, as described, has manifested regularly.

I've enabled the precise-proposed repository on a few test servers and updated facter to 1.6.5-1ubuntu1.2. Based on some hundreds of test runs, the issue is completely resolved.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package facter - 1.6.5-1ubuntu1.2

---------------
facter (1.6.5-1ubuntu1.2) precise; urgency=low

  * d/p/thread-critical-freeze: fix "puppet apply" hang by avoiding
    unnecessary thread critical sections (LP: #986973).
 -- Robie Basak <email address hidden> Tue, 02 Jul 2013 03:21:36 +0000

Changed in facter (Ubuntu Precise):
status: Fix Committed → Fix Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers