Ubuntu

rsyslog hangs loading modules

Reported by Jon Tai on 2013-04-16
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
rsyslog (Ubuntu)
Undecided
Unassigned
Precise
Undecided
Unassigned
Quantal
Undecided
Unassigned
Raring
Undecided
Unassigned

Bug Description

[Impact]
We have rsyslog configured to listen for forwarded logs on TCP and UDP, and to forward logs on to a central log host. On a number of occasions we have observed rsyslog in a hung state on our production machines -- not fowarding logs and not writing them to disk locally. It doesn't happen all the time, so I believe there's some sort of race condition involved.

I've attached a stack trace taken when the process was hung. I believe the issue is the same as the one discussed here: http://www.gossamer-threads.com/lists/rsyslog/users/5873 There is a patch attached to that thread.

[Test Case]
#!/bin/sh

while [ 1 ]; do
  service rsyslog stop
  sleep 1
  killall -9 rsyslogd > /dev/null 2>&1
  service rsyslog start
  sleep 1
  nonce=$(date '+%s')
  logger $nonce
  if grep $nonce /var/log/hourly/* > /dev/null; then
    echo found nonce $nonce
  else
    echo FAIL
    exit 1
  fi
done

[Regression Potential]
This is a fix for a deadlock issue it is pretty difficult to determine if fixing a race condition would impact anything else, however, testing from the community hasn't raised any concerns.

[Additional Info]

Ubuntu 12.04.1 LTS x86_64
rsyslog 5.8.6-1ubuntu8

Here is the (obscured) section of our rsyslog config that I believe is causing the issue.

$ActionQueueType LinkedList
$ActionQueueFileName central_syslog
$ActionQueueMaxDiskSpace 256M
$ActionResumeRetryCount -1
$ActionQueueSaveOnShutdown on
$ActionSendTCPRebindInterval 100000
*.* @@central_syslog.domain:514

$ModLoad imudp
$UDPServerRun 514

$ModLoad imtcp
$InputTCPServerRun 514

Jon Tai (jontai) wrote :
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in rsyslog (Ubuntu):
status: New → Confirmed
Pierre Carrier (pcarrier) wrote :
tags: added: precise
tags: added: patch

The attachment "rsyslog-startup-deadlock.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

Adam Stokes (adam-stokes) wrote :

Thanks Pierre,

I'll see about getting this as an SRU and sent out for testing.

Adam

Adam Stokes (adam-stokes) wrote :

Jon,

Ive got a test package building for Precise, when it becomes available would you please test?

https://launchpad.net/~adam-stokes/+archive/rsyslog-1169740

Thank you
Adam

Jon Tai (jontai) wrote :

Thanks Adam! I will roll the test package out to a subset of our machines today.

Jon Tai (jontai) wrote :

Adam,

I installed the test package on a number of machines that exhibited the deadlock before. They have been running OK for several hours. I also wrote up a quick and dirty synthetic test that restarts rsyslog, logs a unique test value, checks that it appears in the local logs, and does it again unless the value did not appear. This has been running for several hours as well. I will check back in later to see if there are any failures.

- Jon

Adam Stokes (adam-stokes) wrote :

Hey Jon,

Excellent news! Would you object to attaching your test steps/script here so I can include them in the SRU?

Thanks
Adam

Adam Stokes (adam-stokes) wrote :
description: updated
Jon Tai (jontai) wrote :

Here is the script I used to reproduce the bug. I had 3 machines running the patched version of rsyslog (5.8.6-1ubuntu8.1+lp1169740) running it for over 10 hours, no deadlocks. I had 1 machine running the unpatched version of rsyslog (5.8.6-1ubuntu8) and it deadlocked in ~33 minutes.

I also had 5 production machines running the patched version for 10 hours, no deadlocks. On 5 other machines running the unpatched version, 2 deadlocked immediately when I restarted rsyslog, and 2 more deadlocked shortly afterward.

At this point I'm fairly confident that the patch works.

Jon Tai (jontai) wrote :

Here is the (obscured) section of our rsyslog config that I believe is causing the issue.

$ActionQueueType LinkedList
$ActionQueueFileName central_syslog
$ActionQueueMaxDiskSpace 256M
$ActionResumeRetryCount -1
$ActionQueueSaveOnShutdown on
$ActionSendTCPRebindInterval 100000
*.* @@central_syslog.domain:514

$ModLoad imudp
$UDPServerRun 514

$ModLoad imtcp
$InputTCPServerRun 514

Adam Stokes (adam-stokes) wrote :

Thanks for all your help Jon. Since raring is being released next week I can't say for sure when we should expect this into -proposed. I will bring this up in my next meeting (Wednesday) if I am unable to get at least the Precise SRU uploaded by then.

Thanks again
Adam

description: updated
Colin Watson (cjwatson) wrote :

Thanks, Adam - I've uploaded this, with the only change being to adjust the version number to match practices in development releases (i.e. 5.8.11-2ubuntu2). It's currently awaiting freeze review in the upload queue.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package rsyslog - 5.8.11-2ubuntu2

---------------
rsyslog (5.8.11-2ubuntu2) raring-proposed; urgency=low

  [ Pierre Carrier ]
  * debian/patches/04-fix_startup_deadlock.patch:
    - Fixes deadlock during startup (LP: #1169740)
 -- Adam Stokes <email address hidden> Wed, 17 Apr 2013 09:33:32 -0400

Changed in rsyslog (Ubuntu Raring):
status: Confirmed → Fix Released
Jon Tai (jontai) wrote :

Will this fix be backported to Precise?

Pierre Carrier (pcarrier) wrote :

As Adam mentioned, he's going through the process already. SRU stands for https://wiki.ubuntu.com/StableReleaseUpdates

Michael Terry (mterry) wrote :

I've uploaded your precise and quantal patches. Now we wait for SRU approval. I'll unsubscribe the sponsors team.

Hello Jon, or anyone else affected,

Accepted rsyslog into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/rsyslog/5.8.6-1ubuntu8.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in rsyslog (Ubuntu Precise):
status: New → Fix Committed
tags: added: verification-needed
Changed in rsyslog (Ubuntu Quantal):
status: New → Fix Committed
Brian Murray (brian-murray) wrote :

Hello Jon, or anyone else affected,

Accepted rsyslog into quantal-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/rsyslog/5.8.6-1ubuntu9.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Hi,

I had to modify the reproducer script which did trigger "false positive" (the noonce pattern getting into syslog after the grep) by adding a sleep 1 after :

if grep $nonce /var/log/hourly/* > /dev/null; then

The new package in -proposed has been running successfully for more than 3 hours, but the original package does not trigger a FAIL from the script either with the added sleep 1.

I am tempted to mark it 'verification-done' since it has been running correctly for a long period. Maybe Adam wants to comment ?

since the test ran without sign of any regression, I'm marking it verification-done

tags: added: verification-done
removed: verification-needed
Dave Walker (davewalker) wrote :

Louis, would you mind checking that the current released version does present this issue with the test case, in your same enviroment please?

This would just help confirm that that it was broken, and now works. :)

Thanks.

@dave

The main problem I encountered while testing with the provided test script is that it produces false positives both with previous and -proposed versions. Sometimes, the test messages sent using "logger" makes it to /var/log/syslog after the 'grep' has gone through, identifying rsyslog as being hung while it is not. So the script is not reliable.

When using the following modified script :

#!/bin/sh

while [ 1 ]; do
  service rsyslog stop
  sleep 1
  killall -9 rsyslogd > /dev/null 2>&1
  service rsyslog start
  sleep 1
  nonce=$(date '+%s')
  logger $nonce
  sleep 1
  if grep $nonce /var/log/hourly/* > /dev/null; then
    echo found nonce $nonce
  else
    echo FAIL
    exit 1
  fi
done

The script has not reported any failure when running for > 12 hours on the current released version which does not have the fix.

My conclusion is that triggering the bug using the modified script is not possible, hence the mention in comment #24.

So both current released version and the version in -proposed have been running correctly with my modified test script so I am not able to reproduce the failure.

Jon Tai (jontai) wrote :

Sorry for not commenting on this sooner. For what it's worth, we were definitely able to reproduce this bug on our production infrastructure before, and we've been running Adam's PPA version of this package for several weeks now without issue. Since I've already reported these findings and I'm the original reporter of the bug, I'm not sure that qualifies the package to be verification-done.

Adam Stokes (adam-stokes) wrote :

Thanks Jon,

I believe we have enough verification done now we can move forward.

Dave:

Are we able to make sure precise is fix-released now?

Thank you
Adam

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package rsyslog - 5.8.6-1ubuntu8.2

---------------
rsyslog (5.8.6-1ubuntu8.2) precise-proposed; urgency=low

  * debian/patches/02-fix_startup_deadlock.patch:
    - Fixes deadlock during startup (LP: #1169740)
 -- Adam Stokes <email address hidden> Wed, 17 Apr 2013 09:54:00 -0400

Changed in rsyslog (Ubuntu Precise):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package rsyslog - 5.8.6-1ubuntu9.2

---------------
rsyslog (5.8.6-1ubuntu9.2) quantal-proposed; urgency=low

  [ Pierre Carrier ]
  * debian/patches/02-fix_startup_deadlock.patch:
    - Fixes deadlock during startup (LP: #1169740)
 -- Adam Stokes <email address hidden> Wed, 17 Apr 2013 09:41:26 -0400

Changed in rsyslog (Ubuntu Quantal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers