salsauthd maxes cpu

Bug #997217 reported by David Farrar on 2012-05-09
34
This bug affects 5 people
Affects Status Importance Assigned to Milestone
cyrus-sasl2 (Ubuntu)
Undecided
Unassigned
Precise
High
Unassigned
Trusty
High
Andreas Hasenack
Xenial
High
Andreas Hasenack

Bug Description

[Impact]

The rimap authentication mechanism in saslauthd can hit a condition where it will start spinning and using all available CPU. This condition can be easily encountered when an authentication is happening and the imap service is being restarted.

Furthermore, the saslauthd child process that picked up that authentication request and that is spinning now won't be reaped nor can it service further requests. If all children are left in this state, the authentication service as a whole won't be working anymore.

[Test Case]

This test can be performed in a LXD or VM.

* install the needed packages. mail-stack-delivery is used to have an imap server available on localhost that needs no further configuration. Accept the defaults for all debconf prompts:

sudo apt update
sudo apt install sasl2-bin mail-stack-delivery

* set the password "ubuntu" for the ubuntu user
echo ubuntu:ubuntu | sudo chpasswd

* start saslauthd like this, with just one child:
sudo /usr/sbin/saslauthd -a rimap -O localhost -r -n 1

* restart dovecot
sudo service dovecot restart

* test saslauthd authentication:
$ sudo testsaslauthd -u ubuntu -p ubuntu
0: OK "Success."

* Now let's break it. In one terminal watch the output of top:
top

* in another terminal, run the following:
sudo testsaslauthd -u ubuntu -p ubuntu & sleep 1; sudo service dovecot stop

* observe in the "top" terminal that saslauthd is consuming a lot of cpu. If that's not happening, try starting dovecot again and adjusting the sleep value in the previous test command, but 1s was enough in all my runs.

* start dovecot and repeat the authentication request. Since the only saslauthd child is now spinning, this will block:
sudo service dovecot start
$ sudo testsaslauthd -u ubuntu -p ubuntu
<blocks>

[Regression Potential]
This fix relies on read(2) returning zero bytes when the connection is dropped, and that is clearly documented in its manpage:

"On success, the number of bytes read is returned (zero indicates end of file),"

The select manpage also documents such a case being a valid case to indicate that a socket is ready to be read from, and that it won't block:

"The file descriptors listed in readfds will be watched to see if characters become available for reading (more precisely, to see if a read will not block; in particular, a file descriptor is also ready on end-of-file)"

This patch is what was used upstream, and is also present in bionic.

I can't think of regressions specific to this change, other than the usual risk of rebuilding a widely used library (sasl2) in an environment different from the one from when xenial was released, i.e., different libraries available, maybe different system-wide build options, etc.

[Other Info]
Trusty is still not accounting for read() returning zero being an end-of-file case, but the loop there has a counter and it eventually exits, not leading to a perpetual spin or high cpu usage (see comment #17 for a brief history on how this fix was dropped in the xenial package).

The fix is simple and could be applied there as well, if the SRU team prefers.

[Original Description]

sasl2-bin version 2.1.24~rc1.dfsg1+cvs2011-05-23-4ubuntu contains a bug that causes heavy cpu utilization, impacting normal operation of one of our mail servers following an upgrade to Ubuntu 12.04.

We are running the daemon with the following options:

/usr/sbin/saslauthd -a rimap -O our.imap.server -r -m /var/spool/postfix/var/run/saslauthd -n 5

We noticed that users were unable to send mail and that the saslauthd processes were using approximately 100% of each cpu core. An strace of one of the runaway process showed that it was stuck in the following behaviour:

select(9, [8], NULL, NULL, {0, 0}) = 1 (in [8], left {0, 0})
read(8, "", 940) = 0
select(9, [8], NULL, NULL, {0, 0}) = 1 (in [8], left {0, 0})
read(8, "", 940) = 0
select(9, [8], NULL, NULL, {0, 0}) = 1 (in [8], left {0, 0})
read(8, "", 940) = 0
select(9, [8], NULL, NULL, {0, 0}) = 1 (in [8], left {0, 0})
read(8, "", 940) = 0
select(9, [8], NULL, NULL, {0, 0}) = 1 (in [8], left {0, 0})
read(8, "", 940) = 0
select(9, [8], NULL, NULL, {0, 0}) = 1 (in [8], left {0, 0})
read(8, "", 940) = 0
select(9, [8], NULL, NULL, {0, 0}) = 1 (in [8], left {0, 0})
read(8, "", 940) = 0
.....

with further inspection showing that the file descriptor in question was a socket connected to our imap server in CLOSE_WAIT.

Browsing saslauthd/auth_rimap.c in the source package for sasl2-bin, we came across the following code, repeated in two locations:

while( select (fds, &perm, NULL, NULL, &timeout ) >0 ) {
           if ( FD_ISSET(s, &perm) ) {
              ret = read(s, rbuf+rc, sizeof(rbuf)-rc);
              if ( ret<0 ) {
                 rc = ret;
                 break;
              } else {
                 rc += ret;
              }
           }
        }

It looks like this loop is expected to run until a read error is encountered or the timeout of 1 second is reached. There is no test to check that 0 bytes were read, indicating that the connection was closed by the remote peer. Since select() will immediately return the size of the set of the partially closed descriptor (1, which is >0), and calls to read() will always yield 0 bytes, there's the potential for execution to get stuck in this non blocking loop and I'm presuming that that's what's happening here.

We've not performed any further analysis to prove that this is really what's happening but if my intuition is correct then our IMAP server (an nginx imap proxy) most liklely closes the connection at an unexpected time under as yet undetermined conditions.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in cyrus-sasl2 (Ubuntu):
status: New → Confirmed
Fufroma (fufroma-a) wrote :

Bug & patch for Debian Wheezy are here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=708552

Fufroma (fufroma-a) wrote :

Fixed in version cyrus-sasl2/2.1.25.dfsg1-7

Robie Basak (racb) wrote :

Thanks. Saucy is now on 2.1.25.dfsg1-7, so marking Fix Released.

If you need a fix for an existing stable release, please comment with a justification against https://wiki.ubuntu.com/StableReleaseUpdates#When and complete steps 1 through 4 in https://wiki.ubuntu.com/StableReleaseUpdates#Procedure

Changed in cyrus-sasl2 (Ubuntu):
status: Confirmed → Fix Released
Daniel (pada) wrote :

This bugs represents a severe regression on upgrade from Ubuntu 10.04 LTS to Ubuntu 12.04 LTS.
The package is rather unusable since it eats up 100% CPU on production mailservers after upgrade to current LTS.

Would you please include the fix http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=5;filename=0034-fix_dovecot_authentication.patch;att=1;bug=708552 or a backport of 2.1.25.dfsg1-7 in the current LTS Release 12.04 Precise.

Robie Basak (racb) wrote :

If you need a fix for an existing stable release, please comment with a justification against https://wiki.ubuntu.com/StableReleaseUpdates#When and complete steps 1 through 4 in https://wiki.ubuntu.com/StableReleaseUpdates#Procedure

Daniel (pada) wrote :

Hi Robie,

Justification against https://wiki.ubuntu.com/StableReleaseUpdates#When:
* severe regression bug: Package sasl2-bin is unusable after Upgrade from Lucid LTS to Precise LTS since saslauthd eats up 100% CPU on production mailservers when using rimap and needs regular restarts. saslauthd in Lucid LTS did work just fine with rimap.

Complete steps 1 through 4 in https://wiki.ubuntu.com/StableReleaseUpdates#Procedure
Step 1) Check that the bug is fixed in the current development release, and that its bug task is "Fix Released".
 --> Yes, launchpad bug #997217 has status "Fix released". Check.

Step 2) Ensure that the bug report for this issue is public.
--> Yes, launchpad bug #997217 is public.

Step 3)

Update the bug report description and make sure it contains the following information:

[Impact]
* High Load on production mailservers since Upgrade of Long Term Support Release from Lucid 10.04 LTS to Precise 12.04 LTS

[Test Case]
* Sporadically running into high load (infinite loop in saslauthd) when running saslauthd via Remote (rimap) against nginx mailproxy from "nginx-extras" package on mailservers with high imap traffic

[Regression]
* The bug itself is a regression in comparison to Lucid 10.04.

Step 4) Ask someone in the Ubuntu bug control team to nominate the bug for the appropriate Ubuntu release(s)/series (e. g. the current LTS and latest stable release).
--> Nominated for Precise by Adolfo Jayme Barrientos

Is anything missing?

Regards
Daniel

tags: added: precise
Robie Basak (racb) wrote :

Daniel,

Please could you elaborate on the test case? Neither me nor the stable update verification team will be able to reproduce or test the presence of the bug or a fix for it based on the description you've given. These need to be step-by-step instructions that anybody can follow, including somebody not familiar with the specifics of the package.

Will you be OK to test the proposed fix once it is built and ready, so that the update will be able to land for general use?

Thanks!

Daniel (pada) wrote :

I'm sorry to say, that there is no simple test case.

The problem cannot be reproduced with simple instructions. You need to run saslauthd -a rimap for a few days in front of nginx-extras+dovecot IMAP and do a lot of imap logins to trigger the bug. But the bug is already fixed by the newer patch version of 0034-fix_dovecot_authentication.patch in 2.1.25.dfsg1-17.

Some more system information:

Precise nginx:

mail01:~# dpkg -l "nginx*" |grep ^ii
ii nginx-common 1.1.19-1ubuntu0.4 small, but very powerful and efficient web server (common files)
ii nginx-extras 1.1.19-1ubuntu0.4 nginx web server with full set of core modules and extras

Self-compiled known-to-be-working stable dovecot release 2.1.16:

mail01:~# doveconf -n |head -2
# 2.1.16: /etc/dovecot/dovecot.conf
# OS: Linux 3.2.0-56-generic x86_64 Ubuntu 12.04.3 LTS

Saslauthd ist running with following parameters, where 10.10.10.10 is the remote dovecot imap server:

/usr/sbin/saslauthd -a rimap -O 10.10.10.10 -r -c -t 60 -m /var/run/saslauthd -n 5

In the meanwhile I took 0034-fix_dovecot_authentication.patch from the saucy package https://launchpad.net/ubuntu/saucy/amd64/sasl2-bin/2.1.25.dfsg1-17, updated the patch
0034-fix_dovecot_authentication.patch in the precise package and installed it on our servers.

I'll run for a few days (or weeks) now, and see if the CPU problem goes away.

If you want to test an official package update for precise, we can do so on one of the machines.

Regards
Daniel

Changed in cyrus-sasl2 (Ubuntu Precise):
status: New → Triaged
importance: Undecided → High
Daniel (pada) wrote :

Just a short note:

With the 0034-fix_dovecot_authentication.patch from saucy package https://launchpad.net/ubuntu/saucy/amd64/sasl2-bin/2.1.25.dfsg1-17 all systems are running fine so far (for 9 days now), without high load of saslauthd.

Roberto Suarez (robe-allenta) wrote :

I'm seeing this exact same problem with saslauthd in sasl2-bin 2.1.26.dfsg1-14, using Ubuntu 16.04.4. Is there any other possible cause?

Roberto Suarez (robe-allenta) wrote :

I reply to myself: while the version in Xenial is *still* affected with this bug, any of the later versions is fixed. The version 2.1.26.dfsg1-14 is bugged, but the next one, 2.1.26.dfsg1-15 is not. I backported the one in Bionic easily and it's working now without problems.

Andreas Hasenack (ahasenack) wrote :

The patch 0034 mentioned in comment #10 is applied in the xenial package 2.1.26.dfsg1-14build1, so what Roberto hit could be a different issue requiring a different fix.

Might have been this:
cyrus-sasl2 (2.1.26.dfsg1-15) unstable; urgency=medium

  * Add fix for auth_rimap infinite loop (hang) when IMAP server closes
    connection (Closes: #815208)

Patch is https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=815208;filename=auth_rimap_socket_closed.patch;msg=5:
--- a/saslauthd/auth_rimap.c
+++ b/saslauthd/auth_rimap.c
@@ -494,7 +494,7 @@
         while( select (fds, &perm, NULL, NULL, &timeout ) >0 ) {
            if ( FD_ISSET(s, &perm) ) {
               ret = read(s, rbuf+rc, sizeof(rbuf)-rc);
- if ( ret<0 ) {
+ if ( ret<=0 ) {
                  rc = ret;
                  break;
               } else {
@@ -607,7 +607,7 @@
         while( select (fds, &perm, NULL, NULL, &timeout ) >0 ) {
            if ( FD_ISSET(s, &perm) ) {
               ret = read(s, rbuf+rc, sizeof(rbuf)-rc);
- if ( ret<0 ) {
+ if ( ret<=0 ) {
                  rc = ret;
                  break;
               } else {

Changed in cyrus-sasl2 (Ubuntu Precise):
status: Triaged → Won't Fix
Andreas Hasenack (ahasenack) wrote :

Trusty has 0034-fix_dovecot_authentication.patch, but it lacks the above fix as well.

tags: added: server-next
removed: precise
Changed in cyrus-sasl2 (Ubuntu Trusty):
status: New → Triaged
Changed in cyrus-sasl2 (Ubuntu Xenial):
status: New → Triaged
Andreas Hasenack (ahasenack) wrote :

I reproduced the problem in xenial by timing a "testsaslauthd" call with a "service dovecot stop" one (dovecot being my test imap server on localhost).

This is good enough for an SRU test case, thanks all.

Changed in cyrus-sasl2 (Ubuntu Xenial):
assignee: nobody → Andreas Hasenack (ahasenack)
importance: Undecided → High
Changed in cyrus-sasl2 (Ubuntu Trusty):
importance: Undecided → High
assignee: nobody → Andreas Hasenack (ahasenack)
description: updated
description: updated
description: updated
description: updated
Andreas Hasenack (ahasenack) wrote :

I'm unable to trigger this bug when running the test case on trusty, even though the same code is there. I also tried adjusting the sleep.

Andreas Hasenack (ahasenack) wrote :

Ok, trusty isn't affected because the loop there has an exit clause:
+ ret = read(s, rbuf+rc, sizeof(rbuf)-rc);
+ if ( ret<0 ) {
+ rc = ret;
+ break;
+ } else {
+ if (ret == 0) {
+ loopc += 1;
+ } else {
+ loopc = 0;
+ }
+ if (loopc > sizeof(rbuf)) { // arbitrary chosen value
+ break;
+ }

That comes from trusty's patch named 0034-fix_dovecot_authentication.patch. So trusty does loop for a bit (sizeof(rbuf) is 1000), but won't get stuck. Someone added the loop counter as a safety net, but didn't change the "ret<0" check into "ret<=0" which would also have fixed this.

In precise, that same patch (0034) adds the loop, but *without* an exit clause, hence this bug.

Xenial is interesting. Upstream at some point adopted the patch that does *NOT* exit the loop, but the code is so similar that someone decided the patch from the package was already applied and dropped it from the package, reintroducing the bug.

To add to the confusion, in xenial that patch file was super slightly renamed from 0034_fix_dovecot_authentication.patch to 0034-fix_dovecot_authentication.patch (can you spot what changed?) and got totally different contents:
--- cyrus-sasl2.orig/lib/checkpw.c
+++ cyrus-sasl2/lib/checkpw.c
@@ -587,16 +587,14 @@ static int read_wait(int fd, unsigned de
        /* Timeout. */
        errno = ETIMEDOUT;
        return -1;
- case +1:
- if (FD_ISSET(fd, &rfds)) {
- /* Success, file descriptor is readable. */
- return 0;
- }
- return -1;
    case -1:
        if (errno == EINTR || errno == EAGAIN)
        continue;
    default:
+ if (FD_ISSET(fd, &rfds)) {
+ /* Success, file descriptor is readable. */
+ return 0;
+ }
        /* Error catch-all. */
        return -1;
    }

From bionic onwards, the upstream version has the loop with no loop counter, but it checks read()'s result for <= 0, not just 0, so it's fixed there.

Bottom line, only xenial is currently affected (and precise, but precise is EOL).

Changed in cyrus-sasl2 (Ubuntu Trusty):
status: Triaged → Invalid
description: updated
description: updated
Changed in cyrus-sasl2 (Ubuntu Xenial):
status: Triaged → In Progress

Hello David, or anyone else affected,

Accepted cyrus-sasl2 into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cyrus-sasl2/2.1.26.dfsg1-14ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in cyrus-sasl2 (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial
Andreas Hasenack (ahasenack) wrote :

Roberto, are you able to test this update? If not, I can do it, but it would be better if someone not involved with the fix could do it.

Andreas Hasenack (ahasenack) wrote :

xenial verification

Confirming bug:

ubuntu@xenial-saslauthd:~$ apt-cache policy sasl2-bin
sasl2-bin:
  Installed: 2.1.26.dfsg1-14build1
  Candidate: 2.1.26.dfsg1-14build1
  Version table:
 *** 2.1.26.dfsg1-14build1 500
        500 http://br.archive.ubuntu.com/ubuntu xenial/main amd64 Packages

Confirmed saslauthd at 100% CPU after the test command:
  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 6666 root 20 0 90860 972 0 R 100.0 0.0 0:05.72 saslauthd

subsequent testsaslauthd commands just block, as expected due to the bug.

Repeating the test with the package from proposed:
ubuntu@xenial-saslauthd:~$ apt-cache policy sasl2-bin
sasl2-bin:
  Installed: 2.1.26.dfsg1-14ubuntu0.1
  Candidate: 2.1.26.dfsg1-14ubuntu0.1
  Version table:
 *** 2.1.26.dfsg1-14ubuntu0.1 500
        500 http://br.archive.ubuntu.com/ubuntu xenial-proposed/main amd64 Packages

Not only does saslauthd not start eating cpu, it keeps working:
with dovecot stopped:
ubuntu@xenial-saslauthd:~$ sudo testsaslauthd -u ubuntu -p ubuntu
0: NO "authentication failed"

Starting up dovecot and trying again:
ubuntu@xenial-saslauthd:~$ sudo service dovecot start
ubuntu@xenial-saslauthd:~$ sudo testsaslauthd -u ubuntu -p ubuntu
0: OK "Success."
ubuntu@xenial-saslauthd:~$

Xenial verification completed successfully.

tags: added: verification-done-xenial
removed: verification-needed-xenial
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.