rpc.rquotad takes 100% CPU

Bug #1774431 reported by Robert Sander on 2018-05-31
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
quota (CentOS)
Fix Committed
High
quota (Fedora)
Fix Committed
High
quota (Ubuntu)
High
 Christian Ehrhardt 
Bionic
Undecided
 Christian Ehrhardt 

Bug Description

[Impact]

 * Certain configurations will fall into a busy loop and never fully start

 * Fix by backporting from upstream

[Test Case]

1. install packages
   $ sudo apt install rpcbind quota
2. to let the service start at all one need:
   - Add ",usrquota" to your fstab options somewhere
   - Add a fake export to /etc/exports like
     /tmp 127.0.0.1
3. offending config:
   Only then will quotarpc.service start, but it does not expose the
   faulty behavior yet, to get that set a special port via
   RPCRQUOTADOPTS="-p 599" in /etc/default/quota.

Now you can run systemctl restart quota and it will lock up in 100% CPU.

[Regression Potential]

 * It is an extra listen and setsockipt call. While it is correct to do
   those (that is the fix) there might exist cases we didn't hit in
   testing where those calls fail and break it.
   Yet since it lies on a code path that currently is affected by the 100%
   cpu hog that should (tm) not be a huge regression potential.
   At least both paths - if failing - have proper error reporting so a
   user will know what broke and is not just failing without knowing why
   out of a sudden.

[Other Info]

 * n/a

---

This happens on Ubuntu 18.04 with package quota 4.04-2 installed.

The bug is already known and a patch is available: https://sourceforge.net/p/linuxquota/feature-requests/16/

Related branches

Description of problem:

The quota daemon, rpc.rquotad consumes 100% CPU on one core.
It does answer correctly to queries though.

Version-Release number of selected component (if applicable):

quota-rpc-4.04-5.fc28.x86_64

How reproducible:

100% and immediately.

Steps to Reproduce:
1. dnf install quota-rpc
2. systemctl start rpc-rquotad.services
3. top

Actual results:

You see 90+% CPU usage from rpc.rquotad

Expected results:

Minimal CPU usage from rpc.rquotad

Additional info:

1) Downgrade to the last version of Fedora 27, quota-rpc-4.03-9.fc27.x86_64
   makes everything okay again.

   dnf -y downgrade quota-rpc --releasever=27

2) strace on the daemon spits out endless repeats of:

poll([{fd=4, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}, {fd=5,
  events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}, {fd=6,
  events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}, {fd=7,
  events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 4, -1) = 2 ([{fd=5,
  revents=POLLHUP}, {fd=7, revents=POLLHUP}])
accept(5, 0x7ffe61698700, [128]) = -1 EINVAL (Invalid argument)
accept(7, 0x7ffe61698700, [128]) = -1 EINVAL (Invalid argument)

Thanks for the report. I can reproduce it.

This happens inside tirpc library:

#0 0x00007f86ab2e3254 in accept () from /lib64/libc.so.6
#1 0x00007f86ab5c1274 in rendezvous_request () from /lib64/libtirpc.so.3
#2 0x00007f86ab5bd6b8 in svc_getreq_common () from /lib64/libtirpc.so.3
#3 0x00007f86ab5bd947 in svc_getreq_poll () from /lib64/libtirpc.so.3
#4 0x00007f86ab5c016e in svc_run () from /lib64/libtirpc.so.3
#5 0x000055e6f0a3d6c7 in main (argc=2, argv=0x7ffc9afd7d18) at rquota_svc.c:538

The polled descriptors are:

rpc.rquot 21981 root 4u IPv4 80449159 0t0 UDP *:rquotad
rpc.rquot 21981 root 5u sock 0,9 0t0 80449162 protocol: TCP
rpc.rquot 21981 root 6u IPv6 80449165 0t0 UDP *:rquotad
rpc.rquot 21981 root 7u sock 0,9 0t0 80449168 protocol: TCPv6

One can use "/usr/sbin/rpc.rquotad -F" command to prevent from daemonization.

Fedora 27 is not affected, because it does not use tirpc there. I can also reproduce it on Fedora 29.

Either it's a bug in libtirpc-1.0.3-1.rc1.fc28.x86_64 that wants to accept raw sockets, or rpc.rquotad setups the tirpc service inappropriately. Maybe libtirpc maintainer will know more. rpc.rquotad code is in rquota_svc.c

(In reply to Petr Pisar from comment #2)
>
> Either it's a bug in libtirpc-1.0.3-1.rc1.fc28.x86_64 that wants to accept
> raw sockets, or rpc.rquotad setups the tirpc service inappropriately. Maybe
> libtirpc maintainer will know more. rpc.rquotad code is in rquota_svc.c
Well it is spinning in rendezvous_request due to this loop:

again:
    len = sizeof addr;
    sock = accept(xprt->xp_fd, (struct sockaddr *)(void *)&addr, &len);
    if (sock < 0) {
        if (errno == EINTR)
            goto again;
        return (FALSE);
    }
There was the same loop in the old glibc code as well
so nothing there has changed there...

It definitly has something to do with creating the
TCP listening socket since the UDP socket seems to
be fine:

# rpcinfo -T udp localhost 100011
program 100011 version 1 ready and waiting
program 100011 version 2 ready and waiting
# rpcinfo -T tcp localhost 100011
rpcinfo: RPC: Remote system error - Connection refused

In Comment 3 I was mistaken with that again loop since
the errno being returned from accept is EINVAL not
EINTR.

The reason for this looping and accept returning
EINVAL is a listen(2) call is not being done the
the fd used to create the SVCXPRT pointer.

The reason a listen is not done is because
svc_tli_create() is given pre-created socket
and a bound socket. In this cause svc_tli_create()
will not call listen().

In the glibc code, a listen was called, I proved
that with strace, but I'm not sure from where
since I can not find a svc_til_create() in the
older (Fedora 27) glibc code.

Since there does not seem to be an upstream
git tree, I can not tell when that call
was added. Was that call recently added?

So the assumption is made by the tirpc version
of svc_tli_create() is when handed a socket that
is bound, its assumed the listen call has also
been made. In other apps using it (in nfs-utils)
this is the case.

Created attachment 1435510
Patch that fixes the problem

The attached patch adds a listen() call to svc_create_sock()
routine which is needed with libtirpc version of svc_tli_create()
as well as a needed IPv6 setsockopt().

It appears the glibc version of that call must have done
the listen() call. Not clear how that ever worked since
other versions of svc_tli_create() (aka BSD) don't
do the bind either plus the IPV6_V6ONLY is needed
to allow the binds to work.

I can conform that this solves the issue, at least here
(no ipv6 in use).

Thanks for the quick fix ... Now wait for the rpms...

The git tree is at <git://git.code.sf.net/p/linuxquota/code>. Fedora 28 rebased quota to 4.04 that migrated from NIS RPC to tirpc. Hence Fedora 28 is the first one that uses tirpc and supports IPv6.

Thanks for the patch. I will review it, forward to the upstream and apply to Fedora.

The patch looks good and I confirm it fixes this issue. The RPC calls still work over both IPv4 and IPv6. I sent the patch to <https://sourceforge.net/p/linuxquota/feature-requests/16/>

quota-4.04-6.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-4adeec12ed

quota-4.04-6.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-4adeec12ed

quota-4.04-6.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.

Robert Sander (gurubert) on 2018-05-31
description: updated
description: updated
Changed in quota (CentOS):
importance: Unknown → High
status: Unknown → Fix Committed
Changed in quota (Fedora):
importance: Unknown → High
status: Unknown → Fix Committed
tags: added: server-next
Changed in quota (Ubuntu):
status: New → Triaged
importance: Undecided → High

I liked the report especially with the patch already linked and all that.
But to reproduce on Debian/Ubuntu one needs to:

$ sudo apt install rpcbind quota

To let the service start at all one need:
- Add ",usrquota" to your fstab options somewhere
- Add a fake export to /etc/exports like
  /tmp 127.0.0.1

Only then will quotarpc.service start, but it does not expose the faulty behavior.

If you could outline the steps to get the bad behavior on Ubuntu that would be very kind.

Changed in quota (Ubuntu):
status: Triaged → Incomplete
Robert Sander (gurubert) wrote :

This behaviour is only observed when rpc.rquotad is told to run on a specific port with e.g. RPCRQUOTADOPTS="-p 599" in /etc/default/quota.

I need this because I am running a ctdb cluster with Samba and NFS on top of CephFS. Every node needs to have the NFS services on the same ports.

Confirmed, thanks Robert.
Yet with RPCRQUOTADOPTS="-p 599" in /etc/default/quota the issue can be reproduced.

Changed in quota (Ubuntu):
status: Incomplete → Triaged

Hi,
I have built a fixed version in ppa [1].
This works for me a dn I have not spotted any issues yet, but since you seem to be a more heavey user of this I wanted to ask if you could check from the ppa as well.

If it is good we can push that to 18.10 and afterwards consider an SRU for Xenial.

[1]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/3281

Changed in quota (Ubuntu Bionic):
status: New → Triaged
Robert Sander (gurubert) wrote :

Looks good. Thanks.

I have opened a merge proposal to get a 2nd opinion avoiding mistakes and missing use cases affected in a bad way. Once that is acked I can upload to Cosmic.

Changed in quota (Ubuntu):
status: Triaged → In Progress
Changed in quota (Ubuntu Bionic):
status: Triaged → In Progress
Changed in quota (Ubuntu):
assignee: nobody →  Christian Ehrhardt  (paelzer)
Changed in quota (Ubuntu Bionic):
assignee: nobody →  Christian Ehrhardt  (paelzer)

Cosmic upload is currently tested, also prepared the SRU Template for the latter upload to Bionic.

description: updated
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package quota - 4.04-2ubuntu1

---------------
quota (4.04-2ubuntu1) cosmic; urgency=medium

  * d/p/lp-1774431-fix-Listen-on-a-TCP-socket.patch: avoid issues with TCP
    sockets hogging quota at 100% cpu (LP: #1774431 Can be dropped next
    release >4.04 as the patch is upstream already).

 -- Christian Ehrhardt <email address hidden> Mon, 04 Jun 2018 09:13:16 +0200

Changed in quota (Ubuntu):
status: In Progress → Fix Released

Hello Robert, or anyone else affected,

Accepted quota into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/quota/4.04-2ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in quota (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-bionic
Robert Sander (gurubert) wrote :

Hi Łukasz,

looks good so far.

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Robert Sander (gurubert) wrote :

Forgot the package version: 4.04-2ubuntu0.1, had to downgrade from the PPA's version.

The issue LGTM, now only waiting the usual extra days if someone spots something in proposed that might be missed.

@Robert - you'll be added to my personal list of great, active, friendly and helpful bug reporters :-)
Thank you for your participation in all of this!

Robert Sander (gurubert) wrote :

Thank you.

If only all package maintainers would be as fast as you.
I would be happy if https://bugs.launchpad.net/ubuntu/+source/ctdb/+bug/722201 could be sped up a little bit…

Łukasz Zemczak (sil2100) wrote :

Thanks for all the testing! There's one systemd autopkgtest failure happening for this upload in -proposed though. Might not be related as those tend to be flaky, but I'd like someone taking a look at that before we proceed. I re-ran it some days ago but it again failed.

Hi Lukasz,
it is a flaky test in this case.
On one hand there is no quota usage in this test (boot-smoke failed last).
Further different subtests fail on different retries.

Thanks for asking, but for now we can only ignore it and go on or retrigger until green.
I takled to xnox about it but there seems not too much he can do about it atm.
We want the coverage of the working tests, but some are flaky (some more than others).

I retriggered it for now and usually it is 1/3 passing, but murphy ...

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.