beegfs-meta lockup with glibc 2.27 on bionic

Bug #1844195 reported by Dirk Petersen
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
glibc (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Bug report: Lock up of beegfs-meta with glibc 2.27

Affected system:

Release: Ubuntu 18.04.3 bionic
Kernel: 4.15.0-62-generic
libc6: 2.27-3ubuntu1
beegfs: 7.1.3

We have discovered an issue we believe to be a bug in the version of glibc in
Ubuntu 18.04 that causes a beegfs-meta service to lock up and become
unresponsive. (https://www.beegfs.io/)

The issue has also been observed in three other installations, all running
Ubuntu 18.04 and does not occur on Ubuntu 16.04 or RHEL/CentOS 6 or 7.

The affected processes resume normal operation almost immediately after a
debugger like strace or gdb is attached to the process and then continue to run
normally for some time until they get stuck again. In the short period between
attaching strace and the process resuming normal operation we see messages like

38371 futex(0x5597341d9ca8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 282, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
38371 futex(0x5597341d9ca8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 282, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
38371 futex(0x5597341d9ca8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 282, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
38371 futex(0x5597341d9ca8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 282, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)

and a CPU load of 100% on one core, and after the process gets unstuck

38371 futex(0x5597341d9ca8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 282, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unava
ilable)
38371 futex(0x5597341d9ca8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 282, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unava
ilable)
38371 futex(0x5597341d9cb0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3, NULL, 0xffffffff <unfinished ...>
38231 futex(0x5597341d9cb0, FUTEX_WAKE_PRIVATE, 2147483647) = 2
38371 <... futex resumed> ) = 0
38371 futex(0x5597341d9cb0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3, NULL, 0xffffffff <unfinished ...>

We found this [1] patch to glibc that might be related to the issue and built
our own version of the official glibc package with only the following diff
applied to it. All other changes in the patch only touch tests and modify the
Makefile to build those tests and the changelog, so we decided to skip these
for the sake of being able to apply the patch cleanly to the Ubuntu glibc.

index 5dd5342..85fc1bc 100644 (file)
--- a/nptl/pthread_rwlock_common.c
+++ b/nptl/pthread_rwlock_common.c
@@ -314,7 +314,7 @@ __pthread_rwlock_rdlock_full (pthread_rwlock_t *rwlock,
                 harmless because the flag is just about the state of
                 __readers, and all threads set the flag under the same
                 conditions. */
- while ((atomic_load_relaxed (&rwlock->__data.__readers)
+ while (((r = atomic_load_relaxed (&rwlock->__data.__readers))
                      & PTHREAD_RWLOCK_RWAITING) != 0)
                {
                  int private = __pthread_rwlock_get_private (rwlock);

Unfortunately the lockups did not stop after we installed the patched package
versions and restarted our services. The only thing we noticed was that during
the lockups, we could not observe high CPU load any more.

We were able to record backtraces of all of the threads in our stuck processes
before and after applying the patch. The traces are attached to this report.

Additionally, to discard other reasons, we explored the internal mutexes and
condition variables to check for dead(live)locks produced at the application
level (BeeGFS routines). We could not find any.

If you need additional information or testing, we would be happy to provide you
with what we can to help solve this issue.

[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=f21e8f8ca466320fed38bdb71526c574dae98026

Revision history for this message
Dirk Petersen (q-petersen) wrote :
information type: Public → Private Security
information type: Private Security → Public Security
Revision history for this message
Marc Deslauriers (mdeslaur) wrote : Bug is not a security issue

Thanks for taking the time to report this bug and helping to make Ubuntu better. We appreciate the difficulties you are facing, but this appears to be a "regular" (non-security) bug. I have unmarked it as a security issue since this bug does not show evidence of allowing attackers to cross privilege boundaries nor directly cause loss of data/privacy. Please feel free to report any other bugs you may find.

information type: Public Security → Public
Revision history for this message
Ekrem SEREN (ekremseren) wrote :

Hi, we have the same issue. I can confirm that after attaching gdb to the beegfs-meta process, it resumes normal operation.

On our system it seems the issue is recuring more or less around 24h.

Release: Ubuntu 18.04.1 bionic
Kernel: 4.15.0-39-generic
libc6: 2.27-3ubuntu1
beegfs: 7.1.1

Revision history for this message
Brian Koebbe (koebbe) wrote :

Hi, we also have the same issue.

Release: Ubuntu 18.04.3 bionic
Kernel: 4.15.0-62-generic
libc6: 2.27-3ubuntu1
beegfs: 7.1.3

Not ideal, of course, but going to try and run beegfs-meta in a chrooted xenial (using the xenial glibc version).

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in glibc (Ubuntu):
status: New → Confirmed
Revision history for this message
Bernd Pfrommer (bernd-pfrommer) wrote :

Seeing the same issue on the same configuration, but occurs every 24h, to within a few minutes or seconds after the following systemd timer goes off.

Sat 2019-10-05 06:41:41 EDT 21h left Fri 2019-10-04 06:33:12 EDT 2h 59min ago apt-daily-upgrade.timer apt-daily-upgrade.service

The update started at 06:33:05 (tzdata was updated), the metadata server stopped responding 06:35:55.

Similar behavior on 3 different days.

Will switch off the updates to see if that fixes the problem

Revision history for this message
Bernd Pfrommer (bernd-pfrommer) wrote :

beegfs-meta went into high CPU lock-up again today, at 6:27:07 am, even with daily updates disabled. So apt-daily-upgrade is not directly related to the hangs.

It turns out though that there is other stuff scheduled around that time by daily cron:

25 6 * * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )

It is still not clear what triggers the hangs, but it likely is related to a scheduled service, because of the consistency with which this happens at around 6:30am.

Revision history for this message
Paul Jähne (sethosii) wrote :

I also experience the issue but not at specific times but on medium/high load situations.

Revision history for this message
Brian Koebbe (koebbe) wrote :

Really curious if this turns out to be a glibc problem or a beegfs-meta problem, but we were able to hack together something that circumvents this problem by forcing beegfs-meta to use an older glibc6 (from xenial). beegfs-meta is now stable and performing much better!

There's probably a simpler way to do this, but we did something along these lines:

1. debootstrap xenial /srv/xenial-chroot
2. chroot into xenial-chroot, add beegfs to apt sources.list, apt install beegfs-meta, exit chroot
3. prepare a systemd pre-exec script "/usr/local/bin/setup-beegfs-meta-chroot.sh":

#!/bin/bash

set -e

cp /etc/beegfs/beegfs-meta.conf /srv/xenial-chroot/etc/beegfs/beegfs-meta.conf
mountpoint -q /srv/xenial-chroot/proc || mount --bind /proc /srv/xenial-chroot/proc
mountpoint -q /srv/xenial-chroot/sys || mount --bind /sys /srv/xenial-chroot/sys
mountpoint -q /srv/xenial-chroot/path/to/metadata || mount --bind /path/to/metadata /srv/xenial-chroot/path/to/metadata

4. copy /lib/systemd/system/beegfs-meta.service to /etc/systemd/system/beegfs-meta.service, adding the following to the [service] section:

RootDirectory=/srv/xenial-chroot
ExecStartPre=/usr/local/bin/setup-beegfs-meta-chroot.sh
RootDirectoryStartOnly=yes

5. daemon-reload systemd and restart beeegfs-meta

Revision history for this message
Dirk Petersen (q-petersen) wrote :

Just a followup as this issue was reported about a month ago, I wonder if there is anyone from the GLIBC team who could have a look at this code?

Revision history for this message
torel (torehl) wrote :

Been seeing the same, but it has escalated to the point that it occurs every morning at 6:30-6:40am.

root@srl-mds1:~# uname -ar
Linux srl-mds1 4.15.0-109-generic #110-Ubuntu SMP Tue Jun 23 02:39:32 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

root@srl-mds1:~# dpkg -l libc6
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-====================-===============-===============-=============================================
ii libc6:amd64 2.27-3ubuntu1.2 amd64 GNU C Library: Shared libraries

root@srl-mds1:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic

root@srl-mds1:~# dpkg -l libc6
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-====================-===============-===============-=============================================
ii libc6:amd64 2.27-3ubuntu1.2 amd64 GNU C Library: Shared libraries

root@srl-mds1:~# dpkg -l beegfs-common
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-====================-===============-===============-=============================================
ii beegfs-common 20:7.1.5 amd64 BeeGFS common files

root@srl-mds1:~# dpkg -S /opt/beegfs/sbin/beegfs-meta
beegfs-meta: /opt/beegfs/sbin/beegfs-meta

root@srl-mds1:~# dpkg -l beegfs-meta
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-====================-===============-===============-=============================================
ii beegfs-meta 20:7.1.5 amd64 BeeGFS metadata server daemon
root@srl-mds1:~#

Debug trace will follow tomorrow morning.

Revision history for this message
torel (torehl) wrote :

Sounds like something possibly was checked into glibc (libc6 on ubuntu) between

 libc6:amd64 2.27-3ubuntu1 amd64 GNU C Library: Shared libraries

and

 libc6:amd64 2.27-3ubuntu1.2 amd64

Which check-in is the culprit in https://launchpad.net/ubuntu/+source/glibc/2.27-3ubuntu1.2 ?

Revision history for this message
torel (torehl) wrote :

gdb of locked beegfs-meta. Will have to redo. Forgot stdout.

Revision history for this message
torel (torehl) wrote :
Revision history for this message
torel (torehl) wrote :

Debug of this mornings lockup.

Revision history for this message
torel (torehl) wrote :

WorkARound is adding beegfs to PRUNEFS and mountpoint to PRUNEPATHS in /etc/updatedb.conf

Revision history for this message
Asa Sourdiffe (astat100) wrote :

This bug is affecting our systems as well.

I just wanted to note that the torel's workaround of disabling updatedb indexing of beegfs metadata may work for specific uses of beegfs, but it would not fix the problem for us. Our users regularly generate large number of files in cluster jobs, causing the beegfs-meta service to fail.

 I hope a real fix is available soon, since this critical bug is nearly a year old.

Revision history for this message
torel (torehl) wrote :

Any movement?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.