Brick SEGFAULTs in 11.1

Bug #2064843 reported by Nick O'Connor
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
glusterfs (Ubuntu)
Status tracked in Oracular
Noble
Fix Released
Undecided
Bryce Harrington
Oracular
Fix Released
Undecided
Bryce Harrington

Bug Description

[ Impact ]

 * Users experience brick SEGFAULTs under certain not-yet-understood scenarios. Some reports include a high percentage of small file I/O. I encountered the issue roughly every hour with Minio backed by GlusterFS on ZFS.

 * This bug introduces an increased risk of data loss or corruption depending on the user's configuration and timing of brick crashes.

 * Core dumps from multiple users revealed that the SEGFAULTs are caused by a stack overflow when namespaced inodes are destroyed.

 * The patch removes the recursive call to inode_unref when a namespaced inode is destroyed.

[ Test Plan ]

 * I experienced brick crashes on specific volumes about once per hour. On my system, this issue only impacted a locally mounted volume backing a Minio instance (an S3 API compatible server) used by Restic clients (an incremental backup system with lots of small file creations and deletions). Other volumes served with NFS Ganesha with primarily large file random access never triggered it.

 * I attempted to replicate the workload by running various file system benchmarking tools within their own user namespace (i.e. lots of small file creations and deletion) but was not able to replicate the crash.

 * I've been running the proposed patch since 2024-05-06 and haven't experienced a single crash.

 * The test plan is to run the packages from proposed for at least a day, under the same load as when the bug happened, and confirm that the crashes reported in this bug no longer happen.

[ Where problems could occur ]

 * It's conceivable that this patch introduces undesired behavior when inodes are destroyed, however I highly doubt this scenario as __inode_destroy was not recursive before the change which introduced the bug.

[ Other Info ]

 * PR which introduced the bug: https://github.com/gluster/glusterfs/pull/1763
 * PR which added this patch: https://github.com/gluster/glusterfs/pull/4302
 * Issue discussion: https://github.com/gluster/glusterfs/issues/4295

description: updated
description: updated
description: updated
Revision history for this message
Nick O'Connor (nick-oconnor) wrote (last edit ):

I've recompiled glusterfs locally with the changes. I can confirm the fix linked above addresses the issue.

description: updated
Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Thanks, Nick.

I am adding this to the server team backlog so someone can start working on this one soon.

If you are willing to drive this one, please let us know so we can aid you through the SRU process and sponsor uploads on your behalf in case it is needed.

Otherwise, someone in the server team will start checking/driving this one soon.

Changed in glusterfs (Ubuntu Oracular):
status: New → Triaged
Changed in glusterfs (Ubuntu Noble):
status: New → Triaged
tags: added: server-todo
Revision history for this message
Nick O'Connor (nick-oconnor) wrote :

SGTM. I can drive this. Let me know what needs to be done.

description: updated
description: updated
description: updated
description: updated
Changed in glusterfs (Ubuntu Noble):
assignee: nobody → Nick O'Connor (nick-oconnor)
status: Triaged → In Progress
summary: - Gluster 11.1 brick SEGFAULT
+ Brick SEGFAULTs in 11.1
Revision history for this message
Nick O'Connor (nick-oconnor) wrote :
Revision history for this message
Nick O'Connor (nick-oconnor) wrote (last edit ):

This is ready for review/sponsorship.

Changed in glusterfs (Ubuntu Noble):
assignee: Nick O'Connor (nick-oconnor) → nobody
Changed in glusterfs (Ubuntu Oracular):
assignee: nobody → Athos Ribeiro (athos-ribeiro)
Changed in glusterfs (Ubuntu Noble):
assignee: nobody → Athos Ribeiro (athos-ribeiro)
Bryce Harrington (bryce)
Changed in glusterfs (Ubuntu Oracular):
assignee: Athos Ribeiro (athos-ribeiro) → Bryce Harrington (bryce)
Changed in glusterfs (Ubuntu Noble):
assignee: Athos Ribeiro (athos-ribeiro) → Bryce Harrington (bryce)
Revision history for this message
Bryce Harrington (bryce) wrote :

Hi Nick,

Sorry for the delay - Athos got tied up with another project, but I can help in moving this forward.

Thank you for the links to the upstream history, and from that it appears this is only affecting glusterfs 11.x, thus correctly targets noble and oracular.

I've reviewed your debdiff and the patch in question. The changes look sensible to me, and you've given attention to the patch DEP3 header. I've tweaked the changelog entry to include a bit more detail, and retargeted it to oracle. The changes are on a branch here:

    https://code.launchpad.net/~bryce/ubuntu/+source/glusterfs/+git/glusterfs/+ref/ubuntu/oracular-devel

The fixed package is now uploaded to oracular:

Vcs-Git: https://git.launchpad.net/~bryce/ubuntu/+source/glusterfs
Vcs-Git-Commit: 55391d03e55f519d17f7698e6e6478caf5f53b78
Vcs-Git-Ref: refs/heads/sru-lp2064843-oracular

gpg: ../glusterfs_11.1-4ubuntu1_source.changes: Valid signature from E603B2578FB8F0FB
Checking signature on .dsc
gpg: ../glusterfs_11.1-4ubuntu1.dsc: Valid signature from E603B2578FB8F0FB
Uploading to ubuntu (via ftp to upload.ubuntu.com):
  Uploading glusterfs_11.1-4ubuntu1.dsc: done.
  Uploading glusterfs_11.1-4ubuntu1.debian.tar.xz: done.
  Uploading glusterfs_11.1-4ubuntu1_source.buildinfo: done.
  Uploading glusterfs_11.1-4ubuntu1_source.changes: done.
Successfully uploaded packages.

Changed in glusterfs (Ubuntu Oracular):
status: Triaged → Fix Committed
Changed in glusterfs (Ubuntu Noble):
status: In Progress → Confirmed
Revision history for this message
Bryce Harrington (bryce) wrote :

For the noble SRU, I've also sponsored the upload of the package for you:

Uploading to ubuntu (via ftp to upload.ubuntu.com):
  Uploading glusterfs_11.1-4ubuntu0.1.dsc: done.
  Uploading glusterfs_11.1-4ubuntu0.1.debian.tar.xz: done.
  Uploading glusterfs_11.1-4ubuntu0.1_source.buildinfo: done.
  Uploading glusterfs_11.1-4ubuntu0.1_source.changes: done.
Successfully uploaded packages.

To be honest, however, I'm not sure if this will pass SRU review. That team tends to want to see a detailed test plan to ensure the issue is definitively found without the patch, and confirmed absent with it. Even if it's just statistically (e.g. "crashes multiple times a day without this fix, but with the fix it should be crash-free for several days) that documents how the upload can be verified.

You may also want to sharpen your pencil on the impact statement. "not yet understood" may be a bit of a red flag that there may be more diagnosis homework needing done.

A stack trace or crash dump would be appropriate to include in the Other Info section. That can often be useful for other users to help them evaluate if they're affected by this same issue or have something unrelated.

Thanks again for working on this bug!

Changed in glusterfs (Ubuntu Noble):
status: Confirmed → Fix Committed
Revision history for this message
Nick O'Connor (nick-oconnor) wrote (last edit ):

Hi Bryce!

I really appreciate you getting this submitted.

I took a day and looked through the code/tests for GlusterFS and couldn't figure out how to reliably trigger the crash. For my system, I experienced brick crashes about once per hour, but only with specific volumes. For instance, the volumes which I serve with NFS Ganesha never triggered this crash (they triggered a different **crash with NFS Ganesha, but not within GlusterFS). Only a locally mounted volume which backs a Minio instance (an S3 API compatible server) used by Restic clients (an incremental backup system) triggered this crash. I attempted to replicate the workload by running various file system benchmarking tools within their own user namespace (i.e. lots of small file creations/deletions) but that never triggered the crash either. I've been running the above patch since 2024-05-06 and haven't experienced a single crash. Unfortunately I'm at a loss for what else to try.

FWIW I filed this because my experience upgrading my NAS machines from 22.04 to 24.04 was extremely poor due to issues with GlusterFS and NFS Ganesha.

**https://bugs.launchpad.net/ubuntu/+source/nfs-ganesha/+bug/2065856

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package glusterfs - 11.1-4ubuntu1

---------------
glusterfs (11.1-4ubuntu1) oracular; urgency=medium

  * Fix stack overflow in __inode_destroy due to recursive calls to
    inode_unref.
    (LP: #2064843)

 -- Nick O'Connor <email address hidden> Wed, 08 May 2024 18:49:32 -0700

Changed in glusterfs (Ubuntu Oracular):
status: Fix Committed → Fix Released
Revision history for this message
Bryce Harrington (bryce) wrote :

Hi Nick,

Good work on the troubleshooting analysis, and while a set of steps didn't reveal themselves it sounds like you narrowed the conditions down a good bit, ala "locally mounted volume which backs a Minio instance (an S3 API compatible server) used by Restic clients (an incremental backup system) triggered this crash". I'd encourage you to use that to help update the bug description. Even if it can't be boiled down to a paint-by-numbers set of tests, if it can be expressed as a very specific configuration and a statistical measure of crash frequency (once an hour), that can sometimes be sufficient for the SRU reviewers to accept it.

Looks like the oracular fix landed without issue. For the noble fix, unfortunately as I understand it the SRU team is quite swamped right now due to heavy SRU requests for 24.04, and they're estimating the queue size is up to ~6 weeks, so it may take them time before they can process or provide feedback on this one. Meanwhile stay tuned.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

@nick-oconnor, thanks for your work on this bug, specially on the troubleshooting.

To proceed with accepting this package into noble-proposed, I would kindly ask you to update the [Test plan] section with the conditions you described in comment #11 that trigger the crash, even if not immediately.

What I mean is that it's fine to have a test plan that acknowledges that the crash is hard to reproduce, and just describe the scenario where it happens with frequency X, but without being able to detail exactly what is causing it. As long as you have such a deployment, where with the bug the crashes happen periodically, and with the proposed packages you can assert that the after a much longer time it does not happen anymore, that's good enough.

So let's say, if you get crashes every hour, we could say that after a day without crashes of this type, the bug is confirmed fixed.

How does that sound?

Changed in glusterfs (Ubuntu Noble):
status: Fix Committed → Incomplete
Revision history for this message
Nick O'Connor (nick-oconnor) wrote :

Hi Andreas! I've updated the testing section. PTAL.

description: updated
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Thanks, I added a small conclusion to make it clear what we expect. Please keep in mind that once the packages in noble-proposed are built, those are the ones that need testing, even if your local builds have the same patch, ok?

Thanks again

description: updated
Revision history for this message
Nick O'Connor (nick-oconnor) wrote :

Yep! Understood. I'll install them once available.

Revision history for this message
Andreas Hasenack (ahasenack) wrote : Please test proposed package

Hello Nick, or anyone else affected,

Accepted glusterfs into noble-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/glusterfs/11.1-4ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-noble to verification-done-noble. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-noble. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in glusterfs (Ubuntu Noble):
status: Incomplete → Fix Committed
tags: added: verification-needed verification-needed-noble
Revision history for this message
Nick O'Connor (nick-oconnor) wrote :

The new version in noble-proposed is "11.1-4ubuntu0.1" vs the patch which has "11.1-4ubuntu1". Is that expected or did I screw up the patch?

Revision history for this message
Nick O'Connor (nick-oconnor) wrote :
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (glusterfs/11.1-4ubuntu0.1)

All autopkgtests for the newly accepted glusterfs (11.1-4ubuntu0.1) for noble have finished running.
The following regressions have been reported in tests triggered by the package:

samba/2:4.19.5+dfsg-4ubuntu9 (amd64)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/noble/update_excuses.html#glusterfs

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Bryce Harrington (bryce) wrote :

Yep, I changed it to 11.1-4ubuntu0.1 when I uploaded, per that policy.

Regarding the samba autopkgtest, I've just retriggered it to re-run. We've been having some false positive errors due to infrastructure timeouts, and this looks like one of those.

Revision history for this message
Nick O'Connor (nick-oconnor) wrote :

Packages installed and running. Looks good so far. I'll continue to monitor it.

description: updated
description: updated
Revision history for this message
Nick O'Connor (nick-oconnor) wrote :

Zero crashes. Service uptime 1d 9h.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thank you that satisfies the test case, a few more day of maturing.

tags: added: verification-done verification-done-noble
removed: verification-needed verification-needed-noble
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

DEP8 is also clear, and from comments #19 and #22 I conclude that the test was performed from the version in proposed.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package glusterfs - 11.1-4ubuntu0.1

---------------
glusterfs (11.1-4ubuntu0.1) noble; urgency=medium

  * Fix stack overflow in __inode_destroy due to recursive calls to
    inode_unref.
    (LP: #2064843)

 -- Nick O'Connor <email address hidden> Wed, 08 May 2024 18:49:32 -0700

Changed in glusterfs (Ubuntu Noble):
status: Fix Committed → Fix Released
Revision history for this message
Andreas Hasenack (ahasenack) wrote : Update Released

The verification of the Stable Release Update for glusterfs has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.