SEGFAULTs in v4.3 with GlusterFS

Bug #2065856 reported by Nick O'Connor
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
nfs-ganesha (Ubuntu)
Status tracked in Oracular
Mantic
Won't Fix
Undecided
Unassigned
Noble
Incomplete
Undecided
Unassigned
Oracular
Incomplete
Undecided
Unassigned

Bug Description

[ Impact ]

 * Users experience SEGFAULTs when Ganesha fails to communicate with Gluster, causing the NFS share to disappear.

 * The patch refactors how free_space() is called to avoid the null pointer deference.

[ Test Plan ]

 * I experienced crashes about once per day on volumes with heavy large file random access. I've been running the proposed patch since 2024-05-16 and haven't experienced a single crash.

 * The test plan is to run the packages from proposed for at least 5 days, under the same load as when the bug happened, and confirm that the crashes reported in this bug no longer happen.

[ Other Info ]

 * Commit which added this patch: https://github.com/nfs-ganesha/nfs-ganesha/commit/336abcba0b9e0dae0aadb4657c311d04862f2028
 * Issue discussion: https://github.com/nfs-ganesha/nfs-ganesha/issues/904

description: updated
Revision history for this message
Nick O'Connor (nick-oconnor) wrote :
description: updated
summary: - SEGFAULTs in v4.3
+ SEGFAULTs in v4.3 with GlusterFS
summary: - SEGFAULTs in v4.3 with GlusterFS
+ SEGFAULTs in v4.3 with Gluster
Revision history for this message
Nick O'Connor (nick-oconnor) wrote : Re: SEGFAULTs in v4.3 with Gluster

My initial patch missed a line from the original commit (which doesn't cleanly apply to v4.3). I've fixed the patch and it resolves the issue.

summary: - SEGFAULTs in v4.3 with Gluster
+ SEGFAULTs in v4.3 with GlusterFS
Revision history for this message
Nick O'Connor (nick-oconnor) wrote :

This is ready for review/sponsorship.

description: updated
Changed in nfs-ganesha (Ubuntu):
status: New → In Progress
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi, I come by reviewing patches that have patches to sponsor, trying to help this going forward.
Thank you for your work and linking the related upstream discussions.

The patch is merged upstream but only in 5.0 and later.
There have been some reports that around 4.0 it was still ok.
Comparing that to Ubuntu revisions that means we need to consider fixing mantic, noble and oracular (as per SRU policy we need to fix all to avoid the issue coming back on upgrades).

Earlier releases do not have the issue, and the patch can be dropped once we merge 5.0 or later.

Updating bug states accordingly.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Being an upstream merged change makes it ok for Oracular,
but for the backports it is complex as it is messing with memory management and as you said does not apply as-is.

Some comments and adaptations I'll have to make:
- thanks for having dep-3 patch headers at all, that is often forgotten
- usually if modified from the original it states backport instead of upstream, but I can fix that along the way
- the version in the changelog would not work. Noble and Oracular are both at 4.3-8ubuntu1. Only Oracular can go to the proposed 4.3-8ubuntu2 noble will need to go to 4.3-8ubuntu1.1. If you are curious about this you can read [1] which is an explanation on this more complex than it should be topic :-)
- it is no rule but personal preference to list the related patch in a changelog

Other than that this LGTM, it is complex to test and I need to ask you if you are willing and able to do the same for all releases that need SRU verification (noble + mantic in this case).

OTOH strictly speaking the rule is to fix all "later" releases.
We could as well assume that nfs-ganesha is probably not the most common mantic workload and it goes out of support soon. So instead of messing with yet another version we could stick to what you tested already as oracular and noble have the same.

I'll try to prep MPs with the same content you had but my fixes I mentioned above on top.
You can give me a +1 that this matches what you have and expected and then I could upload it.

[1]: https://github.com/canonical/ubuntu-maintainers-handbook/blob/main/VersionStrings.md

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (3.8 KiB)

Patch review

First of all it is not as complex as it seems, it does the very same multiple times.
But I still found some things I need to ask.

Overall, once you had a look at those 4 points, addressed or explained them the next sponsor coming by (or myself if I do not miss the ping due to your reply) could sponsor that to Oracular and for the Noble SRU.

#1
I saw the context changed around
  PTHREAD_RWLOCK_destroy(&my_fd->fdlock);
which formerly was in many *_free_state functions in the version that is in noble.

That is part of the removing of all custom free functions used as function pointer in the ops struct.
What is left is that instead of through the function pointer like
  op_ctx->fsal_export->exp_ops.free_state(
it now calls directly
  free_state(pfid->state);
And in these calls the right state is selected that needs to be processed.

The new free_state now checks for bad state.
Then only if a custom state free function is registered calls that or to gsh_free as final fallback.
Those are all set to NULL as seen in `grep -Hrn init_state src/* -A1`.
So gsh_free it is which is in src/include/abstract_mem.h

But that is just:
254 static inline void
255 gsh_free(void *p)
256 {
257 »···free(p);
258 }

Maybe I just do not see it yet, but could you explain why the proposed changes would not break the locking which seems to be no more done?
I'm sure the SRU team would want to know as well before accepting.

#2
Further there is a comment lost in the struct of *_state_fd
Upstream added:
  /** state MUST be first to use default free_state */
The comment is not functional and therefore not a huge problem.

But I also see no reason why it is not added, it would only apply other backports harder right?

If there is a reason why that isn't added let explain it.
And if there is none the next revision could have that change to be more similar to the upstream patch right?

#3 the aforementioned versioning and changelog
I think this should be
3.1 oracular
nfs-ganesha (4.3-8ubuntu2) oracular; urgency=medium

  * Fix null pointer ref in free_state (LP: #2065856)
    - d/p/20-fix-null-ptr-ref-in-free-state.diff
3.2 noble
nfs-ganesha (4.3-8ubuntu1.1) noble; urgency=medium

  * Fix null pointer ref in free_state (LP: #2065856)
    - d/p/20-fix-null-ptr-ref-in-free-state.diff

#4 dep-3 headers
As mentioned I'm happy you have them.
But since I came to the point to ask you to refresh your proposal anyway.
On one hand since it is adapted we'd change origin->backport
And on the other I see no reason why we should not use exactly the text from the upstream patch.
Was there a reason to reformat it?
I'd sug...

Read more...

Changed in nfs-ganesha (Ubuntu Mantic):
status: New → Won't Fix
Changed in nfs-ganesha (Ubuntu Noble):
status: New → Confirmed
Changed in nfs-ganesha (Ubuntu Oracular):
status: In Progress → Confirmed
Changed in nfs-ganesha (Ubuntu Noble):
status: Confirmed → Incomplete
Changed in nfs-ganesha (Ubuntu Oracular):
status: Confirmed → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

State -> incomplete only as I wait for your answers - set it back to new or confirmed once provided.

Thanks in advance!
Christian

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I've dropped Ubuntu-sponsors to get off the lists people check.
Once the answers (or even better updated debdiffs) have been provided please add them back to pop up so someone can have a look again.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.