libfuse2: race in fuse_daemonize() causes ' Transport endpoint is not connected' (found with cmsfs-fuse)

Bug #1558967 reported by bugproxy
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
High
Unassigned
fuse (Ubuntu)
Fix Released
Medium
Dimitri John Ledkov
Xenial
Fix Released
Low
Dimitri John Ledkov

Bug Description

== Comment: #21 - Hendrik Brueckner - 2016-03-16 06:44:09 ==
Package: libfuse2
Version: 2.9.4-1ubuntu2

The cmsfs-fuse program is used to transfer files from a CMSFS dasd (on z/VM) to Linux. The procedure is to mount, copy files, umount. All commands are issued from within an application over an SSH connection.

The problem is that the copy intermittently fails with "Transport endpoint is not connected". The procedure is as follows:

   #mount cmsfs
   sudo /usr/bin/cmsfs-fuse /dev/dasdb /usr/wave/wavedisk
   # copy file
   /bin/cp -f /usr/wave/wavedisk/WAVEDATA.SCRIPT /usr/wave/wavedata
   /bin/cp: cannot stat '/usr/wave/wavedisk/WAVEDATA.SCRIPT': Transport endpoint is not connected
   #umount
   umount /usr/wave/wavedisk

Because the application uses JSCH to issue the commands, I worked on a non-Java reproducer using SSH.

The problem can be easily re-created with ssh as follows:

root@r3559004:~# ssh -t root@localhost "cmsfs-fuse /dev/disk/by-path/ccw-0.0.0190 /CMSFS"
Connection to localhost closed.
root@r3559004:~# ls /CMSFS
ls: cannot access '/CMSFS': Transport endpoint is not connected

Problem analysis will follow but not that is not specific to cmsfs-fuse; the problem might also occur with other fuse file systems that are mounted through an SSH connection.

== Comment: #23 - Hendrik Brueckner - 2016-03-16 07:07:30 ==
After debugging and some code review on the libfuse library, I think that
we identified the root cause. As suggested, the problem is not related
to cmsfs-fuse directly.

The cmsfs-fuse main program calls into the libfuse library() using the
fuse_main() function. The fuse_main() function later calls the
fuse_daemonize() to fork the daemon process to handle the fuse file
system I/O.

The fuse_daemonize() look at follows:

180 int fuse_daemonize(int foreground)
181 {
182 if (!foreground) {
183 int nullfd;
184
185 /*
186 * demonize current process by forking it and killing the
187 * parent. This makes current process as a child of 'init'.
188 */
189 switch(fork()) {
190 case -1:
191 perror("fuse_daemonize: fork");
192 return -1;
193 case 0:
194 break;
195 default:
196 _exit(0);
197 }
198
199 if (setsid() == -1) {
200 perror("fuse_daemonize: setsid");
201 return -1;
202 }
203
204 (void) chdir("/");
205
206 nullfd = open("/dev/null", O_RDWR, 0);
207 if (nullfd != -1) {
208 (void) dup2(nullfd, 0);
209 (void) dup2(nullfd, 1);
210 (void) dup2(nullfd, 2);
211 if (nullfd > 2)
212 close(nullfd);
213 }
214 }
215 return 0;
216 }

The fuse_daemonize() function calls fork() as usual. The child proceeds with setsid() and then redirecting its file descriptors to /dev/null etc. The parent process, simply exits.

The child's functions and the parent's exit creates a subtle race. This is seen with an SSH connection. The SSH command "ssh -t root@localhost "cmsfs-fuse /dev/disk/by-path/ccw-0.0.0190 /CMSFS" calls the cmsfs-fuse on an allocated pseudo-terminal device (-t option).

If the parent exits, the SSH command receives that its command has been executed and closes the connection, that means, it closes the master side of the pseudo-terminal. This causes a HUP signal being sent to the process group on the pseudo-terminal. The child might not have completed the setsid() call and hence becomes terminated. Note that fuse sets up its signal handler later after fuse_daemonize() has complete.

Even if the child has the chance to disassociate from it's parent process group to become it's own process group with setsid(), the child still has the pseudo-terminal opened as stdin, stdout, and stderr. So the pseudo-terminal still behave as controlling terminal and might cause a SIGHUP to be issued at closing the the master side.

To solve the problem, the parent has to wait until the child (the fuse daemon process) has completed its processing, that means, has become its own process group with setsid() and closed any file descriptors pointing to the pseudo-terminal.

For example, using a pipe as follows could solve the problem:

The parent waits on the pipe, then exits:

read(waiter[0], &completed, sizeof(completed));
_exit(0);

The child signals its completion (after redirecting its file descriptors) with:
completed = 1;
write(waiter[1], &completed, sizeof(completed));

== Comment: #24 - Gerald Schaefer - 2016-03-16 08:18:20 ==
The race can also be triggered w/o ssh, by using "setsid -c", and I can also reproduce it w/o cmsfs-fuse but with sshfs:

root@s3545003:~# setsid -c sshfs geraldsc@tuxmaker: sshfs/
root@s3545003:~# ls sshfs
ls: cannot access 'sshfs': Transport endpoint is not connected

Revision history for this message
bugproxy (bugproxy) wrote : sosreport

Default Comment by Bridge

tags: added: architecture-s39064 bugnameltc-138907 severity-high targetmilestone-inin1604
Revision history for this message
bugproxy (bugproxy) wrote : dbginfo

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1558967/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Kevin W. Rudd (kevinr)
affects: ubuntu → fuse (Ubuntu)
Revision history for this message
dann frazier (dannf) wrote :
Changed in fuse (Ubuntu):
assignee: Skipper Bug Screeners (skipper-screen-team) → Dimitri John Ledkov (xnox)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-03-21 09:22 EDT-------
(In reply to comment #24)
> The race can also be triggered w/o ssh, by using "setsid -c", and I can also
> reproduce it w/o cmsfs-fuse but with sshfs:
>
> root@s3545003:~# setsid -c sshfs geraldsc@tuxmaker: sshfs/
> root@s3545003:~# ls sshfs
> ls: cannot access 'sshfs': Transport endpoint is not connected

I verified that Hendriks suggested fix (using a pipe for synchronization) does fix the problem for cmsfs-fuse. For sshfs though, there seem to be more problems, probably inside sshfs itself, instead of generic fuse code. The "Transport endpoint" issue is also fixed for sshfs, but now there is nothing mounted. Since sshfs does some forks on its own, there probably is some issue left within sshfs, but that would be a different story.

So, for the issue reported here, with cmsfs-fuse, the pipe synchronization fix would help.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-04-05 06:55 EDT-------
Canonical, what is the target for getting this fix integrated. Many thx in advance

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

I'm not sure what we are meant to integrate here.

Can Wave / JSCH maintain & reuse a connection, instead of establishing a new one each time? Such that fuse process has extra time to win it's race?

Also for Wave I hope you are not using d-i nor Ubuntu Server ISO, and instead working with us to integrate and use curtin and/or cloud-image based installation for rapid provisioning.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-04-13 08:21 EDT-------
hi,

at the moment Wave is using nohup to work around the issue.
we would like to remove that from the code and use cmsfs-fuse directly.
we are not closing the ssh connection for each command but we are using a different channel for each command (which also causes the problem).
we would like the fix to go into the 16.04 GA release so we can fix the code before sending to customers.

i am not sure what do you mean...
Wave has 2 ways to install Ubuntu (and RHEL/SLES).
1. from a CD (mounted somewhere and available via ftp)
Wave will use the parmfile to pass all the relevant parameters to the installation (without the user accessing z/VM) up to the point the user can SSH into the installer and continue the installation on his own.
2. wave also supports cloning
a user can install once via the CD and then clone the guest (which will take a few seconds to perform).

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Ok, but when you say "the fix" what do you mean? =) there is no patch available on the bug report, and libfuse upstream acknowledges the bug, but without a solution present. Is there a patch for cmsfs-fuse to make at least that one be less-racy?

Does IBM Wave manages repository mirror too? Because installed systems should be configured with getting security updates from master location ports.ubuntu.com and updates from a country mirror e.g. us.ports.ubuntu.com. Or something needs to manage a mirror of that, users that only have access to the packages available from the ISO will have a limited sub-par experience of Ubuntu without any security or updates support. Note upgrading or receiving security updates from ISOs is not supported.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-04-13 09:31 EDT-------
i was hoping someone will create a patch before 16.04 is released...
i guess Hendrik Brueckner might be able to answer the question if cmsfs-fuse can be less racy to avoid the defect...

Wave will not manage online repository mirrors... it is the customers responsibility...
Wave tries to gap the z/VM knowledge needed to install Linux on z.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Hm, but there is this comment earlier on "I verified that Hendriks suggested fix (using a pipe for synchronization) does fix the problem for cmsfs-fuse." -> is there code to do that, or is that the hack that is in place in Wave right now?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-04-13 10:32 EDT-------
that was send by Gerald Schaefer.
this is not a fix in Wave.
the fix in Wave is to run cmsfs-fuse with nohup...

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-04-13 12:51 EDT-------
(In reply to comment #43)
> Hm, but there is this comment earlier on "I verified that Hendriks suggested
> fix (using a pipe for synchronization) does fix the problem for cmsfs-fuse."
> -> is there code to do that, or is that the hack that is in place in Wave
> right now?

The "pipe synchronization" would be an example fix for the bug in fuse, as explained by Hendrik at the beginning of this bugzilla. The race in fuse_daemonize() leads to the issues that we see in Wave using cmsfs-fuse, but there is no bug neither in Wave nor in cmsfs-fuse, only in libfuse itself.

As noted earlier, the race in libfuse that leads to the "Transport endpoint is not connected" error can also be reproduced w/o Wave or cmsfs-fuse, by using "setsid -c" and sshfs (and probably any other fuse fs):

root@s3545003:~# setsid -c sshfs geraldsc@tuxmaker: sshfs/
root@s3545003:~# ls sshfs
ls: cannot access 'sshfs': Transport endpoint is not connected

When applying the suggested "pipe synchronization" fix from Hendrik to libfuse, it solves the issue for Wave and cmsfs-fuse. For sshfs, it also fixes the "Transport endpoint is not connected" error, but unfortunately sshfs itself also does some forking on its own, which also seems to be broken, which is why using sshfs as in the above example still doesn't work as expected (but at least now w/o the "Transport endpoint" error).

Unfortunately we have no clearance to submit patches directly to libfuse, so there is only the "general hint" about using a pipe for synchronization between parent and child inside fuse_daemonize(). I have just verified that such a patch for libfuse would fix the issue, and now we hope that someone could implement and integrate it into libfuse (and then into Ubuntu).

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

So shall we send you an invoice for libfuse upstream development work?! =) Cause it would be helpful to fix that bug properly upstream and for as many fuse clients as possible.

Ideally, you should have clearance for patches flow between yourself and Ubuntu. Ideally both ways. Would an invoice help to clear and release existing patches?

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
importance: Undecided → High
Changed in fuse (Ubuntu):
status: New → Incomplete
Changed in ubuntu-z-systems:
status: New → Incomplete
Revision history for this message
bugproxy (bugproxy) wrote : sosreport

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-06-21 03:25 EDT-------
I have created a github pull request to correct this issue (after requesting and receiving clearance to do so :-). The pull request is:

https://github.com/libfuse/libfuse/pull/55

The PR is now merged:
https://github.com/libfuse/libfuse/commit/6189312b0c530792657556b266546cd2edb23d4a

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

#winning

reopening bug report, targeting to yakkety & xenial.

Changed in ubuntu-z-systems:
status: Incomplete → Triaged
Changed in fuse (Ubuntu):
status: Incomplete → Triaged
importance: Undecided → Medium
Changed in fuse (Ubuntu Xenial):
status: New → Triaged
assignee: nobody → Dimitri John Ledkov (xnox)
Changed in fuse (Ubuntu Xenial):
importance: Undecided → Low
Changed in fuse (Ubuntu):
status: Triaged → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package fuse - 2.9.4-1ubuntu4

---------------
fuse (2.9.4-1ubuntu4) yakkety; urgency=medium

  * Cherrypick upstream patch for parent to wait until daemon child
    process is ready. LP: #1558967.

 -- Dimitri John Ledkov <email address hidden> Mon, 27 Jun 2016 12:19:25 +0100

Changed in fuse (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Martin Pitt (pitti) wrote : Please test proposed package

Hello bugproxy, or anyone else affected,

Accepted fuse into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/fuse/2.9.4-1ubuntu3.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in fuse (Ubuntu Xenial):
status: Triaged → Fix Committed
tags: added: verification-needed
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Triaged → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-07-20 05:04 EDT-------
Hi,

i was able to test the new fix and everything works fine now.

thank you!

bugproxy (bugproxy)
tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package fuse - 2.9.4-1ubuntu3.1

---------------
fuse (2.9.4-1ubuntu3.1) xenial; urgency=medium

  * Cherrypick upstream patch for parent to wait until daemon child
    process is ready. LP: #1558967.

 -- Dimitri John Ledkov <email address hidden> Mon, 27 Jun 2016 12:19:25 +0100

Changed in fuse (Ubuntu Xenial):
status: Fix Committed → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released
Revision history for this message
Adam Conrad (adconrad) wrote : Update Released

The verification of the Stable Release Update for fuse has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.