mksh exits with status 0 upon error reading script

Bug #2002044 reported by Sam Kendall
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
mksh
New
Undecided
Unassigned

Bug Description

A customer reported that a large, long-running script exited early, but with status 0. This should have been impossible. We eventually figured out that the cluster filesystem containing the script had gone offline while the script was executing. The shell, reading a block of the script itself, got ESTALE from read(2). The bug is that the shell swallowed the error, treating it like EOF. It should have exited with an error status.

To reproduce the bug easily, you need a modern enough Linux that strace can inject errors. In the following session, we create a trivial script and run it under strace twice, the first time to see which read(2) calls read the script, and the second time to inject the error.

$ echo 'echo hello' > s.sh
$ strace -e trace=read mksh s.sh
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300A\2\0\0\0\0\0"..., 832) = 832
read(11, "echo hello\n", 512) = 11
hello
read(11, "", 512) = 0
+++ exited with 0 +++
$ strace -e trace=read -e inject=read:error=ESTALE:when=2 mksh s.sh
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300A\2\0\0\0\0\0"..., 832) = 832
read(11, 0x55629ae2aee0, 512) = -1 ESTALE (Stale file handle) (INJECTED)
+++ exited with 0 +++
$

Variations on that last command are to substitute EIO (a more severe error) for ESTALE, or when=3 for when=2. These variations also display the buggy behavior. The latter variation models something more like the original report we got from our customer, where the script was larger and the error occurred not in reading the first block of the file, but in reading a later block.

The OS is Ubuntu 20.04.5 LTS and KSH_VERSION is @(#)MIRBSD KSH R58 2020/03/27. We also see the bug in ksh93 on RHEL, in fact in every version of mksh and ksh93 we've tested; and in some but not all versions of bash. The bash bug report is https://savannah.gnu.org/support/index.php?110763.

It is common sense that this issue is a bug: one cannot build reliable software if an I/O error might cause a program to exit and report success. POSIX (https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sh.html) mandates an exit status in the range 1-125 for this condition: "A non-interactive shell detected an error other than command_file not found or executable, including but not limited to syntax, redirection, or variable assignment errors." A diagnostic message is a good idea, too, but not mandated by POSIX, at least not in the linked section "Consequences of Shell Errors".

How likely is the bug to reoccur without fault injection? Unlikely in a script smaller than mksh's read block size, 512 bytes. The bug becomes likely if the script lives on an unreliable filesystem, and is larger than one block, and runs for a long time in a portion of the script before the last block. These are not unusual conditions.

Revision history for this message
Thorsten Glaser (mirabilos) wrote :

Good catch!

And thanks for including a reproducer. I didn’t know Linux strace could even do that.

I wonder why this wasn’t caught in so many shells.

In the sense of cross-shell coordination, what errorlevel do you suggest? I have a slight dislike for the small numbers here, but arguments could be held that the large numbers are more likely to have special meaning for the scripts, too.

In fact, interrupting by being unable to read the next part of the script is a rather severe issue. I’m tempted to argue that killing myself with a suitable signal (PIPE? BUS? HUP?) and, if ignored, simulating the corresponding 128+signo exit would be better.

Revision history for this message
Thorsten Glaser (mirabilos) wrote :

getsc_line() in lex.c does treat a read error (other than EINTR, which retries) as EOF (which matches what Chet describes), but without an explicit comment indicating its deliberateness. Hmm…

Revision history for this message
Sam Kendall (sckendall) wrote :

For the shell's exit status, I suggest 125. My rationale is that grep, the only significant precedent I know of, assigns special meaning to 1, so avoid it; and 125 has an appropriate system-ish flavor. But it doesn't matter much.

About signaling and exiting with 128+signalno: that's not a bad thought, given the asynchronous nature of the error, but I'd argue that it would mostly confuse people.

Revision history for this message
Thorsten Glaser (mirabilos) wrote :

Forwarded to: https://<email address hidden>/msg10562.html

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.