fsck.xfs can drop into a rescue shell when fsck.mode=force is set
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
xfsprogs (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Description: Ubuntu 22.04 LTS
Release: 22.04
Package: xfsprogs 5.13.0-1ubuntu2.1
Kernel cmdline containing: fsck.mode=force fsck.repair=yes
[Issue]
LP #2081163 (https:/
commit 19dde7fac0f38af
Author: Gerald Yang <email address hidden>
Date: Tue Aug 13 15:25:51 2024 +0800
fsck.xfs: fix fsck.xfs run by different shells when fsck.mode=force is set
When fsck.mode=force is specified in the kernel command line, fsck.xfs
is executed during the boot process. However, when the default shell is
not bash, $PS1 should be a different value, consider the following script:
cat ps1.sh
echo "$PS1"
run ps1.sh with different shells:
ash ./ps1.sh
$
bash ./ps1.sh
dash ./ps1.sh
$
ksh ./ps1.sh
zsh ./ps1.sh
On systems like Ubuntu, where dash is the default shell during the boot
process to improve startup speed. This results in FORCE being incorrectly
set to false and then xfs_repair is not invoked:
if [ -n "$PS1" -o -t 0 ]; then
fi
Other distros may encounter this issue too if the default shell is set
to anoother shell.
Check "-t 0" is enough to determine if we are in interactive mode, and
xfs_repair is invoked as expected regardless of the shell used.
Fixes: 04a2d5dc ("fsck.xfs: allow forced repairs using xfs_repair")
That fix was originally applied upstream for 6.11.0, but it appears to implicitly depend on an earlier commit from xfsprogs 6.1.0:
commit 79ba1e15d80eba3
Author: Srikanth C S <email address hidden>
Date: Tue Dec 13 22:45:43 2022 +0530
fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair
After a recent data center crash, we had to recover root filesystems
on several thousands of VMs via a boot time fsck. Since these
machines are remotely manageable, support can inject the kernel
command line with 'fsck.mode=force fsck.repair=yes' to kick off
xfs_repair if the machine won't come up or if they suspect there
might be deeper issues with latent errors in the fs metadata, which
is what they did to try to get everyone running ASAP while
journal replay in case of a crash.
fsck.xfs does xfs_repair -e if fsck.mode=force is set. It is
possible that when the machine crashes, the fs is in inconsistent
state with the journal log not yet replayed. This can drop the machine
into the rescue shell because xfs_fsck.sh does not know how to clean the
log. Since the administrator told us to force repairs, address the
deficiency by cleaning the log and rerunning xfs_repair.
Run xfs_repair -e when fsck.mode=force and repair=auto or yes.
Replay the logs only if fsck.mode=force and fsck.repair=yes. For
other option -fa and -f drop to the rescue shell if repair detects
any corruptions.
Now that xfs_repair is actually invoked when fsck.mode=force is set, it's possible for xfs_repair to fail at boot time and drop into a rescue shell.
This behaviour is seen with xfsprogs 5.13.0-1ubuntu2.1 in Jammy. It would presumably also affect 5.3.0-1ubuntu2.1 in Focal, but I haven't tried it personally.
[Potential Solution]
I applied 79ba1e15d80eba3
Booting with fsck.mode=force fsck.repair=yes, fsck.xfs correctly mounts, replays the log, unmounts, repairs, and mounts the filesystem at boot.
Looking at the original patch for 79ba1e15d80eba3 aff4396f44629eb 8960722d36, there may be a potential race condition as well:
echo "Replaying log for $DEV"
mkdir -p /tmp/repair_mnt || exit 1
...
rm -d /tmp/repair_mnt
If there are two or more XFS filesystems, each invoking a parallel mount/repair sequence to the singular /tmp/repair_mnt, it's possible they interfere with one another.
It probably requires an additional upstream change with something like:
TMP_ DIR=$(mktemp -d) || exit 1
...
rm -d "$TMP_DIR"