jfs filesystem corruption after power failure, fast reboot sequences (stale NFS lock)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
jfsutils (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Binary package hint: jfsutils
Power failure leads to file system corruption and data loss, probably because fsck.jfs does not correctly detect the damages in the first run.
See als jfs mailing list discussion http://<email address hidden>
The problem has good reproducibility on a minimal ubuntu lucid install in vmware. Corruption can be detected using ls -alR, which reports a "stale NFS lock" on the jfs filesystem. I haven't found a pattern, which directory or file inodes are usually affected. It seems, that even unmodified files can be lost also and are sometimes reconnected to /lost+found (e.g. /etc/resolv.conf or /usr/local/share vanished without trace, other show up in /lost+found, others show up as "stale NFS lock" inodes in /lost+found), so one knows that an inode was lost but not its content.
It is not clear a reboot triggers the corruption, fsck fails to detect it, mount therefore OK and error can be detected or if the sequence is:
corruption - fsck invalid repair - modifications cause secondary corruption - fsck invalid repair makes corruption visible
To verify this, one would have to run the reproducer on a completely sane (fresh) filesystem quite often to find the minimal number of successive reboots to trigger the problem.
To reproduce it on lucid:
* Create init script to trigger test on each reboot:
# cat /etc/init/
description "Start Disktest"
start on filesystem
task
script
/root/
end script
* Format a small disk partition
I just did this step to produce a smaller 20MB corrupted image with 60% diskuse, but corruption does also occur on root partition, so you have to run multiple test runs to get a result with "non-root" but "data" corruption
dd if=/dev/zero of=/dev/sdb1
mkfs.jfs -f /dev/sdb1
mkdir /data
mount /dev/sdb1 /data
# fill data approx 60%, create a dump of this data, adjust tar name in DiskTest.sh
umount /data
* Add the test script
# cat /root/DiskTest/
#!/bin/bash -e
echo "$(date): Starting disktest" >&2
mountDev=/dev/sdb1
if ! fsck.jfs "${mountDev}" || ! jfs_fsck -n "${mountDev}"; then
echo "Fsck failed!" >&2
exit 1
fi
mount "${mountDev}" /data
if ls -alR / 2>&1 | grep -E -e '(\?|stale )'; then
echo "Damage marker found" >&2
exit 1
fi
rm -rf /data/usr/bin/*d*
tar -C /data -xf /root/DiskTest/
umount /data
echo "Killing system with hard reboot"
echo "b" > /proc/sysrq-trigger
* Start test
start DiskTest
The problem does also occur after replacing fsck.jfs and jfs_fsck with version 1.1.15 from jfsutils trunk. The problem seems to be unrelated to a jfs root node corruption, which does not produce stale nfs locks but destroys the root directory just using mount/unmount multiple times.
$ lsb_release -rd
Description: Ubuntu 10.04.2 LTS
Release: 10.04
$ apt-cache policy jfsutils
jfsutils:
Installed: 1.1.12-2.1
Candidate: 1.1.12-2.1
Version table:
*** 1.1.12-2.1 0
500 http://
100 /var/lib/