jfs filesystem corruption after power failure, fast reboot sequences (stale NFS lock)

Bug #754495 reported by Roman Fiedler on 2011-04-08
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
jfsutils (Ubuntu)
Undecided
Unassigned

Bug Description

Binary package hint: jfsutils

Power failure leads to file system corruption and data loss, probably because fsck.jfs does not correctly detect the damages in the first run.

See als jfs mailing list discussion http://<email address hidden>/msg01682.html

The problem has good reproducibility on a minimal ubuntu lucid install in vmware. Corruption can be detected using ls -alR, which reports a "stale NFS lock" on the jfs filesystem. I haven't found a pattern, which directory or file inodes are usually affected. It seems, that even unmodified files can be lost also and are sometimes reconnected to /lost+found (e.g. /etc/resolv.conf or /usr/local/share vanished without trace, other show up in /lost+found, others show up as "stale NFS lock" inodes in /lost+found), so one knows that an inode was lost but not its content.

It is not clear a reboot triggers the corruption, fsck fails to detect it, mount therefore OK and error can be detected or if the sequence is:
corruption - fsck invalid repair - modifications cause secondary corruption - fsck invalid repair makes corruption visible

To verify this, one would have to run the reproducer on a completely sane (fresh) filesystem quite often to find the minimal number of successive reboots to trigger the problem.

To reproduce it on lucid:

* Create init script to trigger test on each reboot:

# cat /etc/init/DiskTest.conf
description "Start Disktest"

start on filesystem

task

script
  /root/DiskTest/DiskTest.sh >> /root/DiskTest/DiskTest.log 2>&1
end script

* Format a small disk partition

I just did this step to produce a smaller 20MB corrupted image with 60% diskuse, but corruption does also occur on root partition, so you have to run multiple test runs to get a result with "non-root" but "data" corruption

dd if=/dev/zero of=/dev/sdb1
mkfs.jfs -f /dev/sdb1
mkdir /data
mount /dev/sdb1 /data
# fill data approx 60%, create a dump of this data, adjust tar name in DiskTest.sh
umount /data

* Add the test script

# cat /root/DiskTest/DiskTest.sh
#!/bin/bash -e

echo "$(date): Starting disktest" >&2

mountDev=/dev/sdb1
if ! fsck.jfs "${mountDev}" || ! jfs_fsck -n "${mountDev}"; then
  echo "Fsck failed!" >&2
  exit 1
fi

mount "${mountDev}" /data

if ls -alR / 2>&1 | grep -E -e '(\?|stale )'; then
  echo "Damage marker found" >&2
  exit 1
fi

rm -rf /data/usr/bin/*d*
tar -C /data -xf /root/DiskTest/2011-04-08-ContentOriginal.tar
umount /data

echo "Killing system with hard reboot"
echo "b" > /proc/sysrq-trigger

* Start test

start DiskTest

The problem does also occur after replacing fsck.jfs and jfs_fsck with version 1.1.15 from jfsutils trunk. The problem seems to be unrelated to a jfs root node corruption, which does not produce stale nfs locks but destroys the root directory just using mount/unmount multiple times.

$ lsb_release -rd
Description: Ubuntu 10.04.2 LTS
Release: 10.04

$ apt-cache policy jfsutils
jfsutils:
  Installed: 1.1.12-2.1
  Candidate: 1.1.12-2.1
  Version table:
 *** 1.1.12-2.1 0
        500 http://archive.ubuntu.com/ubuntu/ lucid/main Packages
        100 /var/lib/dpkg/status

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers