Data loss on ext3, maybe related to data=journal

Bug #485562 reported by Jürgen Kreileder
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Medium
Unassigned

Bug Description

I'm currently testing a backup scheme on a new karmic installation. The procedure worked flawlessly on jaunty and older Ubuntu/Debian distributions (albeit using hardware RAID on those, the new machine uses a software RAID). With karmic however I'm experiencing data loss (at least on the designated backup partition).

The partition in question gets mounted once per hour. The respective entry in /etc/fstab is

UUID="7420cd8f-dd47-4fdb-b64e-4fd02f945e43" /srv/backup ext3 noatime,nodiratime,user_xattr,acl,noauto,nodev,nosuid,data=journal 0 2

The partition is an LVM2 logical volume which runs on a single PV on a RAID 1 composed of 2 disks (driver is AHCI).

I noticed the data loss because I use sitecopy to push the backups to another machine after each backup run. On about 1 out of 3 backup runs sitecopy complains about a corrupted state file. I didn't check the backups for the integrity yet as I can reproduce the problem with sitecopy alone easily.

To reproduce it I do:

# cd /srv/backup/backup2l/scripts/
# cp data.1001.all.tar.gpg xxxx # change something so sitecopy has something to push
# sitecopy -r /srv/backup/backup2l/scripts/.sitecopyrc -p /srv/backup/backup2l/scripts/.sitecopy -q -u backup
# cd /
# umount /srv/backup
# mount /srv/backup
# less /srv/backup/backup2l/scripts/.sitecopy/backup

In about one out of three runs, the last step step shows a corrupted file: Old contents + rest filled with zeros or a truncated file.

dmesg and syslog show nothing. In particular no journal-replay related message. Adding a "fsck.ext3 -f /dev/vg0/srv_backup" before mounting shows no problem either, still the file gets corrupted every now and then.

So far I've discovered two ways to work around the problem:
* Don't use "data=journal". Both data=writeback and data=ordered seem to work fine
* Do "less /srv/backup/backup2l/scripts/.sitecopy/backup" before the unmount

Especially the latter seems to suggest a strange flush problem with the data=journal code in karmic's current x86-64 kernel (2.6.31.15.28).

# sudo lvdisplay /dev/vg0/srv_backup
  --- Logical volume ---
  LV Name /dev/vg0/srv_backup
  VG Name vg0
  LV UUID KXZqxv-v8MQ-UD4x-41Vf-2c2t-0wsr-etUNjQ
  LV Write Access read/write
  LV Status available
  # open 0
  LV Size 128.00 GB
  Current LE 32768
  Segments 1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device 253:12

# sudo pvdisplay
  --- Physical volume ---
  PV Name /dev/md2
  VG Name vg0
  PV Size 693.63 GB / not usable 4.12 MB
  Allocatable yes
  PE Size (KByte) 4096
  Total PE 177567
  Free PE 64927
  Allocated PE 112640
  PV UUID FHAWPv-otHj-jpDD-x35T-nE0Q-13uB-30GuSt

# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0] sdb3[1]
      727318656 blocks [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
      1052160 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
      4192896 blocks [2/2] [UU]

unused devices: <none>

Tags: karmic
Revision history for this message
Jürgen Kreileder (jk) wrote :
Revision history for this message
Jürgen Kreileder (jk) wrote :
Revision history for this message
Jürgen Kreileder (jk) wrote :
Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Triaged
Andy Whitcroft (apw)
tags: added: kernel-series-unknown
tags: added: karmic
removed: kernel-series-unknown
Revision history for this message
Surbhi Palande (csurbhi) wrote :

Jürgen Kreileder, is it possible to attach the complete dmesg? Thanks!

Revision history for this message
Jürgen Kreileder (jk) wrote :

dmesg2.txt is obviously with a different kernel (the bug report was filed 4 months ago).
I can't tell whether the problem still occurs with this kernel, the machine is in production and I won't experiment on it.

Revision history for this message
Surbhi Palande (csurbhi) wrote :

Jurgen Kreilder, there is a patch in the Ubuntu kernel which we believe fixes this error:
commit 56fcad29d4b3cbcbb2ed47a9d3ceca3f57175417
Author: Jan Kara <email address hidden>
Date: Tue Sep 8 14:59:42 2009 +0200
    ext3: Flush disk caches on fsync when needed

Please do let me know if this bug persists for you whenever you can experiment again. I will need to investigate, if things are not working for you. Thanks!

Surbhi Palande (csurbhi)
Changed in linux (Ubuntu):
importance: High → Medium
Revision history for this message
Brad Figg (brad-figg) wrote : Unsupported series, setting status to "Won't Fix".

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.