Comment 7 for bug 1681410

Revision history for this message
Kyle O'Donnell (kyleo-t) wrote : Re: [Bug 1681410] Re: fstrim corrupts ocfs2 filesystems when clustered

I tried disabling fstrim on all but one server and had the exact same issue as I did when cron enabled it on all servers.

----- Original Message -----
From: "Nick Stallman" <email address hidden>
To: "Kyle O'Donnell" <kyleo@0b10.mx>
Sent: Tuesday, August 1, 2017 7:49:49 PM
Subject: [Bug 1681410] Re: fstrim corrupts ocfs2 filesystems when clustered

I think we've also had a related issue.
We haven't had any serious corruption but we have had random locks that never get released which requires a server reboot to clear.

OCFS2 does support trim, as does our SAN. I think the issue may be related to running fstrim in parallel however.
I didn't realise fstrim was in cron.weekly on all 3 servers that had OCFS2 mounted, causing them to run it at basically the exact same time.

After disabling that when I finally noticed it running at one point I
haven't had any further issues (mind you it's only been a few days).

Running fstrim by default is probably a bad idea on these more advanced filesystems since the liklihood of it running multiple times at once is there.
It's safer to assume that the sysadmin knows about their SAN's fstrim capability and can schedule it in a more controlled manner.

--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/1681410

Title:
  fstrim corrupts ocfs2 filesystems when clustered

Status in util-linux package in Ubuntu:
  Expired

Bug description:
  Recently upgraded from trusty to xenial and found that our ocfs2
  filesystems, which are mounted across a number of nodes
  simultaneously, would become corrupt on the weekend:

  [Sun Apr 9 06:46:35 2017] OCFS2: ERROR (device dm-2): ocfs2_validate_gd_self: Group descriptor #516096 has bad signature
  [Sun Apr 9 06:46:35 2017] On-disk corruption discovered. Please run fsck.ocfs2 once the filesystem is unmounted.
  [Sun Apr 9 06:46:35 2017] OCFS2: File system is now read-only.
  [Sun Apr 9 06:46:35 2017] (fstrim,1080,8):ocfs2_trim_fs:7399 ERROR: status = -30
  [Sun Apr 9 06:46:35 2017] OCFS2: ERROR (device dm-3): ocfs2_validate_gd_self: Group descriptor #516096 has bad signature
  [Sun Apr 9 06:46:36 2017] On-disk corruption discovered. Please run fsck.ocfs2 once the filesystem is unmounted.
  [Sun Apr 9 06:46:36 2017] OCFS2: File system is now read-only.
  [Sun Apr 9 06:46:36 2017] (fstrim,1080,10):ocfs2_trim_fs:7399 ERROR: status = -30

  We found the cron.weekly job which is pretty close to the timing:
  47 6 * * 7 root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )

  # cat /etc/cron.weekly/fstrim
  #!/bin/sh
  # trim all mounted file systems which support it
  /sbin/fstrim --all || true

  We have disabled this job across our servers running clustered ocfs2 filesystems. I think either the utility or the cronjob should ignore ocfs2 (gfs too?) filesystems.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/util-linux/+bug/1681410/+subscriptions