cannot unfreeze filesystem due to a deadlock due to multipath failover

Bug #897421 reported by Peter Petrakis
34
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Declined for Lucid by Chris Van Hoof
Declined for Maverick by Chris Van Hoof
Oneiric
High
Unassigned
Precise
High
Unassigned

Bug Description

To reproduce:
- 2 or more servers using shared storage (SAN)
- Each loaded with iozone on attached data luns (/s1, /s2. /s3)
 iozone -R -l 2 -u 2 -r 4k -s 100m
- Each system has three data luns of 10G, the root filesystem is not stressed
- A failover injected every 6 mins (this can happen on the first failover)
- dmesg -n 8 as root from serial consoles on all systems
- kdump configured
- set sysctl kernel.hung_task_panic = 1

Regardless of whether the HBA enters error handling or not. After a
path is broken, and now comes back, is when the hang occurs.

In simplest terms, the OS via UDEV is recreating the once broken path
by instantiating block devices and creating symlinks. To do this it runs
the following udev rule: /lib/udev/rules.d/95-kpartx.rules

# Create dm tables for partitions
ENV{DM_STATE}=="ACTIVE", ENV{DM_UUID}=="mpath-*", \
        RUN+="/sbin/dmsetup ls --target multipath --exec '/sbin/kpartx -a -p -part' -j %M -m %m"

Which acquires the s_umount semaphore for serialization and freezes the
block device, and thus the filesystem, while it's adding additional partitions
to the root block device.

At the same time a flush thread for the block device in question begins to
writeback dirty pages, that also acquires the s_umount semaphore, before
kpartx does, and finally sleeps on the signal for the block device to
become unfrozen.

Since kpartx is trying obtain a write_lock on the s_umount semaphore, and
the flush thread is already asleep holding a read_lock on s_umount, kpartx
can never enter the critical section to unfreeze the block device. Since the
flush thread is also sleeping on the condition of the block device being
unfrozen, it is also deadlocked.

Root Cause:
After exhausting the write_down instrumentation and
not finding any other instances competing for the write_down
I changed focus to the primary hung thread, the write back flush.

Back to the kpartx hang:
thaw_bdev...

        down_write(&sb->s_umount); <== hang here

        if (sb->s_flags & MS_RDONLY)
                goto out_unfrozen;

        if (sb->s_op->unfreeze_fs) {
                error = sb->s_op->unfreeze_fs(sb);
                if (error) {
                        printk(KERN_ERR
                                "VFS:Filesystem thaw failed\n");
                        sb->s_frozen = SB_FREEZE_TRANS;
                        bdev->bd_fsfreeze_count++;
                        mutex_unlock(&bdev->bd_fsfreeze_mutex);
                        return error;
                }
        }

out_unfrozen:
        sb->s_frozen = SB_UNFROZEN;
        smp_wmb();
        wake_up(&sb->s_wait_unfrozen);

Were we to successfully exit, we change the superblock to unfrozen.
However the flush thread is sleeping, waiting for the super_block
to become unfrozen.

int ext4_force_commit(struct super_block *sb)
{
        journal_t *journal;
        int ret = 0;

        if (sb->s_flags & MS_RDONLY)
                return 0;

        journal = EXT4_SB(sb)->s_journal;
        if (journal) {
                vfs_check_frozen(sb, SB_FREEZE_TRANS); <=== this is where sleep
                ret = ext4_journal_force_commit(journal);
        }

        return ret;
}

enum {
        SB_UNFROZEN = 0,
        SB_FREEZE_WRITE = 1,
        SB_FREEZE_TRANS = 2,
};

#define vfs_check_frozen(sb, level) \
        wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))

crash-5.0> super_block.s_frozen ffff880268a4e000
  s_frozen = 0x2,

So why can't thaw_bdev make any forward progress? There's a reader
out there, that's holding the s_umount sema somewhere in this call
stack.

PID: 992 TASK: ffff8802678a8000 CPU: 7 COMMAND: "flush-251:5"
 #0 [ffff880267bddb00] schedule at ffffffff8158bcbd
 #1 [ffff880267bddbb8] ext4_force_commit at ffffffff8120b16d
 #2 [ffff880267bddc18] ext4_write_inode at ffffffff811f29e5
 #3 [ffff880267bddc68] writeback_single_inode at ffffffff81178964
 #4 [ffff880267bddcb8] writeback_sb_inodes at ffffffff81178f09
 #5 [ffff880267bddd18] wb_writeback at ffffffff8117995c
(down_read(sb->s_umount) taken here)

 #6 [ffff880267bdddc8] wb_do_writeback at ffffffff81179b6b
 #7 [ffff880267bdde58] bdi_writeback_task at ffffffff81179cc3
 #8 [ffff880267bdde98] bdi_start_fn at ffffffff8111e816
 #9 [ffff880267bddec8] kthread at ffffffff81088a06
#10 [ffff880267bddf48] kernel_thread at ffffffff810142ea

and as long as there's an active reader, the writer can't
change anything. After some disection the likely culprit is
in frame #5

(We must have gotten here through writeback_inodes_wb)

 517 void writeback_inodes_wb(struct bdi_writeback *wb,
 518 struct writeback_control *wbc)
 519 {
 520 int ret = 0;
 521
 522 wbc->wb_start = jiffies; /* livelock avoidance */
 523 spin_lock(&inode_lock);
 524 if (!wbc->for_kupdate || list_empty(&wb->b_io))
 525 queue_io(wb, wbc->older_than_this);
 526
 527 while (!list_empty(&wb->b_io)) {
 528 struct inode *inode = list_entry(wb->b_io.prev,
 529 struct inode, i_list);
 530 struct super_block *sb = inode->i_sb;
 531

!!! This is where the read_down is taken !!!

 532 if (!pin_sb_for_writeback(sb)) { <== performs read_try_lock on s_umount
 533 requeue_io(inode);
 534 continue;
 535 }
 536 ret = writeback_sb_inodes(sb, wb, wbc, false);

You must have the successfully grabbed s_umount for reading before
reaching this point.

Thus the deadlock, the flush thread will wait to be unfrozen forever
because it's sleeping with a read lock on s_umount, which prevents
the write lock from making any forward progress in thaw_bdev, so
s_frozen will never be set to UNFROZEN, triggering the waitq and
allowing the flush to complete.

This signature is identical to a issue just proposed on the fs-dev
lists 14 days ago. There's also a test case of applying a simple
"sync" in a loop. Which adds more credibility to the failover
hanging on the SCM that isn't under load. It's not the traffic
that's the issue, it's the writeback that was forced, coupled
with the freeze/thaw action thanks to udev and we have the conditions
for the deadlock.

http://66.135.57.166/lists/linux-fsdevel/msg42068.html

Because it's really related to "sync" it doesn't matter what filesystem
you use.

The proposed solution is against 2.6.38, I've tried parts of it
already with no success. The full patch will require dramatic
changes to the superblock just for it to apply, which of course
could pose even more issues. We can finally say however that the
root cause has been identified.

Chris Van Hoof (vanhoof)
tags: added: blocks-hwcert-enablement
Changed in linux (Ubuntu Oneiric):
importance: Undecided → High
Changed in linux (Ubuntu Precise):
importance: Undecided → High
Changed in linux (Ubuntu Oneiric):
status: New → Incomplete
status: Incomplete → Confirmed
Changed in linux (Ubuntu Precise):
status: New → Confirmed
Changed in linux (Ubuntu Oneiric):
assignee: nobody → Kamal Mostafa (kamalmostafa)
Changed in linux (Ubuntu Precise):
assignee: nobody → Kamal Mostafa (kamalmostafa)
Changed in linux (Ubuntu Oneiric):
status: Confirmed → In Progress
Changed in linux (Ubuntu Precise):
status: Confirmed → In Progress
Revision history for this message
Kamal Mostafa (kamalmostafa) wrote :

The attached patch set lp897421-patches.tar.gz has been determined to resolve the problem. This set applies to linux 3.2-rc3 (or Ubuntu-3.2.0-2.5) and has been submitted to upstream lists as "Subject: [PATCH 0/5] fix s_umount thaw/write and journal deadlock".

An ubuntu-precise kernel with this patch set applied is available:
  PPA: https://launchpad.net/~kamalmostafa/+archive/lp897421-unfreeze-deadlock
  git: http://kernel.ubuntu.com/git?p=kamal/ubuntu-precise.git;a=shortlog;h=refs/heads/lp897421-unfreeze-deadlock

The patch set is comprised of:

  Surbhi Palande (2):
    Adding support to freeze and unfreeze a journal
    Thaw the journal when you unfreeze the fs.

  Valerie Aurora (3):
    VFS: Fix s_umount thaw/write deadlock
    VFS: Rename vfs_check_frozen() to
    Documentation: Correct s_umount state for

Revision history for this message
Kamal Mostafa (kamalmostafa) wrote :

Revised patch set per upstream feedback.

The attached patch set lp897421-patches-2b.tar.gz has been determined to resolve the problem. This set applies to linux 3.2-rc4 (or Ubuntu-3.2.0-3.8) and has been submitted to upstream lists as "Subject: [PATCH v2 0/7] fix s_umount thaw/write and journal deadlock".

An ubuntu-precise kernel with this patch set applied is available:
  PPA: https://launchpad.net/~kamalmostafa/+archive/lp897421-unfreeze-deadlock
  git: http://kernel.ubuntu.com/git?p=kamal/ubuntu-precise.git;a=shortlog;h=refs/heads/lp897421-unfreeze-deadlock-2b

The patch set is comprised of:

Kamal Mostafa (1):
  VFS: Rename and refactor writeback_inodes_sb_if_idle

Surbhi Palande (2):
  Adding support to freeze and unfreeze a journal
  Freeze and thaw the journal on ext4 freeze

Valerie Aurora (4):

  VFS: Fix s_umount thaw/write deadlock
  VFS: Avoid read-write deadlock in try_to_writeback_inodes_sb
  VFS: Document s_frozen state through freeze_super
  Documentation: Correct s_umount state for freeze_fs/unfreeze_fs

Revision history for this message
Peter Petrakis (peter-petrakis) wrote :

The existing patchset has essentially been re-written and resubmitted by Jan Kara.

https://lkml.org/lkml/2012/3/5/278

"Hallelujah,

  after a couple of weeks and several rewrites, here comes the third iteration
of my patches to improve filesystem freezing. Filesystem freezing is currently
racy and thus we can end up with dirty data on frozen filesystem (see changelog
patch 06 for detailed race description). This patch series aims at fixing this."

Revision history for this message
Kamal Mostafa (kamalmostafa) wrote :

I've backported Jan Kara's patch set https://lkml.org/lkml/2012/3/5/278 (and its prerequisites) to ubuntu-precise:

  git: http://kernel.ubuntu.com/git?p=kamal/ubuntu-precise.git;a=shortlog;h=refs/heads/lp897421-jankara-fsfreeze

Massimo reports that in his initial testing, that kernel does appear to resolve the problem (100+ iterations without failure).

tags: added: precise rls-mgr-p-tracking
Revision history for this message
David Bosso (boss-launchpad) wrote :

I'm still able to get deadlocks with the backported patches. The latest (v5) patchset from Jan Kara are rock solid for me with vanilla 3.4.0-rc3. Are you planning on backporting them?

Revision history for this message
Peter Petrakis (peter-petrakis) wrote :

We know the current patchset isn't perfect but it does dramatically reduce the frequency of the fault, We plan to take
the fix when it's final. Hopefully the skew from 3.2 (current precise baseline) to 3.4 isn't too dramatic as this does have to pass the SRU process.

Changed in linux (Ubuntu):
assignee: Kamal Mostafa (kamalmostafa) → Canonical Hardware Enablement Project Management Team (canonical-hwe-pm-team)
Changed in linux (Ubuntu Oneiric):
assignee: Kamal Mostafa (kamalmostafa) → Canonical Hardware Enablement Project Management Team (canonical-hwe-pm-team)
Changed in linux (Ubuntu Precise):
assignee: Kamal Mostafa (kamalmostafa) → Canonical Hardware Enablement Project Management Team (canonical-hwe-pm-team)
Revision history for this message
Peter Petrakis (peter-petrakis) wrote :

AFAIK most of the work upstream is done, there was just some arguing concerning
an additional interface that could freeze/unfreeze filesystems.

https://lkml.org/lkml/2010/6/10/59

I don't think it's in mainline yet. It really needs a sherpa again.

At a minimum, we should integrate the unit test shown in the ML thread.

Revision history for this message
dino99 (9d9) wrote :
tags: removed: oneiric
Changed in linux (Ubuntu Oneiric):
status: In Progress → Invalid
Changed in linux (Ubuntu):
assignee: Canonical Hardware Enablement Project Management Team (canonical-hwe-pm-team) → nobody
Changed in linux (Ubuntu Precise):
assignee: Canonical Hardware Enablement Project Management Team (canonical-hwe-pm-team) → nobody
Changed in linux (Ubuntu Oneiric):
assignee: Canonical Hardware Enablement Project Management Team (canonical-hwe-pm-team) → nobody
Changed in linux (Ubuntu Precise):
status: In Progress → Confirmed
Changed in linux (Ubuntu):
status: In Progress → Confirmed
tags: added: needs-kernel-logs needs-upstream-testing
Revision history for this message
dino99 (9d9) wrote :

That issue reported long ago, should not be one anymore

Changed in linux (Ubuntu Precise):
status: Confirmed → Invalid
Changed in linux (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers