xfs slab objects (memory) leak when xfs shutdown is called

Bug #1706132 reported by Rafael David Tinoco on 2017-07-24
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Low
Unassigned
Xenial
Low
Unassigned

Bug Description

[Impact]

 * xfs kernel memory leak in case of xfs shutdown due to i/o errors
 * if xfs on iscsi, iscsi disconnection and module unload will case mem leak

[Test Case]

 * configure tgtd with 1 lun and make it available through tcp/ip
 * configure open-iscsi to map this lun
 * make sure node.session.timeo.replacement_timeout = 0 in iscsid.conf
 * mount a xfs volume using the lun from tgtd host, run bonnie -d /xfsdir
 * in tgtd server, drop iscsi packets and watch client to have i/o errors
 * after sometime (depending on timeout) xfs will call shutdown
 * make sure the i/o errors led to xfs shutdown (comment #3)
 * after shutdown you try to remove xfs module and it will leak

[Regression Potential]

 * based on upstream fix
 * tested in the same environment
 * potential damage to xfs

[Other Info]

Original Description:

#### This scenario is testing [iscsi <-> scsi <-> disk <-> xfs]

[ 551.125604] sd 2:0:0:1: rejecting I/O to offline device
[ 551.125615] sd 2:0:0:1: rejecting I/O to offline device
[ 551.125627] sd 2:0:0:1: rejecting I/O to offline device
[ 551.125639] sd 2:0:0:1: rejecting I/O to offline device

[ 551.135216] XFS (sda1): metadata I/O error: block 0xeffe01 ("xfs_trans_read_buf_map") error 5 numblks 1
[ 551.135274] XFS (sda1): page discard on page ffffea0002a89cc0, inode 0x83, offset 6442385408.

# when XFS shuts down because of an error (or offline disk, example):

[ 551.850498] XFS (sda1): xfs_do_force_shutdown(0x2) called from line 1197 of file /build/linux-lts-xenial-roXrYH/linux-lts-xenial-4.4.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc0300388
[ 551.850568] XFS (sda1): Log I/O Error Detected. Shutting down filesystem

[ 551.850618] XFS (sda1): xfs_log_force: error -5 returned.

[ 551.850630] XFS (sda1): Failing async write on buffer block 0x77ff08. Retrying async write.
[ 551.850634] XFS (sda1): Failing async write on buffer block 0x77ff10. Retrying async write.
[ 551.850638] XFS (sda1): Failing async write on buffer block 0x77ff01. Retrying async write.
[ 551.853171] XFS (sda1): Please umount the filesystem and rectify the problem(s)

[ 551.874131] XFS (sda1): metadata I/O error: block 0x1dffc49 ("xlog_iodone") error 5 numblks 64
[ 551.877993] XFS (sda1): xfs_do_force_shutdown(0x2) called from line 1197 of file /build/linux-lts-xenial-roXrYH/linux-lts-xenial-4.4.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc0300388

[ 551.899036] XFS (sda1): xfs_log_force: error -5 returned.
[ 569.323074] XFS (sda1): xfs_log_force: error -5 returned.
[ 599.403085] XFS (sda1): xfs_log_force: error -5 returned.
[ 629.483111] XFS (sda1): xfs_log_force: error -5 returned.
[ 659.563115] XFS (sda1): xfs_log_force: error -5 returned.
[ 689.643014] XFS (sda1): xfs_log_force: error -5 returned.

# when I execute:

# sudo umount /dev/sda1:

[81634.923043] XFS (sda1): xfs_log_force: error -5 returned.
[81640.739097] XFS (sda1): xfs_log_force: error -5 returned.
[81640.739137] XFS (sda1): Unmounting Filesystem
[81640.739463] XFS (sda1): xfs_log_force: error -5 returned.
[81640.739508] XFS (sda1): xfs_log_force: error -5 returned.
[81640.742741] sd 2:0:0:1: rejecting I/O to offline device
[81640.745576] blk_update_request: 25 callbacks suppressed
[81640.745601] blk_update_request: I/O error, dev sda, sector 0

# i was able to umount and then to remove iscsi disk.

# but if i try to unload the xfs module:

inaddy@(trustyiscsicli):~$ sudo rmmod xfs
[82211.059301] =============================================================================
[82211.063764] BUG xfs_log_ticket (Tainted: G OE ): Objects remaining in xfs_log_ticket on kmem_cache_close()
[82211.067450] -----------------------------------------------------------------------------
[82211.067450]
[82211.070580] INFO: Slab 0xffffea0002eb7640 objects=22 used=1 fp=0xffff8800badd9f18 flags=0xffffc000000080
[82211.074430] INFO: Object 0xffff8800badd9228 @offset=552
[82211.076133] kmem_cache_destroy xfs_log_ticket: Slab cache still has objects

AND

[82211.059301] =============================================================================
[82211.063764] BUG xfs_log_ticket (Tainted: G OE ): Objects remaining in xfs_log_ticket on kmem_cache_close()
[82211.067450] -----------------------------------------------------------------------------
[82211.067450]
[82211.070570] Disabling lock debugging due to kernel taint
[82211.070580] INFO: Slab 0xffffea0002eb7640 objects=22 used=1 fp=0xffff8800badd9f18 flags=0xffffc000000080
[82211.073964] CPU: 3 PID: 32230 Comm: rmmod Tainted: G B OE 4.4.0-74-generic #95~14.04.1-Ubuntu
[82211.073970] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[82211.073975] 0000000000000000 ffff8800baf53d28 ffffffff813dce3c ffffea0002eb7640
[82211.073984] ffff880036614300 ffff8800baf53e00 ffffffff811dc4b4 ffff880000000020
[82211.073991] ffff8800baf53e10 ffff8800baf53dc0 656a624f02e294c0 616d657220737463
[82211.074013] Call Trace:
[82211.074039] [<ffffffff813dce3c>] dump_stack+0x63/0x87
[82211.074066] [<ffffffff811dc4b4>] slab_err+0xb4/0xe0
[82211.074081] [<ffffffff8118cfd2>] ? __free_kmem_pages+0x22/0x40
[82211.074089] [<ffffffff811dcc6c>] ? __free_slab+0xcc/0x1a0
[82211.074097] [<ffffffff811df799>] ? __kmalloc+0x229/0x270
[82211.074106] [<ffffffff811e14b1>] ? __kmem_cache_shutdown+0x111/0x2b0
[82211.074127] [<ffffffff811e14d1>] __kmem_cache_shutdown+0x131/0x2b0
[82211.074157] [<ffffffff811a85ef>] kmem_cache_destroy+0x1bf/0x220
[82211.074289] [<ffffffffc031a368>] xfs_destroy_zones+0xeb/0x10a [xfs]
[82211.074382] [<ffffffffc031a3fc>] exit_xfs_fs+0x6f/0xc73 [xfs]
[82211.074395] [<ffffffff81101f3d>] SyS_delete_module+0x18d/0x220
[82211.074404] [<ffffffff8107986b>] ? exit_to_usermode_loop+0x78/0xa2
[82211.074416] [<ffffffff818066b6>] entry_SYSCALL_64_fastpath+0x16/0x75
[82211.074430] INFO: Object 0xffff8800badd9228 @offset=552
[82211.076133] kmem_cache_destroy xfs_log_ticket: Slab cache still has objects
[82211.078221] CPU: 3 PID: 32230 Comm: rmmod Tainted: G B OE 4.4.0-74-generic #95~14.04.1-Ubuntu
[82211.078226] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[82211.078230] 0000000000000000 ffff8800baf53e60 ffffffff813dce3c ffff880036614300
[82211.078238] ffff880036614300 ffff8800baf53ec8 ffffffff811a8648 0000000000000000
[82211.078245] 00ff8800baf53eb8 ffff8800baf53e80 ffff8800baf53e80 ffff8800baf53e90
[82211.078253] Call Trace:
[82211.078262] [<ffffffff813dce3c>] dump_stack+0x63/0x87
[82211.078271] [<ffffffff811a8648>] kmem_cache_destroy+0x218/0x220
[82211.078334] [<ffffffffc031a368>] xfs_destroy_zones+0xeb/0x10a [xfs]
[82211.078382] [<ffffffffc031a3fc>] exit_xfs_fs+0x6f/0xc73 [xfs]
[82211.078389] [<ffffffff81101f3d>] SyS_delete_module+0x18d/0x220
[82211.078395] [<ffffffff8107986b>] ? exit_to_usermode_loop+0x78/0xa2
[82211.078401] [<ffffffff818066b6>] entry_SYSCALL_64_fastpath+0x16/0x75
[82269.173849] SGI XFS with ACLs, security attributes, realtime, no debug enabled

Rafael David Tinoco (inaddy) wrote :

# the following commit solves the issue:

commit af055e37a91d215d7174d0b84c86795ca81086a7
Author: Brian Foster <email address hidden>
Date: Mon Feb 8 15:00:02 2016 +1100

    xfs: fix xfs_log_ticket leak in xfs_end_io() after fs shutdown

    If the filesystem has shut down, xfs_end_io() currently sets an
    error on the ioend and proceeds to ioend destruction. The ioend
    might contain a truncate transaction if the I/O extended the size of
    the file. This transaction is only cleaned up in
    xfs_setfilesize_ioend(), however, which is skipped in this case.
    This results in an xfs_log_ticket leak message when the associate
    cache slab is destroyed (e.g., on rmmod).

    This was originally reproduced by xfs/141 on a distro kernel. The
    problem is reproducible on an upstream kernel, but not easily
    detected in current upstream if the xfs_log_ticket cache happens to
    be merged with another cache. This can be reproduced more
    deterministically with the 'slab_nomerge' kernel boot option.

    Update xfs_end_io() to proceed with normal end I/O processing after
    an error is set on an ioend due to fs shutdown. The I/O type-based
    processing is already designed to handle an I/O error and ensure
    that the ioend is cleaned up correctly.

    Signed-off-by: Brian Foster <email address hidden>
    Reviewed-by: Dave Chinner <email address hidden>
    Signed-off-by: Dave Chinner <email address hidden>

Rafael David Tinoco (inaddy) wrote :

$ git tag --contains af055e37a91d215d7174d0b84c86795ca81086a7 | grep -v -- -rc
v4.10
v4.11
v4.12
v4.6
v4.7
v4.8
v4.9

Only Xenial seems to be affected.

Changed in linux (Ubuntu):
assignee: nobody → Rafael David Tinoco (inaddy)
importance: Undecided → Medium
status: New → Confirmed
milestone: none → ubuntu-16.04.3
description: updated
Changed in linux (Ubuntu):
importance: Medium → Low
Rafael David Tinoco (inaddy) wrote :

Consistent reproducer:

inaddy@iscsixfs:~$ uname -a
Linux iscsixfs 4.4.0-87-generic #110~14.04.1-Ubuntu SMP Tue Jul 18 14:51:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

inaddy@iscsixfs:~$ dmesg | grep xfs

[ 381.648984] XFS (sda1): xfs_do_force_shutdown(0x2) called from line 1197 of file /build/linux-lts-xenial-N0BD4p/linux-lts-xenial-4.4.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc02a16a8
[ 381.700064] XFS (sda1): xfs_log_force: error -5 returned.

inaddy@iscsixfs:~$ lsmod | grep xfs
xfs 954368 1
libcrc32c 16384 1 xfs

inaddy@iscsixfs:~$ sudo umount /xfs01
[ 449.785939] sd 2:0:0:1: rejecting I/O to offline device
[ 449.789923] blk_update_request: I/O error, dev sda, sector 0

inaddy@iscsixfs:~$ sudo rmmod xfs
[ 452.928198] =============================================================================
[ 452.929647] BUG xfs_log_ticket (Not tainted): Objects remaining in xfs_log_ticket on kmem_cache_close()
[ 452.931259] -----------------------------------------------------------------------------
[ 452.931259]
[ 452.933002] INFO: Slab 0xffffea0002e67a80 objects=22 used=3 fp=0xffff8800b99eaf18 flags=0xffffc000000080
[ 452.934773] INFO: Object 0xffff8800b99ea000 @offset=0
[ 452.935693] INFO: Object 0xffff8800b99ea228 @offset=552
[ 452.936675] INFO: Object 0xffff8800b99ea2e0 @offset=736
[ 452.937642] kmem_cache_destroy xfs_log_ticket: Slab cache still has objects

description: updated
Rafael David Tinoco (inaddy) wrote :

Testing this patch:

inaddy@iscsixfs:~$ sudo uname -a
Linux iscsixfs 4.4.70+ #2 SMP Mon Jul 24 19:25:49 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

inaddy@iscsixfs:~$ sudo dmesg | grep xfs
[ 109.876336] XFS (sda1): metadata I/O error: block 0xc34d83 ("xfs_trans_read_buf_map") error 5 numblks 1
[ 109.876420] XFS (sda1): metadata I/O error: block 0xc34d83 ("xfs_trans_read_buf_map") error 5 numblks 1
[ 109.876463] XFS (sda1): metadata I/O error: block 0xc34d83 ("xfs_trans_read_buf_map") error 5 numblks 1
[ 109.876524] XFS (sda1): metadata I/O error: block 0xc34d83 ("xfs_trans_read_buf_map") error 5 numblks 1
... tons of times ...
[ 114.612131] XFS (sda1): metadata I/O error: block 0xc34d83 ("xfs_trans_read_buf_map") error 5 numblks 1
[ 114.612204] XFS (sda1): metadata I/O error: block 0xc34d83 ("xfs_trans_read_buf_map") error 5 numblks 1
[ 114.612254] XFS (sda1): metadata I/O error: block 0xc34d83 ("xfs_trans_read_buf_map") error 5 numblks 1
[ 114.631338] XFS (sda1): xfs_do_force_shutdown(0x2) called from line 1197 of file /home/inaddy/sources/trees/xenial/fs/xfs/xfs_log.c. Return address = 0xffffffffc032e0d8
[ 114.631381] XFS (sda1): xfs_log_force: error -5 returned.
[ 114.635642] XFS (sda1): xfs_do_force_shutdown(0x2) called from line 1197 of file /home/inaddy/sources/trees/xenial/fs/xfs/xfs_log.c. Return address = 0xffffffffc032e0d8
[ 114.680054] XFS (sda1): xfs_log_force: error -5 returned.

inaddy@iscsixfs:~$ lsmod | grep xfs
xfs 970752 1
libcrc32c 16384 1 xfs

inaddy@iscsixfs:~$ sudo umount /xfs01
[ 139.491719] sd 2:0:0:1: rejecting I/O to offline device
[ 139.492705] blk_update_request: I/O error, dev sda, sector 0

inaddy@iscsixfs:~$ sudo rmmod xfs

Patch fixes the issue.

inaddy@iscsixfs:~$ sudo modprobe xfs

inaddy@iscsixfs:~$ ls /dev/sd*
/dev/sda /dev/sda1 /dev/sdb /dev/sdb1

inaddy@iscsixfs:~$ sudo mount /dev/sda1 /xfs01

inaddy@iscsixfs:~$ ls /xfs01
Bonnie.1677

Disk is back online after module was reloaded.
(I could have checked slab objects for xfs_log_ticket but I guess this is enough).

Changed in linux (Ubuntu Xenial):
status: New → In Progress
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Changed in linux (Ubuntu Xenial):
importance: Undecided → Low
assignee: nobody → Rafael David Tinoco (inaddy)
milestone: none → ubuntu-16.04.3
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: sts

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Rafael David Tinoco (inaddy) wrote :

Working on verification right now.

Rafael David Tinoco (inaddy) wrote :

Okay,

First I have failed the path without propagating transport layer error and xfs stayed up (nmi watchdog would eventually detect a lockup if pagecache was ever attempted to be flushed), i returned the path and failed it again, now with error propagation. The error was detected and all SCSI CDBs were failed, SCSI sensed errors and XFS called shutdown:

http://pastebin.ubuntu.com/25376025/

Aug 23 11:34:47 xfshang kernel: [ 1871.328044] blk_update_request: I/O error, dev sdb, sector 20481104
Aug 23 11:34:47 xfshang kernel: [ 1871.328051] XFS (sdb1): metadata I/O error: block 0x1387c50 ("xlog_iodone") error 5 numblks 64
Aug 23 11:34:47 xfshang kernel: [ 1871.328053] XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1197 of file /build/linux-lts-xenial-AsqcmJ/linux-lts-xenial-4.4.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc025f698
Aug 23 11:34:47 xfshang kernel: [ 1871.328066] XFS (sdb1): Log I/O Error Detected. Shutting down filesystem
Aug 23 11:34:47 xfshang kernel: [ 1871.328067] XFS (sdb1): Please umount the filesystem and rectify the problem(s)
Aug 23 11:34:47 xfshang kernel: [ 1873.966241] buffer_io_error: 173396 callbacks suppressed

Aug 23 11:34:47 xfshang kernel: [ 1873.966244] Buffer I/O error on dev sdb1, logical block 102421, lost async page write
Aug 23 11:34:47 xfshang kernel: [ 1873.967411] Buffer I/O error on dev sdb1, logical block 102422, lost async page write
Aug 23 11:34:47 xfshang kernel: [ 1873.968599] Buffer I/O error on dev sdb1, logical block 102423, lost async page write
...
Aug 23 11:34:56 xfshang dhclient: DHCPREQUEST of 192.168.50.95 on internal to 192.168.48.1 port 67 (xid=0x87b4739)
Aug 23 11:34:56 xfshang dhclient: DHCPACK of 192.168.50.95 from 192.168.48.1
Aug 23 11:34:56 xfshang dhclient: bound to 192.168.50.95 -- renewal in 15 seconds.
Aug 23 11:35:01 xfshang kernel: [ 1887.284525] XFS (sdb1): xfs_log_force: error -5 returned.
Aug 23 11:35:01 xfshang kernel: [ 1887.548087] XFS (sdb1): Unmounting Filesystem
Aug 23 11:35:01 xfshang kernel: [ 1887.548100] XFS (sdb1): xfs_log_force: error -5 returned.
Aug 23 11:35:01 xfshang kernel: [ 1887.548145] XFS (sdb1): xfs_log_force: error -5 returned.
Aug 23 11:35:01 xfshang kernel: [ 1887.548438] sd 5:0:0:1: rejecting I/O to offline device
Aug 23 11:35:01 xfshang kernel: [ 1887.549424] blk_update_request: I/O error, dev sdb, sector 0

Then I removed xfs module and no leak was found. I consider this verified and good to go.

Thank you!

-Rafael

tags: added: verification-done
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (16.2 KiB)

This bug was fixed in the package linux - 4.4.0-93.116

---------------
linux (4.4.0-93.116) xenial; urgency=low

  * linux: 4.4.0-93.116 -proposed tracker (LP: #1709296)

  * Creating conntrack entry failure with kernel 4.4.0-89 (LP: #1709032)
    - Revert "Revert "netfilter: synproxy: fix conntrackd interaction""
    - netfilter: nf_ct_ext: fix possible panic after nf_ct_extend_unregister

  * CVE-2017-1000112
    - Revert "udp: consistently apply ufo or fragmentation"
    - udp: consistently apply ufo or fragmentation

  * CVE-2017-1000111
    - Revert "net-packet: fix race in packet_set_ring on PACKET_RESERVE"
    - packet: fix tp_reserve race in packet_set_ring

  * kernel BUG at [tty_ldisc_reinit] mm/slub.c! (LP: #1709126)
    - tty: Simplify tty_set_ldisc() exit handling
    - tty: Reset c_line from driver's init_termios
    - tty: Handle NULL tty->ldisc
    - tty: Move tty_ldisc_kill()
    - tty: Use 'disc' for line discipline index name
    - tty: Refactor tty_ldisc_reinit() for reuse
    - tty: Destroy ldisc instance on hangup

  * atheros bt failed after S3 (LP: #1706833)
    - SAUCE: Bluetooth: Make request workqueue freezable

  * The Precision Touchpad(PTP) button sends incorrect event code (LP: #1708372)
    - HID: multitouch: handle external buttons for Precision Touchpads

  * Set CONFIG_SATA_HIGHBANK=y on armhf (LP: #1703430)
    - [Config] CONFIG_SATA_HIGHBANK=y

  * xfs slab objects (memory) leak when xfs shutdown is called (LP: #1706132)
    - xfs: fix xfs_log_ticket leak in xfs_end_io() after fs shutdown

  * Adt tests of src:linux time out often on armhf lxc containers (LP: #1705495)
    - [Packaging] tests -- reduce rebuild test to one flavour

  * CVE-2017-7495
    - ext4: fix data exposure after a crash

  * ubuntu/rsi driver downlink wifi throughput drops to 5-6 Mbps when BT
    keyboard is connected (LP: #1706991)
    - SAUCE: Redpine: enable power save by default for coex mode
    - SAUCE: Redpine: uapsd configuration changes

  * [Hyper-V] hv_netvsc: Exclude non-TCP port numbers from vRSS hashing
    (LP: #1690174)
    - hv_netvsc: Exclude non-TCP port numbers from vRSS hashing

  * ath10k doesn't report full RSSI information (LP: #1706531)
    - ath10k: add per chain RSSI reporting

  * ideapad_laptop don't support v310-14isk (LP: #1705378)
    - platform/x86: ideapad-laptop: Add several models to no_hw_rfkill

  * [8087:0a2b] Failed to load bluetooth firmware(might affect some other Intel
    bt devices) (LP: #1705633)
    - Bluetooth: btintel: Create common Intel Version Read function
    - Bluetooth: Use switch statement for Intel hardware variants
    - Bluetooth: Replace constant hw_variant from Intel Bluetooth firmware
      filename
    - Bluetooth: hci_intel: Fix firmware file name to use hw_variant
    - Bluetooth: btintel: Add MODULE_FIRMWARE entries for iBT 3.5 controllers

  * xhci_hcd: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2
    comp_code 13 (LP: #1667750)
    - xhci: Bad Ethernet performance plugged in ASM1042A host

  * OpenPower: Some multipaths temporarily have only a single path
    (LP: #1696445)
    - scsi: ses: don't get power status of SES device slot on probe

  ...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released

Thank you for taking the time to verify this stable release fix. We have noticed that you have used the verification-done tag for marking the bug as verified and would like to point out that due to a recent change in SRU bug verification policy fixes now have to be marked with per-release tags (i.e. verification-done-$RELEASE). Please remove the verification-done tag and add one for the release you have tested the package in. Thank you!

https://wiki.ubuntu.com/StableReleaseUpdates#Verification

Changed in linux (Ubuntu):
status: In Progress → Fix Released
assignee: Rafael David Tinoco (inaddy) → nobody
Changed in linux (Ubuntu Xenial):
assignee: Rafael David Tinoco (inaddy) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers