fscache: jobs might hang when fscache disk is full

Bug #1821395 reported by Mauricio Faria de Oliveira
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned
Cosmic
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

 * fscache issue where jobs get hung when fscache disk is full.

 * trivial upstream fix; already applied in X/D, required in B/C:
   commit c5a94f434c82 ("fscache: fix race between enablement and
   dropping of object").

[Test Case]

 * Test kernel verified / regression-tested by reporter.

 * Apparently there's no simple test case,
   but these are the conditions to hit the problem:

   1) The active dataset size is equal to the cache disk size.
      The application reads the data over and over again.
   2) Disk is near full (90%+)
   3) cachefilesd in userspace is trying to cull the old objects
      while new objects are being looked up.
   4) new cachefiles are created and some fail with no disk space.
   5) race in dropping object state machine and
      deferred lookup state machine causes the hang.
   6) HUNG in fscache_wait_for_deferred_lookup for
      clear bit FSCACHE_COOKIE_LOOKING_UP cookie->flags.

[Regression Potential]

 * Low; contained in fscache; no further fixes applied upstream.

 * This patch is applied in a stable tree (linux-4.4.y).

[Original Description]

An user reported an fscache issue where jobs get hung when the fscache disk is full.

After investigation, it's been found to be an issue already reported/fixed upstream,
by commit c5a94f434c82 ("fscache: fix race between enablement and dropping of object").

This patch is required in Bionic and Cosmic, and it's applied in Xenial (via stable) and Disco.

Apparently there's no simple test case, but these are the conditions to hit the problem:

1) The active dataset size is equal to the cache disk size.
   The application reads the data over and over again.
2) Disk is near full (90%+)
3) cachefilesd in userspace is trying to cull the old objects
   while new objects are being looked up.
4) new cachefiles are created and some fail with no disk space.
5) race in dropping object state machine and
   deferred lookup state machine causes the hang.
6) HUNG in fscache_wait_for_deferred_lookup for
   clear bit FSCACHE_COOKIE_LOOKING_UP cookie->flags.

CVE References

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1821395

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
status: Incomplete → Invalid
Changed in linux (Ubuntu Bionic):
status: New → Confirmed
Changed in linux (Ubuntu Cosmic):
status: New → Confirmed
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

[B/C][PATCH 0/1] Fix for LP#1821395 (fscache: jobs might hang when fscache disk is full)
https://lists.ubuntu.com/archives/kernel-team/2019-March/099448.html

description: updated
Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Committed
Changed in linux (Ubuntu Cosmic):
status: Confirmed → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-cosmic' to 'verification-done-cosmic'. If the problem still exists, change the tag 'verification-needed-cosmic' to 'verification-failed-cosmic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-cosmic
tags: added: verification-needed-bionic
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

The verification for bionic/cosmic -proposed is expected to finish by tomorrow (Apr 12).

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Verification successful with xfstests on nfs+fscache.
No regression in bionic-proposed from bionic-updates.

bionic-updates / 4.15.0-47:

Failures: generic/035 generic/075 generic/091 generic/112 generic/263 generic/294 generic/306 generic/307 generic/430 generic/431 generic/434 generic/469 generic/484 generic/495
Failed 14 of 437 tests

bionic-proposed / 4.15.0-48:

Failures: generic/035 generic/075 generic/091 generic/112 generic/263 generic/294 generic/306 generic/307 generic/430 generic/431 generic/434 generic/469 generic/484 generic/495
Failed 14 of 437 tests

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Verification successful with xfstests on nfs+fscache.
No regression in cosmic-proposed from cosmic-updates.

cosmic-updates / 4.18.0-17:

Failures: generic/035 generic/258 generic/294 generic/448 generic/467 generic/477 generic/484 generic/490 generic/495
Failed 9 of 437 tests

cosmic-proposed / 4.18.0-18:

Failures: generic/035 generic/258 generic/294 generic/448 generic/467 generic/477 generic/484 generic/490 generic/495
Failed 9 of 437 tests

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (4.0 KiB)

Regression testing setup/steps
===

fscache
-------

sudo apt-get -y install cachefilesd
echo 'RUN=yes' | sudo tee -a /etc/default/cachefilesd
sudo modprobe fscache
sudo systemctl start cachefilesd

nfs
---

sudo apt-get -y install nfs-kernel-server
sudo systemctl start nfs-kernel-server

sudo mkdir -p /{srv,mnt}/nfs-{test,scratch}

# different fsid if in the same local filesystem
echo '/srv/nfs-test 127.0.0.1(rw,no_subtree_check,no_root_squash,fsid=0)' | sudo tee -a /etc/exports
echo '/srv/nfs-scratch 127.0.0.1(rw,no_subtree_check,no_root_squash,fsid=1)' | sudo tee -a /etc/exports
sudo exportfs -ra

xfs-tests
---------

sudo apt-get -y install automake gcc make git xfsprogs xfslibs-dev \
  uuid-dev uuid-runtime libtool-bin e2fsprogs libuuid1 attr libattr1-dev \
  libacl1-dev libaio-dev libgdbm-dev quota gawk fio dbench python sqlite3

git clone https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
cd xfstests-dev

git log --oneline -1 HEAD
  f3c1bca generic: Test that SEEK_HOLE can find a punched hole

make -j$(nproc); echo $? # must be 0

sudo useradd fsgqa
sudo groupadd fsgqa
sudo useradd 123456-fsgqa

export TEST_DEV=127.0.0.1:/srv/nfs-test
export TEST_DIR=/mnt/nfs-test

export SCRATCH_DEV=127.0.0.1:/srv/nfs-scratch
export SCRATCH_MNT=/mnt/nfs-scratch

export TEST_FS_MOUNT_OPTS="-o fsc" # for fscache / test dev
export NFS_MOUNT_OPTIONS="-o fsc" # for fscache / scratch dev

cd ~/xfstests-dev
sudo -E ./check -nfs -g quick 2>&1 | tee ~/xfs-tests.nfs.log.$(uname -r)
<...>

---

In another terminal, check the NFS mounts are indeed with the 'fsc' (fscache) attribute:

$ mount | grep nfs | grep fsc
127.0.0.1:/srv/nfs-test on /mnt/nfs-test type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,fsc,local_lock=none,addr=127.0.0.1)
127.0.0.1:/srv/nfs-scratch on /mnt/nfs-scratch type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,fsc,local_lock=none,addr=127.0.0.1)

And compare fscache stats before/after run:

$ cat /proc/fs/fscache/stats
FS-Cache statistics
Cookies: idx=0 dat=0 spc=0
Objects: alc=0 nal=0 avl=0 ded=0
ChkAux : non=0 ok=0 upd=0 obs=0
Pages : mrk=0 unc=0
Acquire: n=0 nul=0 noc=0 ok=0 nbf=0 oom=0
Lookups: n=0 neg=0 pos=0 crt=0 tmo=0
Invals : n=0 run=0
Updates: n=0 nul=0 run=0
Relinqs: n=0 nul=0 wcr=0 rtr=0
AttrChg: n=0 ok=0 nbf=0 oom=0 run=0
Allocs : n=0 ok=0 wt=0 nbf=0 int=0
Allocs : ops=0 owt=0 abt=0
Retrvls: n=0 ok=0 wt=0 nod=0 nbf=0 int=0 oom=0
Retrvls: ops=0 owt=0 abt=0
Stores : n=0 ok=0 agn=0 nbf=0 oom=0
Stores : ops=0 run=0 pgs=0 rxd=0 olm=0
VmScan : nos=0 gon=0 bsy=0 can=0 wt=0
Ops : pend=0 run=0 enq=0 can=0 rej=0
Ops : ini=0 dfr=0 rel=0 gc=0
CacheOp: alo=0 luo=0 luc=0 gro=0
CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0
CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0

...

$ cat /proc/fs/fscache/stats
FS-Cache statistics
Cookies: idx=412 dat=2441632 spc=0
Objects: alc=8929 nal=0 avl=8741 ded=8928
ChkAux : non=0 ok=86 upd=0 obs=1123
Pages : mrk=371441 unc=371441
Acquire: n=2442044 nul=0 noc=0 ok=2442044 nbf=0 oom=0
Lookups: n=8929 neg=8817 pos=112...

Read more...

tags: added: verification-done-cosmic
removed: verification-needed-cosmic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (6.9 KiB)

This bug was fixed in the package linux - 4.18.0-18.19

---------------
linux (4.18.0-18.19) cosmic; urgency=medium

  * linux: 4.18.0-18.19 -proposed tracker (LP: #1822796)

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts
    - [Packaging] resync retpoline extraction

  * 3b080b2564287be91605bfd1d5ee985696e61d3c in ubuntu_btrfs_kernel_fixes
    triggers system hang on i386 (LP: #1812845)
    - btrfs: raid56: properly unmap parity page in finish_parity_scrub()

  * [SRU][B/C/OEM]IOMMU: add kernel dma protection (LP: #1820153)
    - ACPI / property: Allow multiple property compatible _DSD entries
    - PCI / ACPI: Identify untrusted PCI devices
    - iommu/vt-d: Force IOMMU on for platform opt in hint
    - iommu/vt-d: Do not enable ATS for untrusted devices
    - thunderbolt: Export IOMMU based DMA protection support to userspace
    - iommu/vt-d: Disable ATS support on untrusted devices

  * Huawei Hi1822 NIC has poor performance (LP: #1820187)
    - net-next: hinic: fix a problem in free_tx_poll()
    - hinic: remove ndo_poll_controller
    - net-next/hinic: add checksum offload and TSO support
    - hinic: Fix l4_type parameter in hinic_task_set_tunnel_l4
    - net-next/hinic:replace multiply and division operators
    - net-next/hinic:add rx checksum offload for HiNIC
    - net-next/hinic:fix a bug in set mac address
    - net-next/hinic: fix a bug in rx data flow
    - net: hinic: fix null pointer dereference on pointer hwdev
    - hinic: optmize rx refill buffer mechanism
    - net-next/hinic:add shutdown callback
    - net-next/hinic: replace disable_irq_nosync/enable_irq

  * [CONFIG] please enable highdpi font FONT_TER16x32 (LP: #1819881)
    - Fonts: New Terminus large console font
    - [Config]: enable highdpi Terminus 16x32 font support

  * [19.04 FEAT] qeth: Enhanced link speed - kernel part (LP: #1814892)
    - s390/qeth: report 25Gbit link speed

  * Avoid potential memory corruption on HiSilicon SoCs (LP: #1819546)
    - iommu/arm-smmu-v3: Avoid memory corruption from Hisilicon MSI payloads

  * CVE-2017-5715
    - x86/speculation: Apply IBPB more strictly to avoid cross-process data leak
    - x86/speculation: Propagate information about RSB filling mitigation to sysfs
    - x86/speculation: Add RETPOLINE_AMD support to the inline asm CALL_NOSPEC
      variant
    - x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support
    - x86/retpoline: Remove minimal retpoline support
    - x86/speculation: Update the TIF_SSBD comment
    - x86/speculation: Clean up spectre_v2_parse_cmdline()
    - x86/speculation: Remove unnecessary ret variable in cpu_show_common()
    - x86/speculation: Move STIPB/IBPB string conditionals out of
      cpu_show_common()
    - x86/speculation: Disable STIBP when enhanced IBRS is in use
    - x86/speculation: Rename SSBD update functions
    - x86/speculation: Reorganize speculation control MSRs update
    - sched/smt: Make sched_smt_present track topology
    - x86/Kconfig: Select SCHED_SMT if SMP enabled
    - sched/smt: Expose sched_smt_present static key
    - x86/speculation: Rework SMT state change
    - x86/l1tf: Show actual SMT state
    - x86/speculation: R...

Read more...

Changed in linux (Ubuntu Cosmic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (14.6 KiB)

This bug was fixed in the package linux - 4.15.0-48.51

---------------
linux (4.15.0-48.51) bionic; urgency=medium

  * linux: 4.15.0-48.51 -proposed tracker (LP: #1822820)

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts
    - [Packaging] resync retpoline extraction

  * 3b080b2564287be91605bfd1d5ee985696e61d3c in ubuntu_btrfs_kernel_fixes
    triggers system hang on i386 (LP: #1812845)
    - btrfs: raid56: properly unmap parity page in finish_parity_scrub()

  * [P9][LTCTest][Opal][FW910] cpupower monitor shows multiple stop Idle_Stats
    (LP: #1719545)
    - cpupower : Fix header name to read idle state name

  * [amdgpu] screen corruption when using touchpad (LP: #1818617)
    - drm/amdgpu/gmc: steal the appropriate amount of vram for fw hand-over (v3)
    - drm/amdgpu: Free VGA stolen memory as soon as possible.

  * [SRU][B/C/OEM]IOMMU: add kernel dma protection (LP: #1820153)
    - ACPICA: AML parser: attempt to continue loading table after error
    - ACPI / property: Allow multiple property compatible _DSD entries
    - PCI / ACPI: Identify untrusted PCI devices
    - iommu/vt-d: Force IOMMU on for platform opt in hint
    - iommu/vt-d: Do not enable ATS for untrusted devices
    - thunderbolt: Export IOMMU based DMA protection support to userspace
    - iommu/vt-d: Disable ATS support on untrusted devices

  * Add basic support to NVLink2 passthrough (LP: #1819989)
    - powerpc/powernv/npu: Do not try invalidating 32bit table when 64bit table is
      enabled
    - powerpc/powernv: call OPAL_QUIESCE before OPAL_SIGNAL_SYSTEM_RESET
    - powerpc/powernv: Export opal_check_token symbol
    - powerpc/powernv: Make possible for user to force a full ipl cec reboot
    - powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn
    - powerpc/powernv: Move npu struct from pnv_phb to pci_controller
    - powerpc/powernv/npu: Move OPAL calls away from context manipulation
    - powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation
    - powerpc/pseries/npu: Enable platform support
    - powerpc/pseries: Remove IOMMU API support for non-LPAR systems
    - powerpc/powernv/npu: Check mmio_atsd array bounds when populating
    - powerpc/powernv/npu: Fault user page into the hypervisor's pagetable

  * Huawei Hi1822 NIC has poor performance (LP: #1820187)
    - net-next: hinic: fix a problem in free_tx_poll()
    - hinic: remove ndo_poll_controller
    - net-next/hinic: add checksum offload and TSO support
    - hinic: Fix l4_type parameter in hinic_task_set_tunnel_l4
    - net-next/hinic:replace multiply and division operators
    - net-next/hinic:add rx checksum offload for HiNIC
    - net-next/hinic:fix a bug in set mac address
    - net-next/hinic: fix a bug in rx data flow
    - net: hinic: fix null pointer dereference on pointer hwdev
    - hinic: optmize rx refill buffer mechanism
    - net-next/hinic:add shutdown callback
    - net-next/hinic: replace disable_irq_nosync/enable_irq

  * [CONFIG] please enable highdpi font FONT_TER16x32 (LP: #1819881)
    - Fonts: New Terminus large console font
    - [Config]: enable highdpi Terminus 16x32 font support

  * [19.04 FEAT] qeth: Enhanced link...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Steve Langasek (vorlon) wrote : Update Released

The verification of the Stable Release Update for linux-azure has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.