fscache: bad refcounting in fscache_op_complete leads to OOPS

Bug #1797314 reported by Daniel Axtens on 2018-10-11
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Daniel Axtens
Xenial
Medium
Unassigned
Bionic
Medium
Unassigned
Cosmic
Medium
Daniel Axtens

Bug Description

SRU Justification
-----------------

[Impact]

A kernel BUG is sometimes observed when using fscache:
    [4740718.880898] FS-Cache:
    [4740718.880920] FS-Cache: Assertion failed
    [4740718.880934] FS-Cache: 0 > 0 is false
    [4740718.881001] ------------[ cut here ]------------
    [4740718.881017] kernel BUG at /usr/src/linux-4.4.0/fs/fscache/operation.c:449!
    [4740718.881040] invalid opcode: 0000 [#1] SMP

    [4740718.892659] Call Trace:
    [4740718.893506] [<ffffffffc1464cf9>] cachefiles_read_copier+0x3a9/0x410 [cachefiles]
    [4740718.894374] [<ffffffffc037e272>] fscache_op_work_func+0x22/0x50 [fscache]
    [4740718.895180] [<ffffffff81096da0>] process_one_work+0x150/0x3f0
    [4740718.895966] [<ffffffff8109751a>] worker_thread+0x11a/0x470
    [4740718.896753] [<ffffffff81808e59>] ? __schedule+0x359/0x980
    [4740718.897783] [<ffffffff81097400>] ? rescuer_thread+0x310/0x310
    [4740718.898581] [<ffffffff8109cdd6>] kthread+0xd6/0xf0
    [4740718.899469] [<ffffffff8109cd00>] ? kthread_park+0x60/0x60
    [4740718.900477] [<ffffffff8180d0cf>] ret_from_fork+0x3f/0x70
    [4740718.901514] [<ffffffff8109cd00>] ? kthread_park+0x60/0x60

[Problem]

In include/linux/fscache-cache.h, fscache_retrieval_complete reads, in part:

            atomic_sub(n_pages, &op->n_pages);
            if (atomic_read(&op->n_pages) <= 0)
                    fscache_op_complete(&op->op, true);

The code is using atomic_sub followed by an atomic_read. This causes two threads doing a decrement of pages to race with each other seeing the op->refcount <= 0 at same time, and end up calling fscache_op_complete in both the threads leading to the OOPS.

[Fix]
The fix is trivial to use atomic_sub_return instead of two calls.

[Testcase]
I believe the user has tested the patch successfully on their fscache/cachefiles setup.

[Regression Potential]
Limited to fscache. Small, comprehensible change.

Daniel Axtens (daxtens) on 2018-10-11
description: updated

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1797314

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Daniel Axtens (daxtens) on 2018-10-12
description: updated
Stefan Bader (smb) on 2018-10-12
Changed in linux (Ubuntu Xenial):
status: New → In Progress
Changed in linux (Ubuntu Bionic):
status: New → In Progress
Changed in linux (Ubuntu Xenial):
importance: Undecided → Medium
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
Changed in linux (Ubuntu Cosmic):
importance: Undecided → Medium
status: Incomplete → In Progress
assignee: nobody → Daniel Axtens (daxtens)
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-cosmic' to 'verification-done-cosmic'. If the problem still exists, change the tag 'verification-needed-cosmic' to 'verification-failed-cosmic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-cosmic
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
David Coronel (davecore) wrote :

I have a user who confirms the 4.15.0-39 kernel fixes this issue:

# uname -a
Linux <hostname> 4.15.0-39-generic #42~16.04.1-Ubuntu SMP Wed Oct 24 17:09:54 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

# cat /proc/fs/fscache/stats
FS-Cache statistics
Cookies: idx=2 dat=283731 spc=0
Objects: alc=278557 nal=0 avl=278557 ded=220831
ChkAux : non=0 ok=278555 upd=0 obs=0
Pages : mrk=347998933 unc=323161526
Acquire: n=283733 nul=0 noc=0 ok=283733 nbf=0 oom=0
Lookups: n=278557 neg=0 pos=278557 crt=0 tmo=0
Invals : n=0 run=0
Updates: n=0 nul=0 run=0
Relinqs: n=226008 nul=0 wcr=0 rtr=0
AttrChg: n=0 ok=0 nbf=0 oom=0 run=0
Allocs : n=0 ok=0 wt=0 nbf=0 int=0
Allocs : ops=0 owt=0 abt=0
Retrvls: n=960948 ok=960948 wt=24031 nod=0 nbf=0 int=0 oom=0
Retrvls: ops=960948 owt=51 abt=0
Stores : n=0 ok=0 agn=0 nbf=0 oom=0
Stores : ops=0 run=0 pgs=0 rxd=0 olm=0
VmScan : nos=323161526 gon=0 bsy=0 can=0 wt=0
Ops : pend=51 run=960948 enq=379179023 can=0 rej=0
Ops : ini=960948 dfr=15 rel=960948 gc=15
CacheOp: alo=0 luo=0 luc=0 gro=0
CacheOp: inv=0 upo=0 dro=0 pto=0 atc=0 syn=0
CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0
CacheEv: nsp=0 stl=0 rtr=0 cul=0

Marking verification done for Bionic.

tags: added: verification-done-bionic
removed: verification-needed-bionic

Yes the test results indicates that data was read cached using fscache and 960948 ops have gone through the change.

we are not planning to test with 4.4.0-139 kernel as I don't have a system setup for this kernel.

Changed in linux (Ubuntu Cosmic):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (21.0 KiB)

This bug was fixed in the package linux - 4.4.0-139.165

---------------
linux (4.4.0-139.165) xenial; urgency=medium

  * linux: 4.4.0-139.165 -proposed tracker (LP: #1799401)

  * Kernel panic after the ubuntu_nbd_smoke_test on Xenial kernel (LP: #1793464)
    - nbd: Remove signal usage
    - nbd: Timeouts are not user requested disconnects
    - nbd: Cleanup reset of nbd and bdev after a disconnect
    - nbd: don't shutdown sock with irq's disabled
    - nbd: fix race in ioctl

  * fscache: bad refcounting in fscache_op_complete leads to OOPS (LP: #1797314)
    - SAUCE: fscache: Fix race in decrementing refcount of op->npages

  * xenial: virtio-scsi: CPU soft lockup due to loop in
    virtscsi_target_destroy() (LP: #1798110)
    - SAUCE: (no-up) virtio-scsi: Decrement reqs counter before SCSI command
      requeue

  * Error reported when creating ZFS pool with "-t" option, despite successful
    pool creation (LP: #1769937)
    - SAUCE: (noup) Update zfs to 0.6.5.6-0ubuntu26

  * Xenial update: 4.4.160 upstream stable release (LP: #1798770)
    - crypto: skcipher - Fix -Wstringop-truncation warnings
    - tsl2550: fix lux1_input error in low light
    - vmci: type promotion bug in qp_host_get_user_memory()
    - x86/numa_emulation: Fix emulated-to-physical node mapping
    - staging: rts5208: fix missing error check on call to rtsx_write_register
    - uwb: hwa-rc: fix memory leak at probe
    - power: vexpress: fix corruption in notifier registration
    - Bluetooth: Add a new Realtek 8723DE ID 0bda:b009
    - USB: serial: kobil_sct: fix modem-status error handling
    - 6lowpan: iphc: reset mac_header after decompress to fix panic
    - md-cluster: clear another node's suspend_area after the copy is finished
    - media: exynos4-is: Prevent NULL pointer dereference in __isp_video_try_fmt()
    - powerpc/kdump: Handle crashkernel memory reservation failure
    - media: fsl-viu: fix error handling in viu_of_probe()
    - x86/tsc: Add missing header to tsc_msr.c
    - x86/entry/64: Add two more instruction suffixes
    - scsi: target/iscsi: Make iscsit_ta_authentication() respect the output
      buffer size
    - scsi: klist: Make it safe to use klists in atomic context
    - scsi: ibmvscsi: Improve strings handling
    - usb: wusbcore: security: cast sizeof to int for comparison
    - powerpc/powernv/ioda2: Reduce upper limit for DMA window size
    - alarmtimer: Prevent overflow for relative nanosleep
    - s390/extmem: fix gcc 8 stringop-overflow warning
    - ALSA: snd-aoa: add of_node_put() in error path
    - media: s3c-camif: ignore -ENOIOCTLCMD from v4l2_subdev_call for s_power
    - media: soc_camera: ov772x: correct setting of banding filter
    - media: omap3isp: zero-initialize the isp cam_xclk{a,b} initial data
    - staging: android: ashmem: Fix mmap size validation
    - drivers/tty: add error handling for pcmcia_loop_config
    - media: tm6000: add error handling for dvb_register_adapter
    - ALSA: hda: Add AZX_DCAPS_PM_RUNTIME for AMD Raven Ridge
    - ath10k: protect ath10k_htt_rx_ring_free with rx_ring.lock
    - rndis_wlan: potential buffer overflow in rndis_wlan_auth_indication()
    - wlcore: Add missing PM call fo...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (5.4 KiB)

This bug was fixed in the package linux - 4.15.0-39.42

---------------
linux (4.15.0-39.42) bionic; urgency=medium

  * linux: 4.15.0-39.42 -proposed tracker (LP: #1799411)

  * Linux: insufficient shootdown for paging-structure caches (LP: #1798897)
    - mm: move tlb_table_flush to tlb_flush_mmu_free
    - mm/tlb: Remove tlb_remove_table() non-concurrent condition
    - mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
    - [Config] CONFIG_HAVE_RCU_TABLE_INVALIDATE=y

  * Ubuntu18.04: GPU total memory is reduced (LP: #1792102)
    - Revert "powerpc/powernv: Increase memory block size to 1GB on radix"

  * arm64: snapdragon: reduce boot noise (LP: #1797154)
    - [Config] arm64: snapdragon: DRM_MSM=m
    - [Config] arm64: snapdragon: SND*=m
    - [Config] arm64: snapdragon: disable ARM_SDE_INTERFACE
    - [Config] arm64: snapdragon: disable DRM_I2C_ADV7511_CEC
    - [Config] arm64: snapdragon: disable VIDEO_ADV7511, VIDEO_COBALT

  * [Bionic] CPPC bug fixes (LP: #1796949)
    - ACPI / CPPC: Update all pr_(debug/err) messages to log the susbspace id
    - cpufreq: CPPC: Don't set transition_latency
    - ACPI / CPPC: Fix invalid PCC channel status errors

  * regression in 'ip --family bridge neigh' since linux v4.12 (LP: #1796748)
    - rtnetlink: fix rtnl_fdb_dump() for ndmsg header

  * screen displays abnormally on the lenovo M715 with the AMD GPU (Radeon Vega
    8 Mobile, rev ca, 1002:15dd) (LP: #1796786)
    - drm/amd/display: Fix takover from VGA mode
    - drm/amd/display: early return if not in vga mode in disable_vga
    - drm/amd/display: Refine disable VGA

  * arm64: snapdragon: WARNING: CPU: 0 PID: 1 arch/arm64/kernel/setup.c:271
    reserve_memblock_reserved_regions (LP: #1797139)
    - SAUCE: arm64: Fix /proc/iomem for reserved but not memory regions

  * The front MIC can't work on the Lenovo M715 (LP: #1797292)
    - ALSA: hda/realtek - Fix the problem of the front MIC on the Lenovo M715

  * Keyboard backlight sysfs sometimes is missing on Dell laptops (LP: #1797304)
    - platform/x86: dell-smbios: Correct some style warnings
    - platform/x86: dell-smbios: Rename dell-smbios source to dell-smbios-base
    - platform/x86: dell-smbios: Link all dell-smbios-* modules together
    - [Config] CONFIG_DELL_SMBIOS_SMM=y, CONFIG_DELL_SMBIOS_WMI=y

  * rpi3b+: ethernet not working (LP: #1797406)
    - lan78xx: Don't reset the interface on open

  * 87cdf3148b11 was never backported to 4.15 (LP: #1795653)
    - xfrm: Verify MAC header exists before overwriting eth_hdr(skb)->h_proto

  * [Ubuntu18.04][Power9][DD2.2]package installation segfaults inside debian
    chroot env in P9 KVM guest with HTM enabled (kvm) (LP: #1792501)
    - KVM: PPC: Book3S HV: Fix guest r11 corruption with POWER9 TM workarounds

  * Provide mode where all vCPUs on a core must be the same VM (LP: #1792957)
    - KVM: PPC: Book3S HV: Provide mode where all vCPUs on a core must be the same
      VM

  * fscache: bad refcounting in fscache_op_complete leads to OOPS (LP: #1797314)
    - SAUCE: fscache: Fix race in decrementing refcount of op->npages

  * CVE-2018-9363
    - Bluetooth: hidp: buffer overflow in hidp_process_report

  * CVE-20...

Read more...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.18.0-11.12

---------------
linux (4.18.0-11.12) cosmic; urgency=medium

  * linux: 4.18.0-11.12 -proposed tracker (LP: #1799445)

  * arm64: snapdragon: WARNING: CPU: 0 PID: 1 arch/arm64/kernel/setup.c:271
    reserve_memblock_reserved_regions (LP: #1797139)
    - SAUCE: arm64: Fix /proc/iomem for reserved but not memory regions

  * arm64: snapdragon: WARNING: CPU: 0 PID: 1 at drivers/irqchip/irq-gic.c:1016
    gic_irq_domain_translate (LP: #1797143)
    - SAUCE: arm64: dts: msm8916: camms: fix gic_irq_domain_translate warnings

  * The front MIC can't work on the Lenovo M715 (LP: #1797292)
    - ALSA: hda/realtek - Fix the problem of the front MIC on the Lenovo M715

  * Provide mode where all vCPUs on a core must be the same VM (LP: #1792957)
    - KVM: PPC: Book3S HV: Provide mode where all vCPUs on a core must be the same
      VM

  * fscache: bad refcounting in fscache_op_complete leads to OOPS (LP: #1797314)
    - SAUCE: fscache: Fix race in decrementing refcount of op->npages

  * hns3: autoneg settings get lost on down/up (LP: #1797654)
    - net: hns3: Fix for information of phydev lost problem when down/up

  * not able to unwind the stack from within __kernel_clock_gettime in the Linux
    vDSO (LP: #1797963)
    - powerpc/vdso: Correct call frame information

  * Signal 7 error when running GPFS tracing in cluster (LP: #1792195)
    - powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid.
    - powerpc/mm/radix: Only need the Nest MMU workaround for R -> RW transition

  * Support Edge Gateway's WIFI LED (LP: #1798330)
    - SAUCE: mwifiex: Switch WiFi LED state according to the device status

  * Support Edge Gateway's Bluetooth LED (LP: #1798332)
    - SAUCE: Bluetooth: Support for LED on Edge Gateways

  * kvm doesn't work on 36 physical bits systems (LP: #1798427)
    - KVM: x86: fix L1TF's MMIO GFN calculation

  * CVE-2018-15471
    - xen-netback: fix input validation in xenvif_set_hash_mapping()

  * regression in 'ip --family bridge neigh' since linux v4.12 (LP: #1796748)
    - rtnetlink: fix rtnl_fdb_dump() for ndmsg header

 -- Stefan Bader <email address hidden> Tue, 23 Oct 2018 18:59:15 +0200

Changed in linux (Ubuntu Cosmic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers