[AEP-bug] ext4: more rare direct I/O vs unmap failures

Bug #1787089 reported by quanxian
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
intel
Fix Released
Medium
Unassigned
linux (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

Description:
Even with the ext4_break_layouts() support added by "ext4: handle layout changes to pinned DAX mappings" Still seeing occasional cases with unit test where we are truncating a page that has an elevated reference count. Investigate.

The root cause of this issue is that while the ei->i_mmap_sem provides
synchronization between ext4_break_layouts() and page faults, it doesn't
provide synchronize us with the direct I/O path. This exact same issue exists
in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.

This allows the direct I/O path to do I/O and raise & lower page->_refcount
while we're executing a truncate/hole punch. This leads to us trying to free
a page with an elevated refcount.

Here's one instance of the race:

CPU 0 CPU 1
----- -----
ext4_punch_hole()
ext4_break_layouts() # all pages have refcount=1

ext4_direct_IO()
... lots of layers ...
follow_page_pte()
get_page() # elevates refcount

truncate_pagecache_range()
... a few layers ...
dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE()

A similar race occurs when the refcount is being dropped while we're running
ext4_break_layouts(), and this is the one that my test was actually hitting:

CPU 0 CPU 1
----- -----
ext4_direct_IO()
... lots of layers ...
follow_page_pte()
get_page()

elevates refcount of page X
ext4_punch_hole()
ext4_break_layouts() # two pages, X & Y, have refcount == 2
__wait_var_event() # called for page X
__put_devmap_managed_page()

drops refcount of X to 1
__wait_var_events() checks X's refcount in "if (condition)", and breaks.
We never actually called ext4_wait_dax_page(), so 'retry' in
ext4_break_layouts() is still false. Exit do/while loop in
ext4_break_layouts, never attempting to wait on page Y which still has an
elevated refcount of 2.
truncate_pagecache_range()
... a few layers ...
dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE()

Essentially the solution will most likely involve adding synchronization between the direct I/O path and truncate/hole punch type operations, and it'll need to happen for both ext4 and XFS, so the filesystem folks need to be involved.

CVE References

Revision history for this message
quanxian (quanxian-wang) wrote :

This bug is related with CLX platform which is not published, keep it private until CLX platform is released. Thanks for your understanding.

information type: Public → Private
Revision history for this message
quanxian (quanxian-wang) wrote :

Additional Note from Developer: This is a bug fix than feature request since it's addressing race condition in the filesystem. But for this patch to apply, the following patches must be there already:
Commit from Dan released 5/9
d6dc57e251a ("xfs, dax: introduce xfs_break_dax_layouts()")
Commit from Ross released 7/29
cdbf8897cb0 ("dax: dax_layout_busy_page() warn on !exceptional")
430657b6be8 ("ext4: handle layout changes to pinned DAX mappings")

quanxian (quanxian-wang)
no longer affects: linux (Ubuntu)
tags: added: kernel-da-key
Revision history for this message
pragyansri.pathi@intel.com (pragyan) wrote :

Commit in 4.19:
e25ff835af

Revision history for this message
quanxian (quanxian-wang) wrote :

another commit for xfs:
xfs: more rare direct I/O vs unmap failures
e25ff835af8

this patch is in 4.19. please cherry pick it into 18.10. Thanks

Revision history for this message
quanxian (quanxian-wang) wrote :

need backporting

affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Changed in intel:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with commits cdbf8897cb0, 430657b6be8 and e25ff835af8. Commit d6dc57e251a is already in the 18.10 kernel since it was in mainline as of v4.18-rc1.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1787089

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

Thanks in advance!

Changed in linux (Ubuntu):
status: Triaged → In Progress
assignee: nobody → Joseph Salisbury (jsalisbury)
Revision history for this message
quanxian (quanxian-wang) wrote :

I am finding the test case from upstream to have a try.

information type: Private → Public
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Revision history for this message
quanxian (quanxian-wang) wrote :

Intel upstream developer no longer works at Intel and without test case left. Currently we have no way to test it. Sorry about that.

Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (60.2 KiB)

This bug was fixed in the package linux - 4.18.0-9.10

---------------
linux (4.18.0-9.10) cosmic; urgency=medium

  * linux: 4.18.0-9.10 -proposed tracker (LP: #1796346)

  * Cosmic update: v4.18.12 upstream stable release (LP: #1796139)
    - crypto: skcipher - Fix -Wstringop-truncation warnings
    - iio: adc: ina2xx: avoid kthread_stop() with stale task_struct
    - tsl2550: fix lux1_input error in low light
    - misc: ibmvmc: Use GFP_ATOMIC under spin lock
    - vmci: type promotion bug in qp_host_get_user_memory()
    - siox: don't create a thread without starting it
    - x86/numa_emulation: Fix emulated-to-physical node mapping
    - staging: rts5208: fix missing error check on call to rtsx_write_register
    - power: supply: axp288_charger: Fix initial constant_charge_current value
    - misc: sram: enable clock before registering regions
    - serial: sh-sci: Stop RX FIFO timer during port shutdown
    - uwb: hwa-rc: fix memory leak at probe
    - power: vexpress: fix corruption in notifier registration
    - iommu/amd: make sure TLB to be flushed before IOVA freed
    - Bluetooth: Add a new Realtek 8723DE ID 0bda:b009
    - USB: serial: kobil_sct: fix modem-status error handling
    - 6lowpan: iphc: reset mac_header after decompress to fix panic
    - iommu/msm: Don't call iommu_device_{,un}link from atomic context
    - s390/mm: correct allocate_pgste proc_handler callback
    - power: remove possible deadlock when unregistering power_supply
    - drm/amd/display/dc/dce: Fix multiple potential integer overflows
    - drm/amd/display: fix use of uninitialized memory
    - md-cluster: clear another node's suspend_area after the copy is finished
    - cxgb4: Fix the condition to check if the card is T5
    - RDMA/bnxt_re: Fix a couple off by one bugs
    - RDMA/i40w: Hold read semaphore while looking after VMA
    - RDMA/bnxt_re: Fix a bunch of off by one bugs in qplib_fp.c
    - IB/core: type promotion bug in rdma_rw_init_one_mr()
    - media: exynos4-is: Prevent NULL pointer dereference in __isp_video_try_fmt()
    - IB/mlx4: Test port number before querying type.
    - powerpc/kdump: Handle crashkernel memory reservation failure
    - media: fsl-viu: fix error handling in viu_of_probe()
    - vhost_net: Avoid tx vring kicks during busyloop
    - media: staging/imx: fill vb2_v4l2_buffer field entry
    - IB/mlx5: Fix GRE flow specification
    - include/rdma/opa_addr.h: Fix an endianness issue
    - x86/tsc: Add missing header to tsc_msr.c
    - ARM: hwmod: RTC: Don't assume lock/unlock will be called with irq enabled
    - x86/entry/64: Add two more instruction suffixes
    - ARM: dts: ls1021a: Add missing cooling device properties for CPUs
    - scsi: target/iscsi: Make iscsit_ta_authentication() respect the output
      buffer size
    - thermal: i.MX: Allow thermal probe to fail gracefully in case of bad
      calibration.
    - scsi: klist: Make it safe to use klists in atomic context
    - scsi: ibmvscsi: Improve strings handling
    - scsi: target: Avoid that EXTENDED COPY commands trigger lock inversion
    - usb: wusbcore: security: cast sizeof to int for comparison
    - ath10k: sdio: use same endpoint id for all packets...

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
quanxian (quanxian-wang)
Changed in intel:
status: Triaged → Fix Released
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Andy Whitcroft (apw)
tags: added: kernel-fixup-verification-needed-bionic
removed: verification-needed-bionic
Revision history for this message
Andy Whitcroft (apw) wrote :

This bug was erroneously marked for verification in bionic; verification is not required and verification-needed-bionic is being removed.

tags: added: verification-done-bionic
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.