[AEP-bug] ext4: more rare direct I/O vs unmap failures

Bug #1787089 reported by quanxian on 2018-08-15
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
intel
Medium
Unassigned
linux (Ubuntu)
Medium
Joseph Salisbury

Bug Description

Description:
Even with the ext4_break_layouts() support added by "ext4: handle layout changes to pinned DAX mappings" Still seeing occasional cases with unit test where we are truncating a page that has an elevated reference count. Investigate.

The root cause of this issue is that while the ei->i_mmap_sem provides
synchronization between ext4_break_layouts() and page faults, it doesn't
provide synchronize us with the direct I/O path. This exact same issue exists
in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.

This allows the direct I/O path to do I/O and raise & lower page->_refcount
while we're executing a truncate/hole punch. This leads to us trying to free
a page with an elevated refcount.

Here's one instance of the race:

CPU 0 CPU 1
----- -----
ext4_punch_hole()
ext4_break_layouts() # all pages have refcount=1

ext4_direct_IO()
... lots of layers ...
follow_page_pte()
get_page() # elevates refcount

truncate_pagecache_range()
... a few layers ...
dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE()

A similar race occurs when the refcount is being dropped while we're running
ext4_break_layouts(), and this is the one that my test was actually hitting:

CPU 0 CPU 1
----- -----
ext4_direct_IO()
... lots of layers ...
follow_page_pte()
get_page()

elevates refcount of page X
ext4_punch_hole()
ext4_break_layouts() # two pages, X & Y, have refcount == 2
__wait_var_event() # called for page X
__put_devmap_managed_page()

drops refcount of X to 1
__wait_var_events() checks X's refcount in "if (condition)", and breaks.
We never actually called ext4_wait_dax_page(), so 'retry' in
ext4_break_layouts() is still false. Exit do/while loop in
ext4_break_layouts, never attempting to wait on page Y which still has an
elevated refcount of 2.
truncate_pagecache_range()
... a few layers ...
dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE()

Essentially the solution will most likely involve adding synchronization between the direct I/O path and truncate/hole punch type operations, and it'll need to happen for both ext4 and XFS, so the filesystem folks need to be involved.

CVE References

quanxian (quanxian-wang) wrote :

This bug is related with CLX platform which is not published, keep it private until CLX platform is released. Thanks for your understanding.

information type: Public → Private
quanxian (quanxian-wang) wrote :

Additional Note from Developer: This is a bug fix than feature request since it's addressing race condition in the filesystem. But for this patch to apply, the following patches must be there already:
Commit from Dan released 5/9
d6dc57e251a ("xfs, dax: introduce xfs_break_dax_layouts()")
Commit from Ross released 7/29
cdbf8897cb0 ("dax: dax_layout_busy_page() warn on !exceptional")
430657b6be8 ("ext4: handle layout changes to pinned DAX mappings")

quanxian (quanxian-wang) on 2018-08-15
no longer affects: linux (Ubuntu)
tags: added: kernel-da-key

Commit in 4.19:
e25ff835af

quanxian (quanxian-wang) wrote :

another commit for xfs:
xfs: more rare direct I/O vs unmap failures
e25ff835af8

this patch is in 4.19. please cherry pick it into 18.10. Thanks

quanxian (quanxian-wang) wrote :

need backporting

affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Changed in intel:
status: New → Triaged
importance: Undecided → Medium
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with commits cdbf8897cb0, 430657b6be8 and e25ff835af8. Commit d6dc57e251a is already in the 18.10 kernel since it was in mainline as of v4.18-rc1.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1787089

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

Thanks in advance!

Changed in linux (Ubuntu):
status: Triaged → In Progress
assignee: nobody → Joseph Salisbury (jsalisbury)
quanxian (quanxian-wang) wrote :

I am finding the test case from upstream to have a try.

information type: Private → Public
quanxian (quanxian-wang) wrote :

Intel upstream developer no longer works at Intel and without test case left. Currently we have no way to test it. Sorry about that.

Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (60.2 KiB)

This bug was fixed in the package linux - 4.18.0-9.10

---------------
linux (4.18.0-9.10) cosmic; urgency=medium

  * linux: 4.18.0-9.10 -proposed tracker (LP: #1796346)

  * Cosmic update: v4.18.12 upstream stable release (LP: #1796139)
    - crypto: skcipher - Fix -Wstringop-truncation warnings
    - iio: adc: ina2xx: avoid kthread_stop() with stale task_struct
    - tsl2550: fix lux1_input error in low light
    - misc: ibmvmc: Use GFP_ATOMIC under spin lock
    - vmci: type promotion bug in qp_host_get_user_memory()
    - siox: don't create a thread without starting it
    - x86/numa_emulation: Fix emulated-to-physical node mapping
    - staging: rts5208: fix missing error check on call to rtsx_write_register
    - power: supply: axp288_charger: Fix initial constant_charge_current value
    - misc: sram: enable clock before registering regions
    - serial: sh-sci: Stop RX FIFO timer during port shutdown
    - uwb: hwa-rc: fix memory leak at probe
    - power: vexpress: fix corruption in notifier registration
    - iommu/amd: make sure TLB to be flushed before IOVA freed
    - Bluetooth: Add a new Realtek 8723DE ID 0bda:b009
    - USB: serial: kobil_sct: fix modem-status error handling
    - 6lowpan: iphc: reset mac_header after decompress to fix panic
    - iommu/msm: Don't call iommu_device_{,un}link from atomic context
    - s390/mm: correct allocate_pgste proc_handler callback
    - power: remove possible deadlock when unregistering power_supply
    - drm/amd/display/dc/dce: Fix multiple potential integer overflows
    - drm/amd/display: fix use of uninitialized memory
    - md-cluster: clear another node's suspend_area after the copy is finished
    - cxgb4: Fix the condition to check if the card is T5
    - RDMA/bnxt_re: Fix a couple off by one bugs
    - RDMA/i40w: Hold read semaphore while looking after VMA
    - RDMA/bnxt_re: Fix a bunch of off by one bugs in qplib_fp.c
    - IB/core: type promotion bug in rdma_rw_init_one_mr()
    - media: exynos4-is: Prevent NULL pointer dereference in __isp_video_try_fmt()
    - IB/mlx4: Test port number before querying type.
    - powerpc/kdump: Handle crashkernel memory reservation failure
    - media: fsl-viu: fix error handling in viu_of_probe()
    - vhost_net: Avoid tx vring kicks during busyloop
    - media: staging/imx: fill vb2_v4l2_buffer field entry
    - IB/mlx5: Fix GRE flow specification
    - include/rdma/opa_addr.h: Fix an endianness issue
    - x86/tsc: Add missing header to tsc_msr.c
    - ARM: hwmod: RTC: Don't assume lock/unlock will be called with irq enabled
    - x86/entry/64: Add two more instruction suffixes
    - ARM: dts: ls1021a: Add missing cooling device properties for CPUs
    - scsi: target/iscsi: Make iscsit_ta_authentication() respect the output
      buffer size
    - thermal: i.MX: Allow thermal probe to fail gracefully in case of bad
      calibration.
    - scsi: klist: Make it safe to use klists in atomic context
    - scsi: ibmvscsi: Improve strings handling
    - scsi: target: Avoid that EXTENDED COPY commands trigger lock inversion
    - usb: wusbcore: security: cast sizeof to int for comparison
    - ath10k: sdio: use same endpoint id for all packets...

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
quanxian (quanxian-wang) on 2018-10-26
Changed in intel:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers