arm64: cma_alloc errors at boot

Bug #1823753 reported by dann frazier on 2019-04-08
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Status tracked in Eoan
Disco
High
dann frazier
Eoan
High
dann frazier

Bug Description

[Impact]
We enabled CONFIG_DMA_CMA to fix bug 1803206, but that led to a regression
on other arm64 systems that began spewing these messages on boot - sometimes > 10K of them:

[ 19.534097] cma: cma_alloc: alloc failed, req-size: 64 pages, ret: -12
[ 19.534109] cma: cma_alloc: alloc failed, req-size: 16 pages, ret: -12
[ 19.534113] cma: cma_alloc: alloc failed, req-size: 64 pages, ret: -12
[ 19.534126] cma: cma_alloc: alloc failed, req-size: 16 pages, ret: -12
[ 19.534130] cma: cma_alloc: alloc failed, req-size: 64 pages, ret: -12
[ 19.534142] cma: cma_alloc: alloc failed, req-size: 16 pages, ret: -12
[ 19.534146] cma: cma_alloc: alloc failed, req-size: 64 pages, ret: -12
[ 19.534157] cma: cma_alloc: alloc failed, req-size: 16 pages, ret: -12
[ 19.534161] cma: cma_alloc: alloc failed, req-size: 64 pages, ret: -12
[ 19.534173] cma: cma_alloc: alloc failed, req-size: 16 pages, ret: -12
[ 19.534177] cma: cma_alloc: alloc failed, req-size: 64 pages, ret: -12

In a previous SRU (bug 1828092), we worked around this by just rate-limiting these messages. These are "err" priority messages though, so even a lower number of them is still disconcerting.

[Fix]
1) Bump up the amount of available CMA on arm64 to 32M (same as upstream defconfig)
2) A patch-set from linux-next that redirects dma-direct contiguous allocations to alloc_pages() for single page allocations (single pages are by definition contiguous), avoiding CMA usage/fragmentation.

[Test Case]
dmesg | grep "cma_alloc: alloc failed"
Some system configs will still have some of these errors even after this fix - but this should reduce them significantly. Per-driver optimizations can be used to make further improvements, but we should track those in other bugs.

[Regression Risk]
Tested on a HiSilicon D06 and HP m400 (Hi1620 & X-Gene arm64).
Regression tested on:
 - Raspberry Pi 3B (see Comment #22)
 - Power9 system (ppc64el)
 - z/VM instance (s390x)
 - Intel Centerton system (amd64)

Test builds are staged at http://ppa.launchpad.net/~dannf/ubuntu/cma

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1823753

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
dann frazier (dannf) wrote :

The disco kernel uses CONFIG_CMA_SIZE_MBYTES=16, while the arm64 defconfig sets it to 32. I tried bumping up this setting by powers of 2 until the messages went away, and it finally did at 128, which seems extreme. Need to figure out what is attempting these allocations, and why they are failing.

dann frazier (dannf) wrote :

This was not an issue prior to v4.20. Bisection identified the following commit:
  886643b766321 arm64: use the generic swiotlb_dma_ops

dann frazier (dannf) wrote :

I enabled CMA_DEBUG & CMA_DEBUGFS and booted w/ cma=128M so that we can see the CMA state when no allocations fail.

root@d06-4:/sys/kernel/debug/cma/cma-reserved# cat count
32768
root@d06-4:/sys/kernel/debug/cma/cma-reserved# cat used
15630

Looks like we're actually using ~61M - and perhaps the alignment requirements are pushing us over the 64M barrier (cma=64M still results in some cma_alloc failures).

dann frazier (dannf) on 2019-04-08
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
dann frazier (dannf) wrote :

Note that this appears to be a regression introduced by bug 1803206. Disabling DMA_CMA avoids it - but obviously would regress bug 1803206.

Seth Forshee (sforshee) wrote :

I only had time to look briefly, but here's a few notes. Firstly, a workaround for now is passing cma=0 on the kernel command line.

I feel like there must be something wrong here if everything just works fine with CMA turned off, because obviously devices are getting CMA allocations when they don't really need them. I'm not familiar with CMA, but my assumption would be that you want as small a region as possible for it and thus you want to only use it for devices where it's actually necessary. Shooting from the hip I'd guess that one of two things is wrong, either some devices are doing their allocations wrong and getting CMA memory when they don't need it, or the DMA allocator is over zealous in its use of CMA.

Terry Rudd (terrykrudd) on 2019-05-03
Changed in linux (Ubuntu):
importance: Undecided → High
dann frazier (dannf) wrote :

To try to understand how the CMA is allocated - and what the problem might be - I booted a HiSilicon D06 w/ plenty of cma cushion (cma=128M), and looked at cma debugfs. In the upstream thread, Robin asked if this was potentially an issue with lots of 1 page allocations - something for which patches have been proposed. Here's a histogram of allocation sizes:

$ dmesg | grep "cma: cma_alloc(cma" | sed -r 's/.*count
([0-9]+)\,.*/\1/' | sort -n | uniq -c
   2062 1
     32 2
    266 8
      2 24
      4 32
    256 33
      7 64
      2 128
      2 1024

So, there are 2062 1 page allocations. While that's the largest # of allocations by page size, that's only really 13% of the allocated pages, just over 8M. That would get us down to 53M. But we know that total allocation size isn't the only problem - fragmentation must be playing a role, as we're using ~61M, but still seeing errors w/ cma=64M.

To understand the fragmentation impact, I looked at the debugfs cma bitmap. I converted that map to binary, coalescing repeated lines, and got this:

11111111111111111111111111111111 x309
1111111111111111
11111111111111111111111111111111\__ x252
1 /
11111111111111111111111111111111
0
11111111111111111111111111111111 x48
0 x412

I don't see any signs of fragmentation w/ the single page allocations. What is concerning is the 252 33 page allocations (a subset of the 256 in the histogram). They are getting allocated on 64M boundaries, which ends up leaving 31 free pages per allocation. 256 * 31 * 4K = 31M of potentially wasted space.

Seth Forshee (sforshee) wrote :

That wastage is pretty awful.

You might want try a patch to at least ratelimit the cma_alloc errors. Not that it's a real solution, but it might be something we can apply to mitigate the problem short-term while continuing to search for something better.

dann frazier (dannf) wrote :

The culprit for the 256 33 page allocations (that causes the fragmentation mentioned in comment #8) is the hisi_sas_v3_hw driver:

[ 21.220867] Call trace:
[ 21.223301] dump_backtrace+0x0/0x1b0
[ 21.226948] show_stack+0x24/0x30
[ 21.230251] dump_stack+0x90/0xb4
[ 21.233554] cma_alloc+0x3f4/0x430
[ 21.236942] dma_alloc_from_contiguous+0x70/0x80
[ 21.241544] __dma_direct_alloc_pages+0x14c/0x228
[ 21.246234] dma_direct_alloc_pages+0x48/0xc0
[ 21.250576] dma_direct_alloc+0x50/0x80
[ 21.254397] dma_alloc_attrs+0x94/0x128
[ 21.258218] dmam_alloc_attrs+0x68/0xb8
[ 21.262043] hisi_sas_alloc+0x360/0x538 [hisi_sas_main]
[ 21.267257] hisi_sas_shost_alloc_pci+0xfc/0x170 [hisi_sas_v3_hw]
[ 21.273337] hisi_sas_v3_probe+0xd8/0x360 [hisi_sas_v3_hw]
[ 21.278810] local_pci_probe+0x44/0xa8
[ 21.282544] work_for_cpu_fn+0x20/0x30
[ 21.286279] process_one_work+0x1f0/0x430
[ 21.290274] worker_thread+0x248/0x488
[ 21.294009] kthread+0x134/0x138
[ 21.297223] ret_from_fork+0x10/0x18

I wonder if it'd be possible to adjust this allocation somehow.

dann frazier (dannf) wrote :

Note that in current disco things have gotten worse - our single page allocations have blown up to 11753:

$ cat cma.disco.dmesg | grep "cma: cma_alloc(cma" | sed -r 's/.*count ([0-9]+)\,.*/\1/' | sort -n | uniq -c
  11753 1
      3 2
      3 4
    234 8
     32 16
      2 24
      4 32
    256 33
     39 64
      2 128
      2 1024

I bisected this down to a backport of the following commit:
3e394f9413ec RDMA/hns: Modify qp&cq&pd specification according to UM

Reverting that gets us down to 3029 single page allocations, freeing up about 34M.

dann frazier (dannf) wrote :

As per Seth's suggestion, I tried ratelimiting the cma_alloc messages:

diff --git a/mm/cma.c b/mm/cma.c
index c7b39dd3b4f6..56d2a046f689 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -477,7 +477,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
                        page_kasan_tag_reset(page + i);
        }

- if (ret && !no_warn) {
+ if (ret && !no_warn && printk_ratelimit()) {
                pr_err("%s: alloc failed, req-size: %zu pages, ret: %d\n",
                        __func__, count, ret);
                cma_debug_show_areas(cma);

This drops the cma_alloc error message count down from 10758 to 21.

Fred Kimmy (kongzizaixian) wrote :

I find this "Mellonox 25Gigabit-64bit-SFP28-2port– PCIe3.0 X8–15B3-1015-2" card can require 260M CMA memory. If we insert 2 or 3...4 Mellonox card, this card will fail to init it, so we should disable this cma config or set 1024M or above CMA memory.

dann frazier (dannf) wrote :

Why would it fail to init? If CMA allocation fails, the kernel will fulfill the request using alloc_pages_node():

struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
  dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
{
[...]
 /* CMA can be used only in the context which permits sleeping */
 if (gfpflags_allow_blocking(gfp)) {
  page = dma_alloc_from_contiguous(dev, count, page_order,
       gfp & __GFP_NOWARN);
  if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
   dma_release_from_contiguous(dev, page, count);
   page = NULL;
  }
 }
 if (!page)
  page = alloc_pages_node(dev_to_node(dev), gfp, page_order);
[...]
}

dann frazier (dannf) wrote :

I've split the ratelimiting topic out to bug 1828092 so to keep this one open while we continue to try and solve the underlying cause of the messages.

dann frazier (dannf) wrote :

The hisi_sas maintainer is experimenting with a patch that will change the driver's 33 page allocations to single page allocations. If there is no significant performance impact, that could be a way forward to deal with the fragmentation issue costing us up to 31M of CMA.

In addition, there is DMA/CMA optimization making its way upstream, that uses non-CMA memory for single page requests:
  https://lkml.org/lkml/2019/5/6/1219

This avoids the impact of the CMA memory usage increase in the RDMA/hns driver (see comment #11) and - with the aforementioned hisi_sas patch - those allocations as well.

After those optimizations, I am able to fulfill all cma allocation requests from our D06 CS board[*] with cma=64M. However, Ubuntu ships with only 16M of CMA. Note that upstream's defconfig allocates 32M of CMA,
 which is apparently required for the RPi VC4:
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ebf089248dab2ef569e5e26a607f0977a71182b7

So there maybe other reasons we want to bump to at least 32M - I'm not sure that we'd want to go all the way to 64M for general purpose kernel though.

[*] Keeping in mind that this is just tuning for a specific system/config. Throw in a bunch of mlx5 cards, and I suspect we'll still see cma_alloc log spew on this platform.

Paolo Pisati (p-pisati) wrote :

I realize this is just a workaround (and the above 31M cma memory fragmentation is ugly), but we should definitely bump CMA allocation space: definitely 32M (since that's what upstream default to) but if 64M solves a problem you have at the moment (until we sort out the driver issue), i'm not opposing to such a change - after all, you are the main consumer of generic/arm64 (all other boards are either armhf or have their own topic kernel) so if there's a workaround we can apply to make your life easier, i don't see why we shouldn't do it.

On Wed, May 22, 2019 at 8:41 AM Paolo Pisati <email address hidden> wrote:
>
> I realize this is just a workaround (and the above 31M cma memory
> fragmentation is ugly), but we should definitely bump CMA allocation
> space: definitely 32M (since that's what upstream default to) but if 64M
> solves a problem you have at the moment (until we sort out the driver
> issue), i'm not opposing to such a change -

Unfortunately, without other mitigations, we blow past 64M as well.
Ignoring fragmentation, we're using ~108M of CMA (summing up the
totals in comment #11). I think including the dma-contiguous patches
are key. They would alleviate the pain of the single page allocations
(~46M), and unblock driver optimizations from doing the same (hisi_sas
driver could move 33M of allocs out of CMA). Those would impact archs
that have CONFIG_DMA_CMA, which are arm64 and armhf-generic:

debian.master/config/annotations:CONFIG_DMA_CMA
          policy<{'amd64': 'n', 'arm64': 'y', 'armhf-generic': 'y',
'armhf-generic-lpae': 'n', 'i386': 'n', 's390x': 'n'}>

I don't have any real armhf-generic hw anymore - if I prepared PPA
kernels, would you happen to have kit for regression testing?

> after all, you are the main
> consumer of generic/arm64 (all other boards are either armhf or have
> their own topic kernel) so if there's a workaround we can apply to make
> your life easier, i don't see why we shouldn't do it.

My biggest concern w/ bumping up CMA is that we do appear to have
users of the arm64/generic kernel on platforms I don't have. For
instance, the reason we got into this CMA issue at all was by turning
on DMA_CMA on arm64 for the RPi3 (bug 1803206) - so I'd at least like
to get some regression testing on that platform. And, obviously such
relatively lowmem platforms make me want to be rather conservative
about bumping up CMA sizes.

  -dann

Paolo Pisati (p-pisati) wrote :

Yes, i have all the hardware to do some testing, prepare some kernels and we can take it from there.

dann frazier (dannf) wrote :

On Thu, May 23, 2019 at 8:31 AM Paolo Pisati <email address hidden> wrote:
>
> Yes, i have all the hardware to do some testing, prepare some kernels
> and we can take it from there.

Thanks Paolo. I pushed a test build to ppa:dannf/cma. This bumps cma
to 32M, but would also be good to know the results w/ cma=64M on the
cmdline.

  -dann

dann frazier (dannf) wrote :

I forgot to enable CMA_DEBUGFS in those builds, so I've uploaded another (cma.3)

Paolo Pisati (p-pisati) wrote :

Here are dmesg and meminfo for all 3 cases (generuic, cma3 and cma3=64M):

https://people.canonical.com/~ppisati/lp1823753/

Other than the different memory alloaction, i didn't notice any evident regression.

John Garry (johnpgarry) wrote :

Note: We have tested https://lkml.org/lkml/2019/5/6/1221, which is now on linux-next *, and the hi162 RoCE driver single patch usage is diminished:
$cat cma.disco.dmesg | grep "cma: cma_alloc(cma" | sed -r 's/.*count ([0-9]+)\,.*/\1/' | sort -n | uniq -c
129 2
3 4
206 8
32 16
2 24
4 32
256 33
39 64
2 128
2 1024

I think that we had ~11K x 1K CMA pages previously, without this patchset.

* https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/kernel/dma?h=next-20190530

dann frazier (dannf) wrote :

Thanks Paolo and John.

I've uploaded a cma.4 kernel to ppa:dannf/cma that includes the following additional fix:

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=e7a647580852f90b7f9a5b940e7eaffa105f6271

I'll get some non-ARM architecture testing on that before submitting for the next SRU cycle.

Changed in linux (Ubuntu Disco):
status: New → In Progress
Changed in linux (Ubuntu Eoan):
status: Confirmed → In Progress
Changed in linux (Ubuntu Disco):
assignee: nobody → dann frazier (dannf)
importance: Undecided → High
Changed in linux (Ubuntu Eoan):
assignee: nobody → dann frazier (dannf)
dann frazier (dannf) on 2019-06-04
description: updated
description: updated
description: updated
description: updated
dann frazier (dannf) on 2019-06-04
description: updated
description: updated
description: updated
Seth Forshee (sforshee) on 2019-06-07
Changed in linux (Ubuntu Eoan):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Disco):
status: In Progress → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-disco' to 'verification-done-disco'. If the problem still exists, change the tag 'verification-needed-disco' to 'verification-failed-disco'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-disco
dann frazier (dannf) wrote :

Verification:

ubuntu@d06-1:~$ dmesg | grep cma
[ 0.000000] cma: Reserved 32 MiB at 0x000000007e000000
[ 0.000000] Memory: 526954648K/536866624K available (12092K kernel code, 1694K rwdata, 5112K rodata, 5504K init, 1161K bss, 9879208K reserved, 32768K cma-reserved)
ubuntu@d06-1:~$ cat /proc/version
Linux version 5.0.0-21-generic (buildd@bos02-arm64-046) (gcc version 7.4.0 (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1)) #22~18.04.1-Ubuntu SMP Thu Jul 4 17:25:45 UTC 2019

tags: added: verification-done-disco
removed: verification-needed-disco
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.2.0-8.9

---------------
linux (5.2.0-8.9) eoan; urgency=medium

  * linux: 5.2.0-8.9 -proposed tracker (LP: #1835700)

  * Miscellaneous Ubuntu changes
    - [Packaging] replace zfs and spl build with zfs 0.8.1-1ubuntu1
    - SAUCE: test_bpf: remove expected fail for Ctx heavy transformations test on
      s390
    - SAUCE: add -fcf-protection=none to retpoline flags
    - SAUCE: usbip: ensure strings copied using strncpy are null-terminated
    - SAUCE: usbip: add -Wno-address-of-packed-member to EXTRA_CFLAGS
    - SAUCE: perf jvmti: ensure strncpy result is null-terminated
    - update dkms package versions
    - add removed zfs modules to modules.ignore

  [ Upstream Kernel Changes ]

  * Rebase to v5.2

 -- Seth Forshee <email address hidden> Mon, 08 Jul 2019 07:13:41 -0500

Changed in linux (Ubuntu Eoan):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (57.5 KiB)

This bug was fixed in the package linux - 5.0.0-21.22

---------------
linux (5.0.0-21.22) disco; urgency=medium

  * linux: 5.0.0-21.22 -proposed tracker (LP: #1834902)

  * Disco update: 5.0.15 upstream stable release (LP: #1834529)
    - net: stmmac: Use bfsize1 in ndesc_init_rx_desc
    - Drivers: hv: vmbus: Remove the undesired put_cpu_ptr() in hv_synic_cleanup()
    - ubsan: Fix nasty -Wbuiltin-declaration-mismatch GCC-9 warnings
    - staging: greybus: power_supply: fix prop-descriptor request size
    - staging: wilc1000: Avoid GFP_KERNEL allocation from atomic context.
    - staging: most: cdev: fix chrdev_region leak in mod_exit
    - staging: most: sound: pass correct device when creating a sound card
    - ASoC: tlv320aic3x: fix reset gpio reference counting
    - ASoC: hdmi-codec: fix S/PDIF DAI
    - ASoC: stm32: sai: fix iec958 controls indexation
    - ASoC: stm32: sai: fix exposed capabilities in spdif mode
    - ASoC: stm32: sai: fix race condition in irq handler
    - ASoC:soc-pcm:fix a codec fixup issue in TDM case
    - ASoC:hdac_hda:use correct format to setup hda codec
    - ASoC:intel:skl:fix a simultaneous playback & capture issue on hda platform
    - ASoC: dpcm: prevent snd_soc_dpcm use after free
    - ASoC: nau8824: fix the issue of the widget with prefix name
    - ASoC: nau8810: fix the issue of widget with prefixed name
    - ASoC: samsung: odroid: Fix clock configuration for 44100 sample rate
    - ASoC: rt5682: Check JD status when system resume
    - ASoC: rt5682: fix jack type detection issue
    - ASoC: rt5682: recording has no sound after booting
    - ASoC: wm_adsp: Add locking to wm_adsp2_bus_error
    - clk: meson-gxbb: round the vdec dividers to closest
    - ASoC: stm32: dfsdm: manage multiple prepare
    - ASoC: stm32: dfsdm: fix debugfs warnings on entry creation
    - ASoC: cs4270: Set auto-increment bit for register writes
    - ASoC: dapm: Fix NULL pointer dereference in snd_soc_dapm_free_kcontrol
    - drm/omap: hdmi4_cec: Fix CEC clock handling for PM
    - IB/hfi1: Clear the IOWAIT pending bits when QP is put into error state
    - IB/hfi1: Eliminate opcode tests on mr deref
    - IB/hfi1: Fix the allocation of RSM table
    - MIPS: KGDB: fix kgdb support for SMP platforms.
    - ASoC: tlv320aic32x4: Fix Common Pins
    - drm/mediatek: Fix an error code in mtk_hdmi_dt_parse_pdata()
    - perf/x86/intel: Fix handling of wakeup_events for multi-entry PEBS
    - perf/x86/intel: Initialize TFA MSR
    - linux/kernel.h: Use parentheses around argument in u64_to_user_ptr()
    - iov_iter: Fix build error without CONFIG_CRYPTO
    - xtensa: fix initialization of pt_regs::syscall in start_thread
    - ASoC: rockchip: pdm: fix regmap_ops hang issue
    - drm/amdkfd: Add picasso pci id
    - drm/amdgpu: Adjust IB test timeout for XGMI configuration
    - drm/amdgpu: amdgpu_device_recover_vram always failed if only one node in
      shadow_list
    - drm/amd/display: fix cursor black issue
    - ASoC: cs35l35: Disable regulators on driver removal
    - objtool: Add rewind_stack_do_exit() to the noreturn list
    - slab: fix a crash by reading /proc/slab_allocators
    - drm/sun4i: tcon top: Fix NULL/inv...

Changed in linux (Ubuntu Disco):
status: Fix Committed → Fix Released
Brad Figg (brad-figg) on 2019-07-24
tags: added: cscc
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers