When hugepages is set vm.max_map_count is not automatically adjusted

Bug #1507921 reported by Liam Young on 2015-10-20
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
falkor
Fix Released
High
Chris Glass
dpdk (Ubuntu)
Medium
Unassigned
nova-compute (Juju Charms Collection)
High
Liam Young
openvswitch-dpdk (Ubuntu)
Undecided
Unassigned

Bug Description

When hugepages is set the kernel parameter vm.max_map_count should be a minimum of 2 * vm.nr_hugepages but it is currently not dynamically increased.

This minimum seems to come form https://www.kernel.org/doc/Documentation/sysctl/vm.txt

"While most applications need less than a thousand maps, certain
programs, particularly malloc debuggers, may consume lots of them,
e.g., up to one or two maps per allocation."

Related branches

Liam Young (gnuoy) on 2015-10-20
Changed in nova-compute (Juju Charms Collection):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Liam Young (gnuoy)
Liam Young (gnuoy) on 2015-10-20
description: updated
Liam Young (gnuoy) on 2015-10-26
Changed in nova-compute (Juju Charms Collection):
status: In Progress → Fix Released
milestone: none → 15.10
Changed in falkor:
milestone: none → 0.13
assignee: nobody → Chris Glass (tribaal)
importance: Undecided → High
status: New → Confirmed
Changed in falkor:
status: Confirmed → Fix Committed
Changed in falkor:
status: Fix Committed → Fix Released
Vladimir Eremin (yottatsa) wrote :

For openvswitch-dpdk, vm.max_map_count should be adjusted at least for 2*nr_hugepages + some padding for other apps, e.g.:

    max_map_count="$(awk -v padding=65530 '{total+=$1}END{print total*2+padding}' /sys/devices/system/node/node*/hugepages/hugepages-*/nr_hugepages)"
    sysctl -q vm.max_map_count=${max_map_count:-65530}

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in dpdk (Ubuntu):
status: New → Confirmed
Changed in openvswitch-dpdk (Ubuntu):
status: New → Confirmed

After a discussion with gnuoy I picket it up for the DPDK init scripts that can be used to set hugepages properly for DPDK.

I still consider the reasoning rather unclear why exactly 2*#hp+padding are "correct".
According to our discussion it seems to be only derived from "e.g., up to one or two maps per allocation."

If anybody has more, like an example that breaks and so on and could share it that would be great.
Without that it is hard to correctly quantify if "2*#hp+padding" would be correct for 1G hugepages as well.

Changed in dpdk (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → Low

The comment is back from 2004-04-1 http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/?id=56d93842e4840f371cb9acc8e5a628496b615a96

I doubt that anybody thought about 1G hugepages back then.
Reading the referred doc over and over again I also realized they are referring to 2*alloc not 2*#hugepages.

Only other references I found were:
- some forums and howtos that set it to very high number for high memory sytems (high memory depending on the time of the post e.g. 64G in one example which today is normal for servers)
- hugepage.py charmhelper which got it from this bug
- DPDK issue with a lot of huge pages http://dpdk.org/ml/archives/dev/2014-September/005397.html

The latter being the only source close to what we discuss here.

Around rte_eal_hugepage_init/map_all_hugepages in the dpdk source one finds the chance of 2*mapping of all hugepages.
In fact those can be limited via -m / socket-mem or whatever EAL parm you prefer.
But lets assume up to #hugepages.
And there it does a mapping of hpi->hugepage_sz.
So it does up to 2* mappings for each hugepage, no matter what the size is.
And the padding is to add the normal system limit on top as application and dpdk do more than just handling the huge pages.

Ok, that summarized I think it makes sense to me now.
I hope that helped the next one getting by to understand it as well.

Changed in dpdk (Ubuntu):
importance: Low → Medium

I did some tests to be sure:
/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages : 0
/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages : 5
/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages : 0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages : 2
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages : 0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages : 3

That shows that /sys/kernel/mm/hugepages/* always holds the global aggregated view.
This avoids some hazzle in !numa systems where /sys/devices/system/node doesn't even exists e.g. i386

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package dpdk - 2.2.0-0ubuntu7

---------------
dpdk (2.2.0-0ubuntu7) xenial; urgency=medium

  * Increase max_map_count after setting huge pages (LP: #1507921):
    - The default config of 65530 would cause issues as soon as about 64GB or
      more are used as 2M huge pages for dpdk.
    - Increase this value to base+2*#hugepages to avoid issues on huge systems.
  * d/p/ubuntu-backport-[28-32,34-35] backports for stability (LP: #1568838):
     - these will be in the 16.04 dpdk release, delta can then be dropped.
     - 5 fixes that do not change api/behaviour but fix serious issues.
        - 01 f82f705b lpm: fix allocation of an existing object
        - 02 f9bd3342 hash: fix multi-process support
        - 03 1aadacb5 hash: fix allocation of an existing object
        - 04 5d7bfb73 hash: fix race condition at creation
        - 05 fe671356 vfio: fix resource leak
        - 06 356445f9 port: fix ring writer buffer overflow
        - 07 52f7a5ae port: fix burst size mask type
  * d/p/ubuntu-backport-33-vhost-user-add-error-handling-for-fd-1023.patch
     - this will likely be in dpdk release 16.07 and delta can then be dropped.
     - fixes a crash on using fd's >1023 (LP: #1566874)
  * d/p/ubuntu-fix-lpm-use-after-free-and-leak.patch fix lpm_free (LP: #1569375)
     - the old patches had an error freeing a pointer which had no meta data
     - that lead to a crash on any lpm_free call
     - folded into the fix that generally covers the lpm allocation and free
       weaknesses already (also there this particular mistake was added)

 -- Christian Ehrhardt <email address hidden> Tue, 12 Apr 2016 16:13:47 +0200

Changed in dpdk (Ubuntu):
status: Triaged → Fix Released
Changed in openvswitch-dpdk (Ubuntu):
status: Confirmed → Invalid
status: Invalid → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers