vm hugepages cannot claim the entirety of the reported available memory

Bug #1813325 reported by Peng Peng
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Tao Liu

Bug Description

Brief Description
-----------------
As title

Severity
--------
Major

Steps to Reproduce
------------------
Modify host0 proc0 to have 0 of 1G pages and <2GiB of 4K pages

Expected Behaviour
------------------
Modify success

Actual Behaviour
----------------
failed

Reproducibility
---------------
Reproducible
100%

System Configuration
--------------------
Multi-node system
Dedicated storage

Branch/Pull Time/Commit
-----------------------
master as of 2019-01-24_20-18-00

Timestamp/Logs
--------------
[2019-01-25 12:40:18,729] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abba::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-memory-modify -1G 0 -2M 27372 compute-7 0'
[2019-01-25 12:40:20,899] 387 DEBUG MainThread ssh.expect :: Output:
Processor 0:No available space for 1G vswitch huge page allocation, max 1G vswitch pages: 0

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Steve to triage and determine if this is related to:
https://storyboard.openstack.org/#!/story/2004472

Changed in starlingx:
assignee: nobody → Steven Webster (swebster-wr)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating - a semantic check is needed to cover this scenario as a result of the recent story

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2019.05 stx.config
Revision history for this message
Jim Gauld (jgauld) wrote :

Related to this, but perhaps not the exact same bug:
The memory accounting overheads for NOVA is not yet configured (i.e. for 4K platform reserved overhead, and vswitch hugepages overhead), so there is a discrepancy in knowing what the actual available memory. The end result is that nova's reported available memory is overstated.

Need to do the following: create helm nova-compute per-host overrides to set reserved memory:

i.e., for the following nova config variables:

cfg.MultiOpt('reserved_huge_pages',
        item_type=types.Dict(),
        help="""Number of huge/large memory pages to reserved per NUMA host cell.

Eg,
    reserved_huge_pages = node:0,size:2048,count:64
    reserved_huge_pages = node:1,size:1GB,count:1

Eg,
    cfg.IntOpt('reserved_host_memory_mb',
        default=512,
        min=0,
        help="""Amount of memory in MB to reserve for the host so that it is always available
to host processes. The host resources usage is reported back to the scheduler
continuously from nova-compute running on the compute node. To prevent the host
memory from being considered as available, this option is used to reserve
memory for the host.

    reserved_host_memory_mb = 512

The 'reserved_host_memory_mb' to be generated based on sysinv host platform reserved memory.

The 'reserved_huge_pages' to be generated based on sysinv host platform vswitch memory.

Since the reserved_huge_pages is using the MultiOpt(), it needs to make use of my new routine (or variant of) self._oslo_multistring_override() that I used for pci_alias and pci_whitelist which was using oslo_config.MultiStringOpt().

Remove this static YAML:
stx-config/kubernetes/applications/stx-openstack/stx-openstack-helm/stx-openstack-helm/manifests/manifest.yaml
      nova:
        DEFAULT:
          default_mempages_size: 2048
          reserved_host_memory_mb: 0

Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
Numan Waheed (nwaheed) wrote :

This issue is causing failures in Nightly regression on continuous basis. There are also 35+ test cases failing due to this defect in Nova automated regression. Increasing the priority to P2 and requesting for an early fix.

Changed in starlingx:
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/658222

Changed in starlingx:
status: Triaged → In Progress
Dariush Eslimi (deslimi)
Changed in starlingx:
assignee: Steven Webster (swebster-wr) → Tao Liu (tliu88)
Ghada Khalil (gkhalil)
summary: - STX: vm hugepages cannot claim the entirety of the reported available
- memory
+ vm hugepages cannot claim the entirety of the reported available memory
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/661987

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (master)

Change abandoned by Steven Webster (<email address hidden>) on branch: master
Review: https://review.opendev.org/658222
Reason: New review: https://review.opendev.org/#/c/661987/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/661987
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=9a1ddaa291493226a3e282781c4409665c849161
Submitter: Zuul
Branch: master

commit 9a1ddaa291493226a3e282781c4409665c849161
Author: Tao Liu <email address hidden>
Date: Wed May 29 09:45:48 2019 -0400

    Account for vswitch hugepage count on memory modify

    This commit fixes a bug which can occur if a user tries to claim
    the entirety of available memory for VM hugepage usage.

    This update restructures the huge pages semantic check and
    combines VM and vswitch check. All user request huge pages
    (including 2M, 1G and vswitch) are validated against the
    last reported total possible huge pages.

    Change-Id: I5ac7883416b3128106ee20a27f4a35046ccabfb7
    Closes-Bug: 1813325
    Signed-off-by: Tao Liu <email address hidden>
    Co-authored-by: Steven Webster <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

TC is passed on WCP_61-62
BUILD_ID="20190602T233000Z"
JOB="STX_build_master_master"

[2019-06-03 19:07:54,917] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-memory-modify -1G 0 -2M 6656 controller-1 0'
[2019-06-03 19:07:57,019] 387 DEBUG MainThread ssh.expect :: Output:
+-------------------------------------+--------------------------------------+
| Property | Value |
+-------------------------------------+--------------------------------------+
| Memory: Usable Total (MiB) | 16114 |
| Platform (MiB) | 14500 |
| Available (MiB) | 15090 |
| Huge Pages Configured | True |
| vSwitch Huge Pages: Size (MiB) | 1024 |
| Total | 1 |
| Available | 0 |
| Required | None |
| Application Pages (4K): Total | 0 |
| Application Huge Pages (2M): Total | 7545 |
| Total Pending | 6656 |
| Available | 7545 |
| Application Huge Pages (1G): Total | 0 |
| Total Pending | 0 |
| Available | 0 |
| uuid | 5c7c08d4-b62a-47fb-ad3f-33ef3a97e365 |
| ihost_uuid | dd2a5500-a19f-4491-95bb-f8c455efe744 |
| inode_uuid | 8387f21e-03df-446d-9b35-69da60f1c52c |
| created_at | 2019-06-03T16:32:55.618775+00:00 |
| updated_at | 2019-06-03T19:07:25.145232+00:00 |
+-------------------------------------+--------------------------------------+
[wrsroot@controller-0 ~(keystone_admin)]$

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.