Bug #1797187 “error messages popped up when changing huge page s...” : Bugs : StarlingX

Ghada Khalil (gkhalil) on 2018-10-10

description:

updated

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2018-10-12:

#1

Targeting stx.2019.03 as this is not a common use-case

Changed in starlingx:
assignee:	nobody → Tao Liu (tliu88)
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.2019.03 stx.metal

Revision history for this message

Erich Cordoba (ericho) wrote on 2018-12-12:

#2

Thanks for the report.

This could be an issue that can be solved by configuring your compute's kernel command line.

Could you please share the output of /proc/cmdline and cat /proc/meminfo | grep Huge from the compute machine?

Ken Young (kenyis) on 2019-01-18

tags:

added: stx.2019.05
removed: stx.2019.03

Ken Young (kenyis) on 2019-04-05

tags:

added: stx.2.0
removed: stx.2019.05

Ghada Khalil (gkhalil) on 2019-04-09

tags:

added: stx.retestneeded

Revision history for this message

Abraham Arce (xe1gyq) wrote on 2019-05-30:

#3

The "Steps To Reproduce" has been executed in a Bare Metal Dedicated Storage 2+3+2 having both, the same error message as reported and without the error message, depending on the value and CPU core. The system used to initially test was used for CEPH validation, another system is being installed to repeat the steps and give more clarity in the process followed and the results obtained.

For specific details about the process followed including the requests from Erich:
https://github.com/xe1gyq/starlingx/blob/master/bugs/1797187.md

The high level overview of tasks done are:

Bare Metal Overall Huge Pages Information

- From controller-1, look at the number of hosts and memory information for compute-0 via host-memory-list
- From compute-0, look at the memory information via kernel boot arguments (/etc/default/grub /proc/cmdline) and memory information (/proc/meminfo)

Bare Metal 2M Hugepages

- Get default values for compute-0 2M Hugepages via host-memory-list
- Modify to default values for compute-0 2M Hugepages via host-memory-modify:

$ system host-memory-modify compute-0 0 -2M 42477 [Fail]
Processor 0:No available space for 1G vswitch huge page allocation, max 1G vswitch pages: 0
$ system host-memory-modify compute-0 1 -2M 43232 [Ok]

- Decreasing to 512 2M huge pages

$ system host-memory-modify compute-0 0 -2M 512 [Ok]
$ system host-memory-modify compute-0 1 -2M 512 [Ok]

- Going back to default 2M huge pages values

$ system host-memory-modify compute-0 0 -2M 42477 [Fail]
Processor 0:No available space for 1G vswitch huge page allocation, max 1G vswitch pages: 0
$ system host-memory-modify compute-0 1 -2M 43232 [Ok]

Bare Metal 1G Hugepages

- From controller-0, reserve at least one 1G hugepage

$ system host-memory-modify compute-0 0 -1G 1 [Fail]
Processor 0:No available space for new VM hugepage settings.Max 1G pages is 0 when 2M is 42477, or Max 2M pages is 39966 when 1G is 5.

- Reboot, modify bootargs to include: hugepagesz=1G hugepages=2
- reserve at least one 1G hugepage:

$ system host-memory-modify compute-0 0 -1G 1 [Ok]
$ system host-memory-modify compute-0 0 -1G 2 [Ok]

- Verify 1G values

$ system host-memory-show compute-0 0
| Application Huge Pages (1G): Total | 0
| Total Pending | 2
| Available | 0

$ system host-memory-modify compute-0 0 -1G 10 [Ok]

Will get back with more information...

The "Steps To Reproduce" has been executed in a Bare Metal Dedicated Storage 2+3+2 having both, the same error message as reported and without the error message, depending on the value and CPU core. The system used to initially test was used for CEPH validation, another system is being installed to repeat the steps and give more clarity in the process followed and the results obtained.

For specific details about the process followed including the requests from Erich:
https://github.com/xe1gyq/starlingx/blob/master/bugs/1797187.md

The high level overview of tasks done are:

Bare Metal Overall Huge Pages Information

- From controller-1, look at the number of hosts and memory information for compute-0 via host-memory-list
- From compute-0, look at the memory information via kernel boot arguments (/etc/default/grub /proc/cmdline) and memory information (/proc/meminfo)

Bare Metal 2M Hugepages

- Get default values for compute-0 2M Hugepages via host-memory-list
- Modify to default values for compute-0 2M Hugepages via host-memory-modify:

$ system host-memory-modify compute-0 0 -2M 42477 [Fail]
Processor 0:No available space for 1G vswitch huge page allocation, max 1G vswitch pages: 0
$ system host-memory-modify compute-0 1 -2M 43232 [Ok]

- Decreasing to 512 2M huge pages

$ system host-memory-modify compute-0 0 -2M 512 [Ok]
$ system host-memory-modify compute-0 1 -2M 512 [Ok]

- Going back to default 2M huge pages values

$ system host-memory-modify compute-0 0 -2M 42477 [Fail]
Processor 0:No available space for 1G vswitch huge page allocation, max 1G vswitch pages: 0
$ system host-memory-modify compute-0 1 -2M 43232 [Ok]

Bare Metal 1G Hugepages

- From controller-0, reserve at least one 1G hugepage

$ system host-memory-modify compute-0 0 -1G 1 [Fail]
Processor 0:No available space for new VM hugepage settings.Max 1G pages is 0 when 2M is 42477, or Max 2M pages is 39966 when 1G is 5.

- Reboot, modify bootargs to include: hugepagesz=1G hugepages=2
- reserve at least one 1G hugepage:

$ system host-memory-modify compute-0 0 -1G 1 [Ok]
$ system host-memory-modify compute-0 0 -1G 2 [Ok]

- Verify 1G values

$ system host-memory-show compute-0 0
| Application  Huge Pages (1G): Total | 0
|                 Total Pending       | 2
|                 Available           | 0

$ system host-memory-modify compute-0 0 -1G 10 [Ok]

Will get back with more information...

Revision history for this message

Abraham Arce (xe1gyq) wrote on 2019-06-07:

#4

This was tested again in a brand new deployment:

BUILD_ID="20190523T013000Z"
Configuration: Bare Metal 2 + 2 + 2

The message is seen only when the number of huge pages required, exceeds the limit as in the following commands:

[wrsroot@controller-0 ~(keystone_admin)]$ system host-memory-modify compute-0 0 -2M 42482
Processor 0:No available space for 2M VM huge page allocation, max 2M VM pages: 42204

[wrsroot@controller-0 ~(keystone_admin)]$ system host-memory-modify compute-0 0 -2M 42204
Processor 0:No available space for 1G vswitch huge page allocation, max 1G vswitch pages: 0

It suggests a number but the 1G vswitch is not considered so by getting the 1G vswitch 1 huge page number into the equation is possible to get the max 2M / 1G VM pages that can be allocated.

      max 2M VM pages 42204 * 2048 = 86433792 < compute-0 memory
      compute-0 memory 86433792 − 1G huge page size 1048576 = 85385216 < compute-0 memory available
      compute-0 memory available 85385216 * 2048 = 41692 < max 2M VM pages

[wrsroot@controller-0 ~(keystone_admin)]$ system host-memory-modify compute-0 0 -2M 41692

Full details:
https://github.com/xe1gyq/starlingx/edit/master/bugs/1797187.md

Revision history for this message

Tao Liu (tliu88) wrote on 2019-07-18:

#5

Most of the reported error messages have been eliminated by my patch which was integrated to https://review.opendev.org/#/c/667811/.

There are two scenarios that still triggers unwanted error messages. One, when the vswitch huge page size changes, the total request memory is calculated using the previous vswitch page size. Two, when the platform reserved memory is decreased to allow additional VM huge pages allocation, an error message is displayed which hinders allocation of additional VM huge pages. This is due to a check against the VM huge pages possible setting that is calculated using the previous platform reserved value. This check was removed from my original patch, but was added back in the final code submission.

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-18: Fix proposed to config (master)

#6

Fix proposed to branch: master
Review: https://review.opendev.org/671553

Revision history for this message

Tao Liu (tliu88) wrote on 2019-07-19:

#7

A clarification on the following test steps:
1. reduce 512 from the number of 2M-page
2. add 1 1G-page
Expect the changes are accepted without any errors.

However, there are few exceptions that might still be an error message, prompting the user that “No available space for …”. This might happen, because the semantic check uses the current available memory to determine, whether the requested pages exceeded the current available. The existing allocation however, is based on the available memory prior to the last reboot. The host available memory might have changed after reboot.

The error message should tell you how many 2M or 1G is supported based on the total requested pages vs how much are available. You should be able to use this suggested number to make the changes.

In the future, we will change the user requested page, from a number to % of the available memory in order to remedy this issue.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-23: Fix merged to config (master)

#8

Reviewed: https://review.opendev.org/671553
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=16eb0ce4e0492a56458537d8d5205227e79e80c5
Submitter: Zuul
Branch: master

commit 16eb0ce4e0492a56458537d8d5205227e79e80c5
Author: Tao Liu <email address hidden>
Date: Thu Jul 18 14:59:22 2019 -0400

Fixed unwanted error message when changing huge pages

Most of the unnecessary error messages have been eliminated by
a recent update: https://review.opendev.org/#/c/667811/.

    There are two scenarios that still triggers error messages.
    One, when the vswitch huge page size changes, the requested vswitch
    memory is calculated using the previous vswitch page size.
    Two, when the platform reserved memory is decreased to allow
    additional VM huge pages allocation, an error message is displayed
    which hinders allocation of additional VM huge pages. This is due
    to a check against the VM huge pages possible setting that is
    calculated using the previous platform reserved value.
    This check was removed from my original patch,
    which was integrated to https://review.opendev.org/#/c/667811/,
    but was added back in the final code submission.

This update fixes the above two error scenarios.

Closes-Bug: 1797187

Change-Id: I2383e0d949e0af1c86e2546e63d6cc3a8a693175
Signed-off-by: Tao Liu <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Peng Peng (ppeng) wrote on 2019-08-13:

#9

Verified on 2019-08-12_20-59-00

tags:

removed: stx.retestneeded

StarlingX

error messages popped up when changing huge page settings

Bug Description

Other bug subscribers

Remote bug watches