hugepages not allocated after unlock

Bug #1830549 reported by Brent Rowsell
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Tao Liu

Bug Description

Brief Description
-----------------
This was on AIO-SX. With the controller locked, I configured 10 1-G hugepages per numa node.
After the unlock was complete, the pages were still showing as pending.
I then locked the server and the pages were allocated ?
After the second unlock the pages were still allocated

Severity
--------
Major

Steps to Reproduce
------------------
See above

Expected Behavior
------------------
Pages are allocated

Actual Behavior
----------------
See above

Reproducibility
---------------
Not sure

System Configuration
--------------------
One node
But likely applicable to other configs

Branch/Pull Time/Commit
-----------------------
BUILD_DATE="2019-05-24 17:42:34 -0400"

Last Pass
---------
Don't know

Timestamp/Logs
--------------
Logs attached

Test Activity
-------------
Other

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; robustness issues with huge page allocation

summary: - AIO-SX: hugepages not allocated after unlock
+ hugepages not allocated after unlock
tags: added: stx.2.0 stx.config
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
description: updated
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Tao Liu (tliu88)
Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

logs

Tao Liu (tliu88)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Tao Liu (tliu88) wrote :

This is just clearing the pending fields, which was delayed.

The huge pages are allocated after the worker manifest is applied, and the pending fields are cleared after the conductor receives the first memory update from the agent after unlock. In this case, the first memory update was delayed due to the followings.

The sysinv agent normally starts 5 or 6 minutes after the host is unlock/rebooted which sends inventory update at startup. By this time, if the worker manifest apply has not been completed (via checking .worker_config_complete flag file), the memory update will be skipped. This is because the huge pages are allocated via puppet manifest.

After that, the memory update is triggered by periodical audit. The audit runs every minute, but the memory & lldp reports are sent after 5 audit interval (agent throttle update implemented for big lab performance issues in previous release). The audit throttling results in 5 minutes delay after sending the first inventory report, which adds around 11 minutes to clear the pending fields after reboot.

In this scenario, the second lock was less than 10 minutes after the first unlock. Once the host was locked, the sysinv-conductor would not clear the pending fields.

2019-05-26T17:37:15.000 controller-0 -sh: info HISTORY: PID=617529 UID=1875 system host-lock controller-0
2019-05-26T17:37:19.000 controller-0 -sh: info HISTORY: PID=617529 UID=1875 system host-memory-modify controller-0 0 -1G 10 -f application
2019-05-26T17:37:28.000 controller-0 -sh: info HISTORY: PID=617529 UID=1875 system host-memory-modify controller-0 1 -1G 10 -f application
2019-05-26T17:37:37.000 controller-0 -sh: info HISTORY: PID=617529 UID=1875 system host-unlock controller-0
2019-05-26T17:47:01.000 controller-0 -sh: info HISTORY: PID=126741 UID=1875 system host-lock controller-0

Revision history for this message
Ghada Khalil (gkhalil) wrote :

This seems to be specific to All-in-one systems only

Revision history for this message
Tao Liu (tliu88) wrote :

The pending fields take longer time to clear on AIO, because the worker manifest is applied after the controller manifest has been applied. When sysinv-agent starts, it sends the host inventory update. By this time, if the worker manifest apply has not been completed, the memory report will be skipped. After the initial report, the memory update will be triggered by periodical audit. The audit runs every minute, but the memory report is sent after 5 audit interval. As a result, the huge pages settings could still show as pending on CLI/GUI after around 10 minutes (although the huge pages have been allocated in Linux after the manifest is applied).

After talked to John, we decided to send the memory report more frequently, e.g. every other minute (normal audit interval). This change will trigger the pending fields to be clear 1 minute after the worker manifest is applied.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/671354

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/671354
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=1c24b40ac9ede4836070d3c6e960f4c29efc3afa
Submitter: Zuul
Branch: master

commit 1c24b40ac9ede4836070d3c6e960f4c29efc3afa
Author: Tao Liu <email address hidden>
Date: Wed Jul 17 15:50:08 2019 -0400

    Fix the huge pages fields showing as pending

    The user expects to see the pending fields to be clear
    after the huge pages modification and unlock is complete.
    When sysinv-agent starts, it sends the host inventory update.
    By this time, if the worker manifest apply has not been
    completed, the memory report will be skipped. After the
    initial report, the memory update will be triggered by
    periodical audit. The audit runs every minute, but the
    memory report is sent after 5 audit interval. As a result,
    the huge pages settings could still show as pending on
    CLI/GUI after around 10 minutes (more noticeable on AIO),
    and this led the user to believe something went wrong with
    the huge pages allocation (although the huge pages have
    been allocated in Linux after the manifest is applied).

    After talking with John Kung, we decided to send the memory
    report in normal audit interval (every other minute).

    Closes-Bug: 1830549

    Change-Id: Idf5067648031168078d99a5d84c8368cbd400508
    Signed-off-by: Tao Liu <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.