openvswitch fails to allocate memory pool in virtual environment

Bug #1796380 reported by Matt Peters on 2018-10-05
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
High
Steven Webster

Bug Description

Title
-----
openvswitch fails to allocate memory pool in virtual environment

Brief Description
-----------------
On some virtual systems, openvswitch is failing to allocate the mbuf memory pools for the data path when the DPDK allocated 2M huge pages are fragmented and therefore do not provide a large enough contiguous block of memory for the pools. If the pool is not allocated, then the physical ports will fail to be configured properly, failing compute initialization.

Severity
--------
Minor

Steps to Reproduce
------------------
The issue is not reproducible on most systems, and is only selectively an issue on some virtual environments. This issue is not present on real hardware since the huge page memory for openvswitch is backed by 1G huge pages.

Expected Behavior
------------------
openvswitch should be able to allocate the required mbuf memory pools during initialization.

Actual Behavior
----------------
openvswitch is unable to allocate the required mbuf memory pools, resulting in a failure to initialize the openvswitch service.

Reproducibility
---------------
Intermittent.
On systems that experience the issue, it will occur 20-30% of the time when booting a compute host.

System Configuration
--------------------
All virtual systems are susceptible to this issue.

Branch/Pull Time/Commit
-----------------------
master - 2018-10-05

Timestamp/Logs
--------------
2018-10-03T18:37:13.000 compute-1 ovs-vswitchd[24073]: err ovs|00106|netdev_dpdk|ERR|Failed to create memory pool for netdev eth1, with MTU 1500 on socket 0: Invalid argument
2018-10-03T18:37:13.000 compute-1 ovs-vswitchd[24073]: err ovs|00107|dpif_netdev|ERR|Failed to set interface eth1 new configuration
2018-10-03T18:37:13.000 compute-1 ovs-vswitchd[24073]: err ovs|00111|netdev_dpdk|ERR|Failed to create memory pool for netdev eth0, with MTU 1500 on socket 0: Invalid argument
2018-10-03T18:37:13.000 compute-1 ovs-vswitchd[24073]: err ovs|00112|dpif_netdev|ERR|Failed to set interface eth0 new configuration

Matt Peters (mpeters-wrs) wrote :

The _set_default_vswitch_hugesize method needs to be generalized to not configure a different amount of memory for a virtual environment since OVS does not size the mempool differently for a virtual environment and therefore requires the full 1G allocation.

http://git.openstack.org/cgit/openstack/stx-config/tree/sysinv/sysinv/sysinv/sysinv/agent/node.py#n273

Ghada Khalil (gkhalil) on 2018-10-05
tags: added: stx.networking
Ghada Khalil (gkhalil) on 2018-10-10
Changed in starlingx:
importance: Undecided → Medium
Ghada Khalil (gkhalil) wrote :

Targeting stx.2019.03 - intermittent issue which only occurs on a subset of virtual environments

tags: added: stx.2019.03
Ghada Khalil (gkhalil) on 2018-10-10
Changed in starlingx:
status: New → Triaged
Ghada Khalil (gkhalil) wrote :

This issue is the root-cause of test failures seen by the Intel test team. See https://bugs.launchpad.net/starlingx/+bug/1797474

Re-gating this bug to stx.2018.10 as it's more widely seen than initially thought.

tags: added: stx.2018.10
removed: stx.2019.03
Changed in starlingx:
assignee: nobody → Steven Webster (swebster-wr)
Ghada Khalil (gkhalil) on 2018-10-12
Changed in starlingx:
importance: Medium → High
Juan Pablo Gomez (jpgomez) wrote :

Ghada also this issue is reproducible in virtual Duplex Environment

Fix proposed to branch: master
Review: https://review.openstack.org/611391

Changed in starlingx:
status: Triaged → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/611432

Change abandoned by Steven Webster (<email address hidden>) on branch: master
Review: https://review.openstack.org/611432
Reason: Real (original) review is here

https://review.openstack.org/#/c/611391/

Reviewed: https://review.openstack.org/611391
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=bcc89e579cea90aa8c79dd3a4d164c371f638e5d
Submitter: Zuul
Branch: master

commit bcc89e579cea90aa8c79dd3a4d164c371f638e5d
Author: Steven Webster <email address hidden>
Date: Mon Oct 15 13:58:36 2018 -0400

    OVS: fix memory pool allocation for virtual environment

    This commit increases the vswitch hugepage number for virtual
    environments from 512 to 1024, making it equal to the same amount
    used for non-virtual environments.

    An issue was seen after the da1110a commit to enable LLDP
    over OVS, in which puppet would fail to successfully add ports to
    OVS. The issue would have manifested previously not as a puppet
    error, but as a failure to communicate over the data ports of
    some virtual compute nodes.

    The issue is a failure of DPDK to be able to find a contiguous
    mempool of sufficient size in any of the hugepages, which can
    happen in a virtual environment restricted to a 2M hugepage size.

    Since 1G and 2M pages can be used for both vswitch and vm
    purposes, the concept of a hugepage role is removed.

    Finally, the code has had some cleanup to separate out constants
    and make variable names more pythonic. Unit identifiers have
    been made consistent for readability and to prevent confusion.

    Change-Id: I14550526deddfaf13284d9273397a00b80eb8527
    Closes-Bug: #1796380
    Signed-off-by: Steven Webster <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Erich Cordoba (ericho) wrote :

It's great to have a fix :)
Can we get the cherry-pick into r/2018.10 branch? hopefully to see change this in tomorrow's build.

Steven Webster (swebster-wr) wrote :

Yep, working on that. It won't do a clean cherry-pick via gerrit. Stand by ...

Reviewed: https://review.openstack.org/611873
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=a7cdb4ce95113c74d12d4581c676faa06d1c3112
Submitter: Zuul
Branch: r/2018.10

commit a7cdb4ce95113c74d12d4581c676faa06d1c3112
Author: Steven Webster <email address hidden>
Date: Mon Oct 15 13:58:36 2018 -0400

    OVS: fix memory pool allocation for virtual environment

    Cherry-pick to r/2018.10 branch of commit bcc89e5

    This commit increases the vswitch hugepage number for virtual
    environments from 512 to 1024, making it equal to the same amount
    used for non-virtual environments.

    An issue was seen after the da1110a commit to enable LLDP
    over OVS, in which puppet would fail to successfully add ports to
    OVS. The issue would have manifested previously not as a puppet
    error, but as a failure to communicate over the data ports of
    some virtual compute nodes.

    The issue is a failure of DPDK to be able to find a contiguous
    mempool of sufficient size in any of the hugepages, which can
    happen in a virtual environment restricted to a 2M hugepage size.

    Since 1G and 2M pages can be used for both vswitch and vm
    purposes, the concept of a hugepage role is removed.

    Finally, the code has had some cleanup to separate out constants
    and make variable names more pythonic. Unit identifiers have
    been made consistent for readability and to prevent confusion.

    Change-Id: I14550526deddfaf13284d9273397a00b80eb8527
    Closes-Bug: #1796380
    Signed-off-by: Steven Webster <email address hidden>

Ken Young (kenyis) on 2019-04-06
tags: added: stx.1.0
removed: stx.2018.10
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers