Simplex Bare-Metal : Nova-compute with Status=disable and State=down/degraded

Bug #1796420 reported by Maria Guadalupe Perez Ibara
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Hayde Martinez

Bug Description

Title
-----

Nova-compute with Status=disable and State=down

Brief Description
-----------------
Nova-compute does not start by its own, while trying to enable manually, it just change the Status to enable but State remains down, and availability is degraded.

Severity
--------
Critical

Steps to Reproduce
------------------
1.- Configure Provider Networks
2.- Provide Data Interface on Controller-0
3.- Configure Cinder on Controller Disk
4.- Configure VM Local Storage on Controller Disk
5.- Add LVM Storage Backend
6.- Unlock controller-0
7.- Verify the controller-0 configuration by nova service-list
8.- verify system host-list

Expected Behavior
------------------
Nova-compute should be Status: Enabled
system host-list availability : available

Actual Behavior
----------------
Nova-compute shows as Status : Down
system host-list availability, "available State" : Degraded

Reproducibility
---------------

Reproducible 100% in Simplex - Bare Metal

System Configuration
--------------------
Simplex Bare Metal

iso : stx-2018-10-03-11-r-2018.10.iso

Branch/Pull Time/Commit
-----------------------
stx-tools

Revision history for this message
Maria Guadalupe Perez Ibara (maria-gp) wrote :
summary: - Nova-compute with Status=disable and State=down/degraded
+ Simplex Bare-Metal : Nova-compute with Status=disable and
+ State=down/degraded
description: updated
Revision history for this message
Maria Guadalupe Perez Ibara (maria-gp) wrote :
Revision history for this message
Maria Guadalupe Perez Ibara (maria-gp) wrote :

the same problem is duplicated in simplex-virtual

Ada Cabrales (acabrale)
tags: added: stx.2018.10
Revision history for this message
Jim Gauld (jgauld) wrote :

The system does not complete computeconfig step due to a puppet manifest failure. The puppet manifest logs indicate specific ovs-vswitch dpdk errors. The result is that compute is not configured, nova-compute does not start, so we never bring compute into service.

The user.log and daemon.log indicate the puppet manifest failure. Note that any puppet warning or error yield a failing RC:1 .

/var/log/daemon.log:
2018-10-05T11:30:28.609 controller-0 compute_config[20588]: info Warnings found. See /var/log/puppet/2018-10-05-11-28-23_compute/puppet.log for details
2018-10-05T11:30:28.633 controller-0 compute_config[20588]: info Failed to run the puppet manifest (RC:1

Looking at the /var/log/puppet/2018-10-05-11-28-23_compute/puppet.log, we see:
2018-10-05T11:28:59.134 Notice: 2018-10-05 11:28:59 +0000 /Stage[main]/Platform::Vswitch::Ovs/Platform::Vswitch::Ovs::Port[eth0]/Exec[ovs-add-port: eth0]/returns: ovs-vsctl: Error detected while setting up 'eth0': Error attaching device '0000:03:00.0' to DPDK. See ovs-vswitchd log for details.

Looking at the corresponding /var/log/openvswitch/ovs-vswitchd.log and kern.log we see related dpdk err logs (0000:03:00.0 DMA remapping failed, error 14; Driver cannot attach the device).

Recommend a networking/OVS subject-matter-expert look at their configuration for data/eno2 interface.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Maria, Can you please confirm that you have vt-d enabled in the BIOS of the node?

Revision history for this message
Maria Guadalupe Perez Ibara (maria-gp) wrote :

yes, vt-d is enable.

Revision history for this message
Ada Cabrales (acabrale) wrote :

I have reproduced the issue on a second workstation. I have VT-d enabled.
One extra thing is that I'm constantly having the following message displayed on the console:

DMAR: intel_iommu_map: iommu width (39) is not sufficient for the mapped address (7fbdc0000000)

Revision history for this message
Maria Guadalupe Perez Ibara (maria-gp) wrote :

in simplex-virtual /var/log/openvswitch/ovs-vswitchd.log I saw the following error ; 2018-10-09T13:52:24.876Z|00015|dpdk|ERR|EAL: Not enough memory available on socket 0! Requested: 1024MB, available: 210MB
2018-10-09T13:52:19.462Z|00016|dpdk|ERR|EAL: Cannot init memory

attach log: ovs-vswitchd.log

Ghada Khalil (gkhalil)
tags: added: stx.networking
Revision history for this message
Matt Peters (mpeters-wrs) wrote :

What are the HW specs (cpu, memory, nics) of the system that this is being tested on?

Also, the error you are getting from ovs-dpdk indicates that it is requesting 1G for socket 0, but in a virtual environment it is currently configured for 512MB. That being said, it still looks like you have insufficient number of huge pages available for OVS. Can you confirm the configuration of your virtual machine?

Revision history for this message
Erich Cordoba (ericho) wrote :

It turns out that, for the baremetal case, the problem seems to be related with DPDK and a wrong address width. The issue is reported and fixed in DPDK[1] but is not available yet in the centos openvswitch package.

I applied the patch attached below into openvswitch and reinstalled the package. After this all the kernel errors related with intel_iommu were gone and nova-compute was up, also the degraded state was changed to active.

Revision history for this message
Erich Cordoba (ericho) wrote :
Changed in starlingx:
assignee: nobody → Hayde Martinez (haydemtz)
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Adding recommendation from Matt Peters (stx networking technical lead) with regards to the proposed patch:

-----Original Message-----
From: Peters, Matt [mailto:<email address hidden>]
Sent: Thursday, October 11, 2018 11:02 AM
To: Rowsell, Brent; Dean Troyer; Cordoba Malibran, Erich
Cc: <email address hidden>
Subject: Re: [Starlingx-discuss] Patch to support different address witdhs in DPDK

The patch is not for openvswitch itself and is only for the bundled DPDK distribution that is included in the CentOS specific SRPM (currently based on 17.11.0 - old snapshot).

Currently CentOS is not a distributing a newer version of the openvswitch package that includes a newer version of DPDK. I don't think we want to up version the DPDK distribution in this package within StarlingX since this would deviate from a supported lineup of Centos. However, that being said, the recommended version from OVS is DPDK 17.11.3.

I think the short term solution is we should include this patch in StarlingX to address the immediate issue. Longer term, we would want to push to get the CentOS openvswitch SRPM package updated with a newer version of DPDK 17.11 (currently at 17.11.4) to have all available bug fixes in the dpdk-stable branch.

Regards, Matt

Revision history for this message
Ghada Khalil (gkhalil) wrote :

The patch will need to be included in master as well as the r/2018.10 release branch

Changed in starlingx:
status: New → Triaged
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Ada Cabrales (acabrale) wrote :

We are still having a similar problem in the virtual environment.
A new Launchpad for simplex-virtual has been created

https://bugs.launchpad.net/starlingx/+bug/1797474

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-integ (master)

Reviewed: https://review.openstack.org/609819
Committed: https://git.openstack.org/cgit/openstack/stx-integ/commit/?id=9009950c5721ea2d6f7292e63fbe6ecf124bdcfc
Submitter: Zuul
Branch: master

commit 9009950c5721ea2d6f7292e63fbe6ecf124bdcfc
Author: Hayde Martinez <email address hidden>
Date: Thu Oct 11 16:08:59 2018 -0500

    Fix wrong address width

    A patch was added to fix the wrong address width
    This patch refers to commit:
    github.com/DPDK/dpdk/commit/54a328f552ff2e0098c3f96f9e32302675f2bcf4
    Also it can be removed when the bundled DPDK source is upversioned.

    Closes-Bug:1796420

    Change-Id: I1d148f170302343982fb0fef07d88994ffefb43e
    Signed-off-by: Hayde Martinez <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-integ (r/2018.10)

Fix proposed to branch: r/2018.10
Review: https://review.openstack.org/611196

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-integ (r/2018.10)

Reviewed: https://review.openstack.org/611196
Committed: https://git.openstack.org/cgit/openstack/stx-integ/commit/?id=fcf5d8c3b598a04e503113ebc99b26afeb4acbf9
Submitter: Zuul
Branch: r/2018.10

commit fcf5d8c3b598a04e503113ebc99b26afeb4acbf9
Author: Hayde Martinez <email address hidden>
Date: Thu Oct 11 16:08:59 2018 -0500

    Fix wrong address width

    A patch was added to fix the wrong address width
    This patch refers to commit:
    github.com/DPDK/dpdk/commit/54a328f552ff2e0098c3f96f9e32302675f2bcf4
    Also it can be removed when the bundled DPDK source is upversioned.

    Closes-Bug:1796420

    Change-Id: I1d148f170302343982fb0fef07d88994ffefb43e
    Signed-off-by: Hayde Martinez <email address hidden>
    (cherry picked from commit 9009950c5721ea2d6f7292e63fbe6ecf124bdcfc)

Ken Young (kenyis)
tags: added: stx.1.0
removed: stx.2018.10
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers