failing application-apply due error on osh-openstack-openvswitch

Bug #1826445 reported by Jose Perez Carranza
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Critical
Mingyuan Qi

Bug Description

Brief Description
-----------------
Application applied due error on osh-openstack-openvswitch, after 30 minutes of failure application is aborted and when doing application again it get freeze on osh-openstack-garbd to be ready but this never happens. Trying to do application-remove also fails.

Severity
--------
Provide the severity of the defect.
Critical: System/Feature is not usable due to the defect

Steps to Reproduce
------------------
1. Have a deployment Standar 2+2 ready
2. Execute application apply
  $ system application-apply stx-openstack

Expected Behavior
------------------
Application apply should be completed successfully

Actual Behavior
----------------
- Application apply faild with below error:
========
2019-04-25 11:42:42.746 41 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 475, in install_release
2019-04-25 11:42:42.746 41 ERROR armada.handlers.armada raise ex.ReleaseException(release, status, 'Install')
2019-04-25 11:42:42.746 41 ERROR armada.handlers.armada armada.exceptions.tiller_exceptions.ReleaseException: Failed to Install release: osh-openstack-openvswitch - Tiller Message: b'Release "osh-openstack-openvswitch" failed: timed out waiting for the condition'
2019-04-25 11:42:42.746 41 ERROR armada.handlers.armada
=====

- Doing an application apply again filed with below error
=====
2019-04-25 12:04:09.261 12505 DEBUG armada.handlers.wait [-] [chart=garbd]: Resource osh-openstack-garbd-garbd-7f488f575d-cbhfq not ready: Waiting for pod osh-openstack-garbd-garbd-7f488f575d-cbhfq to be ready... handle_resource /usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py:184
=====

- Doing application-remove fails aloso with below error:
=============
2019-04-25 12:18:44.446 29820 ERROR sysinv.common.kubernetes [req-7da7b7ca-4f87-479b-9a0c-d47f51bf35ec admin admin] Failed to clean up Namespace openstack: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on namespaces \"openstack\": The system is ensuring all content is removed from this namespace. Upon completion, this namespace will automatically be purged by the system.","reason":"Conflict","details":{"name":"openstack","kind":"namespaces"},"code":409}
=============

Reproducibility
---------------
100% on Standard 2+2

System Configuration
--------------------
Multi-node system 2+2 virtual environment

Branch/Pull Time/Commit
-----------------------
##
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190425T013000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="79"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-04-25 01:30:00 +0000"

Last Pass
---------
Last Pass observed on this same host machine and environment was with BUILD_DATE="2019-04-15

Timestamp/Logs
--------------
"sysinv.log" attached

Test Activity
-------------
Test Case development

Revision history for this message
Jose Perez Carranza (jgperezc) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

It is interesting that this issue does not occur in a BM environment but is occurring in a virtual environment.

Marking as release gating as we require the stx-openstack application to work on both BM and virtual environments.

Assigning to Cindy and request assistance to identify a prime to investigate this issue.

tags: added: stx.2.0
tags: added: stx.containers
Changed in starlingx:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Cindy Xie (xxie1)
Cindy Xie (xxie1)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Mingyuan Qi (myqi)
Revision history for this message
Jose Perez Carranza (jgperezc) wrote :

Same behavior observed on todays ISO
=========
##
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190428T233000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="84"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-04-28 23:30:00 +0000"

http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190428T233000Z/outputs/

=======

Revision history for this message
Mingyuan Qi (myqi) wrote :

@Jose, could you please `kubectl -n openstack describe po <openvswitch-related-pods>` and `kubectl -n openstack logs <openvsitch-related-pods>` when issue happens since I don't see any detail logs to identify openvswitch is the root cause? And an overall view of existing pods and jobs in openstack namespace would help as well.

Revision history for this message
Jose Perez Carranza (jgperezc) wrote :
Download full text (6.1 KiB)

Hi with the build BUILD_ID="20190428T233000Z" details of the failures are at

Controllers:
  controller-0:
    partition_a: 250
    partition_b: 250
    partition_d: 50
    memory_size: 10240
    system_cores: 4
  controller-1:
    partition_a: 250
    partition_b: 250
    partition_d: 50
    memory_size: 10240
    system_cores: 4
Computes:
  compute-0:
    partition_a: 250 ...

Read more...

Revision history for this message
Jose Perez Carranza (jgperezc) wrote :
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

With 8 cores and 32GB for each VM, we manage to complete the application apply. Attached file with outputs for further checking.

I'm gonna scale down to 4-6Cores and 16GB.

Revision history for this message
Jose Perez Carranza (jgperezc) wrote :

Minimum memory configuration that we saw it working is

Controllers - 10 GB + 4 Cores
Compute - 19 GB + 8 cores

for Workers (computes) the needed is the double of was working on previous build, so an investigation is needed to track down why this huge amount of memory is needed.

Revision history for this message
Mingyuan Qi (myqi) wrote :

@Jose, I was able to apply stx-openstack successfully with 20190506T233000Z ISO, here is my configuration:

[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | compute-0 | worker | unlocked | enabled | available |
| 4 | compute-1 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[wrsroot@controller-0 ~(keystone_admin)]$ system application-list
+---------------+-----------------------------+-----------------+---------------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------+-----------------------------+-----------------+---------------+---------+-----------+
| stx-openstack | 1.0-11-centos-stable-latest | armada-manifest | manifest.yaml | applied | completed |
+---------------+-----------------------------+-----------------+---------------+---------+-----------+
[wrsroot@controller-0 ~(keystone_admin)]$ cat /etc/build.info
###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190506T233000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="93"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-06 23:30:00 +0000"

Name: Mingyuan_multi-controller-0
UUID: 8389ef80-7138-4a37-8138-ded0d1c3d2a8
OS Type: hvm
State: running
CPU(s): 6
CPU time: 14914.8s
Max memory: 31457280 KiB
Used memory: 31457280 KiB

Name: Mingyuan_multi-controller-1
UUID: 24f2f54c-a34c-4e8e-b7ef-0090540f2d4a
OS Type: hvm
State: running
CPU(s): 6
CPU time: 9374.0s
Max memory: 31457280 KiB
Used memory: 31457280 KiB

Name: Mingyuan_multi-compute-0
UUID: 4b279f22-812f-497d-9203-0ab6722de449
OS Type: hvm
State: running
CPU(s): 4
CPU time: 1763.1s
Max memory: 16777216 KiB
Used memory: 16777216 KiB

Name: Mingyuan_multi-compute-1
UUID: 4f185e25-0108-4e34-b526-d11b5192cd99
OS Type: hvm
State: running
CPU(s): 4
CPU time: 1469.3s
Max memory: 16777216 KiB
Used memory: 16777216 KiB

compute-0:~$ free -h
              total used free shared buff/cache available
Mem: 15G 1.6G 7.9G 11M 6.0G 13G
Swap: 0B 0B 0B

Revision history for this message
Mingyuan Qi (myqi) wrote :

I can run stx-openstack with two 4core + 16G worker nodes

Revision history for this message
Jose Perez Carranza (jgperezc) wrote :

Is below the configuration you used?

Controller 6 Core + 30 GB
Worker 4 Core + 16 GB

If that is the case the wiki should be updated with the new minimum values for the nodes.

According to the wiki [1] below are the minimum values for the nodes.
 * Memory size:
        * Controller nodes: 16384 MB
        * Compute nodes: 10240 MB

[1] - https://wiki.openstack.org/wiki/StarlingX/Containers/InstallationOnStandard

Revision history for this message
Bart Wensley (bartwensley) wrote :

From the notes above, it seems like the original failure was reported when the controllers were configured with only 10G of memory. The recommendations in the wiki are for 16G controllers and 10G computes (https://wiki.openstack.org/wiki/StarlingX/Containers/InstallationOnStandard). I have been running this config without issue in recent loads.

Revision history for this message
Mingyuan Qi (myqi) wrote :

I use the same configuration with wiki required and still got a successfull apply.

Controller:
4 CPUs
16GB Mem

Compute:
3 CPUs
10GB Mem

[wrsroot@controller-0 ~(keystone_admin)]$ system application-list
+---------------+-----------------------------+-----------------+---------------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------+-----------------------------+-----------------+---------------+---------+-----------+
| stx-openstack | 1.0-11-centos-stable-latest | armada-manifest | manifest.yaml | applied | completed |
+---------------+-----------------------------+-----------------+---------------+---------+-----------+
[wrsroot@controller-0 ~(keystone_admin)]$ cat /etc/build.info
###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190506T233000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="93"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-06 23:30:00 +0000"

Revision history for this message
Cindy Xie (xxie1) wrote :

@Jose, can you please test it in the HW with less CPU? due to the fact the huge page was allocated to Tenant VM according to # of cores, thus if your system has bigger # of CPU than recommended, but less RAM, it may cause the host memory not enough.

Revision history for this message
Jose Perez Carranza (jgperezc) wrote :

@Cindy thanks for the hint, actually works of us!! decreasing the number of cores we were able to have a deployment ready with the same RAM.

Mingyuan Qi (myqi)
Changed in starlingx:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.