StarlingX

failing application-apply due error on osh-openstack-openvswitch

Bug #1826445 reported by Jose Perez Carranza on 2019-04-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	Critical	Mingyuan Qi

Bug Description

Brief Description
-----------------
Application applied due error on osh-openstack-openvswitch, after 30 minutes of failure application is aborted and when doing application again it get freeze on osh-openstack-garbd to be ready but this never happens. Trying to do application-remove also fails.

Severity
--------
Provide the severity of the defect.
Critical: System/Feature is not usable due to the defect

Steps to Reproduce
------------------
1. Have a deployment Standar 2+2 ready
2. Execute application apply
$ system application-apply stx-openstack

Expected Behavior
------------------
Application apply should be completed successfully

Actual Behavior
----------------
- Application apply faild with below error:
========
2019-04-25 11:42:42.746 41 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 475, in install_release
2019-04-25 11:42:42.746 41 ERROR armada.handlers.armada raise ex.ReleaseException(release, status, 'Install')
2019-04-25 11:42:42.746 41 ERROR armada.handlers.armada armada.exceptions.tiller_exceptions.ReleaseException: Failed to Install release: osh-openstack-openvswitch - Tiller Message: b'Release "osh-openstack-openvswitch" failed: timed out waiting for the condition'
2019-04-25 11:42:42.746 41 ERROR armada.handlers.armada
=====

- Doing an application apply again filed with below error
=====
2019-04-25 12:04:09.261 12505 DEBUG armada.handlers.wait [-] [chart=garbd]: Resource osh-openstack-garbd-garbd-7f488f575d-cbhfq not ready: Waiting for pod osh-openstack-garbd-garbd-7f488f575d-cbhfq to be ready... handle_resource /usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py:184
=====

- Doing application-remove fails aloso with below error:
=============
2019-04-25 12:18:44.446 29820 ERROR sysinv.common.kubernetes [req-7da7b7ca-4f87-479b-9a0c-d47f51bf35ec admin admin] Failed to clean up Namespace openstack: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on namespaces \"openstack\": The system is ensuring all content is removed from this namespace. Upon completion, this namespace will automatically be purged by the system.","reason":"Conflict","details":{"name":"openstack","kind":"namespaces"},"code":409}
=============

Reproducibility
---------------
100% on Standard 2+2

System Configuration
--------------------
Multi-node system 2+2 virtual environment

Branch/Pull Time/Commit
-----------------------
##
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190425T013000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="79"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-04-25 01:30:00 +0000"

Last Pass
---------
Last Pass observed on this same host machine and environment was with BUILD_DATE="2019-04-15

Timestamp/Logs
--------------
"sysinv.log" attached

Test Activity
-------------
Test Case development

Tags:

Revision history for this message

Jose Perez Carranza (jgperezc) wrote on 2019-04-25:

sysinv_2019-04-25.log Edit (567.8 KiB, text/plain)

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-04-26:

It is interesting that this issue does not occur in a BM environment but is occurring in a virtual environment.

Marking as release gating as we require the stx-openstack application to work on both BM and virtual environments.

Assigning to Cindy and request assistance to identify a prime to investigate this issue.

tags:	added: stx.2.0
tags:	added: stx.containers
Changed in starlingx:
status:	New → Triaged
importance:	Undecided → Critical
assignee:	nobody → Cindy Xie (xxie1)

Cindy Xie (xxie1) on 2019-04-27

Changed in starlingx:
assignee:	Cindy Xie (xxie1) → Mingyuan Qi (myqi)

Revision history for this message

Jose Perez Carranza (jgperezc) wrote on 2019-04-29:

Same behavior observed on todays ISO
=========
##
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190428T233000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="84"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-04-28 23:30:00 +0000"

http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190428T233000Z/outputs/

=======

Revision history for this message

Mingyuan Qi (myqi) wrote on 2019-04-30:

@Jose, could you please `kubectl -n openstack describe po <openvswitch-related-pods>` and `kubectl -n openstack logs <openvsitch-related-pods>` when issue happens since I don't see any detail logs to identify openvswitch is the root cause? And an overall view of existing pods and jobs in openstack namespace would help as well.

Revision history for this message

Jose Perez Carranza (jgperezc) wrote on 2019-04-30:

Download full text (6.1 KiB)

Hi with the build BUILD_ID="20190428T233000Z" details of the failures are at

Hi with the build BUILD_ID="20190428T233000Z" details of the failures are at

Controllers:                                                                                                                                                                                            
  controller-0:                                                                                                                                                                                         
    partition_a: 250                                                                                                                                                                                    
    partition_b: 250                                                                                                                                                                                    
    partition_d: 50                                                                                                                                                                                     
    memory_size: 10240                                                                                                                                                                                  
    system_cores: 4                                                                                                                                                                                     
  controller-1:                                                                                                                                                                                         
    partition_a: 250                                                                                                                                                                                    
    partition_b: 250                                                                                                                                                                                    
    partition_d: 50                                                                                                                                                                                     
    memory_size: 10240                                                                                                                                                                                  
    system_cores: 4                                                                                                                                                                                     
Computes:                                                                                                                                                                                               
  compute-0:                                                                                                                                                                                            
    partition_a: 250                                                                                                                                                                                    
    partition_b: 250                                                                                                                                                                                    
    memory_size: 16384                                                                                                                                                                                  
    system_cores: 6                                                                                                                                                                                     
  compute-1:                                                                                                                                                                                            
    partition_a: 250                                                                                                                                                                                    
    partition_b: 250                                                                                                                                                                                    
    memory_size: 16384                                                                                                                                                                                  
    system_cores: 6

====== stx-openstack-apply.log ========

2019-04-30 06:47:57.603 42 ERROR armada.handlers.armada     raise k8s_exceptions.KubernetesWatchTimeoutEx
ception(error)
2019-04-30 06:47:57.603 42 ERROR armada.handlers.armada armada.exceptions.k8s_exceptions.KubernetesWatchT
imeoutException: Timed out waiting for jobs (namespace=openstack, labels=()). These jobs were not ready=[
  'nova-cell-setup']
2019-04-30 06:47:57.603 42 ERROR armada.handlers.armada
2019-04-30 06:48:08.885 42 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python
3.6/dist-packages/armada/handlers/lock.py:155

===== system application-list ========

detailed logs attached 
"logs_failure_application.tar"

Revision history for this message

Jose Perez Carranza (jgperezc) wrote on 2019-04-30:

logs_failure_application.tar Edit (64.6 KiB, application/x-tar)

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-04-30:

bug1826445_virtual_log.txt Edit (50.6 KiB, text/plain)

With 8 cores and 32GB for each VM, we manage to complete the application apply. Attached file with outputs for further checking.

I'm gonna scale down to 4-6Cores and 16GB.

Revision history for this message

Jose Perez Carranza (jgperezc) wrote on 2019-05-03:

Minimum memory configuration that we saw it working is

Controllers - 10 GB + 4 Cores
Compute - 19 GB + 8 cores

for Workers (computes) the needed is the double of was working on previous build, so an investigation is needed to track down why this huge amount of memory is needed.

Revision history for this message

Mingyuan Qi (myqi) wrote on 2019-05-08:

@Jose, I was able to apply stx-openstack successfully with 20190506T233000Z ISO, here is my configuration:

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190506T233000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="93"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-06 23:30:00 +0000"

Name: Mingyuan_multi-controller-0
UUID: 8389ef80-7138-4a37-8138-ded0d1c3d2a8
OS Type: hvm
State: running
CPU(s): 6
CPU time: 14914.8s
Max memory: 31457280 KiB
Used memory: 31457280 KiB

Name: Mingyuan_multi-controller-1
UUID: 24f2f54c-a34c-4e8e-b7ef-0090540f2d4a
OS Type: hvm
State: running
CPU(s): 6
CPU time: 9374.0s
Max memory: 31457280 KiB
Used memory: 31457280 KiB

Name: Mingyuan_multi-compute-0
UUID: 4b279f22-812f-497d-9203-0ab6722de449
OS Type: hvm
State: running
CPU(s): 4
CPU time: 1763.1s
Max memory: 16777216 KiB
Used memory: 16777216 KiB

Name: Mingyuan_multi-compute-1
UUID: 4f185e25-0108-4e34-b526-d11b5192cd99
OS Type: hvm
State: running
CPU(s): 4
CPU time: 1469.3s
Max memory: 16777216 KiB
Used memory: 16777216 KiB

compute-0:~$ free -h
total used free shared buff/cache available
Mem: 15G 1.6G 7.9G 11M 6.0G 13G
Swap: 0B 0B 0B

@Jose, I was able to apply stx-openstack successfully with 20190506T233000Z ISO, here is my configuration:

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190506T233000Z"

JOB="STX_build_master_master"
BUILD_BY="starlingx.build@cengn.ca"
BUILD_NUMBER="93"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-06 23:30:00 +0000"

Name:           Mingyuan_multi-controller-0
UUID:           8389ef80-7138-4a37-8138-ded0d1c3d2a8
OS Type:        hvm
State:          running
CPU(s):         6
CPU time:       14914.8s
Max memory:     31457280 KiB
Used memory:    31457280 KiB

Name:           Mingyuan_multi-controller-1
UUID:           24f2f54c-a34c-4e8e-b7ef-0090540f2d4a
OS Type:        hvm
State:          running
CPU(s):         6
CPU time:       9374.0s
Max memory:     31457280 KiB
Used memory:    31457280 KiB

Name:           Mingyuan_multi-compute-0
UUID:           4b279f22-812f-497d-9203-0ab6722de449
OS Type:        hvm
State:          running
CPU(s):         4
CPU time:       1763.1s
Max memory:     16777216 KiB
Used memory:    16777216 KiB

Name:           Mingyuan_multi-compute-1
UUID:           4f185e25-0108-4e34-b526-d11b5192cd99
OS Type:        hvm
State:          running
CPU(s):         4
CPU time:       1469.3s
Max memory:     16777216 KiB
Used memory:    16777216 KiB

compute-0:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:            15G        1.6G        7.9G         11M        6.0G         13G
Swap:            0B          0B          0B

Revision history for this message

Mingyuan Qi (myqi) wrote on 2019-05-08:

#10

I can run stx-openstack with two 4core + 16G worker nodes

Revision history for this message

Jose Perez Carranza (jgperezc) wrote on 2019-05-08:

#11

Is below the configuration you used?

Controller 6 Core + 30 GB
Worker 4 Core + 16 GB

If that is the case the wiki should be updated with the new minimum values for the nodes.

According to the wiki [1] below are the minimum values for the nodes.
* Memory size:
* Controller nodes: 16384 MB
* Compute nodes: 10240 MB

[1] - https://wiki.openstack.org/wiki/StarlingX/Containers/InstallationOnStandard

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-05-08:

#12

From the notes above, it seems like the original failure was reported when the controllers were configured with only 10G of memory. The recommendations in the wiki are for 16G controllers and 10G computes (https://wiki.openstack.org/wiki/StarlingX/Containers/InstallationOnStandard). I have been running this config without issue in recent loads.

Revision history for this message

Mingyuan Qi (myqi) wrote on 2019-05-09:

#13

I use the same configuration with wiki required and still got a successfull apply.

Controller:
4 CPUs
16GB Mem

Compute:
3 CPUs
10GB Mem

[wrsroot@controller-0 ~(keystone_admin)]$ system application-list
+---------------+-----------------------------+-----------------+---------------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------+-----------------------------+-----------------+---------------+---------+-----------+
| stx-openstack | 1.0-11-centos-stable-latest | armada-manifest | manifest.yaml | applied | completed |
+---------------+-----------------------------+-----------------+---------------+---------+-----------+
[wrsroot@controller-0 ~(keystone_admin)]$ cat /etc/build.info
###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190506T233000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="93"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-06 23:30:00 +0000"

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-05-09:

#14

@Jose, can you please test it in the HW with less CPU? due to the fact the huge page was allocated to Tenant VM according to # of cores, thus if your system has bigger # of CPU than recommended, but less RAM, it may cause the host memory not enough.

Revision history for this message

Jose Perez Carranza (jgperezc) wrote on 2019-05-09:

#15

@Cindy thanks for the hint, actually works of us!! decreasing the number of cores we were able to have a deployment ready with the same RAM.

Mingyuan Qi (myqi) on 2019-05-10

Changed in starlingx:
status:	Triaged → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.