Alarm 750.003 - Application remove failure for stx-openstack

Bug #1918420 reported by Nicolae Jascanu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
In Progress
Low
chen haochuan

Bug Description

Brief Description
-----------------
This alarm is raised by sanity test testcases/sanity/sanity_openstack/test_host_management.py::test_reapply_openstack
It seems that after swact/lock/unlock the stx-openstack will fail to apply and the test will fail to reapply.

Even if the alarm is present, the openstack specific commands are working and the tests are PASSING

Might be related to bug: https://bugs.launchpad.net/starlingx/+bug/1917308

This is the order of sanity tests on standard_ext configuration:
18:59:50 [2021-03-09T16:59:50.200Z] PASS 20210309 14:14:44 testcases/sanity/sanity_openstack/test_host_management.py::test_host_status
18:59:50 [2021-03-09T16:59:50.200Z] PASS 20210309 14:15:21 testcases/sanity/sanity_openstack/test_host_management.py::test_swact_controller_host
18:59:50 [2021-03-09T16:59:50.200Z] PASS 20210309 14:18:07 testcases/sanity/sanity_openstack/test_host_management.py::test_lock_unlock_active_controller
18:59:50 [2021-03-09T16:59:50.200Z] PASS 20210309 14:18:17 testcases/sanity/sanity_openstack/test_host_management.py::test_lock_unlock_standby_controller
18:59:50 [2021-03-09T16:59:50.200Z] PASS 20210309 14:25:07 testcases/sanity/sanity_openstack/test_host_management.py::test_lock_unlock_compute_hosts
18:59:50 [2021-03-09T16:59:50.201Z] PASS 20210309 14:39:23 testcases/sanity/sanity_openstack/test_host_management.py::test_lock_unlock_storage_hosts
18:59:50 [2021-03-09T16:59:50.201Z] PASS 20210309 14:46:33 testcases/sanity/sanity_openstack/test_host_management.py::test_reboot_standby_controller
18:59:50 [2021-03-09T16:59:50.201Z] PASS 20210309 14:54:39 testcases/sanity/sanity_openstack/test_host_management.py::test_reboot_active_controller
18:59:50 [2021-03-09T16:59:50.201Z] FAIL 20210309 15:24:15 testcases/sanity/sanity_openstack/test_host_management.py::test_reapply_openstack

Severity
--------
MAJOR

Steps to Reproduce
------------------
Run sanity

Expected Behavior
------------------
There should be NO Alarm 750.003 - Application remove failure for stx-openstack

Actual Behavior
----------------
Alarm is present

Reproducibility
---------------
100% on this build

System Configuration
--------------------
All baremetal and virtual configurations
OS="centos"
SW_VERSION="20.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20210309T000306Z"

JOB="STX_build_layer_flock_master_master"
<email address hidden>"
BUILD_NUMBER="419"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2021-03-09 00:03:06 +0000"

FLOCK_OS="centos"
FLOCK_JOB="STX_build_layer_flock_master_master"
<email address hidden>"
FLOCK_BUILD_NUMBER="419"
FLOCK_BUILD_HOST="starlingx_mirror"
FLOCK_BUILD_DATE="2021-03-09 00:03:06 +0000"

DISTRO_OS="centos"
DISTRO_JOB="STX_build_layer_distro_master_master"
<email address hidden>"
DISTRO_BUILD_NUMBER="421"
DISTRO_BUILD_HOST="starlingx_mirror"
DISTRO_BUILD_DATE="2021-03-05 02:31:48 +0000"

COMPILER_OS="centos"
COMPILER_JOB="STX_build_layer_compiler_master_master"
<email address hidden>"
COMPILER_BUILD_NUMBER="478"
COMPILER_BUILD_HOST="starlingx_mirror"
COMPILER_BUILD_DATE="2021-03-04 02:30:01 +0000"

Last Pass
---------
We haven't seen this alarm on build from: http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/flock/20210306T023358Z/outputs/iso/

Test Activity
-------------
sanity

Revision history for this message
Nicolae Jascanu (njascanu-intel) wrote :

Please find the collected logs from baremetal Standard_Ext at:
https://drive.google.com/file/d/1T05Dmtq2ZyhoMP6HmCHQH1m3YowURSau/view?usp=sharing

Ghada Khalil (gkhalil)
tags: added: stx.containers
Revision history for this message
chen haochuan (martin1982) wrote :

controller-0_20210309.173751/var/log/armada/stx-openstack-delete_2021-03-09-15-24-22.log
2021-03-09 15:24:25.159 68 INFO armada.handlers.lock [-] Releasing lock^[[00m
2021-03-09 15:24:25.165 68 ERROR armada.cli [-] Caught unexpected exception: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "write tcp 172.16.192.117:39446->10.10.59.10:5432: write: broken pipe"
        debug_error_string = "{"created":"@1615303464.163584714","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"write tcp 172.16.192.117:39446->10.10.59.10:5432: write: broken pipe","grpc_status":2}"
>

var/log/container/armada-api-696b77b494-xnfg2_armada_tiller-b007e27061e37a45e4d60cd8281ef8dc4fe8cea2084496a3baa196664ac9362f.log
2021-03-09T15:24:24.162677518Z stderr F [storage] 2021/03/09 15:24:24 listing all releases with filter
2021-03-09T15:24:24.162693699Z stderr F [storage/driver] 2021/03/09 15:24:24 list: failed to list: write tcp 172.16.192.117:39446->10.10.59.10:5432: write: broken pipe
2021-03-09T17:39:02.84176903Z stderr F [storage] 2021/03/09 17:39:02 listing all releases with filter

Tiller container in aramda-api pod fail to get data in secret sh.helm.release.v1.armada.v1 in aramda namspace.

Austin Sun (sunausti)
Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → chen haochuan (martin1982)
Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Download full text (5.0 KiB)

I checked /var/log/fm-event.log and see this problems:

2021-04-09T13:48:23.000 controller-1 fmManager: info { "event_log_id" : "750.006", "reason_text" : "A configuration change requires a reapply of the stx-openstack application.", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.k8s_application=stx-openstack", "severity" : "warning", "state" : "set", "timestamp" : "2021-04-09 13:48:23.531584" }
2021-04-09T13:48:25.000 controller-1 fmManager: info { "event_log_id" : "400.005", "reason_text" : "Communication failure detected with peer over port eno2 on host controller-1 within the last 30 seconds", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.host=controller-1.network=cluster-host", "severity" : "major", "state" : "clear", "timestamp" : "2021-04-09 13:48:25.799332" }
2021-04-09T13:48:25.000 controller-1 fmManager: info { "event_log_id" : "401.005", "reason_text" : "Communication established with peer over port eno2 on host controller-1", "entity_instance_id" : "host=controller-1.network=cluster-host", "severity" : "critical", "state" : "msg", "timestamp" : "2021-04-09 13:48:25.637745" }
2021-04-09T13:48:25.000 controller-1 fmManager: info { "event_log_id" : "401.005", "reason_text" : "Communication established with peer over port eno1 on host controller-1", "entity_instance_id" : "host=controller-1.network=oam", "severity" : "critical", "state" : "msg", "timestamp" : "2021-04-09 13:48:25.762306" }
2021-04-09T13:48:26.000 controller-1 fmManager: info { "event_log_id" : "400.005", "reason_text" : "Communication failure detected with peer over port eno1 on host controller-1 within the last 30 seconds", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.host=controller-1.network=oam", "severity" : "major", "state" : "clear", "timestamp" : "2021-04-09 13:48:26.001493" }
2021-04-09T13:48:26.000 controller-1 fmManager: info { "event_log_id" : "401.005", "reason_text" : "Communication established with peer over port eno2 on host controller-1", "entity_instance_id" : "host=controller-1.network=mgmt", "severity" : "critical", "state" : "msg", "timestamp" : "2021-04-09 13:48:25.845300" }
2021-04-09T13:48:26.000 controller-1 fmManager: info { "event_log_id" : "400.005", "reason_text" : "Communication failure detected with peer over port eno2 on host controller-1 within the last 30 seconds", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.host=controller-1.network=mgmt", "severity" : "major", "state" : "clear", "timestamp" : "2021-04-09 13:48:26.203367" }
2021-04-09T13:48:29.000 controller-1 fmManager: info { "event_log_id" : "401.001", "reason_text" : "Service group web-services state change from go-active to active on host controller-0", "entity_instance_id" : "service_domain=controller.service_group=web-services.host=controller-0", "severity" : "critical", "state" : "msg", "timestamp" : "2021-04-09 13:48:29.163167" }
2021-04-09T13:48:32.000 controller-1 fmManager: info { "event_log_id" : "401.002", "reason_text" : "Service group web-services redundancy restored", "entity_instance_id" : "service_domain=contr...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Fix for https://bugs.launchpad.net/starlingx/+bug/1917308 merged in r/stx.5.0 on 2021-04-27

Changed in starlingx:
status: New → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Fix for duplicate LP: https://bugs.launchpad.net/starlingx/+bug/1917308 merged in r/stx.5.0 on 2021-04-27

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

The bug #1917308 was fixed yesterday on RC 5.0. But the remove-fail issue is still visible.

Bill Zvonar (billzvonar)
Changed in starlingx:
status: Fix Released → Confirmed
tags: added: stx.5.0
Revision history for this message
Austin Sun (sunausti) wrote :

Hi , Alex , please upload the new logs, @Martin, could you triage this firstly once get new log.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

I see the stx-openstack remove-failed status. I also see certmanager-manifest.yaml apply-failed status:
 [sysadmin@controller-1 ~(keystone_admin)]$ system application-list
+--------------------------+------------------------------+-----------------------------------+-----------------------+---------------+-----------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+------------------------------+-----------------------------------+-----------------------+---------------+-----------------+
| cert-manager | 1.0-15 | cert-manager-manifest | certmanager-manifest. | apply-failed | operation |
| | | | yaml | | aborted, check |
| | | | | | logs for detail |
| | | | | | |
| nginx-ingress-controller | 1.1-14 | nginx-ingress-controller-manifest | nginx_ingress_control | applied | completed |
| | | | ler_manifest.yaml | | |
| | | | | | |
| oidc-auth-apps | 1.0-43 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-28 | platform-integration-manifest | manifest.yaml | applied | completed |
| rook-ceph-apps | 1.0-4 | rook-ceph-manifest | manifest.yaml | uploaded | completed |
| stx-openstack | 1.0-82-centos-stable- | armada-manifest | stx-openstack.yaml | remove-failed | operation |
| | versioned | | | | aborted, check |
| | | | | | logs for detail |
| | | | | | |
+--------------------------+------------------------------+-----------------------------------+-----------------------+---------------+-----------------+
[sysadmin@controller-1 ~(keystone_admin)]$

Looking on Alarms I see:
| 750.003 | Application Remove Failure | k8s_application=stx-openstack | major

Revision history for this message
chen haochuan (martin1982) wrote :

sysinv 2021-04-28 15:35:37.938 6323 WARNING sysinv.openstack.common.loopingcall [-] task run outlasted interval by 255.745897 sec
sysinv 2021-04-28 15:35:38.154 1072608 WARNING urllib3.connectionpool [-] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb255356790>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)': /api/v1/namespaces/armada/pods?fieldSelector=&labelSelector=application%3Darmada: NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fb255356790>: Failed to establish a new connection: [Errno 111] ECONNREFUSED
sysinv 2021-04-28 15:35:38.155 1072608 WARNING urllib3.connectionpool [-] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c7d50>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)': /api/v1/namespaces/armada/pods?fieldSelector=&labelSelector=application%3Darmada: NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c7d50>: Failed to establish a new connection: [Errno 111] ECONNREFUSED
sysinv 2021-04-28 15:35:38.156 1072608 WARNING urllib3.connectionpool [-] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c7990>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)': /api/v1/namespaces/armada/pods?fieldSelector=&labelSelector=application%3Darmada: NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c7990>: Failed to establish a new connection: [Errno 111] ECONNREFUSED
sysinv 2021-04-28 15:35:38.157 1072608 INFO sysinv.conductor.kube_app [-] Could not get Armada service : HTTPSConnectionPool(host='192.168.206.1', port=6443): Max retries exceeded with url: /api/v1/namespaces/armada/pods?fieldSelector=&labelSelector=application%3Darmada (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c70d0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))

sysinv 2021-04-28 15:35:58.174 1072608 ERROR sysinv.conductor.kube_app [-] Armada service failed to start/restart: SysinvException: Failed to generate bootstrap token
sysinv 2021-04-28 15:35:58.175 1072608 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app stx-openstack
sysinv 2021-04-28 15:35:58.175 1072608 INFO sysinv.conductor.kube_app [-] lifecycle hook for application stx-openstack (1.0-82-centos-stable-versioned) started {'lifecycle_type': u'armada-request', 'relative_timing': u'post', 'operation': u'delete', 'extra': {'rc': False}}.
sysinv 2021-04-28 15:35:58.409 1072608 ERROR sysinv.conductor.kube_app [-] Application remove aborted!.: SysinvException: Failed to generate bootstrap token

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/788897

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ha (master)

Change abandoned by "chen haochuan <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/788897

Revision history for this message
Ghada Khalil (gkhalil) wrote :

This is still seen in the most recent stx.5.0 sanity:
http://lists.starlingx.io/pipermail/starlingx-discuss/2021-May/011460.html

Revision history for this message
Austin Sun (sunausti) wrote :
Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Download full text (5.6 KiB)

I reproduced this issue running this test:
FAIL 20210518 12:30:47 testcases/sanity/sanity_openstack/test_host_management.py::test_reapply_openstack

Debug log:
controller-0:~$ cat /var/log/armada/stx-openstack-delete_2021-05-18-12-30-07.log
2021-05-18 12:30:09.235 177 DEBUG armada.handlers.tiller [-] Using Tiller host IP: 127.0.0.1 _get_tiller_ip /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:165
2021-05-18 12:30:09.235 177 DEBUG armada.handlers.tiller [-] Using Tiller host port: 24134 _get_tiller_port /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:174
2021-05-18 12:30:09.235 177 DEBUG armada.handlers.tiller [-] Tiller getting gRPC insecure channel at 127.0.0.1:24134 with options: [grpc.max_send_message_length=429496729, grpc.max_receive_message_length=429496729] get_channel /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:127
2021-05-18 12:30:09.238 177 DEBUG armada.handlers.tiller [-] Armada is using Tiller at: 127.0.0.1:24134, namespace=kube-system, timeout=300 __init__ /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:107
2021-05-18 12:30:09.238 177 INFO armada.handlers.lock [-] Acquiring lock
2021-05-18 12:30:09.250 177 DEBUG armada.handlers.tiller [-] Getting known releases from Tiller... list_charts /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:359
2021-05-18 12:30:09.250 177 DEBUG armada.handlers.tiller [-] Tiller ListReleases() with timeout=300, request=limit: 32
status_codes: UNKNOWN
status_codes: DEPLOYED
status_codes: DELETED
status_codes: DELETING
status_codes: FAILED
status_codes: PENDING_INSTALL
status_codes: PENDING_UPGRADE
status_codes: PENDING_ROLLBACK
 get_results /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:215
2021-05-18 12:30:10.251 177 INFO armada.handlers.lock [-] Releasing lock
2021-05-18 12:30:10.257 177 ERROR armada.cli [-] Caught unexpected exception: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "write tcp 172.16.166.171:34140->10.10.56.10:5432: write: broken pipe"
        debug_error_string = "{"created":"@1621341009.251408267","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"write tcp 172.16.166.171:34140->10.10.56.10:5432: write: broken pipe","grpc_status":2}"
>
2021-05-18 12:30:10.257 177 ERROR armada.cli Traceback (most recent call last):
2021-05-18 12:30:10.257 177 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke
2021-05-18 12:30:10.257 177 ERROR armada.cli self.invoke()
2021-05-18 12:30:10.257 177 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/delete.py", line 98, in invoke
2021-05-18 12:30:10.257 177 ERROR armada.cli self.handle(tiller)
2021-05-18 12:30:10.257 177 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2021-05-18 12:30:10.257 177 ERROR armada.cli return future.result()
2021-05-18 12:30:10.257 177 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on discussion in the stx release meeting (2021-05-19), the issue is related to the controller being in an offline state when the application removal is triggered. As such, this is lower priority than initially expected and will not hold up the stx.5.0 release.

tags: removed: stx.5.0
Changed in starlingx:
importance: Medium → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.