Bug #1918420 “Alarm 750.003 - Application remove failure for stx...” : Bugs : StarlingX

Revision history for this message

Nicolae Jascanu (njascanu-intel) wrote on 2021-03-10:

#1

Please find the collected logs from baremetal Standard_Ext at:
https://drive.google.com/file/d/1T05Dmtq2ZyhoMP6HmCHQH1m3YowURSau/view?usp=sharing

Ghada Khalil (gkhalil) on 2021-03-10

tags:

added: stx.containers

Revision history for this message

chen haochuan (martin1982) wrote on 2021-03-19:

#2

controller-0_20210309.173751/var/log/armada/stx-openstack-delete_2021-03-09-15-24-22.log
2021-03-09 15:24:25.159 68 INFO armada.handlers.lock [-] Releasing lock^[[00m
2021-03-09 15:24:25.165 68 ERROR armada.cli [-] Caught unexpected exception: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "write tcp 172.16.192.117:39446->10.10.59.10:5432: write: broken pipe"
        debug_error_string = "{"created":"@1615303464.163584714","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"write tcp 172.16.192.117:39446->10.10.59.10:5432: write: broken pipe","grpc_status":2}"
>

var/log/container/armada-api-696b77b494-xnfg2_armada_tiller-b007e27061e37a45e4d60cd8281ef8dc4fe8cea2084496a3baa196664ac9362f.log
2021-03-09T15:24:24.162677518Z stderr F [storage] 2021/03/09 15:24:24 listing all releases with filter
2021-03-09T15:24:24.162693699Z stderr F [storage/driver] 2021/03/09 15:24:24 list: failed to list: write tcp 172.16.192.117:39446->10.10.59.10:5432: write: broken pipe
2021-03-09T17:39:02.84176903Z stderr F [storage] 2021/03/09 17:39:02 listing all releases with filter

Tiller container in aramda-api pod fail to get data in secret sh.helm.release.v1.armada.v1 in aramda namspace.

Austin Sun (sunausti) on 2021-03-24

Changed in starlingx:
importance:	Undecided → Medium
assignee:	nobody → chen haochuan (martin1982)

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-04-09:

#3

Download full text (5.0 KiB)

I checked /var/log/fm-event.log and see this problems:

2021-04-09T13:48:23.000 controller-1 fmManager: info { "event_log_id" : "750.006", "reason_text" : "A configuration change requires a reapply of the stx-openstack application.", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.k8s_application=stx-openstack", "severity" : "warning", "state" : "set", "timestamp" : "2021-04-09 13:48:23.531584" }
2021-04-09T13:48:25.000 controller-1 fmManager: info { "event_log_id" : "400.005", "reason_text" : "Communication failure detected with peer over port eno2 on host controller-1 within the last 30 seconds", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.host=controller-1.network=cluster-host", "severity" : "major", "state" : "clear", "timestamp" : "2021-04-09 13:48:25.799332" }
2021-04-09T13:48:25.000 controller-1 fmManager: info { "event_log_id" : "401.005", "reason_text" : "Communication established with peer over port eno2 on host controller-1", "entity_instance_id" : "host=controller-1.network=cluster-host", "severity" : "critical", "state" : "msg", "timestamp" : "2021-04-09 13:48:25.637745" }
2021-04-09T13:48:25.000 controller-1 fmManager: info { "event_log_id" : "401.005", "reason_text" : "Communication established with peer over port eno1 on host controller-1", "entity_instance_id" : "host=controller-1.network=oam", "severity" : "critical", "state" : "msg", "timestamp" : "2021-04-09 13:48:25.762306" }
2021-04-09T13:48:26.000 controller-1 fmManager: info { "event_log_id" : "400.005", "reason_text" : "Communication failure detected with peer over port eno1 on host controller-1 within the last 30 seconds", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.host=controller-1.network=oam", "severity" : "major", "state" : "clear", "timestamp" : "2021-04-09 13:48:26.001493" }
2021-04-09T13:48:26.000 controller-1 fmManager: info { "event_log_id" : "401.005", "reason_text" : "Communication established with peer over port eno2 on host controller-1", "entity_instance_id" : "host=controller-1.network=mgmt", "severity" : "critical", "state" : "msg", "timestamp" : "2021-04-09 13:48:25.845300" }
2021-04-09T13:48:26.000 controller-1 fmManager: info { "event_log_id" : "400.005", "reason_text" : "Communication failure detected with peer over port eno2 on host controller-1 within the last 30 seconds", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.host=controller-1.network=mgmt", "severity" : "major", "state" : "clear", "timestamp" : "2021-04-09 13:48:26.203367" }
2021-04-09T13:48:29.000 controller-1 fmManager: info { "event_log_id" : "401.001", "reason_text" : "Service group web-services state change from go-active to active on host controller-0", "entity_instance_id" : "service_domain=controller.service_group=web-services.host=controller-0", "severity" : "critical", "state" : "msg", "timestamp" : "2021-04-09 13:48:29.163167" }
2021-04-09T13:48:32.000 controller-1 fmManager: info { "event_log_id" : "401.002", "reason_text" : "Service group web-services redundancy restored", "entity_instance_id" : "service_domain=contr...

I checked /var/log/fm-event.log and see this problems:

2021-04-09T13:48:23.000 controller-1 fmManager: info { "event_log_id" : "750.006", "reason_text" : "A configuration change requires a reapply of the stx-openstack application.", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.k8s_application=stx-openstack", "severity" : "warning", "state" : "set", "timestamp" : "2021-04-09 13:48:23.531584" }
2021-04-09T13:48:25.000 controller-1 fmManager: info { "event_log_id" : "400.005", "reason_text" : "Communication failure detected with peer over port eno2 on host controller-1 within the last 30 seconds", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.host=controller-1.network=cluster-host", "severity" : "major", "state" : "clear", "timestamp" : "2021-04-09 13:48:25.799332" }
2021-04-09T13:48:25.000 controller-1 fmManager: info { "event_log_id" : "401.005", "reason_text" : "Communication established with peer over port eno2 on host controller-1", "entity_instance_id" : "host=controller-1.network=cluster-host", "severity" : "critical", "state" : "msg", "timestamp" : "2021-04-09 13:48:25.637745" }
2021-04-09T13:48:25.000 controller-1 fmManager: info { "event_log_id" : "401.005", "reason_text" : "Communication established with peer over port eno1 on host controller-1", "entity_instance_id" : "host=controller-1.network=oam", "severity" : "critical", "state" : "msg", "timestamp" : "2021-04-09 13:48:25.762306" }
2021-04-09T13:48:26.000 controller-1 fmManager: info { "event_log_id" : "400.005", "reason_text" : "Communication failure detected with peer over port eno1 on host controller-1 within the last 30 seconds", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.host=controller-1.network=oam", "severity" : "major", "state" : "clear", "timestamp" : "2021-04-09 13:48:26.001493" }
2021-04-09T13:48:26.000 controller-1 fmManager: info { "event_log_id" : "401.005", "reason_text" : "Communication established with peer over port eno2 on host controller-1", "entity_instance_id" : "host=controller-1.network=mgmt", "severity" : "critical", "state" : "msg", "timestamp" : "2021-04-09 13:48:25.845300" }
2021-04-09T13:48:26.000 controller-1 fmManager: info { "event_log_id" : "400.005", "reason_text" : "Communication failure detected with peer over port eno2 on host controller-1 within the last 30 seconds", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.host=controller-1.network=mgmt", "severity" : "major", "state" : "clear", "timestamp" : "2021-04-09 13:48:26.203367" }
2021-04-09T13:48:29.000 controller-1 fmManager: info { "event_log_id" : "401.001", "reason_text" : "Service group web-services state change from go-active to active on host controller-0", "entity_instance_id" : "service_domain=controller.service_group=web-services.host=controller-0", "severity" : "critical", "state" : "msg", "timestamp" : "2021-04-09 13:48:29.163167" }
2021-04-09T13:48:32.000 controller-1 fmManager: info { "event_log_id" : "401.002", "reason_text" : "Service group web-services redundancy restored", "entity_instance_id" : "service_domain=controller.service_group=web-services", "severity" : "critical", "state" : "msg", "timestamp" : "2021-04-09 13:48:31.888832" }
2021-04-09T13:48:32.000 controller-1 fmManager: info { "event_log_id" : "400.002", "reason_text" : "Service group web-services loss of redundancy; expected 2 active members but only 1 active member available", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.service_domain=controller.service_group=web-services", "severity" : "major", "state" : "clear", "timestamp" : "2021-04-09 13:48:32.008168" }
2021-04-09T13:49:11.000 controller-1 fmManager: info { "event_log_id" : "750.006", "reason_text" : "A configuration change requires a reapply of the stx-openstack application.", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.k8s_application=stx-openstack", "severity" : "warning", "state" : "clear", "timestamp" : "2021-04-09 13:49:11.024289" }
2021-04-09T13:49:11.000 controller-1 fmManager: info { "event_log_id" : "750.004", "reason_text" : "Application Apply In Progress", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.k8s_application=stx-openstack", "severity" : "warning", "state" : "set", "timestamp" : "2021-04-09 13:49:11.057328" }
2021-04-09T13:50:07.000 controller-1 fmManager: info { "event_log_id" : "750.004", "reason_text" : "Application Apply In Progress", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.k8s_application=stx-openstack", "severity" : "warning", "state" : "clear", "timestamp" : "2021-04-09 13:50:07.653780" }
2021-04-09T13:50:07.000 controller-1 fmManager: info { "event_log_id" : "750.002", "reason_text" : "Application Apply Failure", "entity_instance_id" : "region=RegionOne.system=1606aba2-4182-485d-b46d-d50a67bdab33.k8s_application=stx-openstack", "severity" : "major", "state" : "set", "timestamp" : "2021-04-09 13:50:07.598428" }

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-04-27:

#4

Fix for https://bugs.launchpad.net/starlingx/+bug/1917308 merged in r/stx.5.0 on 2021-04-27

Changed in starlingx:
status:	New → Fix Released

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-04-27:

#5

Fix for duplicate LP: https://bugs.launchpad.net/starlingx/+bug/1917308 merged in r/stx.5.0 on 2021-04-27

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-04-28:

#6

The bug #1917308 was fixed yesterday on RC 5.0. But the remove-fail issue is still visible.

Bill Zvonar (billzvonar) on 2021-04-28

Changed in starlingx:
status:	Fix Released → Confirmed
tags:	added: stx.5.0

Revision history for this message

Austin Sun (sunausti) wrote on 2021-04-28:

#7

Hi , Alex , please upload the new logs, @Martin, could you triage this firstly once get new log.

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-04-28:

#8

Attached the collected logs Edit (121.6 MiB, application/x-tar)

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-04-28:

#9

Looking on Alarms I see:
| 750.003 | Application Remove Failure | k8s_application=stx-openstack | major

Looking on Alarms I see:
| 750.003  | Application Remove Failure                                             | k8s_application=stx-openstack        | major

Revision history for this message

chen haochuan (martin1982) wrote on 2021-04-29:

#10

sysinv 2021-04-28 15:35:37.938 6323 WARNING sysinv.openstack.common.loopingcall [-] task run outlasted interval by 255.745897 sec
sysinv 2021-04-28 15:35:38.154 1072608 WARNING urllib3.connectionpool [-] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb255356790>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)': /api/v1/namespaces/armada/pods?fieldSelector=&labelSelector=application%3Darmada: NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fb255356790>: Failed to establish a new connection: [Errno 111] ECONNREFUSED
sysinv 2021-04-28 15:35:38.155 1072608 WARNING urllib3.connectionpool [-] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c7d50>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)': /api/v1/namespaces/armada/pods?fieldSelector=&labelSelector=application%3Darmada: NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c7d50>: Failed to establish a new connection: [Errno 111] ECONNREFUSED
sysinv 2021-04-28 15:35:38.156 1072608 WARNING urllib3.connectionpool [-] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c7990>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)': /api/v1/namespaces/armada/pods?fieldSelector=&labelSelector=application%3Darmada: NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c7990>: Failed to establish a new connection: [Errno 111] ECONNREFUSED
sysinv 2021-04-28 15:35:38.157 1072608 INFO sysinv.conductor.kube_app [-] Could not get Armada service : HTTPSConnectionPool(host='192.168.206.1', port=6443): Max retries exceeded with url: /api/v1/namespaces/armada/pods?fieldSelector=&labelSelector=application%3Darmada (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c70d0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))

sysinv 2021-04-28 15:35:58.174 1072608 ERROR sysinv.conductor.kube_app [-] Armada service failed to start/restart: SysinvException: Failed to generate bootstrap token
sysinv 2021-04-28 15:35:58.175 1072608 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app stx-openstack
sysinv 2021-04-28 15:35:58.175 1072608 INFO sysinv.conductor.kube_app [-] lifecycle hook for application stx-openstack (1.0-82-centos-stable-versioned) started {'lifecycle_type': u'armada-request', 'relative_timing': u'post', 'operation': u'delete', 'extra': {'rc': False}}.
sysinv 2021-04-28 15:35:58.409 1072608 ERROR sysinv.conductor.kube_app [-] Application remove aborted!.: SysinvException: Failed to generate bootstrap token

sysinv 2021-04-28 15:35:37.938 6323 WARNING sysinv.openstack.common.loopingcall [-] task run outlasted interval by 255.745897 sec
sysinv 2021-04-28 15:35:38.154 1072608 WARNING urllib3.connectionpool [-] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb255356790>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)': /api/v1/namespaces/armada/pods?fieldSelector=&labelSelector=application%3Darmada: NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fb255356790>: Failed to establish a new connection: [Errno 111] ECONNREFUSED
sysinv 2021-04-28 15:35:38.155 1072608 WARNING urllib3.connectionpool [-] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c7d50>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)': /api/v1/namespaces/armada/pods?fieldSelector=&labelSelector=application%3Darmada: NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c7d50>: Failed to establish a new connection: [Errno 111] ECONNREFUSED
sysinv 2021-04-28 15:35:38.156 1072608 WARNING urllib3.connectionpool [-] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c7990>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)': /api/v1/namespaces/armada/pods?fieldSelector=&labelSelector=application%3Darmada: NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c7990>: Failed to establish a new connection: [Errno 111] ECONNREFUSED
sysinv 2021-04-28 15:35:38.157 1072608 INFO sysinv.conductor.kube_app [-] Could not get Armada service : HTTPSConnectionPool(host='192.168.206.1', port=6443): Max retries exceeded with url: /api/v1/namespaces/armada/pods?fieldSelector=&labelSelector=application%3Darmada (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb2559c70d0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))

sysinv 2021-04-28 15:35:58.174 1072608 ERROR sysinv.conductor.kube_app [-] Armada service failed to start/restart: SysinvException: Failed to generate bootstrap token
sysinv 2021-04-28 15:35:58.175 1072608 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app stx-openstack
sysinv 2021-04-28 15:35:58.175 1072608 INFO sysinv.conductor.kube_app [-] lifecycle hook for application stx-openstack (1.0-82-centos-stable-versioned) started {'lifecycle_type': u'armada-request', 'relative_timing': u'post', 'operation': u'delete', 'extra': {'rc': False}}.
sysinv 2021-04-28 15:35:58.409 1072608 ERROR sysinv.conductor.kube_app [-] Application remove aborted!.: SysinvException: Failed to generate bootstrap token

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-04-30: Fix proposed to ha (master)

#11

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/788897

Changed in starlingx:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-12: Change abandoned on ha (master)

#12

Change abandoned by "chen haochuan <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/788897

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-05-17:

#13

This is still seen in the most recent stx.5.0 sanity:
http://lists.starlingx.io/pipermail/starlingx-discuss/2021-May/011460.html

Revision history for this message

Austin Sun (sunausti) wrote on 2021-05-18:

#14

http://lists.starlingx.io/pipermail/starlingx-discuss/2021-May/011417.html comments

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-05-18:

#15

Download full text (5.6 KiB)

I reproduced this issue running this test:
FAIL 20210518 12:30:47 testcases/sanity/sanity_openstack/test_host_management.py::test_reapply_openstack

Debug log:
controller-0:~$ cat /var/log/armada/stx-openstack-delete_2021-05-18-12-30-07.log
2021-05-18 12:30:09.235 177 DEBUG armada.handlers.tiller [-] Using Tiller host IP: 127.0.0.1 _get_tiller_ip /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:165
2021-05-18 12:30:09.235 177 DEBUG armada.handlers.tiller [-] Using Tiller host port: 24134 _get_tiller_port /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:174
2021-05-18 12:30:09.235 177 DEBUG armada.handlers.tiller [-] Tiller getting gRPC insecure channel at 127.0.0.1:24134 with options: [grpc.max_send_message_length=429496729, grpc.max_receive_message_length=429496729] get_channel /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:127
2021-05-18 12:30:09.238 177 DEBUG armada.handlers.tiller [-] Armada is using Tiller at: 127.0.0.1:24134, namespace=kube-system, timeout=300 __init__ /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:107
2021-05-18 12:30:09.238 177 INFO armada.handlers.lock [-] Acquiring lock
2021-05-18 12:30:09.250 177 DEBUG armada.handlers.tiller [-] Getting known releases from Tiller... list_charts /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:359
2021-05-18 12:30:09.250 177 DEBUG armada.handlers.tiller [-] Tiller ListReleases() with timeout=300, request=limit: 32
status_codes: UNKNOWN
status_codes: DEPLOYED
status_codes: DELETED
status_codes: DELETING
status_codes: FAILED
status_codes: PENDING_INSTALL
status_codes: PENDING_UPGRADE
status_codes: PENDING_ROLLBACK
get_results /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:215
2021-05-18 12:30:10.251 177 INFO armada.handlers.lock [-] Releasing lock
2021-05-18 12:30:10.257 177 ERROR armada.cli [-] Caught unexpected exception: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "write tcp 172.16.166.171:34140->10.10.56.10:5432: write: broken pipe"
        debug_error_string = "{"created":"@1621341009.251408267","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"write tcp 172.16.166.171:34140->10.10.56.10:5432: write: broken pipe","grpc_status":2}"
>
2021-05-18 12:30:10.257 177 ERROR armada.cli Traceback (most recent call last):
2021-05-18 12:30:10.257 177 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke
2021-05-18 12:30:10.257 177 ERROR armada.cli self.invoke()
2021-05-18 12:30:10.257 177 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/delete.py", line 98, in invoke
2021-05-18 12:30:10.257 177 ERROR armada.cli self.handle(tiller)
2021-05-18 12:30:10.257 177 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2021-05-18 12:30:10.257 177 ERROR armada.cli return future.result()
2021-05-18 12:30:10.257 177 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result...

StarlingX

Alarm 750.003 - Application remove failure for stx-openstack

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

tags:	removed: stx.5.0
Changed in starlingx:
importance:	Medium → Low