controller-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress."

Bug #1918318 reported by Bruce Jones
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Triaged
Low
chen haochuan

Bug Description

An end user is seeing this issue, which looks very similar to https://review.opendev.org/c/starlingx/integ/+/659081 which was fixed 2 years ago. The user is seeing this issue on StarlingX 4.0. The configuration is a simplex (AIO) system.

The user reports that locking/unlocking the controller does not resolve the issue.

Logs captured are:

controller-0:~$ source /etc/platform/openrc
[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | degraded |
+----+--------------+-------------+----------------+-------------+--------------+
[sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+--------------------------+---------------------------+-----------------------------------+----------------------------------------+----------+-----------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------------------------+-----------------------------------+----------------------------------------+----------+-----------+
| cert-manager | 1.0-6 | cert-manager-manifest | certmanager-manifest.yaml | applied | completed |
| nginx-ingress-controller | 1.0-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | applied | completed |
| oidc-auth-apps | 1.0-28 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-10 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-49-centos-stable- | armada-manifest | stx-openstack.yaml | applied | completed |
| | latest | | | | |
| | | | | | |
+--------------------------+---------------------------+-----------------------------------+----------------------------------------+----------+-----------+

[sysadmin@controller-0 ~(keystone_admin)]$ openstack compute service list
internal endpoint for compute service in RegionOne region not found
[sysadmin@controller-0 ~(keystone_admin)]$ openstack endpoint list
+----------------------------------+-----------+--------------+-----------------+---------+-----------+-------------------------------+
| ID | Region | Service Name | Service Type | Enabled | Interface | URL |
+----------------------------------+-----------+--------------+-----------------+---------+-----------+-------------------------------+
| 7640ca71125c468d8d404eb6e5fab0be | RegionOne | fm | faultmanagement | True | admin | http://192.168.204.1:18002 |
| ed95da9eed7f45a48536f56cd98d7f6d | RegionOne | fm | faultmanagement | True | internal | http://192.168.204.1:18002 |
| 0fbbef441ab648e19167cd72271d993a | RegionOne | fm | faultmanagement | True | public | http://192.168.172.18:18002 |
| 4cf33a8c21004501b65701ca4af9e085 | RegionOne | patching | patching | True | admin | http://192.168.204.1:5491 |
| 9b3ed1ea44a344798c1508090b8455c1 | RegionOne | patching | patching | True | internal | http://192.168.204.1:5491 |
| 9bcc2dc5da56472482b99158580a15aa | RegionOne | patching | patching | True | public | http://192.168.172.18:15491 |
| 974458fe518848a1b353bc522aa719f4 | RegionOne | vim | nfv | True | admin | http://192.168.204.1:4545 |
| 776e1616ae8e4487a088d1c3bcc393d8 | RegionOne | vim | nfv | True | internal | http://192.168.204.1:4545 |
| 00a8ec1edef442859d002ba82106c839 | RegionOne | vim | nfv | True | public | http://192.168.172.18:4545 |
| 4e1be4f3cfe440a1b3696d310ba587cc | RegionOne | barbican | key-manager | True | admin | http://192.168.204.1:9311 |
| 1f62de9274044b76967eaa5c50f8963a | RegionOne | barbican | key-manager | True | internal | http://192.168.204.1:9311 |
| 0f336b7a951740fab6f5624f1b4745c1 | RegionOne | barbican | key-manager | True | public | http://192.168.172.18:9311 |
| b268f65d54f4442582831ef350ad04b3 | RegionOne | smapi | smapi | True | admin | http://192.168.204.1:7777 |
| 21ce287f252f4c29a8b488a95011fd38 | RegionOne | smapi | smapi | True | internal | http://192.168.204.1:7777 |
| f2bdd1e93e70449cbe9c86ce4d1b5ef8 | RegionOne | smapi | smapi | True | public | http://192.168.172.18:7777 |
| 8c8b84b269284623aa7a56410f610f6a | RegionOne | keystone | identity | True | admin | http://192.168.204.1:5000/v3 |
| e5098d9e405243b1a65b9481829cfacf | RegionOne | keystone | identity | True | internal | http://192.168.204.1:5000/v3 |
| 7772e70e06824cf7af9d419810747843 | RegionOne | keystone | identity | True | public | http://192.168.172.18:5000/v3 |
| 2c893433a3f345249a4bf8e2aa57c71d | RegionOne | sysinv | platform | True | admin | http://192.168.204.1:6385/v1 |
| 2d508cf2e24a4a16a46c56f6afcd2ec7 | RegionOne | sysinv | platform | True | internal | http://192.168.204.1:6385/v1 |
| ac433b091270449f8f99dacd7147c5af | RegionOne | sysinv | platform | True | public | http://192.168.172.18:6385/v1 |
+----------------------------------+-----------+--------------+-----------------+---------+-----------+-------------------------------+
controller-0:~$ source /etc/platform/openrc
[sysadmin@controller-0 ~(keystone_admin)]$ openstack compute service list
internal endpoint for compute service in RegionOne region not found
[sysadmin@controller-0 ~(keystone_admin)]$ openstack endpoint list
+----------------------------------+-----------+--------------+-----------------+---------+-----------+-------------------------------+
| ID | Region | Service Name | Service Type | Enabled | Interface | URL |
+----------------------------------+-----------+--------------+-----------------+---------+-----------+-------------------------------+
| 7640ca71125c468d8d404eb6e5fab0be | RegionOne | fm | faultmanagement | True | admin | http://192.168.204.1:18002 |
| ed95da9eed7f45a48536f56cd98d7f6d | RegionOne | fm | faultmanagement | True | internal | http://192.168.204.1:18002 |
| 0fbbef441ab648e19167cd72271d993a | RegionOne | fm | faultmanagement | True | public | http://192.168.172.18:18002 |
| 4cf33a8c21004501b65701ca4af9e085 | RegionOne | patching | patching | True | admin | http://192.168.204.1:5491 |
| 9b3ed1ea44a344798c1508090b8455c1 | RegionOne | patching | patching | True | internal | http://192.168.204.1:5491 |
| 9bcc2dc5da56472482b99158580a15aa | RegionOne | patching | patching | True | public | http://192.168.172.18:15491 |
| 974458fe518848a1b353bc522aa719f4 | RegionOne | vim | nfv | True | admin | http://192.168.204.1:4545 |
| 776e1616ae8e4487a088d1c3bcc393d8 | RegionOne | vim | nfv | True | internal | http://192.168.204.1:4545 |
| 00a8ec1edef442859d002ba82106c839 | RegionOne | vim | nfv | True | public | http://192.168.172.18:4545 |
| 4e1be4f3cfe440a1b3696d310ba587cc | RegionOne | barbican | key-manager | True | admin | http://192.168.204.1:9311 |
| 1f62de9274044b76967eaa5c50f8963a | RegionOne | barbican | key-manager | True | internal | http://192.168.204.1:9311 |
| 0f336b7a951740fab6f5624f1b4745c1 | RegionOne | barbican | key-manager | True | public | http://192.168.172.18:9311 |
| b268f65d54f4442582831ef350ad04b3 | RegionOne | smapi | smapi | True | admin | http://192.168.204.1:7777 |
| 21ce287f252f4c29a8b488a95011fd38 | RegionOne | smapi | smapi | True | internal | http://192.168.204.1:7777 |
| f2bdd1e93e70449cbe9c86ce4d1b5ef8 | RegionOne | smapi | smapi | True | public | http://192.168.172.18:7777 |
| 8c8b84b269284623aa7a56410f610f6a | RegionOne | keystone | identity | True | admin | http://192.168.204.1:5000/v3 |
| e5098d9e405243b1a65b9481829cfacf | RegionOne | keystone | identity | True | internal | http://192.168.204.1:5000/v3 |
| 7772e70e06824cf7af9d419810747843 | RegionOne | keystone | identity | True | public | http://192.168.172.18:5000/v3 |
| 2c893433a3f345249a4bf8e2aa57c71d | RegionOne | sysinv | platform | True | admin | http://192.168.204.1:6385/v1 |
| 2d508cf2e24a4a16a46c56f6afcd2ec7 | RegionOne | sysinv | platform | True | internal | http://192.168.204.1:6385/v1 |
| ac433b091270449f8f99dacd7147c5af | RegionOne | sysinv | platform | True | public | http://192.168.172.18:6385/v1 |
+----------------------------------+-----------+--------------+-----------------+---------+-----------+-------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ openstack compute service list --service nova-compute
internal endpoint for compute service in RegionOne region not found
[sysadmin@controller-0 ~(keystone_admin)]$ openstack flavor list
internal endpoint for compute service in RegionOne region not found

controller-0:~$ kubectl get pods --all-namespaces -o wide|grep nova | grep -a -e Running -e Completed
openstack nova-bootstrap-bt6jz 0/1 Completed 0 2d8h 172.16.192.157 controller-0 <none> <none>
openstack nova-bootstrap-zhj9w 0/1 Completed 0 5d12h 172.16.192.109 controller-0 <none> <none>
openstack nova-cell-setup-bxh4q 0/1 Completed 0 5d12h 172.16.192.105 controller-0 <none> <none>
openstack nova-cell-setup-ltxmx 0/1 Completed 0 2d8h 172.16.192.154 controller-0 <none> <none>
openstack nova-db-init-g67tk 0/3 Completed 0 5d12h 172.16.192.111 controller-0 <none> <none>
openstack nova-db-init-hnnxw 0/3 Completed 0 2d8h 172.16.192.155 controller-0 <none> <none>
openstack nova-db-sync-8zq8m 0/1 Completed 0 2d8h 172.16.192.156 controller-0 <none> <none>
openstack nova-db-sync-xdcb2 0/1 Completed 0 5d12h 172.16.192.107 controller-0 <none> <none>
openstack nova-ks-endpoints-c9kkx 0/3 Completed 0 5d12h 172.16.192.70 controller-0 <none> <none>
openstack nova-ks-service-9hr9j 0/1 Completed 0 2d8h 172.16.192.159 controller-0 <none> <none>
openstack nova-ks-service-jpf7g 0/1 Completed 0 5d12h 172.16.192.98 controller-0 <none> <none>
openstack nova-ks-user-bfjg2 0/1 Completed 0 2d8h 172.16.192.160 controller-0 <none> <none>
openstack nova-ks-user-bqh5q 0/1 Completed 0 5d12h 172.16.192.108 controller-0 <none> <none>
openstack nova-rabbit-init-7g7hk 0/1 Completed 0 2d8h 172.16.192.158 controller-0 <none> <none>
openstack nova-rabbit-init-cfxws 0/1 Completed 0 5d12h 172.16.192.112 controller-0 <none> <none>
openstack nova-service-cleaner-1614945600-hck6t 0/1 Completed 0 3d12h 172.16.192.113 controller-0 <none> <none>
openstack nova-service-cleaner-1614949200-j6h5g 0/1 Completed 0 3d11h 172.16.192.105 controller-0 <none> <none>
openstack nova-service-cleaner-1614967200-kh2w7 0/1 Completed 0 2d9h 172.16.192.92 controller-0 <none> <none>
openstack nova-service-cleaner-1615201200-kk47d 0/1 Completed 0 13h 172.16.192.154 controller-0 <none> <none>
openstack nova-service-cleaner-1615204800-h6w8p 0/1 Completed 0 12h 172.16.192.161 controller-0 <none> <none>
openstack nova-service-cleaner-1615208400-965sc 0/1 Completed 0 11h 172.16.192.163 controller-0 <none> <none>
openstack nova-storage-init-vsbdz 0/1 Completed 0 5d12h 172.16.192.95 controller-0 <none> <none>
openstack nova-storage-init-wl674 0/1 Completed 0 2d8h 172.16.192.161 controller-0 <none> <none>

controller-0:~$ kubectl get pods --all-namespaces -o wide|grep nova | grep -v -e Running -e Completed
openstack nova-api-metadata-89496f88b-wwwfg 0/1 Init:Unknown 0 12h <none> controller-0 <none> <none>
openstack nova-api-osapi-8c586b98b-k6dsn 0/1 Init:Unknown 0 12h <none> controller-0 <none> <none>
openstack nova-api-proxy-6c5488769-dbdt8 0/1 Unknown 1 12h <none> controller-0 <none> <none>
openstack nova-compute-controller-0-937646f6-4jlmq 0/2 Init:Unknown 1 13h 192.168.204.2 controller-0 <none> <none>
openstack nova-conductor-7d6fc48cc9-hq7sb 0/1 Init:Unknown 0 12h <none> controller-0 <none> <none>
openstack nova-novncproxy-c7b7f4c69-xpdj4 0/1 Init:Unknown 0 3h18m <none> controller-0 <none> <none>
openstack nova-scheduler-cfcf778f7-dkksd 0/1 Init:0/1 0 147m 172.16.192.157 controller-0 <none> <none>
openstack nova-service-cleaner-1615215600-klrrw 0/1 Init:0/1 0 147m 172.16.192.134 controller-0 <none> <none>

----inventory log---

sysinv 2021-03-09 04:38:24.379 83939 INFO sysinv.api.controllers.v1.rest_api [-] GET cmd:http://localhost:30001/nfvi-plugins/v1/sw-update hdr:{'Content-type': 'application/json', 'User-Agent': 'sysinv/1.0'} payload:None
sysinv 2021-03-09 04:38:24.383 83939 INFO sysinv.api.controllers.v1.rest_api [-] Response={u'status': u'success', u'in-progress': None, u'sw-update-type': None}
sysinv 2021-03-09 04:38:24.428 83939 INFO sysinv.conductor.manager [-] Platform managed application oidc-auth-apps: Prerequisites not met.
sysinv 2021-03-09 04:38:24.434 83939 INFO sysinv.api.controllers.v1.rest_api [-] GET cmd:http://localhost:30001/nfvi-plugins/v1/sw-update hdr:{'Content-type': 'application/json', 'User-Agent': 'sysinv/1.0'} payload:None
sysinv 2021-03-09 04:38:24.436 83939 INFO sysinv.api.controllers.v1.rest_api [-] Response={u'status': u'success', u'in-progress': None, u'sw-update-type': None}
sysinv 2021-03-09 04:38:24.443 83939 INFO sysinv.conductor.manager [-] platform-integ-apps requires re-apply but there are currently node(s) in an unstable state. Will retry on next audit
sysinv 2021-03-09 04:39:24.384 83939 INFO sysinv.api.controllers.v1.rest_api [-] GET cmd:http://localhost:30001/nfvi-plugins/v1/sw-update hdr:{'Content-type': 'application/json', 'User-Agent': 'sysinv/1.0'} payload:None
sysinv 2021-03-09 04:39:24.386 83939 INFO sysinv.api.controllers.v1.rest_api [-] Response={u'status': u'success', u'in-progress': None, u'sw-update-type': None}
sysinv 2021-03-09 04:39:24.431 83939 INFO sysinv.conductor.manager [-] Platform managed application oidc-auth-apps: Prerequisites not met.
sysinv 2021-03-09 04:39:24.437 83939 INFO sysinv.api.controllers.v1.rest_api [-] GET cmd:http://localhost:30001/nfvi-plugins/v1/sw-update hdr:{'Content-type': 'application/json', 'User-Agent': 'sysinv/1.0'} payload:None
sysinv 2021-03-09 04:39:24.439 83939 INFO sysinv.api.controllers.v1.rest_api [-] Response={u'status': u'success', u'in-progress': None, u'sw-update-type': None}
sysinv 2021-03-09 04:39:24.448 83939 INFO sysinv.conductor.manager [-] platform-integ-apps requires re-apply but there are currently node(s) in an unstable state. Will retry on next audit
sysinv 2021-03-09 04:40:24.404 83939 INFO sysinv.api.controllers.v1.rest_api [-] GET cmd:http://localhost:30001/nfvi-plugins/v1/sw-update hdr:{'Content-type': 'application/json', 'User-Agent': 'sysinv/1.0'} payload:None
sysinv 2021-03-09 04:40:24.405 83939 WARNING sysinv.api.controllers.v1.rest_api [-] URLError Error e=<urlopen error [Errno 111] ECONNREFUSED>: URLError: <urlopen error [Errno 111] ECONNREFUSED>
sysinv 2021-03-09 04:40:24.405 83939 ERROR sysinv.openstack.common.periodic_task [-] Error during ConductorManager._k8s_application_audit: 'NoneType' object has no attribute '__getitem__': TypeError: 'NoneType' object has no attribute '__getitem__'
2021-03-09 04:40:24.405 83939 ERROR sysinv.openstack.common.periodic_task Traceback (most recent call last):
2021-03-09 04:40:24.405 83939 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/periodic_task.py", line 180, in run_periodic_tasks
2021-03-09 04:40:24.405 83939 ERROR sysinv.openstack.common.periodic_task task(self, context)
2021-03-09 04:40:24.405 83939 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 5454, in _k8s_application_audit
2021-03-09 04:40:24.405 83939 ERROR sysinv.openstack.common.periodic_task if self._check_software_orchestration_in_progress():
2021-03-09 04:40:24.405 83939 ERROR sysinv.openstack.common.periodic_task File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 5411, in _check_software_orchestration_in_progress
2021-03-09 04:40:24.405 83939 ERROR sysinv.openstack.common.periodic_task if vim_resp['sw-update-type'] is not None and \
2021-03-09 04:40:24.405 83939 ERROR sysinv.openstack.common.periodic_task TypeError: 'NoneType' object has no attribute '__getitem__'
2021-03-09 04:40:24.405 83939 ERROR sysinv.openstack.common.periodic_task
sysinv 2021-03-09 04:41:24.375 83939 INFO sysinv.api.controllers.v1.rest_api [-] GET cmd:http://localhost:30001/nfvi-plugins/v1/sw-update hdr:{'Content-type': 'application/json', 'User-Agent': 'sysinv/1.0'} payload:None
sysinv 2021-03-09 04:41:24.378 83939 INFO sysinv.api.controllers.v1.rest_api [-] Response={u'status': u'success', u'in-progress': None, u'sw-update-type': None}
sysinv 2021-03-09 04:41:24.424 83939 INFO sysinv.conductor.manager [-] Platform managed application oidc-auth-apps: Prerequisites not met.
sysinv 2021-03-09 04:41:24.430 83939 INFO sysinv.api.controllers.v1.rest_api [-] GET cmd:http://localhost:30001/nfvi-plugins/v1/sw-update hdr:{'Content-type': 'application/json', 'User-Agent': 'sysinv/1.0'} payload:None

Tags: stx.metal
Revision history for this message
Bruce Jones (brucej) wrote :

This bug was reported by Danishka Navin <email address hidden>, who should be contacted if more information is needed.

Changed in starlingx:
importance: Undecided → High
assignee: nobody → yong hu (yhu6)
assignee: yong hu (yhu6) → nobody
assignee: nobody → Nicolae Jascanu (njascanu-intel)
Revision history for this message
Nicolae Jascanu (njascanu-intel) wrote :

Could you please update the bug with the following information?
run command: source /etc/platform/openrc
- output of: cat /etc/build.info
- output of: cat /var/log/pmond.log
- output of: fm alarm-list

Also, for Yong’s team will be necessary the archive resulted from running collect tool:
- please run: collect all

Revision history for this message
Nicolae Jascanu (njascanu-intel) wrote :

On our AIO simplex baremetal, I'm seeing errors related to pci-irq-affinity-agent in /var/log/pmond.log,
but the controller-0 is NOT in degraded mode yet.

Revision history for this message
OpenInfra (openinfra) wrote :
Download full text (4.0 KiB)

[sysadmin@controller-0 ~(keystone_admin)]$ cat /etc/build.info
###
### StarlingX
### Release 20.06
###

OS="centos"
SW_VERSION="20.06"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="r/stx.4.0"

JOB="STX_4.0_build_layer_flock"
<email address hidden>"
BUILD_NUMBER="22"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2020-08-05 12:25:52 +0000"

FLOCK_OS="centos"
FLOCK_JOB="STX_4.0_build_layer_flock"
<email address hidden>"
FLOCK_BUILD_NUMBER="22"
FLOCK_BUILD_HOST="starlingx_mirror"
FLOCK_BUILD_DATE="2020-08-05 12:25:52 +0000"

Please note that I have increase memory from 18GB to 32GB and then locked and unlocked the controller-0.

[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+------------------------------------------------------------------------+--------------------------------------+----------+-----------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+------------------------------------------------------------------------+--------------------------------------+----------+-----------------------+
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-0.ntp | major | 2021-03-10T13:58:52. |
| | | | | 739176 |
| | | | | |
| 100.114 | NTP address 131.188.3.220 is not a valid or a reachable NTP server. | host=controller-0.ntp=131.188.3.220 | minor | 2021-03-10T13:58:52. |
| | | | | 736444 |
| | | | | |
| 700.004 | Instance iv-centos owned by admin is stopped on host controller-0 | tenant=d86f3dec-65f6-4afa- | critical | 2021-03-10T10:43:48. |
| | | b0f5-36dae0f52b71.instance= | | 922999 |
| | | fd1b6814-6ee8-4caf-8bf4-2027b91a24d3 | | |
| | | | | |
| 700.004 | Instance ap-centos-8 owned by admin is stopped on host controller-0 | tenant=d86f3dec-65f6-4afa- | critical | 2021-03-10T10:40:16. |
| | | b0f5-36dae0f52b71.instance= | | 166109 |
| | ...

Read more...

Revision history for this message
OpenInfra (openinfra) wrote :

Attached pmond.log file

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Tagging as stx.4.0 since this is the release the issue is reported on

tags: added: stx.4.0
tags: added: stx.metal
Revision history for this message
Nicolae Jascanu (njascanu-intel) wrote :

I added the collect all logs

Bruce Jones (brucej)
Changed in starlingx:
status: New → Triaged
assignee: Nicolae Jascanu (njascanu-intel) → yong hu (yhu6)
yong hu (yhu6)
Changed in starlingx:
assignee: yong hu (yhu6) → chen haochuan (martin1982)
Revision history for this message
Nicolae Jascanu (njascanu-intel) wrote :
Download full text (3.9 KiB)

We've installed a baremetal simplex. The error appears in pmond.log but the controller is not degraded.

[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+--------------------------+------------------------------+-----------------------------------+----------------------------------------+----------+-----------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+------------------------------+-----------------------------------+----------------------------------------+----------+-----------+
| cert-manager | 1.0-5 | cert-manager-manifest | certmanager-manifest.yaml | applied | completed |
| nginx-ingress-controller | 1.0-0 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | applied | completed |
| oidc-auth-apps | 1.0-27 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-9 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-49-centos-stable- | armada-manifest | stx-openstack.yaml | applied | completed |
| | versioned | | | | |
| | | | | | |
+--------------------------+------------------------------+-----------------------------------+----------------------------------------+----------+-----------+
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+----------------------------------------------------------------------+-------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+----------------------------------------------------------------------+-------------------+----------+-------------------+
| 100.103 | Platform Memory threshold exceeded ; threshold 80.00%, actual 80.62% | host=controller-0 | major | 2021-03-12T07:07: |
| | | | | 35.607210 |
| | ...

Read more...

Revision history for this message
chen haochuan (martin1982) wrote :

pci-irq-affinity script launch fail in platform.log

2021-03-09T06:00:55.476 controller-0 2021-03-09 info 06:00:55,476 MainThread[118285] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/agent.py.176 - INFO Enter PCIInterruptAffinity Agent
2021-03-09T06:00:55.477 controller-0 2021-03-09 info 06:00:55,476 MainThread[118285] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/agent.py.194 - INFO 'NoneType' object has no attribute 'getCapabilities'
2021-03-09T06:00:55.477 controller-0 2021-03-09 err 06:00:55,476 MainThread[118285] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/agent.py.198 - ERROR proces_main finalized!!!

Revision history for this message
chen haochuan (martin1982) wrote :

it should fail here, conn.getCapabilities() in guest.py

def get_host_cpu_topology():
    """Enumerate logical cpu topology using socket_id, core_id, thread_id.

    This generates the following dictionary:
    topology[socket_id][core_id][thread_id] = cpu_id
    """
    global total_cpus

    # Connect to local libvirt hypervisor
    conn = connect_to_libvirt()
    # Get host capabilities
    caps_str = conn.getCapabilities()

Revision history for this message
chen haochuan (martin1982) wrote :

openstack pod libvirt, should not launch correctly. If reproducible, please check this pod works well.

Revision history for this message
Sivonaldo Diogo Santos Silva (sdiogosa) wrote :

I'm having the same problem in a DX

2021-04-20T06:02:39.842 [471839.00353] controller-1 mtcAgent hbs nodeClass.cpp (5687) log_process_failure : Warn : controller-0 pmon: 'pci-irq-affinity-agent' process failed and is being auto recovered
2021-04-20T06:03:38.842 [471839.00354] controller-1 mtcAgent hbs nodeClass.cpp (5744) degrade_process_raise : Warn : controller-0 is degraded due to 'pci-irq-affinity-agent' process failure

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

I observed today this issue affecting compute-0 on Standard baremetal RC5.0 image 20210505T010951Z:

| 200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' | host=compute-0.process= | major | 2021-05-05T11: |
| | process. Auto recovery of this major process is in progress. | pci-irq-affinity-agent | | 22:34.602662 |

Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: Added the stx.5.0 label based on the note above that this is also seen in the stx.5.0 release sanity. Seen twice in sanity already; frequency is TBD.

tags: added: stx.5.0
Revision history for this message
Austin Sun (sunausti) wrote :

I just login into this setup .

The alarm raised is
| 200. | compute-0 is degraded due to the failure of | host=compute-0.process=pci-irq- | major | 2021-05-12T |
| 006 | its 'pci-irq-affinity-agent' process. Auto | affinity-agent | | 11:18:56. |
| | recovery of this major process is in progress | | | 615041 |

And compute-0 is locked

+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | locked | disabled | online |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | controller-1 | controller | unlocked | enabled | available |
| 5 | storage-0 | storage | locked | disabled | online |
| 6 | storage-1 | storage | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

So no openstack pods is running on compute-0 .

Revision history for this message
Yvonne Ding (yding) wrote :

I hit the issue on AIO-SX on starling master load 0429. After unlock controller-0, 200.006 appears along with degraded after controller is up with unlocked/enable/available, then back to available again.

Revision history for this message
chen haochuan (martin1982) wrote :

Hi Yvonne, please share you log file, reproduce on AIO-SX.

Revision history for this message
Austin Sun (sunausti) wrote :

This should be different issue as the original one.

The original issue from customer is seen on simplex and controller-0 is unlocked and available states.

from recently reproduces , the issues seen either in AIO-SX and controller-0 is locked, or Multi and computes/workers are locked. which should be normal case, once controller-0 or workers are unlocked these alarm will be clean as openstack pods running on those nodes are back.

Revision history for this message
Austin Sun (sunausti) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on discussion in the stx release meeting (2021-05-19), tssue is highly intermittent. Lowering the priority; this will not hold up the stx.5.0 release.

From Alexandru: Of 32 installations, he saw it 2 times

Changed in starlingx:
importance: High → Low
tags: removed: stx.4.0 stx.5.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.