Comment 16 for bug 1832047

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote : Re: 200.006 alarm "controller-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process" after reboot

BUILD_ID="20190622T013000Z"
SYSTEM_NAME="yow-cgcs-wolfpass-03_07" (2+3 HW system)
VSWITCH_TYPE="ovs-dpdk"

[Failed in teardown in nova regression testcase
nova/test_force_lock_with_vms.py::test_force_lock_with_mig_vms]

$ system application-list --------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+------------
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-16-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applied | completed

1. All instances that were running on compute-0 landed on compute-1 (when compute-0 was force locked at approx 2019-06-25 14:18:30)

nova hypervisor reverts to down and disabled on compute-0

]$ nova hypervisor-list
+--------------------------------------+---------------------+-------+----------+
| ID | Hypervisor hostname | State | Status |
+--------------------------------------+---------------------+-------+----------+
| a8aea911-05a5-4410-9f39-08630783d373 | compute-0 | down | disabled |
| ba4344e6-5d97-4b42-a67b-eabbc95b531b | compute-1 | up | enabled |
| a612e8ad-5026-4d9e-b103-339b6d94eefe | compute-2 | up | enabled |
+--------------------------------------+---------------------+-------+----------+
[sysadmin@controller-0 ~(keystone_admin)]$ date
Tue Jun 25 14:22:00 UTC 2019

compute-0 is unlocked ~ 2019-06-25 14:24:15

2 alarms are reported after this (from Horizon). Only the CPU alarm on compute-1 clears

200.006 compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.
host=compute-0.process=pci-irq-affinity-agent major 2019-06-25T10:28:17

100.101 Platform CPU threshold exceeded ; threshold 90.00%, actual 93.43% host=compute-1 major 2019-06-25T10:33:11

+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | compute-0 | worker | unlocked | disabled | intest |
| 4 | compute-1 | worker | unlocked | enabled | degraded |
| 5 | compute-2 | worker | unlocked | enabled | available

$ date
Tue Jun 25 14:28:15 UTC 2019

$ nova hypervisor-list;date
+--------------------------------------+---------------------+-------+---------+
| ID | Hypervisor hostname | State | Status |
+--------------------------------------+---------------------+-------+---------+
| a8aea911-05a5-4410-9f39-08630783d373 | compute-0 | down | enabled |
| ba4344e6-5d97-4b42-a67b-eabbc95b531b | compute-1 | up | enabled |
| a612e8ad-5026-4d9e-b103-339b6d94eefe | compute-2 | up | enabled |
+--------------------------------------+---------------------+-------+---------+
Tue Jun 25 14:41:14 UTC 2019

$ nova hypervisor-list; date
+--------------------------------------+---------------------+-------+---------+
| ID | Hypervisor hostname | State | Status |
+--------------------------------------+---------------------+-------+---------+
| a8aea911-05a5-4410-9f39-08630783d373 | compute-0 | down | enabled |
| ba4344e6-5d97-4b42-a67b-eabbc95b531b | compute-1 | up | enabled |
| a612e8ad-5026-4d9e-b103-339b6d94eefe | compute-2 | up | enabled |
+--------------------------------------+---------------------+-------+---------+
Tue Jun 25 15:05:37 UTC 2019

Result
pci-irq-affinity-agent alarm on compute-0 did not clear long after the compute was unlocked
The hypervisor on compute-1 remained in state down (status enabled)