StarlingX

hypervisor stays down after force lock and unlock due to pci-irq-affinity-agent process failure

Bug #1839160 reported by Yang Liu on 2019-08-06

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Won't Fix	Low	Jim Gauld

Bug Description

Brief Description
-----------------
After force lock then unlock a worker node, the node gets stuck at degraded state with hypervisor state down. Following alarm is seen on system:
200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.

Severity
--------
Major

Steps to Reproduce
------------------
- Precondition: stx-openstack application is applied and all pods are runinng/completed, system is healthy.
- launch a few vms on same compute host (e.g., compute-0):
- system host-lock compute-0 --force
- system host-unlock compute-0

TC-name:
test_force_lock_with_mig_vms
test_force_lock_with_non_mig_vms

Expected Behavior
------------------
- compute-0 recovers and available; hypervisor is Up in nova hypervisor-list, no new alarm on system

Actual Behavior
----------------
- compute-0 degraded, hypervisor stays Down,
Following alarm generated:
200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.

Following pods stuck at init status
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack libvirt-libvirt-default-p9sv8 0/1 Init:0/3 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-dhcp-agent-compute-0-5621f953-q6fs2 0/1 Init:0/1 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-l3-agent-compute-0-5621f953-dgw26 0/1 Init:0/1 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-metadata-agent-compute-0-5621f953-9gss7 0/1 Init:0/2 0 54m 192.168.204.138 compute-0 <none> <none>
openstack nova-compute-compute-0-5621f953-pk2rz 0/2 Init:0/6 0 54m 192.168.204.138 compute-0 <none> <none>

Reproducibility
---------------
Intermittent
Test fails about 50% of the time.

System Configuration
--------------------
Multi-node system
Lab-name: Wolfpass3-7

Branch/Pull Time/Commit
-----------------------
stx master as of 20190803T013000Z

Last Pass
---------
Same load and lab, not sure when it is introduced.

Timestamp/Logs
--------------
[2019-08-03 20:25:15,108] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-0 --force'

[2019-08-03 20:31:04,994] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-0'

[2019-08-03 20:46:21,382] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
+--------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+----------+----------------------------+
| 1016f7f6-4f15-4ca0-8476-e218ca594212 | 200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=compute-0.process=pci-irq-affinity-agent | major | 2019-08-03T20:35:06.545527 |

[2019-08-03 21:29:39,562] 301 DEBUG MainThread ssh.send :: Send 'kubectl get pod --all-namespaces -o=wide --field-selector=status.phase!=Running,status.phase!=Succeeded'
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack libvirt-libvirt-default-p9sv8 0/1 Init:0/3 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-dhcp-agent-compute-0-5621f953-q6fs2 0/1 Init:0/1 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-l3-agent-compute-0-5621f953-dgw26 0/1 Init:0/1 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-metadata-agent-compute-0-5621f953-9gss7 0/1 Init:0/2 0 54m 192.168.204.138 compute-0 <none> <none>
openstack nova-compute-compute-0-5621f953-pk2rz 0/2 Init:0/6 0 54m 192.168.204.138 compute-0 <none> <none>

[2019-08-03 21:29:42,693] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+----------------+-------+
| 5 | compute-0 | QEMU | 192.168.206.86 | down |
| 8 | compute-1 | QEMU | 192.168.206.11 | up |
| 11 | compute-2 | QEMU | 192.168.206.21 | up |
+----+---------------------+-----------------+----------------+-------+
[sysadmin@controller-0 ~(keystone_admin)]$

Test Activity
-------------
Regression Testing

Tags:

Revision history for this message

Yang Liu (yliu12) wrote on 2019-08-06:

ALL_NODES_20190804.090905.tar Edit (127.7 MiB, application/x-tar)

Maria Guadalupe Perez Ibara (maria-gp) on 2019-08-06

tags:

added: stx.regression

Numan Waheed (nwaheed) on 2019-08-07

tags:

added: stx.retestneeded

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-08-07:

Marking high priority as the nova service does not recover. Assigning to Jim to investigate why the nova hypervisor is not recovering after the lock force/unlock.

Changed in starlingx:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → Jim Gauld (jgauld)
tags:	added: stx.2.0 stx.metal

Revision history for this message

Austin Sun (sunausti) wrote on 2019-08-29:

from compute-0 platform.log
2019-08-03T20:02:49.968 compute-0 2019-08-03 info 20:02:49,968 MainThread[21551] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/agent.py.176 - INFO Enter PCIInterruptAffinity Agent
2019-08-03T20:02:49.972 compute-0 2019-08-03 info 20:02:49,972 MainThread[21551] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/agent.py.194 - INFO 'NoneType' object has no attribute 'getCapabilities'
2019-08-03T20:02:49.972 compute-0 2019-08-03 err 20:02:49,972 MainThread[21551] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/agent.py.198 - ERROR proces_main finalized!!!
this Error is because libvirt cannot connected.
until 2019-08-04T04:17:54.441 compute-0, the pci_irq_affinity get back normal, but there is no libvirt container log when the issue is happened . need reproduce and capture the libvirt container log when issue is happened.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-06-29:

Reducing the priority. There are no plans to investigate stx.2.0 issues as there are no plans to build another maintenance release for stx.2.0

Changed in starlingx:
importance:	High → Low

Revision history for this message

Peng Peng (ppeng) wrote on 2020-10-11:

Issue seems reproduced on
Lab: wcp_122
Build ID: 2020-06-27_18-35-20
Software Version: 20.06
Patches:
WRCP_20.06_PATCH_0001
WRCP_20.06_PATCH_0002
WRCP_20.06_PATCH_0003
WRCP_20.06_PATCH_0004
WRCP_20.06_PATCH_0005

After process:hbsClient was killed and system reboot. After system boot up, event-list showed
controller-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process.

[2020-10-10 18:40:45,421] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne event-list -l 10 -q 'event_log_id=200.006;start=2020-10-10T16:58:01.513375' --nowrap --nopaging --uuid'
[2020-10-10 18:40:46,687] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------------------------+-------+--------------+------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+----------+
| UUID | Time Stamp | State | Event Log ID | Reason Text | Entity Instance ID | Severity |
+--------------------------------------+----------------------------+-------+--------------+------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+----------+
| cc911600-df02-4f54-813c-a1d7278836e4 | 2020-10-10T16:58:01.513375 | clear | 200.006 | controller-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=controller-0.process=pci-irq-affinity-agent | major |
+--------------------------------------+----------------------------+-------+--------------+------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+----------+

collect log:
https://files.starlingx.kube.cengn.ca/launchpad/1839160

Issue seems reproduced on 
Lab: wcp_122
Build ID: 2020-06-27_18-35-20
Software Version: 20.06
Patches:
WRCP_20.06_PATCH_0001
WRCP_20.06_PATCH_0002
WRCP_20.06_PATCH_0003
WRCP_20.06_PATCH_0004
WRCP_20.06_PATCH_0005

After process:hbsClient was killed and system reboot. After system boot up, event-list showed
controller-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process.

[2020-10-10 18:40:45,421] 314  DEBUG MainThread ssh.send    :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne event-list  -l 10 -q 'event_log_id=200.006;start=2020-10-10T16:58:01.513375' --nowrap --nopaging --uuid'
[2020-10-10 18:40:46,687] 436  DEBUG MainThread ssh.expect  :: Output: 
+--------------------------------------+----------------------------+-------+--------------+------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+----------+
| UUID                                 | Time Stamp                 | State | Event Log ID | Reason Text                                                                                                                              | Entity Instance ID                               | Severity |
+--------------------------------------+----------------------------+-------+--------------+------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+----------+
| cc911600-df02-4f54-813c-a1d7278836e4 | 2020-10-10T16:58:01.513375 | clear | 200.006      | controller-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=controller-0.process=pci-irq-affinity-agent | major    |
+--------------------------------------+----------------------------+-------+--------------+------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+----------+

collect log:
https://files.starlingx.kube.cengn.ca/launchpad/1839160

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-03-25:

Closing as stx.2.0 is EOL as of Aug 2020