hypervisor stays down after force lock and unlock due to pci-irq-affinity-agent process failure

Bug #1839160 reported by Yang Liu
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
Low
Jim Gauld

Bug Description

Brief Description
-----------------
After force lock then unlock a worker node, the node gets stuck at degraded state with hypervisor state down. Following alarm is seen on system:
200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.

Severity
--------
Major

Steps to Reproduce
------------------
- Precondition: stx-openstack application is applied and all pods are runinng/completed, system is healthy.
- launch a few vms on same compute host (e.g., compute-0):
- system host-lock compute-0 --force
- system host-unlock compute-0

TC-name:
test_force_lock_with_mig_vms
test_force_lock_with_non_mig_vms

Expected Behavior
------------------
- compute-0 recovers and available; hypervisor is Up in nova hypervisor-list, no new alarm on system

Actual Behavior
----------------
- compute-0 degraded, hypervisor stays Down,
Following alarm generated:
200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.

Following pods stuck at init status
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack libvirt-libvirt-default-p9sv8 0/1 Init:0/3 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-dhcp-agent-compute-0-5621f953-q6fs2 0/1 Init:0/1 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-l3-agent-compute-0-5621f953-dgw26 0/1 Init:0/1 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-metadata-agent-compute-0-5621f953-9gss7 0/1 Init:0/2 0 54m 192.168.204.138 compute-0 <none> <none>
openstack nova-compute-compute-0-5621f953-pk2rz 0/2 Init:0/6 0 54m 192.168.204.138 compute-0 <none> <none>

Reproducibility
---------------
Intermittent
Test fails about 50% of the time.

System Configuration
--------------------
Multi-node system
Lab-name: Wolfpass3-7

Branch/Pull Time/Commit
-----------------------
stx master as of 20190803T013000Z

Last Pass
---------
Same load and lab, not sure when it is introduced.

Timestamp/Logs
--------------
[2019-08-03 20:25:15,108] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-0 --force'

[2019-08-03 20:31:04,994] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-0'

[2019-08-03 20:46:21,382] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
+--------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+----------+----------------------------+
| 1016f7f6-4f15-4ca0-8476-e218ca594212 | 200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=compute-0.process=pci-irq-affinity-agent | major | 2019-08-03T20:35:06.545527 |

[2019-08-03 21:29:39,562] 301 DEBUG MainThread ssh.send :: Send 'kubectl get pod --all-namespaces -o=wide --field-selector=status.phase!=Running,status.phase!=Succeeded'
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack libvirt-libvirt-default-p9sv8 0/1 Init:0/3 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-dhcp-agent-compute-0-5621f953-q6fs2 0/1 Init:0/1 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-l3-agent-compute-0-5621f953-dgw26 0/1 Init:0/1 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-metadata-agent-compute-0-5621f953-9gss7 0/1 Init:0/2 0 54m 192.168.204.138 compute-0 <none> <none>
openstack nova-compute-compute-0-5621f953-pk2rz 0/2 Init:0/6 0 54m 192.168.204.138 compute-0 <none> <none>

[2019-08-03 21:29:42,693] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+----------------+-------+
| 5 | compute-0 | QEMU | 192.168.206.86 | down |
| 8 | compute-1 | QEMU | 192.168.206.11 | up |
| 11 | compute-2 | QEMU | 192.168.206.21 | up |
+----+---------------------+-----------------+----------------+-------+
[sysadmin@controller-0 ~(keystone_admin)]$

Test Activity
-------------
Regression Testing

Revision history for this message
Yang Liu (yliu12) wrote :
tags: added: stx.regression
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Frank Miller (sensfan22) wrote :

Marking high priority as the nova service does not recover. Assigning to Jim to investigate why the nova hypervisor is not recovering after the lock force/unlock.

Changed in starlingx:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Jim Gauld (jgauld)
tags: added: stx.2.0 stx.metal
Revision history for this message
Austin Sun (sunausti) wrote :

from compute-0 platform.log
2019-08-03T20:02:49.968 compute-0 2019-08-03 info 20:02:49,968 MainThread[21551] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/agent.py.176 - INFO Enter PCIInterruptAffinity Agent
2019-08-03T20:02:49.972 compute-0 2019-08-03 info 20:02:49,972 MainThread[21551] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/agent.py.194 - INFO 'NoneType' object has no attribute 'getCapabilities'
2019-08-03T20:02:49.972 compute-0 2019-08-03 err 20:02:49,972 MainThread[21551] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/agent.py.198 - ERROR proces_main finalized!!!
this Error is because libvirt cannot connected.
until 2019-08-04T04:17:54.441 compute-0, the pci_irq_affinity get back normal, but there is no libvirt container log when the issue is happened . need reproduce and capture the libvirt container log when issue is happened.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Reducing the priority. There are no plans to investigate stx.2.0 issues as there are no plans to build another maintenance release for stx.2.0

Changed in starlingx:
importance: High → Low
Revision history for this message
Peng Peng (ppeng) wrote :

Issue seems reproduced on
Lab: wcp_122
Build ID: 2020-06-27_18-35-20
Software Version: 20.06
Patches:
WRCP_20.06_PATCH_0001
WRCP_20.06_PATCH_0002
WRCP_20.06_PATCH_0003
WRCP_20.06_PATCH_0004
WRCP_20.06_PATCH_0005

After process:hbsClient was killed and system reboot. After system boot up, event-list showed
controller-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process.

[2020-10-10 18:40:45,421] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne event-list -l 10 -q 'event_log_id=200.006;start=2020-10-10T16:58:01.513375' --nowrap --nopaging --uuid'
[2020-10-10 18:40:46,687] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------------------------+-------+--------------+------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+----------+
| UUID | Time Stamp | State | Event Log ID | Reason Text | Entity Instance ID | Severity |
+--------------------------------------+----------------------------+-------+--------------+------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+----------+
| cc911600-df02-4f54-813c-a1d7278836e4 | 2020-10-10T16:58:01.513375 | clear | 200.006 | controller-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=controller-0.process=pci-irq-affinity-agent | major |
+--------------------------------------+----------------------------+-------+--------------+------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+----------+

collect log:
https://files.starlingx.kube.cengn.ca/launchpad/1839160

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Closing as stx.2.0 is EOL as of Aug 2020

Changed in starlingx:
status: Triaged → Won't Fix
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.