hypervisor stays down after force lock and unlock due to pci-irq-affinity-agent process failure

Bug #1839160 reported by Yang Liu on 2019-08-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
High
Jim Gauld

Bug Description

Brief Description
-----------------
After force lock then unlock a worker node, the node gets stuck at degraded state with hypervisor state down. Following alarm is seen on system:
200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.

Severity
--------
Major

Steps to Reproduce
------------------
- Precondition: stx-openstack application is applied and all pods are runinng/completed, system is healthy.
- launch a few vms on same compute host (e.g., compute-0):
- system host-lock compute-0 --force
- system host-unlock compute-0

TC-name:
test_force_lock_with_mig_vms
test_force_lock_with_non_mig_vms

Expected Behavior
------------------
- compute-0 recovers and available; hypervisor is Up in nova hypervisor-list, no new alarm on system

Actual Behavior
----------------
- compute-0 degraded, hypervisor stays Down,
Following alarm generated:
200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.

Following pods stuck at init status
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack libvirt-libvirt-default-p9sv8 0/1 Init:0/3 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-dhcp-agent-compute-0-5621f953-q6fs2 0/1 Init:0/1 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-l3-agent-compute-0-5621f953-dgw26 0/1 Init:0/1 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-metadata-agent-compute-0-5621f953-9gss7 0/1 Init:0/2 0 54m 192.168.204.138 compute-0 <none> <none>
openstack nova-compute-compute-0-5621f953-pk2rz 0/2 Init:0/6 0 54m 192.168.204.138 compute-0 <none> <none>

Reproducibility
---------------
Intermittent
Test fails about 50% of the time.

System Configuration
--------------------
Multi-node system
Lab-name: Wolfpass3-7

Branch/Pull Time/Commit
-----------------------
stx master as of 20190803T013000Z

Last Pass
---------
Same load and lab, not sure when it is introduced.

Timestamp/Logs
--------------
[2019-08-03 20:25:15,108] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-0 --force'

[2019-08-03 20:31:04,994] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-0'

[2019-08-03 20:46:21,382] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
+--------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+---------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+----------+----------------------------+
| 1016f7f6-4f15-4ca0-8476-e218ca594212 | 200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=compute-0.process=pci-irq-affinity-agent | major | 2019-08-03T20:35:06.545527 |

[2019-08-03 21:29:39,562] 301 DEBUG MainThread ssh.send :: Send 'kubectl get pod --all-namespaces -o=wide --field-selector=status.phase!=Running,status.phase!=Succeeded'
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack libvirt-libvirt-default-p9sv8 0/1 Init:0/3 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-dhcp-agent-compute-0-5621f953-q6fs2 0/1 Init:0/1 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-l3-agent-compute-0-5621f953-dgw26 0/1 Init:0/1 0 54m 192.168.204.138 compute-0 <none> <none>
openstack neutron-metadata-agent-compute-0-5621f953-9gss7 0/1 Init:0/2 0 54m 192.168.204.138 compute-0 <none> <none>
openstack nova-compute-compute-0-5621f953-pk2rz 0/2 Init:0/6 0 54m 192.168.204.138 compute-0 <none> <none>

[2019-08-03 21:29:42,693] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+----------------+-------+
| 5 | compute-0 | QEMU | 192.168.206.86 | down |
| 8 | compute-1 | QEMU | 192.168.206.11 | up |
| 11 | compute-2 | QEMU | 192.168.206.21 | up |
+----+---------------------+-----------------+----------------+-------+
[sysadmin@controller-0 ~(keystone_admin)]$

Test Activity
-------------
Regression Testing

Yang Liu (yliu12) wrote :
tags: added: stx.regression
Numan Waheed (nwaheed) on 2019-08-07
tags: added: stx.retestneeded
Frank Miller (sensfan22) wrote :

Marking high priority as the nova service does not recover. Assigning to Jim to investigate why the nova hypervisor is not recovering after the lock force/unlock.

Changed in starlingx:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Jim Gauld (jgauld)
tags: added: stx.2.0 stx.metal
Austin Sun (sunausti) wrote :

from compute-0 platform.log
2019-08-03T20:02:49.968 compute-0 2019-08-03 info 20:02:49,968 MainThread[21551] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/agent.py.176 - INFO Enter PCIInterruptAffinity Agent
2019-08-03T20:02:49.972 compute-0 2019-08-03 info 20:02:49,972 MainThread[21551] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/agent.py.194 - INFO 'NoneType' object has no attribute 'getCapabilities'
2019-08-03T20:02:49.972 compute-0 2019-08-03 err 20:02:49,972 MainThread[21551] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/agent.py.198 - ERROR proces_main finalized!!!
this Error is because libvirt cannot connected.
until 2019-08-04T04:17:54.441 compute-0, the pci_irq_affinity get back normal, but there is no libvirt container log when the issue is happened . need reproduce and capture the libvirt container log when issue is happened.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers