hypervisor stays down after force lock and unlock due to pci-irq-affinity-agent process failure
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Won't Fix
|
Low
|
Jim Gauld |
Bug Description
Brief Description
-----------------
After force lock then unlock a worker node, the node gets stuck at degraded state with hypervisor state down. Following alarm is seen on system:
200.006 | compute-0 is degraded due to the failure of its 'pci-irq-
Severity
--------
Major
Steps to Reproduce
------------------
- Precondition: stx-openstack application is applied and all pods are runinng/completed, system is healthy.
- launch a few vms on same compute host (e.g., compute-0):
- system host-lock compute-0 --force
- system host-unlock compute-0
TC-name:
test_force_
test_force_
Expected Behavior
------------------
- compute-0 recovers and available; hypervisor is Up in nova hypervisor-list, no new alarm on system
Actual Behavior
----------------
- compute-0 degraded, hypervisor stays Down,
Following alarm generated:
200.006 | compute-0 is degraded due to the failure of its 'pci-irq-
Following pods stuck at init status
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack libvirt-
openstack neutron-
openstack neutron-
openstack neutron-
openstack nova-compute-
Reproducibility
---------------
Intermittent
Test fails about 50% of the time.
System Configuration
-------
Multi-node system
Lab-name: Wolfpass3-7
Branch/Pull Time/Commit
-------
stx master as of 20190803T013000Z
Last Pass
---------
Same load and lab, not sure when it is introduced.
Timestamp/Logs
--------------
[2019-08-03 20:25:15,108] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://
[2019-08-03 20:31:04,994] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://
[2019-08-03 20:46:21,382] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://
+------
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+------
| 1016f7f6-
[2019-08-03 21:29:39,562] 301 DEBUG MainThread ssh.send :: Send 'kubectl get pod --all-namespaces -o=wide --field-
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack libvirt-
openstack neutron-
openstack neutron-
openstack neutron-
openstack nova-compute-
[2019-08-03 21:29:42,693] 423 DEBUG MainThread ssh.expect :: Output:
+----+-
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+-
| 5 | compute-0 | QEMU | 192.168.206.86 | down |
| 8 | compute-1 | QEMU | 192.168.206.11 | up |
| 11 | compute-2 | QEMU | 192.168.206.21 | up |
+----+-
[sysadmin@
Test Activity
-------------
Regression Testing
tags: | added: stx.regression |
tags: | added: stx.retestneeded |
tags: | removed: stx.retestneeded |
Marking high priority as the nova service does not recover. Assigning to Jim to investigate why the nova hypervisor is not recovering after the lock force/unlock.