controller-0 is degraded due to failure of pci-irq-affinity-agent process after reboot - libvirt failed to connect to ovs

Bug #1832047 reported by Peng Peng
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
cheng li

Bug Description

Brief Description
-----------------
In DX system, after active controller reboot, system failed to recover in 180 seconds as LP-1829432, but alarm 200.006 raised as " controller-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. " And this alarm is not cleared in 1 hour

Severity
--------
Major

Steps to Reproduce
------------------
reboot active controller

TC-name: test_evacuate_vms

Expected Behavior
------------------
200.006 alarm should be cleared after system recovered from reboot

Actual Behavior
----------------
200.006 alarm not cleared

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Two node system

Lab-name:

Branch/Pull Time/Commit
-----------------------
stx master as of 20190604T144018Z

Last Pass
---------
2019-06-03_18-34-53

Timestamp/Logs
--------------
controller-0:~$
[2019-06-07 10:48:47,307] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-06-07 10:48:47,307] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-06-07 10:54:02,726] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-06-07 10:54:04,803] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------+----------+----------------------------+
| 08339a9b-5d53-4145-be3e-2e9458ba61b8 | 200.006 | controller-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=controller-0.process=pci-irq-affinity-agent | major | 2019-06-07T10:54:04.046760 |

[2019-06-07 11:45:46,732] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-06-07 11:45:48,315] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
                                                                  | host=controller-1.ntp | major | 2019-06-07T11:03:53.446426 |
| 08339a9b-5d53-4145-be3e-2e9458ba61b8 | 200.006 | controller-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=controller-0.process=pci-irq-affinity-agent | major | 2019-06-07T10:54:04.046760 |

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

This sounds similar to https://bugs.launchpad.net/starlingx/+bug/1828877 which was already fixed on 2019-05-15. Assigning to Zhipeng to investigate.

Changed in starlingx:
assignee: nobody → zhipeng liu (zhipengs)
Revision history for this message
zhipeng liu (zhipengs) wrote :

From log, it seems libvirtd not restarted normally after reboot.
Then pci-affinity could not get connection to libvirtd and exit.

Could anyone help check why libvirtd could not be started after reboot?

Thanks!
Zhipeng

Numan Waheed (nwaheed)
description: updated
tags: added: stx.retestneeded
Peng Peng (ppeng)
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Are there any clues in the libvirtd pod logs?

Revision history for this message
zhipeng liu (zhipengs) wrote :

libvirt log is below, seems libvirt exit abnormally

zhipengl@zhipengl-nuc:~/ALL_NODES_20190607.144900/controller-0_20190607.144900/var/log/libvirt$ vi libvirtd.log
2019-06-07 09:07:24.839+0000: 900080: info : libvirt version: 4.7.0, package: 1.tis.100 (Unknown, 2019-06-04-16:54:06, 14b46d92114f)
2019-06-07 09:07:24.839+0000: 900080: info : hostname: controller-0
2019-06-07 09:07:24.839+0000: 900080: warning : qemuDomainObjTaint:7640 : Domain id=1 name='instance-00000004' uuid=b158aa0e-9d88-457a-aab0-828625f02c1c is tainted: high-privileges
2019-06-07 09:07:39.679+0000: 900082: error : virCommandWait:2591 : internal error: Child process (ovs-vsctl --timeout=5 get Interface vhub51236a2-f3 statistics:tx_errors) unexpected exit status 1: ovs-vsctl: no key "tx_errors" in Interface record "vhub51236a2-f3" column statistics

2019-06-07 09:24:35.225+0000: 900084: warning : qemuDomainObjTaint:7640 : Domain id=2 name='instance-00000008' uuid=5d285f2c-fc62-4060-be9c-46bc5ebf04d2 is tainted: high-privileges
2019-06-07 09:25:10.121+0000: 900082: error : virCommandWait:2591 : internal error: Child process (ovs-vsctl --timeout=5 get Interface vhu08e410c3-2a statistics:tx_errors) unexpected exit status 1: ovs-vsctl: no key "tx_errors" in Interface record "vhu08e410c3-2a" column statistics

2019-06-07 09:25:10.247+0000: 900082: error : virCommandWait:2591 : internal error: Child process (ovs-vsctl --timeout=5 get Interface vhu096be99f-d0 statistics:tx_errors) unexpected exit status 1: ovs-vsctl: no key "tx_errors" in Interface record "vhu096be99f-d0" column statistics

pci-affinity log is below in /var/log/platform.log
2019-06-07T09:24:40.536 controller-0 2019-06-07 warning 09:24:40,535 Thread-37[1232790] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/nova_provider.py.103 - WARNING Failed to access libvirt! error='NoneType' object has no attribute 'get'

Zhipeng

Revision history for this message
Ghada Khalil (gkhalil) wrote :

The errors seem related to ovs; assigning to the networking team to investigate further.

Changed in starlingx:
assignee: zhipeng liu (zhipengs) → Forrest Zhao (forrest.zhao)
tags: added: stx.networking
Changed in starlingx:
assignee: Forrest Zhao (forrest.zhao) → cheng li (chengli3)
Revision history for this message
Qianshui Jiang (qianshui) wrote :

Here are two questions about the configuration of test environment:

1: Does the DX system in this bug description works in virtual environment or bare metal?

2: which .tgz file below used during the environment set up?
    helm-charts-manifest-centos-dev-latest.tgz
    helm-charts-manifest-centos-dev-versioned.tgz
    helm-charts-manifest-centos-stable-latest.tgz
    helm-charts-manifest-centos-stable-versioned.tgz
    stx-openstack-1.0-14-centos-dev-latest.tgz
    stx-openstack-1.0-14-centos-dev-versioned.tgz
    stx-openstack-1.0-14-centos-stable-latest.tgz
    stx-openstack-1.0-14-centos-stable-versioned.tgz

Revision history for this message
cheng li (chengli3) wrote :

Hi PengPeng, could you please let us know which tarball was used? The deployment was on virtual env or baremetal? Thanks

Revision history for this message
Peng Peng (ppeng) wrote :

helm-charts-manifest-centos-stable-versioned.tgz

Revision history for this message
cheng li (chengli3) wrote :

Thanks PengPeng, and you deployed the Duplex environment on virtual environment or Baremetal?

Revision history for this message
Peng Peng (ppeng) wrote :

on real lab, not virtual.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating for now until the networking team has had a chance to investigate.

Peng, Please clarify the frequency of occurrence of this issue. Was this a one time occurrence? How many times was this issue seen? If seen multiple times, does it happen on only one system or multiple systems?

tags: added: stx.2.0
Revision history for this message
Peng Peng (ppeng) wrote :

So far we only saw this issue once on a two nodes system.

Revision history for this message
cheng li (chengli3) wrote :

@PengPeng, which vswitch_type were you using? The default 'none' or 'ovs-dpdk'?

https://wiki.openstack.org/wiki/StarlingX/Containers/Installation#Configure_the_vswitch_type_.28optional.29

Revision history for this message
Peng Peng (ppeng) wrote :

it should be ovs-dpdk

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (4.1 KiB)

BUILD_ID="20190622T013000Z"
SYSTEM_NAME="yow-cgcs-wolfpass-03_07" (2+3 HW system)
VSWITCH_TYPE="ovs-dpdk"

[Failed in teardown in nova regression testcase
nova/test_force_lock_with_vms.py::test_force_lock_with_mig_vms]

$ system application-list --------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+------------
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-16-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applied | completed

1. All instances that were running on compute-0 landed on compute-1 (when compute-0 was force locked at approx 2019-06-25 14:18:30)

nova hypervisor reverts to down and disabled on compute-0

]$ nova hypervisor-list
+--------------------------------------+---------------------+-------+----------+
| ID | Hypervisor hostname | State | Status |
+--------------------------------------+---------------------+-------+----------+
| a8aea911-05a5-4410-9f39-08630783d373 | compute-0 | down | disabled |
| ba4344e6-5d97-4b42-a67b-eabbc95b531b | compute-1 | up | enabled |
| a612e8ad-5026-4d9e-b103-339b6d94eefe | compute-2 | up | enabled |
+--------------------------------------+---------------------+-------+----------+
[sysadmin@controller-0 ~(keystone_admin)]$ date
Tue Jun 25 14:22:00 UTC 2019

compute-0 is unlocked ~ 2019-06-25 14:24:15

2 alarms are reported after this (from Horizon). Only the CPU alarm on compute-1 clears

200.006 compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.
host=compute-0.process=pci-irq-affinity-agent major 2019-06-25T10:28:17

100.101 Platform CPU threshold exceeded ; threshold 90.00%, actual 93.43% host=compute-1 major 2019-06-25T10:33:11

+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | compute-0 | worker | unlocked | disabled | intest |
| 4 | compute-1 | worker | unlocked | enabled | degraded |
| 5 | compute-2 | worker | unlocked | enabled | available

$ date
Tue Jun 25 14:28:15 UTC 2019

$ nova hypervisor-list;date
+--------------------------------------+---------------------+-------+---------+
| ID | Hypervisor hostname | State | Status |
+--------------------------------------+---------------------+-------+---------+
| a8aea911-05a5-4410-9f39-08630783d373 | compute-0 | down | enabled |
| ba4344e6-5d97-4b42-a67b-eabbc...

Read more...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

2019-06-25T14:30:19.615 controller-0 VIM_Thread[2586686] INFO _host_state_enabling.py.73 Enable failed for compute-0.
2019-06-25T14:33:23.287 controller-0 VIM_Thread[2586686] INFO _host_state_enabling.py.73 Enable failed for compute-0.
...
2019-06-25T15:10:33.793 controller-0 VIM_Thread[2586686] INFO _host_state_enabling.py.73 Enable failed for compute-0.
2019-06-25T15:12:44.894 controller-0 VIM_Thread[2586686] INFO _host_state_enabling.py.73 Enable failed for compute-0.

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

Increasing the priority on this.
Adding regression affecting.

Live migration testcases that follow this testcase all fail as the hypervisor remains down.

nova/test_force_lock_with_vms.py::test_force_lock_with_mig_vms
[2019-06-24 00:33:59,249]
E Details: Host(s) ['compute-0'] are not up in hypervisor-list within timeout

767cd209-0853-439b-90c2-567988043d53 | 200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=compute-0.process=pci-irq-affinity-agent | major | 2019-06-24T00:38:00.878998

FAIL 20190624 02:12:43 test_live_migrate_vm_positive[remote-0-0-None-2-volume-False]
FAIL 20190624 02:12:56 test_live_migrate_vm_positive[remote-1-0-dedicated-1-volume-False]
FAIL 20190624 02:13:08 test_live_migrate_vm_positive[remote-1-512-None-1-image-False]
FAIL 20190624 02:13:21 test_live_migrate_vm_positive[remote-0-512-dedicated-2-image_with_vol-False]
FAIL 20190624 02:13:33 test_live_migrate_vm_negative[remote-0-0-volume-True-LiveMigErr.BLOCK_MIG_UNSUPPORTED]
FAIL 20190624 02:13:46 test_live_migrate_vm_negative[remote-1-0-volume-True-LiveMigErr.BLOCK_MIG_UNSUPPORTED]
FAIL 20190624 02:13:58 test_live_migrate_vm_negative[remote-0-512-volume-True-LiveMigErr.BLOCK_MIG_UNSUPPORTED]
FAIL 20190624 02:14:11 test_live_migrate_vm_negative[remote-0-512-image-True-LiveMigErr.BLOCK_MIG_UNSUPPORTED]
FAIL 20190624 02:14:23 test_live_migrate_vm_negative[remote-0-0-image_with_vol-True-LiveMigErr.BLOCK_MIG_UNSUPPORTED]

tags: added: stx.regression
Revision history for this message
cheng li (chengli3) wrote :
Download full text (3.6 KiB)

I am trying to reproduce this bug, but one issue is blocking me from installing duplex env on BMs.
I am still working on fixing the issue. Not sure if this is a bug. I would like to paste the info here in case anyone has ever meet the same issue.

After unlocking the first controller node, ceph didn't get up, kubernetes didn't get up. After I checked the puppet log, I found the root cause could be that some management ip adresses were missed on the eno2(management interface and cluster-host network interface). The eno2 interface was supposed to have 192.168.206.3/24 and 192.168.204.3/28. But it has only .2 addresses. Then I `ifdown eno2` and `ifup eno2`, some error messages appeared.

What's wired is that we didn't meet this issue with virtual duplex env.

```
controller-0:/etc/sysconfig/network-scripts# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp24s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:fd:fe:d2:10:24 brd ff:ff:ff:ff:ff:ff
3: enp24s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:fd:fe:d2:10:25 brd ff:ff:ff:ff:ff:ff
4: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether a4:bf:01:3a:53:52 brd ff:ff:ff:ff:ff:ff
    inet 10.239.48.72/24 brd 10.239.48.255 scope global eno1
       valid_lft forever preferred_lft forever
    inet 10.239.48.103/24 scope global secondary eno1
       valid_lft forever preferred_lft forever
    inet6 fe80::a6bf:1ff:fe3a:5352/64 scope link
       valid_lft forever preferred_lft forever
5: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether a4:bf:01:3a:53:53 brd ff:ff:ff:ff:ff:ff
    inet 192.168.206.2/24 scope global eno2
       valid_lft forever preferred_lft forever
    inet 169.254.202.2/24 scope global eno2
       valid_lft forever preferred_lft forever
    inet 192.168.204.2/28 scope global eno2
       valid_lft forever preferred_lft forever
    inet6 fe80::a6bf:1ff:fe3a:5353/64 scope link
       valid_lft forever preferred_lft forever
6: enp175s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:fd:fe:d2:10:58 brd ff:ff:ff:ff:ff:ff
7: enp175s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:fd:fe:d2:10:59 brd ff:ff:ff:ff:ff:ff
8: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:f3:c5:8e:82 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
```

```
controller-0:/etc/sysconfig/network-scripts# ifup eno2
ERROR : [/etc/sysconfig/network-scripts/ifup-ipv6] Global IPv6 forwarding is disabled in configuration, but not currently disabled in kernel
ERROR : [/etc/sysconfig/network-scripts/ifup-...

Read more...

Revision history for this message
Qianshui Jiang (qianshui) wrote :
Download full text (3.7 KiB)

We’ve tried to reproduce the bug, however we didn’t get the same phenomenon when using the .iso file which come out in 6.7;
In our test, the ovs and libvirtd pod seems work properly when reboot the controller-0. Here is the log information;

controller-0:~$ openstack hypervisor list
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 4 | controller-1 | QEMU | 192.168.206.4 | up |
| 6 | controller-0 | QEMU | 192.168.206.3 | up |
+----+---------------------+-----------------+---------------+-------+

controller-0:/home/wrsroot# kubectl logs -n openstack libvirt-libvirt-default-69mlr
'[' -n '' ']'
+ rm -f /var/run/libvirtd.pid
+ [[ -c /dev/kvm ]]
+ chmod 660 /dev/kvm
+ chown root:kvm /dev/kvm
+ CGROUPS=
+ for CGROUP in cpu rdma hugetlb
+ '[' -d /sys/fs/cgroup/cpu ']'
+ CGROUPS+=cpu,
+ for CGROUP in cpu rdma hugetlb
+ '[' -d /sys/fs/cgroup/rdma ']'
+ for CGROUP in cpu rdma hugetlb
+ '[' -d /sys/fs/cgroup/hugetlb ']'
+ CGROUPS+=hugetlb,
+ cgcreate -g cpu,hugetlb:/osh-libvirt
++ cat /proc/meminfo
++ grep HugePages_Total
++ tr -cd '[:digit:]'
+ hp_count=78225
+ '[' 078225 -gt 0 ']'
+ echo 'INFO: Detected hugepage count of '\''78225'\''. Enabling hugepage settings for libvirt/qemu.'
INFO: Detected hugepage count of '78225'. Enabling hugepage settings for libvirt/qemu.
++ grep KVM_HUGEPAGES=0 /etc/default/qemu-kvm
grep: /etc/default/qemu-kvm: No such file or directory
+ '[' -n '' ']'
+ echo KVM_HUGEPAGES=1
+ '[' '!' -d /dev/hugepages ']'
+ '[' -d /sys/fs/cgroup/hugetlb ']'
++ ls /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.1GB.limit_in_bytes /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.2MB.limit_in_bytes
+ limits='/sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.1GB.limit_in_bytes
/sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.2MB.limit_in_bytes'
+ for limit in '$limits'
+++ awk -F: '($2~/hugetlb/){print $3}' /proc/self/cgroup
++ dirname /k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/ecce25603440da89fe3b6d5539f051e6b47943ac196f73e9afdc694a1113f38c
++ basename /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.1GB.limit_in_bytes
+ target=/sys/fs/cgroup/hugetlb//k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/hugetlb.1GB.limit_in_bytes
+ '[' '!' -f /sys/fs/cgroup/hugetlb//k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/hugetlb.1GB.limit_in_bytes ']'
++ cat /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.1GB.limit_in_bytes
+ echo 9223372036854771712
+ for limit in '$limits'
+++ awk -F: '($2~/hugetlb/){print $3}' /proc/self/cgroup
++ dirname /k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/ecce25603440da89fe3b6d5539f051e6b47943ac196f73e9afdc694a1113f38c
++ basename /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.2MB.limit_in_bytes
+ target=/sys/fs/cgroup/hugetlb//k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/hugetlb.2MB.limit_in_bytes
+ '[' '!' -f /sys/fs/cgroup/hugetlb//k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/hugetlb.2MB.limit_in_bytes ']'
++ cat /sys/fs...

Read more...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (3.7 KiB)

still fails in weekly regression test for
Lab: WP_3_7
Load: 20190706T013000Z

test_force_lock_with_mig_vms
compute-0 is force locked in this testcase

@ [2019-07-06 19:41:00,533] compute-0 becomes online
| administrative | locked
| availability | online

@ [2019-07-06 19:42:04,408] compute-0 is unlocked via cli

@ [2019-07-06 19:42:11,330] compute-0 is in unlocking task
| administrative | locked |
| availability | online
| task | Unlocking

@ [2019-07-06 19:42:13,039] compue-0 appears to be administrative state "unlocked" (but disabled, online)

@ [2019-07-06 19:42:17,833] compute-0 is offline (disabled)
| 3 | compute-0 | worker | unlocked | disabled | offline

@ [2019-07-06 19:44:34,969] compute-0 is 'intest'
| 3 | compute-0 | worker | unlocked | disabled | intest

@[2019-07-06 19:45:55,073] compute-0 becomes available
| 3 | compute-0 | worker | unlocked | enabled | available

However, the hypervisor remains down

[2019-07-06 19:51:11,109]
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 5 | compute-0 | QEMU | 192.168.206.89 | down |
| 8 | compute-2 | QEMU | 192.168.206.135 | up |
| 11 | compute-1 | QEMU | 192.168.206.21 | up |
+----+---------------------+-----------------+-----------------+-------+

fm-event.log
2019-07-06T19:44:45.000 controller-0 fmManager: info { "event_log_id" : "200.022", "reason_text" : "compute-0 is now 'online'", "entity_instance_id" : "host=compute-0.status=online", "severity" : "not-applicable", "state" : "msg", "timestamp" : "2019-07-06 19:44:45.983111" }
2019-07-06T19:44:56.000 controller-0 fmManager: info { "event_log_id" : "800.001", "reason_text" : "Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wolfpass-03_07.cluster=cf9dc322-6a2b-4f56-b3bb-8e059098aa4d", "severity" : "warning", "state" : "clear", "timestamp" : "2019-07-06 19:44:56.995349" }
2019-07-06T19:45:15.000 controller-0 fmManager: info { "event_log_id" : "200.006", "reason_text" : "compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wolfpass-03_07.host=compute-0.process=pci-irq-affinity-agent", "severity" : "major", "state" : "clear", "timestamp" : "2019-07-06 19:45:15.206717" }
2019-07-06T19:45:41.000 controller-0 fmManager: info { "event_log_id" : "200.006", "reason_text" : "compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wolfpass-03_07.host=compute-0.process=pci-irq-affinity-agent", "severity" : "major", "state" : "set", "timestamp" : "2019-07-06 19:45:41.623027" }
2019-07-06T19:45:43.000 contr...

Read more...

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

This bug was reproduced on a Standard (2+2+2) with Build: 20190710T140851Z

During sanity execution, compute-1 is locked/unlocked by the automated test. After 1 hour of the unlock action, the current status is:

[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | degraded |
| 4 | controller-1 | controller | unlocked | enabled | available |
| 5 | storage-0 | storage | unlocked | enabled | available |
| 6 | storage-1 | storage | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

Reported by fm alarm-list
| 200.006 | compute-1 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of | host=compute-1.process=pci- | major | 2019-07-10T17:25: |
| | this major process is in progress. | irq-affinity-agent | | 18.738181 |
| | | | | |

Hypervisor down:
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 5 | compute-1 | QEMU | 192.168.206.182 | down |
| 8 | compute-0 | QEMU | 192.168.206.147 | up |
+----+---------------------+-----------------+-----------------+-------+

Full collect attached.

Just a note, we got the same event on some other configurations, but after a few minutes (~ <5mins), compute node recovers.

Ghada Khalil (gkhalil)
summary: - 200.006 alarm "controller-0 is degraded due to the failure of its 'pci-
- irq-affinity-agent' process" after reboot
+ controller-0 is degraded due to failure of pci-irq-affinity-agent
+ process after reboot - libvirt failed to connect to ovs
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Christopher, The occurrence you report related to the pci-irq-affinity-agent' process is not the same as the original issue. Please open a separate launchpad for this bug.

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (3.8 KiB)

Please refer to controller-0*.tgz and compute-0*.tgz log collection from (attached above) from the following regression run.

Lab: WP_3_7
Load: 20190706T013000Z

[2019-07-06 19:29:02,166] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 5 | compute-0 | QEMU | 192.168.206.89 | up |
| 8 | compute-2 | QEMU | 192.168.206.135 | up |
| 11 | compute-1 | QEMU | 192.168.206.21 | up |
+----+---------------------+-----------------+-----------------+----

[2019-07-06 19:31:53,399] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 5 | compute-0 | QEMU | 192.168.206.89 | up |
| 8 | compute-2 | QEMU | 192.168.206.135 | up |
| 11 | compute-1 | QEMU | 192.168.206.21 | up |
+----+---------------------+-----------------+-----------------+-------+
[2019-07-06 19:36:29,379]
Send 'system host-lock compute-0 --force'

[2019-07-06 19:42:04,408]
Send 'system host-unlock compute-0'

[2019-07-06 19:42:13,039] 423 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | compute-0 | worker | unlocked | disabled | online |
| 4 | compute-1 | worker | unlocked | enabled | available |
| 5 | compute-2 | worker | unlocked | enabled | available |
[2019-07-06 19:46:07,218] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 5 | compute-0 | QEMU | 192.168.206.89 | down |
| 8 | compute-2 | QEMU | 192.168.206.135 | up |
| 11 | compute-1 | QEMU | 192.168.206.21 | up |
+----+---------------------+-----------------+-----------------+---

[2019-07-06 19:51:55,928]
| 306f746a-a44a-47a9-99e3-fae402762753 | 200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=compute-0.process=pci-irq-affinity-agent | major | 2019-07-06T19:45:41.599167 |

[2019-07-06 21:31:34,636] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+-----------...

Read more...

Revision history for this message
cheng li (chengli3) wrote :

It's interesting. I can see libvirt error in compute-0 var/log/libvirt/libvirt/libvirtd.log, but the error is not related to ovs.

```
qemuMonitorIO:718 : internal error: End of file from qemu monitor
```

To summary, pci-irq-affinity-agent cased degrade alarm could be cased by the following paths:

1. pci-irq-affinity-agent failed <- libvirt error <- ovs related error . (Only PengPeng had ever seen meet this scenario once)
2. pci-irq-affinity-agent failed <- libvirt error <- qemu internal error. (the latest log uploaded by Wendy matches this path)
3. pci-irq-affinity-agent failed, but could be recovered automatically in ~ 5 minutes.

This bug is to trace the first path listed above. And no one could really reproduce it for now.

Revision history for this message
Yang Liu (yliu12) wrote :

Based on Cheng's comments, a new LP is opened to track the issue in Wendy's comments (issue2). Although issue2 is also intermittent, it occurs more frequent than issue1.
https://bugs.launchpad.net/starlingx/+bug/1839160

Revision history for this message
Forrest Zhao (forrest.zhao) wrote :

Close it since it cannot be reproduced. Can open a new bug once it's observed again.

Changed in starlingx:
status: Triaged → Invalid
Peng Peng (ppeng)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.