Bug #1832047 “controller-0 is degraded due to failure of pci-irq...” : Bugs : StarlingX

Revision history for this message

Peng Peng (ppeng) wrote on 2019-06-07:

#1

ALL_NODES_20190607.144900.tar Edit (42.0 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-07:

#2

This sounds similar to https://bugs.launchpad.net/starlingx/+bug/1828877 which was already fixed on 2019-05-15. Assigning to Zhipeng to investigate.

Changed in starlingx:
assignee:	nobody → zhipeng liu (zhipengs)

Revision history for this message

zhipeng liu (zhipengs) wrote on 2019-06-08:

#3

From log, it seems libvirtd not restarted normally after reboot.
Then pci-affinity could not get connection to libvirtd and exit.

Could anyone help check why libvirtd could not be started after reboot?

Thanks!
Zhipeng

Numan Waheed (nwaheed) on 2019-06-10

description:	updated
tags:	added: stx.retestneeded

Peng Peng (ppeng) on 2019-06-10

description:

updated

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-10:

#4

Are there any clues in the libvirtd pod logs?

Revision history for this message

zhipeng liu (zhipengs) wrote on 2019-06-11:

#5

libvirt log is below, seems libvirt exit abnormally

zhipengl@zhipengl-nuc:~/ALL_NODES_20190607.144900/controller-0_20190607.144900/var/log/libvirt$ vi libvirtd.log
2019-06-07 09:07:24.839+0000: 900080: info : libvirt version: 4.7.0, package: 1.tis.100 (Unknown, 2019-06-04-16:54:06, 14b46d92114f)
2019-06-07 09:07:24.839+0000: 900080: info : hostname: controller-0
2019-06-07 09:07:24.839+0000: 900080: warning : qemuDomainObjTaint:7640 : Domain id=1 name='instance-00000004' uuid=b158aa0e-9d88-457a-aab0-828625f02c1c is tainted: high-privileges
2019-06-07 09:07:39.679+0000: 900082: error : virCommandWait:2591 : internal error: Child process (ovs-vsctl --timeout=5 get Interface vhub51236a2-f3 statistics:tx_errors) unexpected exit status 1: ovs-vsctl: no key "tx_errors" in Interface record "vhub51236a2-f3" column statistics

2019-06-07 09:24:35.225+0000: 900084: warning : qemuDomainObjTaint:7640 : Domain id=2 name='instance-00000008' uuid=5d285f2c-fc62-4060-be9c-46bc5ebf04d2 is tainted: high-privileges
2019-06-07 09:25:10.121+0000: 900082: error : virCommandWait:2591 : internal error: Child process (ovs-vsctl --timeout=5 get Interface vhu08e410c3-2a statistics:tx_errors) unexpected exit status 1: ovs-vsctl: no key "tx_errors" in Interface record "vhu08e410c3-2a" column statistics

2019-06-07 09:25:10.247+0000: 900082: error : virCommandWait:2591 : internal error: Child process (ovs-vsctl --timeout=5 get Interface vhu096be99f-d0 statistics:tx_errors) unexpected exit status 1: ovs-vsctl: no key "tx_errors" in Interface record "vhu096be99f-d0" column statistics

pci-affinity log is below in /var/log/platform.log
2019-06-07T09:24:40.536 controller-0 2019-06-07 warning 09:24:40,535 Thread-37[1232790] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/nova_provider.py.103 - WARNING Failed to access libvirt! error='NoneType' object has no attribute 'get'

Zhipeng

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-12:

#6

The errors seem related to ovs; assigning to the networking team to investigate further.

Changed in starlingx:
assignee:	zhipeng liu (zhipengs) → Forrest Zhao (forrest.zhao)
tags:	added: stx.networking

Forrest Zhao (forrest.zhao) on 2019-06-13

Changed in starlingx:
assignee:	Forrest Zhao (forrest.zhao) → cheng li (chengli3)

Revision history for this message

Qianshui Jiang (qianshui) wrote on 2019-06-13:

#7

Here are two questions about the configuration of test environment:

1: Does the DX system in this bug description works in virtual environment or bare metal?

2: which .tgz file below used during the environment set up?
    helm-charts-manifest-centos-dev-latest.tgz
    helm-charts-manifest-centos-dev-versioned.tgz
    helm-charts-manifest-centos-stable-latest.tgz
    helm-charts-manifest-centos-stable-versioned.tgz
    stx-openstack-1.0-14-centos-dev-latest.tgz
    stx-openstack-1.0-14-centos-dev-versioned.tgz
    stx-openstack-1.0-14-centos-stable-latest.tgz
    stx-openstack-1.0-14-centos-stable-versioned.tgz

Revision history for this message

cheng li (chengli3) wrote on 2019-06-13:

#8

Hi PengPeng, could you please let us know which tarball was used? The deployment was on virtual env or baremetal? Thanks

Revision history for this message

Peng Peng (ppeng) wrote on 2019-06-13:

#9

helm-charts-manifest-centos-stable-versioned.tgz

Revision history for this message

cheng li (chengli3) wrote on 2019-06-13:

#10

Thanks PengPeng, and you deployed the Duplex environment on virtual environment or Baremetal?

Revision history for this message

Peng Peng (ppeng) wrote on 2019-06-13:

#11

on real lab, not virtual.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-14:

#12

Marking as release gating for now until the networking team has had a chance to investigate.

Peng, Please clarify the frequency of occurrence of this issue. Was this a one time occurrence? How many times was this issue seen? If seen multiple times, does it happen on only one system or multiple systems?

tags:

added: stx.2.0

Revision history for this message

Peng Peng (ppeng) wrote on 2019-06-14:

#13

So far we only saw this issue once on a two nodes system.

Revision history for this message

cheng li (chengli3) wrote on 2019-06-19:

#14

@PengPeng, which vswitch_type were you using? The default 'none' or 'ovs-dpdk'?

https://wiki.openstack.org/wiki/StarlingX/Containers/Installation#Configure_the_vswitch_type_.28optional.29

Revision history for this message

Peng Peng (ppeng) wrote on 2019-06-19:

#15

it should be ovs-dpdk

Ghada Khalil (gkhalil) on 2019-06-19

Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-25:

#16

Download full text (4.1 KiB)

BUILD_ID="20190622T013000Z"
SYSTEM_NAME="yow-cgcs-wolfpass-03_07" (2+3 HW system)
VSWITCH_TYPE="ovs-dpdk"

[Failed in teardown in nova regression testcase
nova/test_force_lock_with_vms.py::test_force_lock_with_mig_vms]

1. All instances that were running on compute-0 landed on compute-1 (when compute-0 was force locked at approx 2019-06-25 14:18:30)

nova hypervisor reverts to down and disabled on compute-0

compute-0 is unlocked ~ 2019-06-25 14:24:15

2 alarms are reported after this (from Horizon). Only the CPU alarm on compute-1 clears

200.006 compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.
host=compute-0.process=pci-irq-affinity-agent major 2019-06-25T10:28:17

100.101 Platform CPU threshold exceeded ; threshold 90.00%, actual 93.43% host=compute-1 major 2019-06-25T10:33:11

$ date
Tue Jun 25 14:28:15 UTC 2019

BUILD_ID="20190622T013000Z"
SYSTEM_NAME="yow-cgcs-wolfpass-03_07"   (2+3 HW system)
VSWITCH_TYPE="ovs-dpdk"

[Failed in teardown in nova regression testcase
nova/test_force_lock_with_vms.py::test_force_lock_with_mig_vms]

1. All instances that were running on compute-0 landed on compute-1 (when compute-0 was force locked at approx 2019-06-25 14:18:30)

nova hypervisor reverts to down and disabled on compute-0

compute-0 is unlocked ~ 2019-06-25 14:24:15

2 alarms are reported after this (from Horizon). Only the CPU alarm on compute-1 clears

200.006 	compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. 	
host=compute-0.process=pci-irq-affinity-agent 	major 	2019-06-25T10:28:17
 	
100.101 	Platform CPU threshold exceeded ; threshold 90.00%, actual 93.43% 	host=compute-1 	major 	2019-06-25T10:33:11

$ date
Tue Jun 25 14:28:15 UTC 2019

Result
pci-irq-affinity-agent alarm on compute-0 did not clear long after the compute was unlocked
The hypervisor on compute-1 remained in state down (status enabled)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-25:

#17

compute0_mtcAgent.log Edit (21.9 KiB, text/plain)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-25:

#18

pmond.log Edit (2.0 MiB, text/plain)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-25:

#19

2019-06-25T14:30:19.615 controller-0 VIM_Thread[2586686] INFO _host_state_enabling.py.73 Enable failed for compute-0.
2019-06-25T14:33:23.287 controller-0 VIM_Thread[2586686] INFO _host_state_enabling.py.73 Enable failed for compute-0.
...
2019-06-25T15:10:33.793 controller-0 VIM_Thread[2586686] INFO _host_state_enabling.py.73 Enable failed for compute-0.
2019-06-25T15:12:44.894 controller-0 VIM_Thread[2586686] INFO _host_state_enabling.py.73 Enable failed for compute-0.

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-25:

#20

nfv-vim.log Edit (11.1 MiB, text/html)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-06-26:

#21

Increasing the priority on this.
Adding regression affecting.

Live migration testcases that follow this testcase all fail as the hypervisor remains down.

nova/test_force_lock_with_vms.py::test_force_lock_with_mig_vms
[2019-06-24 00:33:59,249]
E Details: Host(s) ['compute-0'] are not up in hypervisor-list within timeout

767cd209-0853-439b-90c2-567988043d53 | 200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=compute-0.process=pci-irq-affinity-agent | major | 2019-06-24T00:38:00.878998

FAIL 20190624 02:12:43 test_live_migrate_vm_positive[remote-0-0-None-2-volume-False]
FAIL 20190624 02:12:56 test_live_migrate_vm_positive[remote-1-0-dedicated-1-volume-False]
FAIL 20190624 02:13:08 test_live_migrate_vm_positive[remote-1-512-None-1-image-False]
FAIL 20190624 02:13:21 test_live_migrate_vm_positive[remote-0-512-dedicated-2-image_with_vol-False]
FAIL 20190624 02:13:33 test_live_migrate_vm_negative[remote-0-0-volume-True-LiveMigErr.BLOCK_MIG_UNSUPPORTED]
FAIL 20190624 02:13:46 test_live_migrate_vm_negative[remote-1-0-volume-True-LiveMigErr.BLOCK_MIG_UNSUPPORTED]
FAIL 20190624 02:13:58 test_live_migrate_vm_negative[remote-0-512-volume-True-LiveMigErr.BLOCK_MIG_UNSUPPORTED]
FAIL 20190624 02:14:11 test_live_migrate_vm_negative[remote-0-512-image-True-LiveMigErr.BLOCK_MIG_UNSUPPORTED]
FAIL 20190624 02:14:23 test_live_migrate_vm_negative[remote-0-0-image_with_vol-True-LiveMigErr.BLOCK_MIG_UNSUPPORTED]

tags:

added: stx.regression

Revision history for this message

cheng li (chengli3) wrote on 2019-06-28:

#22

Download full text (3.6 KiB)

I am trying to reproduce this bug, but one issue is blocking me from installing duplex env on BMs.
I am still working on fixing the issue. Not sure if this is a bug. I would like to paste the info here in case anyone has ever meet the same issue.

After unlocking the first controller node, ceph didn't get up, kubernetes didn't get up. After I checked the puppet log, I found the root cause could be that some management ip adresses were missed on the eno2(management interface and cluster-host network interface). The eno2 interface was supposed to have 192.168.206.3/24 and 192.168.204.3/28. But it has only .2 addresses. Then I `ifdown eno2` and `ifup eno2`, some error messages appeared.

What's wired is that we didn't meet this issue with virtual duplex env.

```
controller-0:/etc/sysconfig/network-scripts# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp24s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:fd:fe:d2:10:24 brd ff:ff:ff:ff:ff:ff
3: enp24s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:fd:fe:d2:10:25 brd ff:ff:ff:ff:ff:ff
4: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether a4:bf:01:3a:53:52 brd ff:ff:ff:ff:ff:ff
    inet 10.239.48.72/24 brd 10.239.48.255 scope global eno1
       valid_lft forever preferred_lft forever
    inet 10.239.48.103/24 scope global secondary eno1
       valid_lft forever preferred_lft forever
    inet6 fe80::a6bf:1ff:fe3a:5352/64 scope link
       valid_lft forever preferred_lft forever
5: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether a4:bf:01:3a:53:53 brd ff:ff:ff:ff:ff:ff
    inet 192.168.206.2/24 scope global eno2
       valid_lft forever preferred_lft forever
    inet 169.254.202.2/24 scope global eno2
       valid_lft forever preferred_lft forever
    inet 192.168.204.2/28 scope global eno2
       valid_lft forever preferred_lft forever
    inet6 fe80::a6bf:1ff:fe3a:5353/64 scope link
       valid_lft forever preferred_lft forever
6: enp175s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:fd:fe:d2:10:58 brd ff:ff:ff:ff:ff:ff
7: enp175s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:fd:fe:d2:10:59 brd ff:ff:ff:ff:ff:ff
8: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:f3:c5:8e:82 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
```

```
controller-0:/etc/sysconfig/network-scripts# ifup eno2
ERROR : [/etc/sysconfig/network-scripts/ifup-ipv6] Global IPv6 forwarding is disabled in configuration, but not currently disabled in kernel
ERROR : [/etc/sysconfig/network-scripts/ifup-...

I am trying to reproduce this bug, but one issue is blocking me from installing duplex env on BMs.
I am still working on fixing the issue. Not sure if this is a bug. I would like to paste the info here in case anyone has ever meet the same issue.

After unlocking the first controller node, ceph didn't get up, kubernetes didn't get up. After I checked the puppet log, I found the root cause could be that some management ip adresses were missed on the eno2(management interface and cluster-host network interface). The eno2 interface was supposed to have 192.168.206.3/24 and 192.168.204.3/28. But it has only .2 addresses. Then I `ifdown eno2` and `ifup eno2`, some error messages appeared.

What's wired is that we didn't meet this issue with virtual duplex env.

```
controller-0:/etc/sysconfig/network-scripts# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp24s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:fd:fe:d2:10:24 brd ff:ff:ff:ff:ff:ff
3: enp24s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:fd:fe:d2:10:25 brd ff:ff:ff:ff:ff:ff
4: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether a4:bf:01:3a:53:52 brd ff:ff:ff:ff:ff:ff
    inet 10.239.48.72/24 brd 10.239.48.255 scope global eno1
       valid_lft forever preferred_lft forever
    inet 10.239.48.103/24 scope global secondary eno1
       valid_lft forever preferred_lft forever
    inet6 fe80::a6bf:1ff:fe3a:5352/64 scope link
       valid_lft forever preferred_lft forever
5: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether a4:bf:01:3a:53:53 brd ff:ff:ff:ff:ff:ff
    inet 192.168.206.2/24 scope global eno2
       valid_lft forever preferred_lft forever
    inet 169.254.202.2/24 scope global eno2
       valid_lft forever preferred_lft forever
    inet 192.168.204.2/28 scope global eno2
       valid_lft forever preferred_lft forever
    inet6 fe80::a6bf:1ff:fe3a:5353/64 scope link
       valid_lft forever preferred_lft forever
6: enp175s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:fd:fe:d2:10:58 brd ff:ff:ff:ff:ff:ff
7: enp175s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 3c:fd:fe:d2:10:59 brd ff:ff:ff:ff:ff:ff
8: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:f3:c5:8e:82 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
```

```
controller-0:/etc/sysconfig/network-scripts# ifup eno2
ERROR     : [/etc/sysconfig/network-scripts/ifup-ipv6] Global IPv6 forwarding is disabled in configuration, but not currently disabled in kernel
ERROR     : [/etc/sysconfig/network-scripts/ifup-ipv6] Please restart network with '/sbin/service network restart'
Determining if ip address 192.168.204.3 is already in use for device eno2...
ERROR     : [/etc/sysconfig/network-scripts/ifup-aliases] Error, some other host () already uses address 192.168.204.3.
Determining if ip address 192.168.206.3 is already in use for device eno2...
ERROR     : [/etc/sysconfig/network-scripts/ifup-aliases] Error, some other host () already uses address 192.168.206.3.
```

Revision history for this message

Qianshui Jiang (qianshui) wrote on 2019-07-03:

#23

Download full text (3.7 KiB)

We’ve tried to reproduce the bug, however we didn’t get the same phenomenon when using the .iso file which come out in 6.7;
In our test, the ovs and libvirtd pod seems work properly when reboot the controller-0. Here is the log information;

controller-0:~$ openstack hypervisor list
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 4 | controller-1 | QEMU | 192.168.206.4 | up |
| 6 | controller-0 | QEMU | 192.168.206.3 | up |
+----+---------------------+-----------------+---------------+-------+

controller-0:/home/wrsroot# kubectl logs -n openstack libvirt-libvirt-default-69mlr
'[' -n '' ']'
+ rm -f /var/run/libvirtd.pid
+ [[ -c /dev/kvm ]]
+ chmod 660 /dev/kvm
+ chown root:kvm /dev/kvm
+ CGROUPS=
+ for CGROUP in cpu rdma hugetlb
+ '[' -d /sys/fs/cgroup/cpu ']'
+ CGROUPS+=cpu,
+ for CGROUP in cpu rdma hugetlb
+ '[' -d /sys/fs/cgroup/rdma ']'
+ for CGROUP in cpu rdma hugetlb
+ '[' -d /sys/fs/cgroup/hugetlb ']'
+ CGROUPS+=hugetlb,
+ cgcreate -g cpu,hugetlb:/osh-libvirt
++ cat /proc/meminfo
++ grep HugePages_Total
++ tr -cd '[:digit:]'
+ hp_count=78225
+ '[' 078225 -gt 0 ']'
+ echo 'INFO: Detected hugepage count of '\''78225'\''. Enabling hugepage settings for libvirt/qemu.'
INFO: Detected hugepage count of '78225'. Enabling hugepage settings for libvirt/qemu.
++ grep KVM_HUGEPAGES=0 /etc/default/qemu-kvm
grep: /etc/default/qemu-kvm: No such file or directory
+ '[' -n '' ']'
+ echo KVM_HUGEPAGES=1
+ '[' '!' -d /dev/hugepages ']'
+ '[' -d /sys/fs/cgroup/hugetlb ']'
++ ls /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.1GB.limit_in_bytes /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.2MB.limit_in_bytes
+ limits='/sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.1GB.limit_in_bytes
/sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.2MB.limit_in_bytes'
+ for limit in '$limits'
+++ awk -F: '($2~/hugetlb/){print $3}' /proc/self/cgroup
++ dirname /k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/ecce25603440da89fe3b6d5539f051e6b47943ac196f73e9afdc694a1113f38c
++ basename /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.1GB.limit_in_bytes
+ target=/sys/fs/cgroup/hugetlb//k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/hugetlb.1GB.limit_in_bytes
+ '[' '!' -f /sys/fs/cgroup/hugetlb//k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/hugetlb.1GB.limit_in_bytes ']'
++ cat /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.1GB.limit_in_bytes
+ echo 9223372036854771712
+ for limit in '$limits'
+++ awk -F: '($2~/hugetlb/){print $3}' /proc/self/cgroup
++ dirname /k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/ecce25603440da89fe3b6d5539f051e6b47943ac196f73e9afdc694a1113f38c
++ basename /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.2MB.limit_in_bytes
+ target=/sys/fs/cgroup/hugetlb//k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/hugetlb.2MB.limit_in_bytes
+ '[' '!' -f /sys/fs/cgroup/hugetlb//k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/hugetlb.2MB.limit_in_bytes ']'
++ cat /sys/fs...

We’ve tried to reproduce the bug, however we didn’t get the same phenomenon when using the .iso file which come out in 6.7;
In our test, the ovs and libvirtd pod seems work properly when reboot the controller-0. Here is the log information;

controller-0:~$ openstack hypervisor list
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP       | State |
+----+---------------------+-----------------+---------------+-------+
|  4 | controller-1        | QEMU            | 192.168.206.4 | up    |
|  6 | controller-0        | QEMU            | 192.168.206.3 | up    |
+----+---------------------+-----------------+---------------+-------+

controller-0:/home/wrsroot# kubectl logs -n openstack libvirt-libvirt-default-69mlr
'[' -n '' ']'
+ rm -f /var/run/libvirtd.pid
+ [[ -c /dev/kvm ]]
+ chmod 660 /dev/kvm
+ chown root:kvm /dev/kvm
+ CGROUPS=
+ for CGROUP in cpu rdma hugetlb
+ '[' -d /sys/fs/cgroup/cpu ']'
+ CGROUPS+=cpu,
+ for CGROUP in cpu rdma hugetlb
+ '[' -d /sys/fs/cgroup/rdma ']'
+ for CGROUP in cpu rdma hugetlb
+ '[' -d /sys/fs/cgroup/hugetlb ']'
+ CGROUPS+=hugetlb,
+ cgcreate -g cpu,hugetlb:/osh-libvirt
++ cat /proc/meminfo
++ grep HugePages_Total
++ tr -cd '[:digit:]'
+ hp_count=78225
+ '[' 078225 -gt 0 ']'
+ echo 'INFO: Detected hugepage count of '\''78225'\''. Enabling hugepage settings for libvirt/qemu.'
INFO: Detected hugepage count of '78225'. Enabling hugepage settings for libvirt/qemu.
++ grep KVM_HUGEPAGES=0 /etc/default/qemu-kvm
grep: /etc/default/qemu-kvm: No such file or directory
+ '[' -n '' ']'
+ echo KVM_HUGEPAGES=1
+ '[' '!' -d /dev/hugepages ']'
+ '[' -d /sys/fs/cgroup/hugetlb ']'
++ ls /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.1GB.limit_in_bytes /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.2MB.limit_in_bytes
+ limits='/sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.1GB.limit_in_bytes
/sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.2MB.limit_in_bytes'
+ for limit in '$limits'
+++ awk -F: '($2~/hugetlb/){print $3}' /proc/self/cgroup
++ dirname /k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/ecce25603440da89fe3b6d5539f051e6b47943ac196f73e9afdc694a1113f38c
++ basename /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.1GB.limit_in_bytes
+ target=/sys/fs/cgroup/hugetlb//k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/hugetlb.1GB.limit_in_bytes
+ '[' '!' -f /sys/fs/cgroup/hugetlb//k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/hugetlb.1GB.limit_in_bytes ']'
++ cat /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.1GB.limit_in_bytes
+ echo 9223372036854771712
+ for limit in '$limits'
+++ awk -F: '($2~/hugetlb/){print $3}' /proc/self/cgroup
++ dirname /k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/ecce25603440da89fe3b6d5539f051e6b47943ac196f73e9afdc694a1113f38c
++ basename /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.2MB.limit_in_bytes
+ target=/sys/fs/cgroup/hugetlb//k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/hugetlb.2MB.limit_in_bytes
+ '[' '!' -f /sys/fs/cgroup/hugetlb//k8s-infra/kubepods/besteffort/pod23696212-9be3-11e9-b11d-3cfdfed21024/hugetlb.2MB.limit_in_bytes ']'
++ cat /sys/fs/cgroup/hugetlb/k8s-infra/hugetlb.2MB.limit_in_bytes
+ echo 9223372036854771712
++ cat /proc/meminfo
++ grep Hugepagesize
++ tr -cd '[:digit:]'
+ default_hp_kb=2048
++ cat /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages
++ tr -cd '[:digit:]'
+ num_free_pages=78225
+ echo 'INFO: '\''78225'\'' free hugepages of size 2048kB'
+ '[' 078225 -gt 0 ']'
INFO: '78225' free hugepages of size 2048kB
+ fallocate -o0 -l 2048 /dev/hugepages/foo
+ rm /dev/hugepages/foo
+ '[' -n '' ']'
+ exec cgexec -g cpu,hugetlb:/osh-libvirt systemd-run --scope --slice=system libvirtd --listen
Running scope as unit run-715348.scope.

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-07-09:

#24

Download full text (3.7 KiB)

still fails in weekly regression test for
Lab: WP_3_7
Load: 20190706T013000Z

test_force_lock_with_mig_vms
compute-0 is force locked in this testcase

@ [2019-07-06 19:41:00,533] compute-0 becomes online
| administrative | locked
| availability | online

@ [2019-07-06 19:42:04,408] compute-0 is unlocked via cli

@ [2019-07-06 19:42:13,039] compue-0 appears to be administrative state "unlocked" (but disabled, online)

However, the hypervisor remains down

[2019-07-06 19:51:11,109]
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 5 | compute-0 | QEMU | 192.168.206.89 | down |
| 8 | compute-2 | QEMU | 192.168.206.135 | up |
| 11 | compute-1 | QEMU | 192.168.206.21 | up |
+----+---------------------+-----------------+-----------------+-------+

fm-event.log
2019-07-06T19:44:45.000 controller-0 fmManager: info { "event_log_id" : "200.022", "reason_text" : "compute-0 is now 'online'", "entity_instance_id" : "host=compute-0.status=online", "severity" : "not-applicable", "state" : "msg", "timestamp" : "2019-07-06 19:44:45.983111" }
2019-07-06T19:44:56.000 controller-0 fmManager: info { "event_log_id" : "800.001", "reason_text" : "Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wolfpass-03_07.cluster=cf9dc322-6a2b-4f56-b3bb-8e059098aa4d", "severity" : "warning", "state" : "clear", "timestamp" : "2019-07-06 19:44:56.995349" }
2019-07-06T19:45:15.000 controller-0 fmManager: info { "event_log_id" : "200.006", "reason_text" : "compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wolfpass-03_07.host=compute-0.process=pci-irq-affinity-agent", "severity" : "major", "state" : "clear", "timestamp" : "2019-07-06 19:45:15.206717" }
2019-07-06T19:45:41.000 controller-0 fmManager: info { "event_log_id" : "200.006", "reason_text" : "compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wolfpass-03_07.host=compute-0.process=pci-irq-affinity-agent", "severity" : "major", "state" : "set", "timestamp" : "2019-07-06 19:45:41.623027" }
2019-07-06T19:45:43.000 contr...

still fails in weekly regression test for
Lab: WP_3_7
Load: 20190706T013000Z

test_force_lock_with_mig_vms
compute-0 is force locked in this testcase

@ [2019-07-06 19:41:00,533] compute-0 becomes online
| administrative      | locked 
| availability        | online

@ [2019-07-06 19:42:04,408] compute-0 is unlocked via cli

@ [2019-07-06 19:42:13,039] compue-0 appears to be administrative state "unlocked" (but disabled, online)

However, the hypervisor remains down

[2019-07-06 19:51:11,109]
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP         | State |
+----+---------------------+-----------------+-----------------+-------+
|  5 | compute-0           | QEMU            | 192.168.206.89  | down  |
|  8 | compute-2           | QEMU            | 192.168.206.135 | up    |
| 11 | compute-1           | QEMU            | 192.168.206.21  | up    |
+----+---------------------+-----------------+-----------------+-------+

fm-event.log
2019-07-06T19:44:45.000 controller-0 fmManager: info { "event_log_id" : "200.022", "reason_text" : "compute-0 is now 'online'", "entity_instance_id" : "host=compute-0.status=online", "severity" : "not-applicable", "state" : "msg", "timestamp" : "2019-07-06 19:44:45.983111" }
2019-07-06T19:44:56.000 controller-0 fmManager: info { "event_log_id" : "800.001", "reason_text" : "Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wolfpass-03_07.cluster=cf9dc322-6a2b-4f56-b3bb-8e059098aa4d", "severity" : "warning", "state" : "clear", "timestamp" : "2019-07-06 19:44:56.995349" }
2019-07-06T19:45:15.000 controller-0 fmManager: info { "event_log_id" : "200.006", "reason_text" : "compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wolfpass-03_07.host=compute-0.process=pci-irq-affinity-agent", "severity" : "major", "state" : "clear", "timestamp" : "2019-07-06 19:45:15.206717" }
2019-07-06T19:45:41.000 controller-0 fmManager: info { "event_log_id" : "200.006", "reason_text" : "compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wolfpass-03_07.host=compute-0.process=pci-irq-affinity-agent", "severity" : "major", "state" : "set", "timestamp" : "2019-07-06 19:45:41.623027" }
2019-07-06T19:45:43.000 controller-0 fmManager: info { "event_log_id" : "200.022", "reason_text" : "compute-0 is now 'enabled'", "entity_instance_id" : "host=compute-0.state=enabled", "severity" : "not-applicable", "state" : "msg", "timestamp" : "2019-07-06 19:45:43.823962" }
2019-07-06T19:46:03.000 controller-0 fmManager: info { "event_log_id" : "275.001", "reason_text" : "Host compute-0 hypervisor is now unlocked-disabled", "entity_instance_id" : "host=compute-0.hypervisor=0d8f9102-defc-4e50-b2a3-d46049780c0d", "severity" : "critical", "state" : "msg", "timestamp" : "2019-07-06 19:46:03.301928" }

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-07-11:

#25

controller-0_20190710.192110.tar Edit (23.8 MiB, application/x-tar)

This bug was reproduced on a Standard (2+2+2) with Build: 20190710T140851Z

During sanity execution, compute-1 is locked/unlocked by the automated test. After 1 hour of the unlock action, the current status is:

Hypervisor down:
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 5 | compute-1 | QEMU | 192.168.206.182 | down |
| 8 | compute-0 | QEMU | 192.168.206.147 | up |
+----+---------------------+-----------------+-----------------+-------+

Full collect attached.

Just a note, we got the same event on some other configurations, but after a few minutes (~ <5mins), compute node recovers.

This bug was reproduced on a Standard (2+2+2) with Build: 20190710T140851Z

During sanity execution, compute-1 is locked/unlocked by the automated test. After 1 hour of the unlock action, the current status is:

Hypervisor down:
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP         | State |
+----+---------------------+-----------------+-----------------+-------+
|  5 | compute-1           | QEMU            | 192.168.206.182 | down  |
|  8 | compute-0           | QEMU            | 192.168.206.147 | up    |
+----+---------------------+-----------------+-----------------+-------+

Full collect attached.

Just a note, we got the same event on some other configurations, but after a few minutes (~ <5mins), compute node recovers.

Ghada Khalil (gkhalil) on 2019-07-12

summary:

- 200.006 alarm "controller-0 is degraded due to the failure of its 'pci-
- irq-affinity-agent' process" after reboot
+ controller-0 is degraded due to failure of pci-irq-affinity-agent
+ process after reboot - libvirt failed to connect to ovs

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-15:

#26

Christopher, The occurrence you report related to the pci-irq-affinity-agent' process is not the same as the original issue. Please open a separate launchpad for this bug.

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-07-25:

#27

controller-0_20190707.060623.tgz Edit (37.2 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-07-25:

#28

compute-0_20190707.060623.tgz Edit (16.2 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-07-25:

#29

Download full text (3.8 KiB)

Please refer to controller-0*.tgz and compute-0*.tgz log collection from (attached above) from the following regression run.

Lab: WP_3_7
Load: 20190706T013000Z

[2019-07-06 19:29:02,166] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 5 | compute-0 | QEMU | 192.168.206.89 | up |
| 8 | compute-2 | QEMU | 192.168.206.135 | up |
| 11 | compute-1 | QEMU | 192.168.206.21 | up |
+----+---------------------+-----------------+-----------------+----

[2019-07-06 19:31:53,399] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+-----------------+-------+
| 5 | compute-0 | QEMU | 192.168.206.89 | up |
| 8 | compute-2 | QEMU | 192.168.206.135 | up |
| 11 | compute-1 | QEMU | 192.168.206.21 | up |
+----+---------------------+-----------------+-----------------+-------+
[2019-07-06 19:36:29,379]
Send 'system host-lock compute-0 --force'

[2019-07-06 19:42:04,408]
Send 'system host-unlock compute-0'

[2019-07-06 19:51:55,928]
| 306f746a-a44a-47a9-99e3-fae402762753 | 200.006 | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=compute-0.process=pci-irq-affinity-agent | major | 2019-07-06T19:45:41.599167 |

[2019-07-06 21:31:34,636] 423 DEBUG MainThread ssh.expect :: Output:
+----+---------------------+-----------------+-----------...

Please refer to controller-0*.tgz and compute-0*.tgz log collection from (attached above) from the following regression run.

Lab: WP_3_7
Load: 20190706T013000Z

[2019-07-06 19:29:02,166] 423  DEBUG MainThread ssh.expect  :: Output: 
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP         | State |
+----+---------------------+-----------------+-----------------+-------+
|  5 | compute-0           | QEMU            | 192.168.206.89  | up    |
|  8 | compute-2           | QEMU            | 192.168.206.135 | up    |
| 11 | compute-1           | QEMU            | 192.168.206.21  | up    |
+----+---------------------+-----------------+-----------------+----

[2019-07-06 19:31:53,399] 423  DEBUG MainThread ssh.expect  :: Output: 
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP         | State |
+----+---------------------+-----------------+-----------------+-------+
|  5 | compute-0           | QEMU            | 192.168.206.89  | up    |
|  8 | compute-2           | QEMU            | 192.168.206.135 | up    |
| 11 | compute-1           | QEMU            | 192.168.206.21  | up    |
+----+---------------------+-----------------+-----------------+-------+
[2019-07-06 19:36:29,379]
Send 'system host-lock compute-0 --force'

[2019-07-06 19:42:04,408]  
Send 'system host-unlock compute-0'

[2019-07-06 19:51:55,928]
| 306f746a-a44a-47a9-99e3-fae402762753 | 200.006  | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=compute-0.process=pci-irq-affinity-agent | major    | 2019-07-06T19:45:41.599167 |

[2019-07-06 21:31:34,636] 423  DEBUG MainThread ssh.expect  :: Output: 
+----+---------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP         | State |
+----+---------------------+-----------------+-----------------+-------+
|  5 | compute-0           | QEMU            | 192.168.206.89  | down  |
|  8 | compute-2           | QEMU            | 192.168.206.135 | up    |
| 11 | compute-1           | QEMU            | 192.168.206.21  | up

[2019-07-06 22:30:44,926] 
| 306f746a-a44a-47a9-99e3-fae402762753 | 200.006  | compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=compute-0.process=pci-irq-affinity-agent | major    | 2019-07-06T19:45:41.599167 |

Revision history for this message

cheng li (chengli3) wrote on 2019-07-26:

#30

It's interesting. I can see libvirt error in compute-0 var/log/libvirt/libvirt/libvirtd.log, but the error is not related to ovs.

```
qemuMonitorIO:718 : internal error: End of file from qemu monitor
```

To summary, pci-irq-affinity-agent cased degrade alarm could be cased by the following paths:

1. pci-irq-affinity-agent failed <- libvirt error <- ovs related error . (Only PengPeng had ever seen meet this scenario once)
2. pci-irq-affinity-agent failed <- libvirt error <- qemu internal error. (the latest log uploaded by Wendy matches this path)
3. pci-irq-affinity-agent failed, but could be recovered automatically in ~ 5 minutes.

This bug is to trace the first path listed above. And no one could really reproduce it for now.

Revision history for this message

Yang Liu (yliu12) wrote on 2019-08-06:

#31

Based on Cheng's comments, a new LP is opened to track the issue in Wendy's comments (issue2). Although issue2 is also intermittent, it occurs more frequent than issue1.
https://bugs.launchpad.net/starlingx/+bug/1839160

Revision history for this message

Forrest Zhao (forrest.zhao) wrote on 2019-08-09:

#32

Close it since it cannot be reproduced. Can open a new bug once it's observed again.

Changed in starlingx:
status:	Triaged → Invalid

Peng Peng (ppeng) on 2019-08-12

tags:

removed: stx.retestneeded

StarlingX

controller-0 is degraded due to failure of pci-irq-affinity-agent process after reboot - libvirt failed to connect to ovs

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches