Instances crash on each system application-update operation (and CPU exceeds thresholds)

Bug #1833096 reported by Wendy Mitchell
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bart Wensley

Bug Description

Brief Description
-----------------.
Instances crash on each system application-update operation (and CPU exceeds thresholds)
pci_irq_affinity/nova_provider.py.131 error=Unknown Error (HTTP 503)
nova.exception.NeutronAdminCredentialConfigurationInvalid'>", "code": 500}

Severity
--------.
Major
Steps to Reproduce
------------------
Platform only install on system,
2 worker nodes running with labels

1. Perform system application-apply
2019-06-14T15:01:07.000 controller-0 -sh: info HISTORY: PID=902640 UID=1875 system application-apply stx-openstack
2 compute hosts now running (enabled, available)

2. Create 1 or 2 instance(s) on compute-1

3. Perform system application-update operation to (or from centos-stable-latest)

2019-06-14T16:13:02.000 controller-0 -sh: info HISTORY: PID=1152783 UID=1875 system application-update -n stx-openstack -v 1.0-14stable-latest stx-openstack-1.0-14-centos-stable-latest.tg

2019-06-14T18:20:38.000 controller-0 bash: info HISTORY: PID=3510299 UID=0 system application-update -n stx-openstack -v stx-openstack-1.0-14-centos-stable-versioned stx-openstack-1.0-14-centos-stable-versioned.tgz

Expected Behavior
------------------
Instance migrate during system application update (hypervisor is locked-disabled during a system application-update)
No critical level CPU usage

Actual Behavior
----------------
1. cpu threshold exceeded during system application-update as follows:

First system application-update operation, the CPU reaches critical levels
(subsequent system application update operation it exceeded Major threshold levels)
2019-06-14T16:24:37.418093 set 100.101 Platform CPU threshold exceeded ; threshold 95.00%, actual 97.32% host=compute-0 critical
2019-06-14T16:26:37.421164 set 100.101 Platform CPU threshold exceeded ; threshold 90.00%, actual 94.65% host=compute-0 major

cat fm-event.log | grep "CPU"
2019-06-14T16:24:37.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 97.32%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wildcat-63_66.host=compute-0", "severity" : "critical", "state" : "set", "timestamp" : "2019-06-14 16:24:37.418093" }
2019-06-14T16:26:37.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 90.00%, actual 94.65%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wildcat-63_66.host=compute-0", "severity" : "major", "state" : "set", "timestamp" : "2019-06-14 16:26:37.421164" }

2019-06-14T18:28:37.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 90.00%, actual 90.57%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wildcat-63_66.host=compute-0", "severity" : "major", "state" : "set", "timestamp" : "2019-06-14 18:28:37.608475" }
2019-06-14T18:30:37.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 90.00%, actual 90.57%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-wildcat-63_66.host=compute-0", "severity" : "major", "state" : "clear", "timestamp" : "2019-06-14 18:30:37.593239" }

2. platform.log reports pci-interrupt-affinity 503 error in get instance info (as follows)

2019-06-14T16:21:55.753 compute-1 2019-06-14 warning 16:21:55,752 Thread-1[133577] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/nova_provider.py.131 - WARNING Failed to get instances info! error=Unknown Error (HTTP 503)

2019-06-14T16:22:55.817 compute-1 2019-06-14 warning 16:22:55,817 Thread-1[133577] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/nova_provider.py.131 - WARNING Failed to get instances info! error=Unknown Error (HTTP 503)
2019-06-14T18:25:58.751 compute-1

2019-06-14 warning 18:25:58,751 Thread-1[133577] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/nova_provider.py.131 - WARNING Failed to get instances info! error=Unknown Error (HTTP 503)
2019-06-14T18:26:58.818 compute-1
2019-06-14 warning 18:26:58,818 Thread-1[133577] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/nova_provider.py.131 - WARNING Failed to get instances info! error=Unknown Error (HTTP 503)

3. instance(s) crashed each time in step application-update operation.
see 500 level error in log as follows:
class 'nova.exception.NeutronAdminCredentialConfigurationInvalid'>", "code": 500

Each instance crashed on each application-update - nova-api exception
(eventually the instances hard rebooted to recover)

2019-06-14T16:18:33.291 controller-0 VIM_Thread[1945186] ERROR Caught exception while trying to get an instance, error=[OpenStack Rest-API Exception: method=GET, url=http://nova-api.openstack.svc.cluster.local:8774/v2.1/24f381887b5d41679209360b132bcd85/servers/87b3d078-ad65-4331-8517-da64249b78f8, headers={'X-OpenStack-Nova-API-Version': '2.53'}, body=None, status_code=500, reason=HTTP Error 500: Internal Server Error, response_headers=[('Content-Length', '231'), ('X-Compute-Request-Id', 'req-7fb665fa-aa98-4d18-9958-07749bb232ba'), ('Vary', 'OpenStack-API-Version, X-OpenStack-Nova-API-Version'), ('Openstack-Api-Version', 'compute 2.53'), ('X-Openstack-Nova-Api-Version', '2.53'), ('Date', 'Fri, 14 Jun 2019 16:18:33 GMT'), ('Content-Type', 'application/json; charset=UTF-8'), ('X-Openstack-Request-Id', 'req-7fb665fa-aa98-4d18-9958-07749bb232ba')], response_body={"computeFault": {"message": "Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.\n<class 'nova.exception.NeutronAdminCredentialConfigurationInvalid'>", "code": 500}}].
OpenStackRestAPIException: [OpenStack Rest-API Exception: method=GET, url=http://nova-api.openstack.svc.cluster.local:8774/v2.1/24f381887b5d41679209360b132bcd85/servers/87b3d078-ad65-4331-8517-da64249b78f8, headers={'X-OpenStack-Nova-API-Version': '2.53'}, body=None, status_code=500, reason=HTTP Error 500: Internal Server Error, response_headers=[('Content-Length', '231'), ('X-Compute-Request-Id', 'req-7fb665fa-aa98-4d18-9958-07749bb232ba'), ('Vary', 'OpenStack-API-Version, X-OpenStack-Nova-API-Version'), ('Openstack-Api-Version', 'compute 2.53'), ('X-Openstack-Nova-Api-Version', '2.53'), ('Date', 'Fri, 14 Jun 2019 16:18:33 GMT'), ('Content-Type', 'application/json; charset=UTF-8'), ('X-Openstack-Request-Id', 'req-7fb665fa-aa98-4d18-9958-07749bb232ba')], response_body={"computeFault": {"message": "Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.\n<class 'nova.exception.NeutronAdminCredentialConfigurationInvalid'>", "code": 500}}]

2019-06-14T18:23:51.347 controller-0 VIM_Thread[1945186] ERROR Caught exception while trying to get an instance, error=[OpenStack Rest-API Exception: method=GET, url=http://nova-api.openstack.svc.cluster.local:8774/v2.1/24f381887b5d41679209360b132bcd85/servers/369db3e0-30e8-412d-8b89-dd8ee658ad1f, headers={'X-OpenStack-Nova-API-Version': '2.53'}, body=None, status_code=500, reason=HTTP Error 500: Internal Server Error, response_headers=[('Content-Length', '231'), ('X-Compute-Request-Id', 'req-ee775414-44b3-4260-8bcf-b8c3c1427a4d'), ('Vary', 'OpenStack-API-Version, X-OpenStack-Nova-API-Version'), ('Openstack-Api-Version', 'compute 2.53'), ('X-Openstack-Nova-Api-Version', '2.53'), ('Date', 'Fri, 14 Jun 2019 18:23:51 GMT'), ('Content-Type', 'application/json; charset=UTF-8'), ('X-Openstack-Request-Id', 'req-ee775414-44b3-4260-8bcf-b8c3c1427a4d')], response_body={"computeFault": {"message": "Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.\n<class 'nova.exception.NeutronAdminCredentialConfigurationInvalid'>", "code": 500}}].
OpenStackRestAPIException: [OpenStack Rest-API Exception: method=GET, url=http://nova-api.openstack.svc.cluster.local:8774/v2.1/24f381887b5d41679209360b132bcd85/servers/369db3e0-30e8-412d-8b89-dd8ee658ad1f, headers={'X-OpenStack-Nova-API-Version': '2.53'}, body=None, status_code=500, reason=HTTP Error 500: Internal Server Error, response_headers=[('Content-Length', '231'), ('X-Compute-Request-Id', 'req-ee775414-44b3-4260-8bcf-b8c3c1427a4d'), ('Vary', 'OpenStack-API-Version, X-OpenStack-Nova-API-Version'), ('Openstack-Api-Version', 'compute 2.53'), ('X-Openstack-Nova-Api-Version', '2.53'), ('Date', 'Fri, 14 Jun 2019 18:23:51 GMT'), ('Content-Type', 'application/json; charset=UTF-8'), ('X-Openstack-Request-Id', 'req-ee775414-44b3-4260-8bcf-b8c3c1427a4d')], response_body={"computeFault": {"message": "Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.\n<class 'nova.exception.NeutronAdminCredentialConfigurationInvalid'>", "code": 500}}]

[instance log instance-00000005.log indicates instance shutdown reach=crashed]

2019-06-14 16:01:44.537+0000: starting up libvirt version: 4.7.0, package: 1.tis.100 (Unknown, 2019-06-12-17:02:54, 990e9d9abdda), qemu version: 3.0.0qemu-kvm-ev-3.0.0-0.tis.100, kernel: 3.10.0-957.12.2.el7.1.tis.x86_64, hostname: compute-1
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOME=/root QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000005,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-instance-00000005/master-key.aes -machine pc-i440fx-3.0,accel=kvm,usb=off,dump-guest-core=off -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 87b3d078-ad65-4331-8517-da64249b78f8 -smbios 'type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=19.0.1,serial=87b3d078-ad65-4331-8517-da64249b78f8,uuid=87b3d078-ad65-4331-8517-da64249b78f8,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=61,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/87b3d078-ad65-4331-8517-da64249b78f8/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,discard=unmap -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on -netdev tap,fd=62,id=hostnet0,vhost=on,vhostfd=63 -device virtio-net-pci,host_mtu=1500,netdev=hostnet0,id=net0,mac=fa:16:3e:8e:0c:40,bus=pci.0,addr=0x3 -netdev tap,fd=64,id=hostnet1,vhost=on,vhostfd=65 -device virtio-net-pci,host_mtu=1500,netdev=hostnet1,id=net1,mac=fa:16:3e:31:48:b7,bus=pci.0,addr=0x4 -netdev tap,fd=66,id=hostnet2,vhost=on,vhostfd=67 -device virtio-net-pci,host_mtu=1500,netdev=hostnet2,id=net2,mac=fa:16:3e:93:cb:80,bus=pci.0,addr=0x5 -chardev pty,id=charserial0,logfile=/var/lib/nova/instances/87b3d078-ad65-4331-8517-da64249b78f8/console.log,logappend=off -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 0.0.0.0:0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
2019-06-14 16:01:44.537+0000: Domain id=1 is tainted: high-privileges
2019-06-14T16:01:44.987963Z qemu-kvm: -chardev pty,id=charserial0,logfile=/var/lib/nova/instances/87b3d078-ad65-4331-8517-da64249b78f8/console.log,logappend=off: char device redirected to /dev/pts/0 (label charserial0)
2019-06-14 16:24:51.605+0000: shutting down, reason=crashed
2019-06-14 16:27:36.711+0000: starting up libvirt version: 4.7.0, package: 1.tis.100 (Unknown, 2019-06-12-17:02:54, 990e9d9abdda), qemu version: 3.0.0qemu-kvm-ev-3.0.0-0.tis.100, kernel: 3.10.0-957.12.2.el7.1.tis.x86_64, hostname: compute-1
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOME=/root QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000005,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-instance-00000005/master-key.aes -machine pc-i440fx-3.0,accel=kvm,usb=off,dump-guest-core=off -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 87b3d078-ad65-4331-8517-da64249b78f8 -smbios 'type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=19.0.1,serial=87b3d078-ad65-4331-8517-da64249b78f8,uuid=87b3d078-ad65-4331-8517-da64249b78f8,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=26,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/87b3d078-ad65-4331-8517-da64249b78f8/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,discard=unmap -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,host_mtu=1500,netdev=hostnet0,id=net0,mac=fa:16:3e:8e:0c:40,bus=pci.0,addr=0x3 -netdev tap,fd=29,id=hostnet1,vhost=on,vhostfd=30 -device virtio-net-pci,host_mtu=1500,netdev=hostnet1,id=net1,mac=fa:16:3e:31:48:b7,bus=pci.0,addr=0x4 -netdev tap,fd=31,id=hostnet2,vhost=on,vhostfd=32 -device virtio-net-pci,host_mtu=1500,netdev=hostnet2,id=net2,mac=fa:16:3e:93:cb:80,bus=pci.0,addr=0x5 -chardev pty,id=charserial0,logfile=/var/lib/nova/instances/87b3d078-ad65-4331-8517-da64249b78f8/console.log,logappend=off -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 0.0.0.0:0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
2019-06-14 16:27:36.712+0000: Domain id=2 is tainted: high-privileges
2019-06-14T16:27:37.302634Z qemu-kvm: -chardev pty,id=charserial0,logfile=/var/lib/nova/instances/87b3d078-ad65-4331-8517-da64249b78f8/console.log,logappend=off: char device redirected to /dev/pts/0 (label charserial0)
2019-06-14 18:05:44.497+0000: initiating migration
2019-06-14 18:05:47.276+0000: shutting down, reason=migrated
2019-06-14T18:05:47.277834Z qemu-kvm: terminating on signal 15 from pid 232446 (/usr/sbin/libvirtd)
2019-06-14 18:18:45.074+0000: starting up libvirt version: 4.7.0, package: 1.tis.100 (Unknown, 2019-06-12-17:02:54, 990e9d9abdda), qemu version: 3.0.0qemu-kvm-ev-3.0.0-0.tis.100, kernel: 3.10.0-957.12.2.el7.1.tis.x86_64, hostname: compute-1
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOME=/root QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000005,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-4-instance-00000005/master-key.aes -machine pc-i440fx-3.0,accel=kvm,usb=off,dump-guest-core=off -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 87b3d078-ad65-4331-8517-da64249b78f8 -smbios 'type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=19.0.1,serial=87b3d078-ad65-4331-8517-da64249b78f8,uuid=87b3d078-ad65-4331-8517-da64249b78f8,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=62,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/87b3d078-ad65-4331-8517-da64249b78f8/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,discard=unmap -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on -netdev tap,fd=63,id=hostnet0,vhost=on,vhostfd=64 -device virtio-net-pci,host_mtu=1500,netdev=hostnet0,id=net0,mac=fa:16:3e:8e:0c:40,bus=pci.0,addr=0x3 -netdev tap,fd=65,id=hostnet1,vhost=on,vhostfd=66 -device virtio-net-pci,host_mtu=1500,netdev=hostnet1,id=net1,mac=fa:16:3e:31:48:b7,bus=pci.0,addr=0x4 -netdev tap,fd=67,id=hostnet2,vhost=on,vhostfd=68 -device virtio-net-pci,host_mtu=1500,netdev=hostnet2,id=net2,mac=fa:16:3e:93:cb:80,bus=pci.0,addr=0x5 -chardev pty,id=charserial0,logfile=/var/lib/nova/instances/87b3d078-ad65-4331-8517-da64249b78f8/console.log,logappend=off -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 0.0.0.0:1 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
2019-06-14 18:18:45.075+0000: Domain id=4 is tainted: high-privileges
2019-06-14T18:18:45.451972Z qemu-kvm: -chardev pty,id=charserial0,logfile=/var/lib/nova/instances/87b3d078-ad65-4331-8517-da64249b78f8/console.log,logappend=off: char device redirected to /dev/pts/1 (label charserial0)
2019-06-14 18:29:08.426+0000: shutting down, reason=crashed
2019-06-14 18:30:54.800+0000: starting up libvirt version: 4.7.0, package: 1.tis.100 (Unknown, 2019-06-12-17:02:54, 990e9d9abdda), qemu version: 3.0.0qemu-kvm-ev-3.0.0-0.tis.100, kernel: 3.10.0-957.12.2.el7.1.tis.x86_64, hostname: compute-1
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOME=/root QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000005,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-6-instance-00000005/master-key.aes -machine pc-i440fx-3.0,accel=kvm,usb=off,dump-guest-core=off -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 87b3d078-ad65-4331-8517-da64249b78f8 -smbios 'type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=19.0.1,serial=87b3d078-ad65-4331-8517-da64249b78f8,uuid=87b3d078-ad65-4331-8517-da64249b78f8,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=27,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/87b3d078-ad65-4331-8517-da64249b78f8/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,discard=unmap -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=29 -device virtio-net-pci,host_mtu=1500,netdev=hostnet0,id=net0,mac=fa:16:3e:8e:0c:40,bus=pci.0,addr=0x3 -netdev tap,fd=30,id=hostnet1,vhost=on,vhostfd=31 -device virtio-net-pci,host_mtu=1500,netdev=hostnet1,id=net1,mac=fa:16:3e:31:48:b7,bus=pci.0,addr=0x4 -netdev tap,fd=32,id=hostnet2,vhost=on,vhostfd=33 -device virtio-net-pci,host_mtu=1500,netdev=hostnet2,id=net2,mac=fa:16:3e:93:cb:80,bus=pci.0,addr=0x5 -chardev pty,id=charserial0,logfile=/var/lib/nova/instances/87b3d078-ad65-4331-8517-da64249b78f8/console.log,logappend=off -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 0.0.0.0:1 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
2019-06-14 18:30:54.800+0000: Domain id=6 is tainted: high-privileges
2019-06-14T18:30:55.534249Z qemu-kvm: -chardev pty,id=charserial0,logfile=/var/lib/nova/instances/87b3d078-ad65-4331-8517-da64249b78f8/console.log,logappend=off: char device redirected to /dev/pts/1 (label charserial0)

Reproducibility
---------------
Reproducible

System Configuration
--------------------
HW (2+2 system)
wcp 63-66

Branch/Pull Time/Commit
-----------------------
BUILD_DATE=20190613T013000Z

Last Pass
---------
feature

Timestamp/Logs
--------------
see inline

Test Activity
-------------
Regression test

Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; instances should be unaffected by an application-update.

Note: As per Brent, the statement that the expected behavior is that instances should be migrated in not correct. The instances should remain in a running state.

tags: added: stx.2.0 stx.containers
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Gerry Kopec (gerry-kopec)
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (7.9 KiB)

Reproduced BUILD_ID="20190617T233000Z"

Another instance failure on a system application-update and compute goes to NodeNotReady state.
Tested from version 1.0-15-centos-stable-latest to version 1.0-15-centos-stable-versioned

Update appears to be 'completed' but instance on compute-1 failed
The libvirt pod appears in Terminating state and the compute-1 hypervisor goes down
libvirt-libvirt-default-dvj5v 1/1 Terminating 0 97m
libvirt-libvirt-default-h4m8r 1/1 Running 0 104s
libvirt-libvirt-default-rk2gx 1/1 Running 0 28m

  Normal NodeNotReady 112s (x18 over 70m) kubelet, compute-1 Node compute-1 status is now: NodeNotReady
  Normal NodeReady 52s (x19 over 5h11m) kubelet, compute-1 Node compute-1 status is now: NodeReady

platform.log (compute-1)
2019-06-18T20:16:12.380 compute-1 2019-06-18 info 20:16:12,379 Thread-15[122404] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/agent.py.95 - INFO instance_created:

uuid=2c1c5a73-99ac-4b06-a0e3-9724d5ff9b81.
2019-06-18T20:16:12.625 compute-1 2019-06-18 warning 20:16:12,625 Thread-15[122404] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/guest.py.262 - WARNING Failed to get domain for

uuid=2c1c5a73-99ac-4b06-a0e3-9724d5ff9b81! error=Cannot write data: Broken pipe
2019-06-18T20:26:04.606 compute-1 2019-06-18 warning 20:26:04,605 Thread-1[122404] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/nova_provider.py.131 - WARNING Failed to get

instances info! error=Unknown Error (HTTP 503)
2019-06-18T20:27:05.419 compute-1 2019-06-18 warning 20:27:05,419 Thread-1[122404] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/nova_provider.py.131 - WARNING Failed to get

instances info! error=Unknown Error (HTTP 503)
2019-06-18T20:28:07.418 compute-1 2019-06-18 warning 20:28:07,417 Thread-1[122404] pci-interrupt-affinity./usr/lib64/python2.7/site-packages/pci_irq_affinity/nova_provider.py.131 - WARNING Failed to get

instances info! error=Unknown Error (HTTP 503)

daemon.log (compute-1)

2019-06-18T20:36:45.568 compute-1 kubelet[50524]: info I0618 20:36:45.568288 50524 setters.go:518] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-06-18 20:36:45.568264194 +0000 UTC

m=+14532.912687448 LastTransitionTime:2019-06-18 20:36:45.568264194 +0000 UTC m=+14532.912687448 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m3.786180934s ago; threshold is 3m0s}
2019-06-18T20:40:46.146 compute-1 kubelet[50524]: info I0618 20:40:46.146444 50524 setters.go:518] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-06-18 20:40:46.14641817 +0000 UTC

m=+14773.490841420 LastTransitionTime:2019-06-18 20:40:46.14641817 +0000 UTC m=+14773.490841420 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m3.333005661s ago; threshold is 3m0s}

compute-1:/var/log/libvirt/qemu# cat instance-00000006.log
2019-06-18 20:16:08.480+0000: starting up libvirt version: 4.7.0, package: 1.tis.100 (Unknown, 2...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/671955

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/671955
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=c67ad17608fe496b6a12c86fe1f593732bdaaec5
Submitter: Zuul
Branch: master

commit c67ad17608fe496b6a12c86fe1f593732bdaaec5
Author: Gerry Kopec <email address hidden>
Date: Sun Jul 21 22:36:30 2019 -0400

    Add more cgroup controllers to libvirt qemu.conf

    Add cgroup controllers freezer, net_cls & perf_event to
    libvirt/qemu.conf. This will move these controllers from the k8s-infra
    group used by the libvirt pod to the machine.slice group.
    For background, see commit:
    https://review.opendev.org/#/c/648511/

    This prevents VM processes from being killed when the libvirt pod is
    terminated.

    Change-Id: I6bf9871fbb5e516bc21a23f966381862ab330a8a
    Partial-Bug: 1833096
    Signed-off-by: Gerry Kopec <email address hidden>

Revision history for this message
Frank Miller (sensfan22) wrote :

Gerry's commit has addressed the issue with the VMs crashing when libvirtd pod is terminated.
But a 2nd commit is required to prevent the VIM from trying to migrate or evacuate VMs when libvirt is being restarted. VIM takes action when the nova-compute service is down and nova-compute service is disabled when libvirtd is down.

To address this a commit is required that will change VIM to not take action when nova-compute is down but only raise an alarm. There is not much point in VIM trying to take action in this case as it cannot perform any nova actions while nova-compute is down. (A future feature might be needed to take action at the server level but that is beyond the scope of this bug.)

Changed in starlingx:
assignee: Gerry Kopec (gerry-kopec) → Al Bailey (albailey1974)
Revision history for this message
Frank Miller (sensfan22) wrote :

Assigning to Al to implement the 2nd commit.

Revision history for this message
Frank Miller (sensfan22) wrote :

Re-assigning to Bart as Al is tied up on another LP.

Changed in starlingx:
assignee: Al Bailey (albailey1974) → Bart Wensley (bartwensley)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (master)

Fix proposed to branch: master
Review: https://review.opendev.org/677019

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (master)

Reviewed: https://review.opendev.org/677019
Committed: https://git.openstack.org/cgit/starlingx/nfv/commit/?id=a9004988dc37bdd9caefdbcf911472b38c4db5ac
Submitter: Zuul
Branch: master

commit a9004988dc37bdd9caefdbcf911472b38c4db5ac
Author: Bart Wensley <email address hidden>
Date: Fri Aug 16 13:48:09 2019 -0500

    Make VIM tolerant of compute service failures

    When the VIM detects the nova compute service is down on a
    worker host, it attempts to migrate instances off that host
    (by "disabling" the host). However, this isn't possible if
    the compute service is down. The VIM then fails the instances,
    which will eventually result in their evacuation (if the host
    goes offline) or a reboot of the instance (if the compute
    service recovers).

    In the containers world, when the libvirt pod is restarted
    (e.g. when stx-openstack application is re-applied), nova
    reports that the compute service is down (for a short period
    of time), which causes the undesirable behaviour described
    above. The VIM is being updated to not disable the host in
    this case and instead just raise an alarm to indicate that
    the compute service has failed.

    Change-Id: I186d8d76bbcd87405bafec47deb92ec24580640e
    Closes-Bug: 1833096
    Signed-off-by: Bart Wensley <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Bart, please cherrypick to the r/stx.2.0 branch before 2019-08-23

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/677706

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (r/stx.2.0)

Reviewed: https://review.opendev.org/677706
Committed: https://git.openstack.org/cgit/starlingx/nfv/commit/?id=4ea74a99c9148ee5c813a3bec086103c463634e5
Submitter: Zuul
Branch: r/stx.2.0

commit 4ea74a99c9148ee5c813a3bec086103c463634e5
Author: Bart Wensley <email address hidden>
Date: Fri Aug 16 13:48:09 2019 -0500

    Make VIM tolerant of compute service failures

    When the VIM detects the nova compute service is down on a
    worker host, it attempts to migrate instances off that host
    (by "disabling" the host). However, this isn't possible if
    the compute service is down. The VIM then fails the instances,
    which will eventually result in their evacuation (if the host
    goes offline) or a reboot of the instance (if the compute
    service recovers).

    In the containers world, when the libvirt pod is restarted
    (e.g. when stx-openstack application is re-applied), nova
    reports that the compute service is down (for a short period
    of time), which causes the undesirable behaviour described
    above. The VIM is being updated to not disable the host in
    this case and instead just raise an alarm to indicate that
    the compute service has failed.

    Change-Id: I186d8d76bbcd87405bafec47deb92ec24580640e
    Closes-Bug: 1833096
    Signed-off-by: Bart Wensley <email address hidden>
    (cherry picked from commit a9004988dc37bdd9caefdbcf911472b38c4db5ac)

Ghada Khalil (gkhalil)
tags: added: in-r-stx20
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

verified no instance crash observed in system application-update operation
hw ip20-27

BUILD_DATE="2019-08-20 21:00:35 -0400"

$ system application-list
+---------------------+---------------------------+-------------------------------+----------------+---------+----------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+---------------------------+-------------------------------+----------------+---------+----------------------------------------+
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable- | armada-manifest | stx-openstack. | applied | Application update from version 1.0-17 |
| | latest | | yaml | | -centos-stable-versioned to version 1. |
| | | | | | 0-17-centos-stable-latest completed.

tags: removed: stx.retestneeded
Revision history for this message
Oscar Medina Perez (omedinap) wrote :

It was verified, no issue was present, neither platform.log nor fm-event.log:

BUILD_DATE="2019-08-26 23:30:00 +0000"

The below commands did not show any related to "pci-interrupt-affinity" nor "CPU":

cat /var/log/platform.log | grep "pci-interrupt-affinity"

cat /var/log/fm-event.log | grep "CPU"

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.