Activity log for bug #1917857

Date Who What changed Old value New value Message
2021-03-05 08:05:49 Venkata Veldanda bug added bug
2021-03-05 08:05:49 Venkata Veldanda attachment added controller-0_20210303.161347.tar https://bugs.launchpad.net/bugs/1917857/+attachment/5473326/+files/controller-0_20210303.161347.tar
2021-03-05 08:10:49 Venkata Veldanda description Brief Description We are using STX 4.0.1 to install our Flexran based 5G solution in AIO-SX mode . We had created VFs on the N3000 device and some of the NIC interfaces. These resources were reflected in the Kuberenets allocatable resources. During the course of using the system, the allocatable resources for N3000 and one of the NIC interface crad started coming up as 0. The following is a part of kubectl describe nodes output. The affected resources are intel.com/intel_fpga_fec, intel.com/pci_sriov_net_datanet_c, and intel.com/pci_sriov_net_datanet_u We already tried lock/unlock, delete and re-create the resources, but none of these help to recover the resources [root@controller-0 sysadmin(keystone_admin)]# cat /etc/build.info ### ### StarlingX ### Release 20.06 ### OS="centos" SW_VERSION="20.06" BUILD_TARGET="Host Installer" BUILD_TYPE="Formal" BUILD_ID="r/stx.4.0" JOB="STX_4.0_build_layer_flock" BUILD_BY="starlingx.build@cengn.ca" BUILD_NUMBER="22" BUILD_HOST="starlingx_mirror" BUILD_DATE="2020-08-05 12:25:52 +0000" FLOCK_OS="centos" FLOCK_JOB="STX_4.0_build_layer_flock" FLOCK_BUILD_BY="starlingx.build@cengn.ca" FLOCK_BUILD_NUMBER="22" FLOCK_BUILD_HOST="starlingx_mirror" FLOCK_BUILD_DATE="2020-08-05 12:25:52 +0000" Capacity: cpu: 96 ephemeral-storage: 10190100Ki hugepages-1Gi: 46Gi hugepages-2Mi: 0 intel.com/intel_fpga_fec: 0 intel.com/pci_sriov_net_datanet_c: 0 intel.com/pci_sriov_net_datanet_u: 0 intel.com/pci_sriov_net_datanetbh1: 8 intel.com/pci_sriov_net_datanetdn1: 8 intel.com/pci_sriov_net_datanetmh1: 8 memory: 97436728Ki pods: 110 Allocatable: cpu: 92 ephemeral-storage: 9391196145 hugepages-1Gi: 46Gi hugepages-2Mi: 0 intel.com/intel_fpga_fec: 0 intel.com/pci_sriov_net_datanet_c: 0 intel.com/pci_sriov_net_datanet_u: 0 intel.com/pci_sriov_net_datanetbh1: 8 intel.com/pci_sriov_net_datanetdn1: 8 intel.com/pci_sriov_net_datanetmh1: 8 It seems like everything is "OK" upto sriov device plugin because the sriov device plugin pod logs do show that the correct number of resources are getting updated towards kubernetes Here are some of the CLI outputs: [root@controller-0 sysadmin(keystone_admin)]# system host-device-show controller-0 pci_0000_1d_00_0 +-----------------------+---------------------------------------------------------------------------------------------------------+ | Property | Value | +-----------------------+---------------------------------------------------------------------------------------------------------+ | name | pci_0000_1d_00_0 | | address | 0000:1d:00.0 | | class id | 120000 | | vendor id | 8086 | | device id | 0d8f | | class name | Processing accelerators | | vendor name | Intel Corporation | | device name | Device 0d8f | | numa_node | 0 | | enabled | True | | sriov_totalvfs | 8 | | sriov_numvfs | 8 | | sriov_vfs_pci_address | 0000:1d:00.1,0000:1d:00.2,0000:1d:00.3,0000:1d:00.4,0000:1d:00.5,0000:1d:00.6,0000:1d:00.7,0000:1d:01.0 | | sriov_vf_pdevice_id | 0d90 | | extra_info | | | created_at | 2021-03-03T13:46:26.363470+00:00 | | updated_at | 2021-03-03T13:47:12.684827+00:00 | | root_key | None | | revoked_key_ids | None | | boot_page | None | | bitstream_id | None | | bmc_build_version | None | | bmc_fw_version | None | | driver | igb_uio | | sriov_vf_driver | igb_uio | +-----------------------+---------------------------------------------------------------------------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system host-if-show controller-0 sriovfh1 +-----------------+--------------------------------------+ | Property | Value | +-----------------+--------------------------------------+ | ifname | sriovfh1 | | iftype | ethernet | | ports | [u'enp177s0f3'] | | imac | 40:a6:b7:34:e4:a3 | | imtu | 9216 | | ifclass | pci-sriov | | ptp_role | none | | aemode | None | | schedpolicy | None | | txhashpolicy | None | | uuid | 6f30a690-2414-424f-b5fc-d324d63cc502 | | ihost_uuid | 8075e0db-4cc5-4d74-8601-849adce97b7e | | vlan_id | None | | uses | [] | | used_by | [] | | created_at | | | updated_at | | | sriov_numvfs | 16 | | sriov_vf_driver | vfio | | accelerated | [True] | +-----------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-list controller-0 +--------------+--------------------------------------+----------+------------------+ | hostname | uuid | ifname | datanetwork_name | +--------------+--------------------------------------+----------+------------------+ | controller-0 | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | sriovfh1 | datanet-c | | controller-0 | 6aff29d7-cfaf-48b4-9802-b17b8a025efc | sriovdn1 | datanetdn1 | | controller-0 | 76a2da50-11a6-408e-90b3-3a316cef6557 | sriovmh1 | datanetmh1 | | controller-0 | e155e1d0-8dec-47e6-ac60-076832698a95 | sriovfh1 | datanet-u | | controller-0 | e569db46-a31b-4b8f-b7ca-175b1168798f | sriovbh1 | datanetbh1 | +--------------+--------------------------------------+----------+------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-show datanet-c controller-0 datanet-ccontroller-0 datanet-c datanet-clist controller-0 +--------------+--------------------------------------+----------+------------------+ | hostname | uuid | ifname | datanetwork_name | +--------------+--------------------------------------+----------+------------------+ | controller-0 | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | sriovfh1 | datanet-c | | controller-0 | 6aff29d7-cfaf-48b4-9802-b17b8a025efc | sriovdn1 | datanetdn1 | | controller-0 | 76a2da50-11a6-408e-90b3-3a316cef6557 | sriovmh1 | datanetmh1 | | controller-0 | e155e1d0-8dec-47e6-ac60-076832698a95 | sriovfh1 | datanet-u | | controller-0 | e569db46-a31b-4b8f-b7ca-175b1168798f | sriovbh1 | datanetbh1 | +--------------+--------------------------------------+----------+------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-list controller-0show 63a44e7b-18f4-4f9b-8504-a950cb8abb86 +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | hostname | controller-0 | | uuid | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | | ifname | sriovfh1 | | datanetwork_name | datanet-c | +------------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-show 63a44e7b-18f4-4f9b-8504-a950cb8abb86e155e1d0-8dec-47e6-ac60-076832698a95 +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | hostname | controller-0 | | uuid | e155e1d0-8dec-47e6-ac60-076832698a95 | | ifname | sriovfh1 | | datanetwork_name | datanet-u | +------------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# sriov device plugin logs: ===================================================================================== controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep intel_fpga_fec "resourceName": "intel_fpga_fec", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275709 138581 manager.go:193] validating resource name "intel.com/intel_fpga_fec" I0303 13:47:07.450400 138581 manager.go:116] Creating new ResourcePool: intel_fpga_fec I0303 13:47:07.450446 138581 manager.go:145] New resource server is created for intel_fpga_fec ResourcePool I0303 13:47:07.453772 138581 server.go:191] starting intel_fpga_fec device plugin endpoint at: intel.com_intel_fpga_fec.sock I0303 13:47:07.454032 138581 server.go:217] intel_fpga_fec device plugin endpoint started serving I0303 13:47:07.640208 138581 server.go:106] Plugin: intel.com_intel_fpga_fec.sock gets registered successfully at Kubelet I0303 13:47:07.640225 138581 server.go:131] ListAndWatch(intel_fpga_fec) invoked I0303 13:47:07.640342 138581 server.go:139] ListAndWatch(intel_fpga_fec): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:1d:01.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},} controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep intel_fpga_fecpci_sriov_net_datanet_c "resourceName": "pci_sriov_net_datanet_c", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275645 138581 manager.go:193] validating resource name "intel.com/pci_sriov_net_datanet_c" I0303 13:47:07.449979 138581 manager.go:116] Creating new ResourcePool: pci_sriov_net_datanet_c I0303 13:47:07.450122 138581 manager.go:145] New resource server is created for pci_sriov_net_datanet_c ResourcePool I0303 13:47:07.450478 138581 server.go:191] starting pci_sriov_net_datanet_c device plugin endpoint at: intel.com_pci_sriov_net_datanet_c.sock I0303 13:47:07.451088 138581 server.go:217] pci_sriov_net_datanet_c device plugin endpoint started serving I0303 13:47:07.639929 138581 server.go:131] ListAndWatch(pci_sriov_net_datanet_c) invoked I0303 13:47:07.640068 138581 server.go:106] Plugin: intel.com_pci_sriov_net_datanet_c.sock gets registered successfully at Kubelet I0303 13:47:07.639996 138581 server.go:139] ListAndWatch(pci_sriov_net_datanet_c): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:b1:0f.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},} controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep pci_sriov_net_datanet_cu "resourceName": "pci_sriov_net_datanet_u", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275700 138581 manager.go:193] validating resource name "intel.com/pci_sriov_net_datanet_u" I0303 13:47:07.450306 138581 manager.go:116] Creating new ResourcePool: pci_sriov_net_datanet_u I0303 13:47:07.450388 138581 manager.go:145] New resource server is created for pci_sriov_net_datanet_u ResourcePool I0303 13:47:07.453443 138581 server.go:191] starting pci_sriov_net_datanet_u device plugin endpoint at: intel.com_pci_sriov_net_datanet_u.sock I0303 13:47:07.453750 138581 server.go:217] pci_sriov_net_datanet_u device plugin endpoint started serving I0303 13:47:07.639929 138581 server.go:131] ListAndWatch(pci_sriov_net_datanet_u) invoked I0303 13:47:07.639945 138581 server.go:139] ListAndWatch(pci_sriov_net_datanet_u): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:b1:0e.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},} I0303 13:47:07.640791 138581 server.go:106] Plugin: intel.com_pci_sriov_net_datanet_u.sock gets registered successfully at Kubelet Severity: Critical - This is a show stopper and blocks the deployment of Flexran Steps to Reproduce: 1> created 8 VFs on FEC device with igb_uio driver and 16 VFs on a 10G NIC with vfio driver 2> system lock and unlock the host 3> checked the resource of FEC device from k8s Allocatable: intel.com/intel_fpga_fec: 8 intel.com/pci_sriov_net_datanet_c: 16 intel.com/pci_sriov_net_datanet_u: 16 4> system lock and unlock the host multiple times during regular usage 5> check k8s allocatable resources becomes 0 and then never recovers, even after multipl host lock/unlock Allocatable: intel.com/intel_fpga_fec: 0 intel.com/pci_sriov_net_datanet_c: 0 intel.com/pci_sriov_net_datanet_u: 0 6> sriov daemonset pod logs seem to indicate the correct processing of the abover resource set definiton and registration to Kubelet Expected Behavior: All FPGA resources and SRIOV resources properly populated in the output of "kubectl describe nodes controller-0" Actual Behavior: Resources are not seen as expected Reproducibility: Intermittent System Configuration Simplex (AIO) Branch/Pull Time/Commit StarlingX4.0 Official ISO from http://mirror.starlingx.cengn.ca/mirror/starlingx/release/4.0.1/ Same issue observed even with the ISO build on 26-Feb-2021 03:40 http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/latest_green_build/outputs/ Timestamp/Logs -------------- Logs are huge and the page https://files.starlingx.kube.cengn.ca/ is not reachable. Is there any other place we can upload the logs? or let us know any individual log files you are looking at? Test Activity Evaluation Workaround None Brief Description We are using STX 4.0.1 to install our Flexran based 5G solution in AIO-SX mode . We had created VFs on the N3000 device and some of the NIC interfaces. These resources were reflected in the Kuberenets allocatable resources. During the course of using the system, the allocatable resources for N3000 and one of the NIC interface crad started coming up as 0. The following is a part of kubectl describe nodes output. The affected resources are intel.com/intel_fpga_fec, intel.com/pci_sriov_net_datanet_c, and intel.com/pci_sriov_net_datanet_u We already tried lock/unlock, delete and re-create the resources, but none of these help to recover the resources [root@controller-0 sysadmin(keystone_admin)]# cat /etc/build.info ### ### StarlingX ### Release 20.06 ### OS="centos" SW_VERSION="20.06" BUILD_TARGET="Host Installer" BUILD_TYPE="Formal" BUILD_ID="r/stx.4.0" JOB="STX_4.0_build_layer_flock" BUILD_BY="starlingx.build@cengn.ca" BUILD_NUMBER="22" BUILD_HOST="starlingx_mirror" BUILD_DATE="2020-08-05 12:25:52 +0000" FLOCK_OS="centos" FLOCK_JOB="STX_4.0_build_layer_flock" FLOCK_BUILD_BY="starlingx.build@cengn.ca" FLOCK_BUILD_NUMBER="22" FLOCK_BUILD_HOST="starlingx_mirror" FLOCK_BUILD_DATE="2020-08-05 12:25:52 +0000" Capacity:   cpu: 96   ephemeral-storage: 10190100Ki   hugepages-1Gi: 46Gi   hugepages-2Mi: 0   intel.com/intel_fpga_fec: 0   intel.com/pci_sriov_net_datanet_c: 0   intel.com/pci_sriov_net_datanet_u: 0   intel.com/pci_sriov_net_datanetbh1: 8   intel.com/pci_sriov_net_datanetdn1: 8   intel.com/pci_sriov_net_datanetmh1: 8   memory: 97436728Ki   pods: 110 Allocatable:   cpu: 92   ephemeral-storage: 9391196145   hugepages-1Gi: 46Gi   hugepages-2Mi: 0   intel.com/intel_fpga_fec: 0   intel.com/pci_sriov_net_datanet_c: 0   intel.com/pci_sriov_net_datanet_u: 0   intel.com/pci_sriov_net_datanetbh1: 8   intel.com/pci_sriov_net_datanetdn1: 8   intel.com/pci_sriov_net_datanetmh1: 8 It seems like everything is "OK" upto sriov device plugin because the sriov device plugin pod logs do show that the correct number of resources are getting updated towards kubernetes Here are some of the CLI outputs: [root@controller-0 sysadmin(keystone_admin)]# system host-device-show controller-0 pci_0000_1d_00_0 +-----------------------+---------------------------------------------------------------------------------------------------------+ | Property | Value | +-----------------------+---------------------------------------------------------------------------------------------------------+ | name | pci_0000_1d_00_0 | | address | 0000:1d:00.0 | | class id | 120000 | | vendor id | 8086 | | device id | 0d8f | | class name | Processing accelerators | | vendor name | Intel Corporation | | device name | Device 0d8f | | numa_node | 0 | | enabled | True | | sriov_totalvfs | 8 | | sriov_numvfs | 8 | | sriov_vfs_pci_address | 0000:1d:00.1,0000:1d:00.2,0000:1d:00.3,0000:1d:00.4,0000:1d:00.5,0000:1d:00.6,0000:1d:00.7,0000:1d:01.0 | | sriov_vf_pdevice_id | 0d90 | | extra_info | | | created_at | 2021-03-03T13:46:26.363470+00:00 | | updated_at | 2021-03-03T13:47:12.684827+00:00 | | root_key | None | | revoked_key_ids | None | | boot_page | None | | bitstream_id | None | | bmc_build_version | None | | bmc_fw_version | None | | driver | igb_uio | | sriov_vf_driver | igb_uio | +-----------------------+---------------------------------------------------------------------------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system host-if-show controller-0 sriovfh1 +-----------------+--------------------------------------+ | Property | Value | +-----------------+--------------------------------------+ | ifname | sriovfh1 | | iftype | ethernet | | ports | [u'enp177s0f3'] | | imac | 40:a6:b7:34:e4:a3 | | imtu | 9216 | | ifclass | pci-sriov | | ptp_role | none | | aemode | None | | schedpolicy | None | | txhashpolicy | None | | uuid | 6f30a690-2414-424f-b5fc-d324d63cc502 | | ihost_uuid | 8075e0db-4cc5-4d74-8601-849adce97b7e | | vlan_id | None | | uses | [] | | used_by | [] | | created_at | | | updated_at | | | sriov_numvfs | 16 | | sriov_vf_driver | vfio | | accelerated | [True] | +-----------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-list controller-0 +--------------+--------------------------------------+----------+------------------+ | hostname | uuid | ifname | datanetwork_name | +--------------+--------------------------------------+----------+------------------+ | controller-0 | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | sriovfh1 | datanet-c | | controller-0 | 6aff29d7-cfaf-48b4-9802-b17b8a025efc | sriovdn1 | datanetdn1 | | controller-0 | 76a2da50-11a6-408e-90b3-3a316cef6557 | sriovmh1 | datanetmh1 | | controller-0 | e155e1d0-8dec-47e6-ac60-076832698a95 | sriovfh1 | datanet-u | | controller-0 | e569db46-a31b-4b8f-b7ca-175b1168798f | sriovbh1 | datanetbh1 | +--------------+--------------------------------------+----------+------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-show datanet-c controller-0 datanet-ccontroller-0 datanet-c datanet-clist controller-0 +--------------+--------------------------------------+----------+------------------+ | hostname | uuid | ifname | datanetwork_name | +--------------+--------------------------------------+----------+------------------+ | controller-0 | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | sriovfh1 | datanet-c | | controller-0 | 6aff29d7-cfaf-48b4-9802-b17b8a025efc | sriovdn1 | datanetdn1 | | controller-0 | 76a2da50-11a6-408e-90b3-3a316cef6557 | sriovmh1 | datanetmh1 | | controller-0 | e155e1d0-8dec-47e6-ac60-076832698a95 | sriovfh1 | datanet-u | | controller-0 | e569db46-a31b-4b8f-b7ca-175b1168798f | sriovbh1 | datanetbh1 | +--------------+--------------------------------------+----------+------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-list controller-0show 63a44e7b-18f4-4f9b-8504-a950cb8abb86 +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | hostname | controller-0 | | uuid | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | | ifname | sriovfh1 | | datanetwork_name | datanet-c | +------------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-show 63a44e7b-18f4-4f9b-8504-a950cb8abb86e155e1d0-8dec-47e6-ac60-076832698a95 +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | hostname | controller-0 | | uuid | e155e1d0-8dec-47e6-ac60-076832698a95 | | ifname | sriovfh1 | | datanetwork_name | datanet-u | +------------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# sriov device plugin logs: ===================================================================================== controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep intel_fpga_fec       "resourceName": "intel_fpga_fec", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275709 138581 manager.go:193] validating resource name "intel.com/intel_fpga_fec" I0303 13:47:07.450400 138581 manager.go:116] Creating new ResourcePool: intel_fpga_fec I0303 13:47:07.450446 138581 manager.go:145] New resource server is created for intel_fpga_fec ResourcePool I0303 13:47:07.453772 138581 server.go:191] starting intel_fpga_fec device plugin endpoint at: intel.com_intel_fpga_fec.sock I0303 13:47:07.454032 138581 server.go:217] intel_fpga_fec device plugin endpoint started serving I0303 13:47:07.640208 138581 server.go:106] Plugin: intel.com_intel_fpga_fec.sock gets registered successfully at Kubelet I0303 13:47:07.640225 138581 server.go:131] ListAndWatch(intel_fpga_fec) invoked I0303 13:47:07.640342 138581 server.go:139] ListAndWatch(intel_fpga_fec): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:1d:01.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},} controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep intel_fpga_fecpci_sriov_net_datanet_c       "resourceName": "pci_sriov_net_datanet_c", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275645 138581 manager.go:193] validating resource name "intel.com/pci_sriov_net_datanet_c" I0303 13:47:07.449979 138581 manager.go:116] Creating new ResourcePool: pci_sriov_net_datanet_c I0303 13:47:07.450122 138581 manager.go:145] New resource server is created for pci_sriov_net_datanet_c ResourcePool I0303 13:47:07.450478 138581 server.go:191] starting pci_sriov_net_datanet_c device plugin endpoint at: intel.com_pci_sriov_net_datanet_c.sock I0303 13:47:07.451088 138581 server.go:217] pci_sriov_net_datanet_c device plugin endpoint started serving I0303 13:47:07.639929 138581 server.go:131] ListAndWatch(pci_sriov_net_datanet_c) invoked I0303 13:47:07.640068 138581 server.go:106] Plugin: intel.com_pci_sriov_net_datanet_c.sock gets registered successfully at Kubelet I0303 13:47:07.639996 138581 server.go:139] ListAndWatch(pci_sriov_net_datanet_c): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:b1:0f.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},} controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep pci_sriov_net_datanet_cu       "resourceName": "pci_sriov_net_datanet_u", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275700 138581 manager.go:193] validating resource name "intel.com/pci_sriov_net_datanet_u" I0303 13:47:07.450306 138581 manager.go:116] Creating new ResourcePool: pci_sriov_net_datanet_u I0303 13:47:07.450388 138581 manager.go:145] New resource server is created for pci_sriov_net_datanet_u ResourcePool I0303 13:47:07.453443 138581 server.go:191] starting pci_sriov_net_datanet_u device plugin endpoint at: intel.com_pci_sriov_net_datanet_u.sock I0303 13:47:07.453750 138581 server.go:217] pci_sriov_net_datanet_u device plugin endpoint started serving I0303 13:47:07.639929 138581 server.go:131] ListAndWatch(pci_sriov_net_datanet_u) invoked I0303 13:47:07.639945 138581 server.go:139] ListAndWatch(pci_sriov_net_datanet_u): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:b1:0e.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},} I0303 13:47:07.640791 138581 server.go:106] Plugin: intel.com_pci_sriov_net_datanet_u.sock gets registered successfully at Kubelet Severity: Critical - This is a show stopper and blocks the deployment of Flexran Steps to Reproduce: 1> created 8 VFs on FEC device with igb_uio driver and 16 VFs on a 10G NIC with vfio driver 2> system lock and unlock the host 3> checked the resource of FEC device from k8s Allocatable: intel.com/intel_fpga_fec: 8 intel.com/pci_sriov_net_datanet_c: 16 intel.com/pci_sriov_net_datanet_u: 16 4> system lock and unlock the host multiple times during regular usage 5> check k8s allocatable resources becomes 0 and then never recovers, even after multipl host lock/unlock  Allocatable: intel.com/intel_fpga_fec: 0 intel.com/pci_sriov_net_datanet_c: 0 intel.com/pci_sriov_net_datanet_u: 0 6> sriov daemonset pod logs seem to indicate the correct processing of the abover resource set definiton and registration to Kubelet Expected Behavior: All FPGA resources and SRIOV resources properly populated in the output of "kubectl describe nodes controller-0" Actual Behavior: Resources are not seen as expected Reproducibility: Intermittent System Configuration Simplex (AIO) Branch/Pull Time/Commit StarlingX4.0 Official ISO from http://mirror.starlingx.cengn.ca/mirror/starlingx/release/4.0.1/ Same issue observed even with the ISO build on 26-Feb-2021 03:40 http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/latest_green_build/outputs/ Timestamp/Logs -------------- Logs are attached Timestamp: Issue occurred on 03-03-2021 Test Activity Evaluation Workaround None
2021-03-05 08:15:36 Venkata Veldanda description Brief Description We are using STX 4.0.1 to install our Flexran based 5G solution in AIO-SX mode . We had created VFs on the N3000 device and some of the NIC interfaces. These resources were reflected in the Kuberenets allocatable resources. During the course of using the system, the allocatable resources for N3000 and one of the NIC interface crad started coming up as 0. The following is a part of kubectl describe nodes output. The affected resources are intel.com/intel_fpga_fec, intel.com/pci_sriov_net_datanet_c, and intel.com/pci_sriov_net_datanet_u We already tried lock/unlock, delete and re-create the resources, but none of these help to recover the resources [root@controller-0 sysadmin(keystone_admin)]# cat /etc/build.info ### ### StarlingX ### Release 20.06 ### OS="centos" SW_VERSION="20.06" BUILD_TARGET="Host Installer" BUILD_TYPE="Formal" BUILD_ID="r/stx.4.0" JOB="STX_4.0_build_layer_flock" BUILD_BY="starlingx.build@cengn.ca" BUILD_NUMBER="22" BUILD_HOST="starlingx_mirror" BUILD_DATE="2020-08-05 12:25:52 +0000" FLOCK_OS="centos" FLOCK_JOB="STX_4.0_build_layer_flock" FLOCK_BUILD_BY="starlingx.build@cengn.ca" FLOCK_BUILD_NUMBER="22" FLOCK_BUILD_HOST="starlingx_mirror" FLOCK_BUILD_DATE="2020-08-05 12:25:52 +0000" Capacity:   cpu: 96   ephemeral-storage: 10190100Ki   hugepages-1Gi: 46Gi   hugepages-2Mi: 0   intel.com/intel_fpga_fec: 0   intel.com/pci_sriov_net_datanet_c: 0   intel.com/pci_sriov_net_datanet_u: 0   intel.com/pci_sriov_net_datanetbh1: 8   intel.com/pci_sriov_net_datanetdn1: 8   intel.com/pci_sriov_net_datanetmh1: 8   memory: 97436728Ki   pods: 110 Allocatable:   cpu: 92   ephemeral-storage: 9391196145   hugepages-1Gi: 46Gi   hugepages-2Mi: 0   intel.com/intel_fpga_fec: 0   intel.com/pci_sriov_net_datanet_c: 0   intel.com/pci_sriov_net_datanet_u: 0   intel.com/pci_sriov_net_datanetbh1: 8   intel.com/pci_sriov_net_datanetdn1: 8   intel.com/pci_sriov_net_datanetmh1: 8 It seems like everything is "OK" upto sriov device plugin because the sriov device plugin pod logs do show that the correct number of resources are getting updated towards kubernetes Here are some of the CLI outputs: [root@controller-0 sysadmin(keystone_admin)]# system host-device-show controller-0 pci_0000_1d_00_0 +-----------------------+---------------------------------------------------------------------------------------------------------+ | Property | Value | +-----------------------+---------------------------------------------------------------------------------------------------------+ | name | pci_0000_1d_00_0 | | address | 0000:1d:00.0 | | class id | 120000 | | vendor id | 8086 | | device id | 0d8f | | class name | Processing accelerators | | vendor name | Intel Corporation | | device name | Device 0d8f | | numa_node | 0 | | enabled | True | | sriov_totalvfs | 8 | | sriov_numvfs | 8 | | sriov_vfs_pci_address | 0000:1d:00.1,0000:1d:00.2,0000:1d:00.3,0000:1d:00.4,0000:1d:00.5,0000:1d:00.6,0000:1d:00.7,0000:1d:01.0 | | sriov_vf_pdevice_id | 0d90 | | extra_info | | | created_at | 2021-03-03T13:46:26.363470+00:00 | | updated_at | 2021-03-03T13:47:12.684827+00:00 | | root_key | None | | revoked_key_ids | None | | boot_page | None | | bitstream_id | None | | bmc_build_version | None | | bmc_fw_version | None | | driver | igb_uio | | sriov_vf_driver | igb_uio | +-----------------------+---------------------------------------------------------------------------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system host-if-show controller-0 sriovfh1 +-----------------+--------------------------------------+ | Property | Value | +-----------------+--------------------------------------+ | ifname | sriovfh1 | | iftype | ethernet | | ports | [u'enp177s0f3'] | | imac | 40:a6:b7:34:e4:a3 | | imtu | 9216 | | ifclass | pci-sriov | | ptp_role | none | | aemode | None | | schedpolicy | None | | txhashpolicy | None | | uuid | 6f30a690-2414-424f-b5fc-d324d63cc502 | | ihost_uuid | 8075e0db-4cc5-4d74-8601-849adce97b7e | | vlan_id | None | | uses | [] | | used_by | [] | | created_at | | | updated_at | | | sriov_numvfs | 16 | | sriov_vf_driver | vfio | | accelerated | [True] | +-----------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-list controller-0 +--------------+--------------------------------------+----------+------------------+ | hostname | uuid | ifname | datanetwork_name | +--------------+--------------------------------------+----------+------------------+ | controller-0 | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | sriovfh1 | datanet-c | | controller-0 | 6aff29d7-cfaf-48b4-9802-b17b8a025efc | sriovdn1 | datanetdn1 | | controller-0 | 76a2da50-11a6-408e-90b3-3a316cef6557 | sriovmh1 | datanetmh1 | | controller-0 | e155e1d0-8dec-47e6-ac60-076832698a95 | sriovfh1 | datanet-u | | controller-0 | e569db46-a31b-4b8f-b7ca-175b1168798f | sriovbh1 | datanetbh1 | +--------------+--------------------------------------+----------+------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-show datanet-c controller-0 datanet-ccontroller-0 datanet-c datanet-clist controller-0 +--------------+--------------------------------------+----------+------------------+ | hostname | uuid | ifname | datanetwork_name | +--------------+--------------------------------------+----------+------------------+ | controller-0 | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | sriovfh1 | datanet-c | | controller-0 | 6aff29d7-cfaf-48b4-9802-b17b8a025efc | sriovdn1 | datanetdn1 | | controller-0 | 76a2da50-11a6-408e-90b3-3a316cef6557 | sriovmh1 | datanetmh1 | | controller-0 | e155e1d0-8dec-47e6-ac60-076832698a95 | sriovfh1 | datanet-u | | controller-0 | e569db46-a31b-4b8f-b7ca-175b1168798f | sriovbh1 | datanetbh1 | +--------------+--------------------------------------+----------+------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-list controller-0show 63a44e7b-18f4-4f9b-8504-a950cb8abb86 +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | hostname | controller-0 | | uuid | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | | ifname | sriovfh1 | | datanetwork_name | datanet-c | +------------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-show 63a44e7b-18f4-4f9b-8504-a950cb8abb86e155e1d0-8dec-47e6-ac60-076832698a95 +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | hostname | controller-0 | | uuid | e155e1d0-8dec-47e6-ac60-076832698a95 | | ifname | sriovfh1 | | datanetwork_name | datanet-u | +------------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# sriov device plugin logs: ===================================================================================== controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep intel_fpga_fec       "resourceName": "intel_fpga_fec", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275709 138581 manager.go:193] validating resource name "intel.com/intel_fpga_fec" I0303 13:47:07.450400 138581 manager.go:116] Creating new ResourcePool: intel_fpga_fec I0303 13:47:07.450446 138581 manager.go:145] New resource server is created for intel_fpga_fec ResourcePool I0303 13:47:07.453772 138581 server.go:191] starting intel_fpga_fec device plugin endpoint at: intel.com_intel_fpga_fec.sock I0303 13:47:07.454032 138581 server.go:217] intel_fpga_fec device plugin endpoint started serving I0303 13:47:07.640208 138581 server.go:106] Plugin: intel.com_intel_fpga_fec.sock gets registered successfully at Kubelet I0303 13:47:07.640225 138581 server.go:131] ListAndWatch(intel_fpga_fec) invoked I0303 13:47:07.640342 138581 server.go:139] ListAndWatch(intel_fpga_fec): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:1d:01.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},} controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep intel_fpga_fecpci_sriov_net_datanet_c       "resourceName": "pci_sriov_net_datanet_c", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275645 138581 manager.go:193] validating resource name "intel.com/pci_sriov_net_datanet_c" I0303 13:47:07.449979 138581 manager.go:116] Creating new ResourcePool: pci_sriov_net_datanet_c I0303 13:47:07.450122 138581 manager.go:145] New resource server is created for pci_sriov_net_datanet_c ResourcePool I0303 13:47:07.450478 138581 server.go:191] starting pci_sriov_net_datanet_c device plugin endpoint at: intel.com_pci_sriov_net_datanet_c.sock I0303 13:47:07.451088 138581 server.go:217] pci_sriov_net_datanet_c device plugin endpoint started serving I0303 13:47:07.639929 138581 server.go:131] ListAndWatch(pci_sriov_net_datanet_c) invoked I0303 13:47:07.640068 138581 server.go:106] Plugin: intel.com_pci_sriov_net_datanet_c.sock gets registered successfully at Kubelet I0303 13:47:07.639996 138581 server.go:139] ListAndWatch(pci_sriov_net_datanet_c): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:b1:0f.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},} controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep pci_sriov_net_datanet_cu       "resourceName": "pci_sriov_net_datanet_u", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275700 138581 manager.go:193] validating resource name "intel.com/pci_sriov_net_datanet_u" I0303 13:47:07.450306 138581 manager.go:116] Creating new ResourcePool: pci_sriov_net_datanet_u I0303 13:47:07.450388 138581 manager.go:145] New resource server is created for pci_sriov_net_datanet_u ResourcePool I0303 13:47:07.453443 138581 server.go:191] starting pci_sriov_net_datanet_u device plugin endpoint at: intel.com_pci_sriov_net_datanet_u.sock I0303 13:47:07.453750 138581 server.go:217] pci_sriov_net_datanet_u device plugin endpoint started serving I0303 13:47:07.639929 138581 server.go:131] ListAndWatch(pci_sriov_net_datanet_u) invoked I0303 13:47:07.639945 138581 server.go:139] ListAndWatch(pci_sriov_net_datanet_u): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:b1:0e.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},} I0303 13:47:07.640791 138581 server.go:106] Plugin: intel.com_pci_sriov_net_datanet_u.sock gets registered successfully at Kubelet Severity: Critical - This is a show stopper and blocks the deployment of Flexran Steps to Reproduce: 1> created 8 VFs on FEC device with igb_uio driver and 16 VFs on a 10G NIC with vfio driver 2> system lock and unlock the host 3> checked the resource of FEC device from k8s Allocatable: intel.com/intel_fpga_fec: 8 intel.com/pci_sriov_net_datanet_c: 16 intel.com/pci_sriov_net_datanet_u: 16 4> system lock and unlock the host multiple times during regular usage 5> check k8s allocatable resources becomes 0 and then never recovers, even after multipl host lock/unlock  Allocatable: intel.com/intel_fpga_fec: 0 intel.com/pci_sriov_net_datanet_c: 0 intel.com/pci_sriov_net_datanet_u: 0 6> sriov daemonset pod logs seem to indicate the correct processing of the abover resource set definiton and registration to Kubelet Expected Behavior: All FPGA resources and SRIOV resources properly populated in the output of "kubectl describe nodes controller-0" Actual Behavior: Resources are not seen as expected Reproducibility: Intermittent System Configuration Simplex (AIO) Branch/Pull Time/Commit StarlingX4.0 Official ISO from http://mirror.starlingx.cengn.ca/mirror/starlingx/release/4.0.1/ Same issue observed even with the ISO build on 26-Feb-2021 03:40 http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/latest_green_build/outputs/ Timestamp/Logs -------------- Logs are attached Timestamp: Issue occurred on 03-03-2021 Test Activity Evaluation Workaround None Brief Description We are using STX 4.0.1 to install our Flexran based 5G solution in AIO-SX mode . We had created VFs on the N3000 device and some of the NIC interfaces. These resources were reflected in the Kuberenets allocatable resources. During the course of using the system, the allocatable resources for N3000 and one of the NIC interface crad started coming up as 0. The following is a part of kubectl describe nodes output. The affected resources are intel.com/intel_fpga_fec, intel.com/pci_sriov_net_datanet_c, and intel.com/pci_sriov_net_datanet_u We already tried lock/unlock, delete and re-create the resources, but none of these help to recover the resources [root@controller-0 sysadmin(keystone_admin)]# cat /etc/build.info ### ### StarlingX ### Release 20.06 ### OS="centos" SW_VERSION="20.06" BUILD_TARGET="Host Installer" BUILD_TYPE="Formal" BUILD_ID="r/stx.4.0" JOB="STX_4.0_build_layer_flock" BUILD_BY="starlingx.build@cengn.ca" BUILD_NUMBER="22" BUILD_HOST="starlingx_mirror" BUILD_DATE="2020-08-05 12:25:52 +0000" FLOCK_OS="centos" FLOCK_JOB="STX_4.0_build_layer_flock" FLOCK_BUILD_BY="starlingx.build@cengn.ca" FLOCK_BUILD_NUMBER="22" FLOCK_BUILD_HOST="starlingx_mirror" FLOCK_BUILD_DATE="2020-08-05 12:25:52 +0000" Capacity:   cpu: 96   ephemeral-storage: 10190100Ki   hugepages-1Gi: 46Gi   hugepages-2Mi: 0   intel.com/intel_fpga_fec: 0   intel.com/pci_sriov_net_datanet_c: 0   intel.com/pci_sriov_net_datanet_u: 0   intel.com/pci_sriov_net_datanetbh1: 8   intel.com/pci_sriov_net_datanetdn1: 8   intel.com/pci_sriov_net_datanetmh1: 8   memory: 97436728Ki   pods: 110 Allocatable:   cpu: 92   ephemeral-storage: 9391196145   hugepages-1Gi: 46Gi   hugepages-2Mi: 0   intel.com/intel_fpga_fec: 0   intel.com/pci_sriov_net_datanet_c: 0   intel.com/pci_sriov_net_datanet_u: 0   intel.com/pci_sriov_net_datanetbh1: 8   intel.com/pci_sriov_net_datanetdn1: 8   intel.com/pci_sriov_net_datanetmh1: 8 It seems like everything is "OK" upto sriov device plugin because the sriov device plugin pod logs do show that the correct number of resources are getting updated towards kubernetes Here are some of the CLI outputs: [root@controller-0 sysadmin(keystone_admin)]# system host-device-show controller-0 pci_0000_1d_00_0 +-----------------------+---------------------------------------------------------------------------------------------------------+ | Property | Value | +-----------------------+---------------------------------------------------------------------------------------------------------+ | name | pci_0000_1d_00_0 | | address | 0000:1d:00.0 | | class id | 120000 | | vendor id | 8086 | | device id | 0d8f | | class name | Processing accelerators | | vendor name | Intel Corporation | | device name | Device 0d8f | | numa_node | 0 | | enabled | True | | sriov_totalvfs | 8 | | sriov_numvfs | 8 | | sriov_vfs_pci_address | 0000:1d:00.1,0000:1d:00.2,0000:1d:00.3,0000:1d:00.4,0000:1d:00.5,0000:1d:00.6,0000:1d:00.7,0000:1d:01.0 | | sriov_vf_pdevice_id | 0d90 | | extra_info | | | created_at | 2021-03-03T13:46:26.363470+00:00 | | updated_at | 2021-03-03T13:47:12.684827+00:00 | | root_key | None | | revoked_key_ids | None | | boot_page | None | | bitstream_id | None | | bmc_build_version | None | | bmc_fw_version | None | | driver | igb_uio | | sriov_vf_driver | igb_uio | +-----------------------+---------------------------------------------------------------------------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system host-if-show controller-0 sriovfh1 +-----------------+--------------------------------------+ | Property | Value | +-----------------+--------------------------------------+ | ifname | sriovfh1 | | iftype | ethernet | | ports | [u'enp177s0f3'] | | imac | 40:a6:b7:34:e4:a3 | | imtu | 9216 | | ifclass | pci-sriov | | ptp_role | none | | aemode | None | | schedpolicy | None | | txhashpolicy | None | | uuid | 6f30a690-2414-424f-b5fc-d324d63cc502 | | ihost_uuid | 8075e0db-4cc5-4d74-8601-849adce97b7e | | vlan_id | None | | uses | [] | | used_by | [] | | created_at | | | updated_at | | | sriov_numvfs | 16 | | sriov_vf_driver | vfio | | accelerated | [True] | +-----------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-list controller-0 +--------------+--------------------------------------+----------+------------------+ | hostname | uuid | ifname | datanetwork_name | +--------------+--------------------------------------+----------+------------------+ | controller-0 | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | sriovfh1 | datanet-c | | controller-0 | 6aff29d7-cfaf-48b4-9802-b17b8a025efc | sriovdn1 | datanetdn1 | | controller-0 | 76a2da50-11a6-408e-90b3-3a316cef6557 | sriovmh1 | datanetmh1 | | controller-0 | e155e1d0-8dec-47e6-ac60-076832698a95 | sriovfh1 | datanet-u | | controller-0 | e569db46-a31b-4b8f-b7ca-175b1168798f | sriovbh1 | datanetbh1 | +--------------+--------------------------------------+----------+------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-show datanet-c controller-0 datanet-ccontroller-0 datanet-c datanet-clist controller-0 +--------------+--------------------------------------+----------+------------------+ | hostname | uuid | ifname | datanetwork_name | +--------------+--------------------------------------+----------+------------------+ | controller-0 | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | sriovfh1 | datanet-c | | controller-0 | 6aff29d7-cfaf-48b4-9802-b17b8a025efc | sriovdn1 | datanetdn1 | | controller-0 | 76a2da50-11a6-408e-90b3-3a316cef6557 | sriovmh1 | datanetmh1 | | controller-0 | e155e1d0-8dec-47e6-ac60-076832698a95 | sriovfh1 | datanet-u | | controller-0 | e569db46-a31b-4b8f-b7ca-175b1168798f | sriovbh1 | datanetbh1 | +--------------+--------------------------------------+----------+------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-list controller-0show 63a44e7b-18f4-4f9b-8504-a950cb8abb86 +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | hostname | controller-0 | | uuid | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | | ifname | sriovfh1 | | datanetwork_name | datanet-c | +------------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-show 63a44e7b-18f4-4f9b-8504-a950cb8abb86e155e1d0-8dec-47e6-ac60-076832698a95 +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | hostname | controller-0 | | uuid | e155e1d0-8dec-47e6-ac60-076832698a95 | | ifname | sriovfh1 | | datanetwork_name | datanet-u | +------------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# sriov device plugin logs: ===================================================================================== controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep intel_fpga_fec       "resourceName": "intel_fpga_fec", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275709 138581 manager.go:193] validating resource name "intel.com/intel_fpga_fec" I0303 13:47:07.450400 138581 manager.go:116] Creating new ResourcePool: intel_fpga_fec I0303 13:47:07.450446 138581 manager.go:145] New resource server is created for intel_fpga_fec ResourcePool I0303 13:47:07.453772 138581 server.go:191] starting intel_fpga_fec device plugin endpoint at: intel.com_intel_fpga_fec.sock I0303 13:47:07.454032 138581 server.go:217] intel_fpga_fec device plugin endpoint started serving I0303 13:47:07.640208 138581 server.go:106] Plugin: intel.com_intel_fpga_fec.sock gets registered successfully at Kubelet I0303 13:47:07.640225 138581 server.go:131] ListAndWatch(intel_fpga_fec) invoked I0303 13:47:07.640342 138581 server.go:139] ListAndWatch(intel_fpga_fec): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:1d:01.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},} controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep intel_fpga_fecpci_sriov_net_datanet_c       "resourceName": "pci_sriov_net_datanet_c", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275645 138581 manager.go:193] validating resource name "intel.com/pci_sriov_net_datanet_c" I0303 13:47:07.449979 138581 manager.go:116] Creating new ResourcePool: pci_sriov_net_datanet_c I0303 13:47:07.450122 138581 manager.go:145] New resource server is created for pci_sriov_net_datanet_c ResourcePool I0303 13:47:07.450478 138581 server.go:191] starting pci_sriov_net_datanet_c device plugin endpoint at: intel.com_pci_sriov_net_datanet_c.sock I0303 13:47:07.451088 138581 server.go:217] pci_sriov_net_datanet_c device plugin endpoint started serving I0303 13:47:07.639929 138581 server.go:131] ListAndWatch(pci_sriov_net_datanet_c) invoked I0303 13:47:07.640068 138581 server.go:106] Plugin: intel.com_pci_sriov_net_datanet_c.sock gets registered successfully at Kubelet I0303 13:47:07.639996 138581 server.go:139] ListAndWatch(pci_sriov_net_datanet_c): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:b1:0f.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},} controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep pci_sriov_net_datanet_cu       "resourceName": "pci_sriov_net_datanet_u", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275700 138581 manager.go:193] validating resource name "intel.com/pci_sriov_net_datanet_u" I0303 13:47:07.450306 138581 manager.go:116] Creating new ResourcePool: pci_sriov_net_datanet_u I0303 13:47:07.450388 138581 manager.go:145] New resource server is created for pci_sriov_net_datanet_u ResourcePool I0303 13:47:07.453443 138581 server.go:191] starting pci_sriov_net_datanet_u device plugin endpoint at: intel.com_pci_sriov_net_datanet_u.sock I0303 13:47:07.453750 138581 server.go:217] pci_sriov_net_datanet_u device plugin endpoint started serving I0303 13:47:07.639929 138581 server.go:131] ListAndWatch(pci_sriov_net_datanet_u) invoked I0303 13:47:07.639945 138581 server.go:139] ListAndWatch(pci_sriov_net_datanet_u): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:b1:0e.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},} I0303 13:47:07.640791 138581 server.go:106] Plugin: intel.com_pci_sriov_net_datanet_u.sock gets registered successfully at Kubelet Severity: Critical - This is a show stopper and blocks the deployment of Flexran Steps to Reproduce: 1> created 8 VFs on FEC device with igb_uio driver and 16 VFs on a 10G NIC with vfio driver 2> system lock and unlock the host 3> checked the resource of FEC device from k8s Allocatable: intel.com/intel_fpga_fec: 8 intel.com/pci_sriov_net_datanet_c: 16 intel.com/pci_sriov_net_datanet_u: 16 4> system lock and unlock the host multiple times during regular usage 5> check k8s allocatable resources becomes 0 and then never recovers, even after multipl host lock/unlock  Allocatable: intel.com/intel_fpga_fec: 0 intel.com/pci_sriov_net_datanet_c: 0 intel.com/pci_sriov_net_datanet_u: 0 6> sriov daemonset pod logs seem to indicate the correct processing of the abover resource set definiton and registration to Kubelet Expected Behavior: All FPGA resources and SRIOV resources properly populated in the output of "kubectl describe nodes controller-0" Actual Behavior: Resources are not seen as expected Reproducibility: Intermittent System Configuration Simplex (AIO) Branch/Pull Time/Commit StarlingX4.0 Official ISO from http://mirror.starlingx.cengn.ca/mirror/starlingx/release/4.0.1/ Same issue observed even with the ISO build on 26-Feb-2021 03:40 http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/latest_green_build/outputs/ Timestamp/Logs -------------- Logs are attached. Issue occurred on 03-03-2021 Test Activity Evaluation Workaround None
2021-03-05 15:16:17 Venkata Veldanda description Brief Description We are using STX 4.0.1 to install our Flexran based 5G solution in AIO-SX mode . We had created VFs on the N3000 device and some of the NIC interfaces. These resources were reflected in the Kuberenets allocatable resources. During the course of using the system, the allocatable resources for N3000 and one of the NIC interface crad started coming up as 0. The following is a part of kubectl describe nodes output. The affected resources are intel.com/intel_fpga_fec, intel.com/pci_sriov_net_datanet_c, and intel.com/pci_sriov_net_datanet_u We already tried lock/unlock, delete and re-create the resources, but none of these help to recover the resources [root@controller-0 sysadmin(keystone_admin)]# cat /etc/build.info ### ### StarlingX ### Release 20.06 ### OS="centos" SW_VERSION="20.06" BUILD_TARGET="Host Installer" BUILD_TYPE="Formal" BUILD_ID="r/stx.4.0" JOB="STX_4.0_build_layer_flock" BUILD_BY="starlingx.build@cengn.ca" BUILD_NUMBER="22" BUILD_HOST="starlingx_mirror" BUILD_DATE="2020-08-05 12:25:52 +0000" FLOCK_OS="centos" FLOCK_JOB="STX_4.0_build_layer_flock" FLOCK_BUILD_BY="starlingx.build@cengn.ca" FLOCK_BUILD_NUMBER="22" FLOCK_BUILD_HOST="starlingx_mirror" FLOCK_BUILD_DATE="2020-08-05 12:25:52 +0000" Capacity:   cpu: 96   ephemeral-storage: 10190100Ki   hugepages-1Gi: 46Gi   hugepages-2Mi: 0   intel.com/intel_fpga_fec: 0   intel.com/pci_sriov_net_datanet_c: 0   intel.com/pci_sriov_net_datanet_u: 0   intel.com/pci_sriov_net_datanetbh1: 8   intel.com/pci_sriov_net_datanetdn1: 8   intel.com/pci_sriov_net_datanetmh1: 8   memory: 97436728Ki   pods: 110 Allocatable:   cpu: 92   ephemeral-storage: 9391196145   hugepages-1Gi: 46Gi   hugepages-2Mi: 0   intel.com/intel_fpga_fec: 0   intel.com/pci_sriov_net_datanet_c: 0   intel.com/pci_sriov_net_datanet_u: 0   intel.com/pci_sriov_net_datanetbh1: 8   intel.com/pci_sriov_net_datanetdn1: 8   intel.com/pci_sriov_net_datanetmh1: 8 It seems like everything is "OK" upto sriov device plugin because the sriov device plugin pod logs do show that the correct number of resources are getting updated towards kubernetes Here are some of the CLI outputs: [root@controller-0 sysadmin(keystone_admin)]# system host-device-show controller-0 pci_0000_1d_00_0 +-----------------------+---------------------------------------------------------------------------------------------------------+ | Property | Value | +-----------------------+---------------------------------------------------------------------------------------------------------+ | name | pci_0000_1d_00_0 | | address | 0000:1d:00.0 | | class id | 120000 | | vendor id | 8086 | | device id | 0d8f | | class name | Processing accelerators | | vendor name | Intel Corporation | | device name | Device 0d8f | | numa_node | 0 | | enabled | True | | sriov_totalvfs | 8 | | sriov_numvfs | 8 | | sriov_vfs_pci_address | 0000:1d:00.1,0000:1d:00.2,0000:1d:00.3,0000:1d:00.4,0000:1d:00.5,0000:1d:00.6,0000:1d:00.7,0000:1d:01.0 | | sriov_vf_pdevice_id | 0d90 | | extra_info | | | created_at | 2021-03-03T13:46:26.363470+00:00 | | updated_at | 2021-03-03T13:47:12.684827+00:00 | | root_key | None | | revoked_key_ids | None | | boot_page | None | | bitstream_id | None | | bmc_build_version | None | | bmc_fw_version | None | | driver | igb_uio | | sriov_vf_driver | igb_uio | +-----------------------+---------------------------------------------------------------------------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system host-if-show controller-0 sriovfh1 +-----------------+--------------------------------------+ | Property | Value | +-----------------+--------------------------------------+ | ifname | sriovfh1 | | iftype | ethernet | | ports | [u'enp177s0f3'] | | imac | 40:a6:b7:34:e4:a3 | | imtu | 9216 | | ifclass | pci-sriov | | ptp_role | none | | aemode | None | | schedpolicy | None | | txhashpolicy | None | | uuid | 6f30a690-2414-424f-b5fc-d324d63cc502 | | ihost_uuid | 8075e0db-4cc5-4d74-8601-849adce97b7e | | vlan_id | None | | uses | [] | | used_by | [] | | created_at | | | updated_at | | | sriov_numvfs | 16 | | sriov_vf_driver | vfio | | accelerated | [True] | +-----------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-list controller-0 +--------------+--------------------------------------+----------+------------------+ | hostname | uuid | ifname | datanetwork_name | +--------------+--------------------------------------+----------+------------------+ | controller-0 | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | sriovfh1 | datanet-c | | controller-0 | 6aff29d7-cfaf-48b4-9802-b17b8a025efc | sriovdn1 | datanetdn1 | | controller-0 | 76a2da50-11a6-408e-90b3-3a316cef6557 | sriovmh1 | datanetmh1 | | controller-0 | e155e1d0-8dec-47e6-ac60-076832698a95 | sriovfh1 | datanet-u | | controller-0 | e569db46-a31b-4b8f-b7ca-175b1168798f | sriovbh1 | datanetbh1 | +--------------+--------------------------------------+----------+------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-show datanet-c controller-0 datanet-ccontroller-0 datanet-c datanet-clist controller-0 +--------------+--------------------------------------+----------+------------------+ | hostname | uuid | ifname | datanetwork_name | +--------------+--------------------------------------+----------+------------------+ | controller-0 | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | sriovfh1 | datanet-c | | controller-0 | 6aff29d7-cfaf-48b4-9802-b17b8a025efc | sriovdn1 | datanetdn1 | | controller-0 | 76a2da50-11a6-408e-90b3-3a316cef6557 | sriovmh1 | datanetmh1 | | controller-0 | e155e1d0-8dec-47e6-ac60-076832698a95 | sriovfh1 | datanet-u | | controller-0 | e569db46-a31b-4b8f-b7ca-175b1168798f | sriovbh1 | datanetbh1 | +--------------+--------------------------------------+----------+------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-list controller-0show 63a44e7b-18f4-4f9b-8504-a950cb8abb86 +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | hostname | controller-0 | | uuid | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | | ifname | sriovfh1 | | datanetwork_name | datanet-c | +------------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-show 63a44e7b-18f4-4f9b-8504-a950cb8abb86e155e1d0-8dec-47e6-ac60-076832698a95 +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | hostname | controller-0 | | uuid | e155e1d0-8dec-47e6-ac60-076832698a95 | | ifname | sriovfh1 | | datanetwork_name | datanet-u | +------------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# sriov device plugin logs: ===================================================================================== controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep intel_fpga_fec       "resourceName": "intel_fpga_fec", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275709 138581 manager.go:193] validating resource name "intel.com/intel_fpga_fec" I0303 13:47:07.450400 138581 manager.go:116] Creating new ResourcePool: intel_fpga_fec I0303 13:47:07.450446 138581 manager.go:145] New resource server is created for intel_fpga_fec ResourcePool I0303 13:47:07.453772 138581 server.go:191] starting intel_fpga_fec device plugin endpoint at: intel.com_intel_fpga_fec.sock I0303 13:47:07.454032 138581 server.go:217] intel_fpga_fec device plugin endpoint started serving I0303 13:47:07.640208 138581 server.go:106] Plugin: intel.com_intel_fpga_fec.sock gets registered successfully at Kubelet I0303 13:47:07.640225 138581 server.go:131] ListAndWatch(intel_fpga_fec) invoked I0303 13:47:07.640342 138581 server.go:139] ListAndWatch(intel_fpga_fec): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:1d:01.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},} controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep intel_fpga_fecpci_sriov_net_datanet_c       "resourceName": "pci_sriov_net_datanet_c", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275645 138581 manager.go:193] validating resource name "intel.com/pci_sriov_net_datanet_c" I0303 13:47:07.449979 138581 manager.go:116] Creating new ResourcePool: pci_sriov_net_datanet_c I0303 13:47:07.450122 138581 manager.go:145] New resource server is created for pci_sriov_net_datanet_c ResourcePool I0303 13:47:07.450478 138581 server.go:191] starting pci_sriov_net_datanet_c device plugin endpoint at: intel.com_pci_sriov_net_datanet_c.sock I0303 13:47:07.451088 138581 server.go:217] pci_sriov_net_datanet_c device plugin endpoint started serving I0303 13:47:07.639929 138581 server.go:131] ListAndWatch(pci_sriov_net_datanet_c) invoked I0303 13:47:07.640068 138581 server.go:106] Plugin: intel.com_pci_sriov_net_datanet_c.sock gets registered successfully at Kubelet I0303 13:47:07.639996 138581 server.go:139] ListAndWatch(pci_sriov_net_datanet_c): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:b1:0f.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},} controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep pci_sriov_net_datanet_cu       "resourceName": "pci_sriov_net_datanet_u", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275700 138581 manager.go:193] validating resource name "intel.com/pci_sriov_net_datanet_u" I0303 13:47:07.450306 138581 manager.go:116] Creating new ResourcePool: pci_sriov_net_datanet_u I0303 13:47:07.450388 138581 manager.go:145] New resource server is created for pci_sriov_net_datanet_u ResourcePool I0303 13:47:07.453443 138581 server.go:191] starting pci_sriov_net_datanet_u device plugin endpoint at: intel.com_pci_sriov_net_datanet_u.sock I0303 13:47:07.453750 138581 server.go:217] pci_sriov_net_datanet_u device plugin endpoint started serving I0303 13:47:07.639929 138581 server.go:131] ListAndWatch(pci_sriov_net_datanet_u) invoked I0303 13:47:07.639945 138581 server.go:139] ListAndWatch(pci_sriov_net_datanet_u): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:b1:0e.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},} I0303 13:47:07.640791 138581 server.go:106] Plugin: intel.com_pci_sriov_net_datanet_u.sock gets registered successfully at Kubelet Severity: Critical - This is a show stopper and blocks the deployment of Flexran Steps to Reproduce: 1> created 8 VFs on FEC device with igb_uio driver and 16 VFs on a 10G NIC with vfio driver 2> system lock and unlock the host 3> checked the resource of FEC device from k8s Allocatable: intel.com/intel_fpga_fec: 8 intel.com/pci_sriov_net_datanet_c: 16 intel.com/pci_sriov_net_datanet_u: 16 4> system lock and unlock the host multiple times during regular usage 5> check k8s allocatable resources becomes 0 and then never recovers, even after multipl host lock/unlock  Allocatable: intel.com/intel_fpga_fec: 0 intel.com/pci_sriov_net_datanet_c: 0 intel.com/pci_sriov_net_datanet_u: 0 6> sriov daemonset pod logs seem to indicate the correct processing of the abover resource set definiton and registration to Kubelet Expected Behavior: All FPGA resources and SRIOV resources properly populated in the output of "kubectl describe nodes controller-0" Actual Behavior: Resources are not seen as expected Reproducibility: Intermittent System Configuration Simplex (AIO) Branch/Pull Time/Commit StarlingX4.0 Official ISO from http://mirror.starlingx.cengn.ca/mirror/starlingx/release/4.0.1/ Same issue observed even with the ISO build on 26-Feb-2021 03:40 http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/latest_green_build/outputs/ Timestamp/Logs -------------- Logs are attached. Issue occurred on 03-03-2021 Test Activity Evaluation Workaround None Brief Description We are using STX 4.0.1 to install our Flexran based 5G solution in AIO-SX mode . We had created VFs on the N3000 device and some of the NIC interfaces. These resources were reflected in the Kuberenets allocatable resources. During the course of using the system, the allocatable resources for N3000 and one of the NIC interface crad started coming up as 0. The following is a part of kubectl describe nodes output. The affected resources are intel.com/intel_fpga_fec, intel.com/pci_sriov_net_datanet_c, and intel.com/pci_sriov_net_datanet_u We already tried lock/unlock, delete and re-create the resources, but none of these help to recover the resources [root@controller-0 sysadmin(keystone_admin)]# cat /etc/build.info ### ### StarlingX ### Release 20.06 ### OS="centos" SW_VERSION="20.06" BUILD_TARGET="Host Installer" BUILD_TYPE="Formal" BUILD_ID="r/stx.4.0" JOB="STX_4.0_build_layer_flock" BUILD_BY="starlingx.build@cengn.ca" BUILD_NUMBER="22" BUILD_HOST="starlingx_mirror" BUILD_DATE="2020-08-05 12:25:52 +0000" FLOCK_OS="centos" FLOCK_JOB="STX_4.0_build_layer_flock" FLOCK_BUILD_BY="starlingx.build@cengn.ca" FLOCK_BUILD_NUMBER="22" FLOCK_BUILD_HOST="starlingx_mirror" FLOCK_BUILD_DATE="2020-08-05 12:25:52 +0000" Capacity:   cpu: 96   ephemeral-storage: 10190100Ki   hugepages-1Gi: 46Gi   hugepages-2Mi: 0   intel.com/intel_fpga_fec: 0   intel.com/pci_sriov_net_datanet_c: 0   intel.com/pci_sriov_net_datanet_u: 0   intel.com/pci_sriov_net_datanetbh1: 8   intel.com/pci_sriov_net_datanetdn1: 8   intel.com/pci_sriov_net_datanetmh1: 8   memory: 97436728Ki   pods: 110 Allocatable:   cpu: 92   ephemeral-storage: 9391196145   hugepages-1Gi: 46Gi   hugepages-2Mi: 0   intel.com/intel_fpga_fec: 0   intel.com/pci_sriov_net_datanet_c: 0   intel.com/pci_sriov_net_datanet_u: 0   intel.com/pci_sriov_net_datanetbh1: 8   intel.com/pci_sriov_net_datanetdn1: 8   intel.com/pci_sriov_net_datanetmh1: 8 It seems like everything is "OK" upto sriov device plugin because the sriov device plugin pod logs do show that the correct number of resources are getting updated towards kubernetes Here are some of the CLI outputs: [root@controller-0 sysadmin(keystone_admin)]# system host-device-show controller-0 pci_0000_1d_00_0 +-----------------------+---------------------------------------------------------------------------------------------------------+ | Property | Value | +-----------------------+---------------------------------------------------------------------------------------------------------+ | name | pci_0000_1d_00_0 | | address | 0000:1d:00.0 | | class id | 120000 | | vendor id | 8086 | | device id | 0d8f | | class name | Processing accelerators | | vendor name | Intel Corporation | | device name | Device 0d8f | | numa_node | 0 | | enabled | True | | sriov_totalvfs | 8 | | sriov_numvfs | 8 | | sriov_vfs_pci_address | 0000:1d:00.1,0000:1d:00.2,0000:1d:00.3,0000:1d:00.4,0000:1d:00.5,0000:1d:00.6,0000:1d:00.7,0000:1d:01.0 | | sriov_vf_pdevice_id | 0d90 | | extra_info | | | created_at | 2021-03-03T13:46:26.363470+00:00 | | updated_at | 2021-03-03T13:47:12.684827+00:00 | | root_key | None | | revoked_key_ids | None | | boot_page | None | | bitstream_id | None | | bmc_build_version | None | | bmc_fw_version | None | | driver | igb_uio | | sriov_vf_driver | igb_uio | +-----------------------+---------------------------------------------------------------------------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system host-if-show controller-0 sriovfh1 +-----------------+--------------------------------------+ | Property | Value | +-----------------+--------------------------------------+ | ifname | sriovfh1 | | iftype | ethernet | | ports | [u'enp177s0f3'] | | imac | 40:a6:b7:34:e4:a3 | | imtu | 9216 | | ifclass | pci-sriov | | ptp_role | none | | aemode | None | | schedpolicy | None | | txhashpolicy | None | | uuid | 6f30a690-2414-424f-b5fc-d324d63cc502 | | ihost_uuid | 8075e0db-4cc5-4d74-8601-849adce97b7e | | vlan_id | None | | uses | [] | | used_by | [] | | created_at | | | updated_at | | | sriov_numvfs | 16 | | sriov_vf_driver | vfio | | accelerated | [True] | +-----------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-list controller-0 +--------------+--------------------------------------+----------+------------------+ | hostname | uuid | ifname | datanetwork_name | +--------------+--------------------------------------+----------+------------------+ | controller-0 | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | sriovfh1 | datanet-c | | controller-0 | 6aff29d7-cfaf-48b4-9802-b17b8a025efc | sriovdn1 | datanetdn1 | | controller-0 | 76a2da50-11a6-408e-90b3-3a316cef6557 | sriovmh1 | datanetmh1 | | controller-0 | e155e1d0-8dec-47e6-ac60-076832698a95 | sriovfh1 | datanet-u | | controller-0 | e569db46-a31b-4b8f-b7ca-175b1168798f | sriovbh1 | datanetbh1 | +--------------+--------------------------------------+----------+------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-show datanet-c controller-0 datanet-ccontroller-0 datanet-c datanet-clist controller-0 +--------------+--------------------------------------+----------+------------------+ | hostname | uuid | ifname | datanetwork_name | +--------------+--------------------------------------+----------+------------------+ | controller-0 | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | sriovfh1 | datanet-c | | controller-0 | 6aff29d7-cfaf-48b4-9802-b17b8a025efc | sriovdn1 | datanetdn1 | | controller-0 | 76a2da50-11a6-408e-90b3-3a316cef6557 | sriovmh1 | datanetmh1 | | controller-0 | e155e1d0-8dec-47e6-ac60-076832698a95 | sriovfh1 | datanet-u | | controller-0 | e569db46-a31b-4b8f-b7ca-175b1168798f | sriovbh1 | datanetbh1 | +--------------+--------------------------------------+----------+------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-list controller-0show 63a44e7b-18f4-4f9b-8504-a950cb8abb86 +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | hostname | controller-0 | | uuid | 63a44e7b-18f4-4f9b-8504-a950cb8abb86 | | ifname | sriovfh1 | | datanetwork_name | datanet-c | +------------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# system interface-datanetwork-show 63a44e7b-18f4-4f9b-8504-a950cb8abb86e155e1d0-8dec-47e6-ac60-076832698a95 +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | hostname | controller-0 | | uuid | e155e1d0-8dec-47e6-ac60-076832698a95 | | ifname | sriovfh1 | | datanetwork_name | datanet-u | +------------------+--------------------------------------+ [root@controller-0 sysadmin(keystone_admin)]# sriov device plugin logs: ===================================================================================== controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep intel_fpga_fec       "resourceName": "intel_fpga_fec", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275709 138581 manager.go:193] validating resource name "intel.com/intel_fpga_fec" I0303 13:47:07.450400 138581 manager.go:116] Creating new ResourcePool: intel_fpga_fec I0303 13:47:07.450446 138581 manager.go:145] New resource server is created for intel_fpga_fec ResourcePool I0303 13:47:07.453772 138581 server.go:191] starting intel_fpga_fec device plugin endpoint at: intel.com_intel_fpga_fec.sock I0303 13:47:07.454032 138581 server.go:217] intel_fpga_fec device plugin endpoint started serving I0303 13:47:07.640208 138581 server.go:106] Plugin: intel.com_intel_fpga_fec.sock gets registered successfully at Kubelet I0303 13:47:07.640225 138581 server.go:131] ListAndWatch(intel_fpga_fec) invoked I0303 13:47:07.640342 138581 server.go:139] ListAndWatch(intel_fpga_fec): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:1d:01.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:1d:00.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},} controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep intel_fpga_fecpci_sriov_net_datanet_c       "resourceName": "pci_sriov_net_datanet_c", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275645 138581 manager.go:193] validating resource name "intel.com/pci_sriov_net_datanet_c" I0303 13:47:07.449979 138581 manager.go:116] Creating new ResourcePool: pci_sriov_net_datanet_c I0303 13:47:07.450122 138581 manager.go:145] New resource server is created for pci_sriov_net_datanet_c ResourcePool I0303 13:47:07.450478 138581 server.go:191] starting pci_sriov_net_datanet_c device plugin endpoint at: intel.com_pci_sriov_net_datanet_c.sock I0303 13:47:07.451088 138581 server.go:217] pci_sriov_net_datanet_c device plugin endpoint started serving I0303 13:47:07.639929 138581 server.go:131] ListAndWatch(pci_sriov_net_datanet_c) invoked I0303 13:47:07.640068 138581 server.go:106] Plugin: intel.com_pci_sriov_net_datanet_c.sock gets registered successfully at Kubelet I0303 13:47:07.639996 138581 server.go:139] ListAndWatch(pci_sriov_net_datanet_c): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:b1:0f.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},} controller-0:/home/sysadmin# kubectl logs kube-sriov-device-plugin-amd64-h9kp8 -n kube-system | grep pci_sriov_net_datanet_cu       "resourceName": "pci_sriov_net_datanet_u", I0303 13:47:07.275600 138581 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:pci_sriov_net_datanet_c DeviceType:netDevice Selectors:0xc00000c900 SelectorObj:0xc0002601e0} {ResourcePrefix: ResourceName:pci_sriov_net_datanetbh1 DeviceType:netDevice Selectors:0xc00000c920 SelectorObj:0xc000260280} {ResourcePrefix: ResourceName:pci_sriov_net_datanetdn1 DeviceType:netDevice Selectors:0xc00000c940 SelectorObj:0xc000260320} {ResourcePrefix: ResourceName:pci_sriov_net_datanetmh1 DeviceType:netDevice Selectors:0xc00000c960 SelectorObj:0xc0002603c0} {ResourcePrefix: ResourceName:pci_sriov_net_datanet_u DeviceType:netDevice Selectors:0xc00000c9a0 SelectorObj:0xc000260460} {ResourcePrefix: ResourceName:intel_fpga_fec DeviceType:accelerator Selectors:0xc00000c9c0 SelectorObj:0xc00014c820}] I0303 13:47:07.275700 138581 manager.go:193] validating resource name "intel.com/pci_sriov_net_datanet_u" I0303 13:47:07.450306 138581 manager.go:116] Creating new ResourcePool: pci_sriov_net_datanet_u I0303 13:47:07.450388 138581 manager.go:145] New resource server is created for pci_sriov_net_datanet_u ResourcePool I0303 13:47:07.453443 138581 server.go:191] starting pci_sriov_net_datanet_u device plugin endpoint at: intel.com_pci_sriov_net_datanet_u.sock I0303 13:47:07.453750 138581 server.go:217] pci_sriov_net_datanet_u device plugin endpoint started serving I0303 13:47:07.639929 138581 server.go:131] ListAndWatch(pci_sriov_net_datanet_u) invoked I0303 13:47:07.639945 138581 server.go:139] ListAndWatch(pci_sriov_net_datanet_u): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:b1:0e.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0e.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:b1:0f.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},} I0303 13:47:07.640791 138581 server.go:106] Plugin: intel.com_pci_sriov_net_datanet_u.sock gets registered successfully at Kubelet Severity: Critical - This is a show stopper and blocks the deployment of Flexran Steps to Reproduce: 1> created 8 VFs on FEC device with igb_uio driver and 16 VFs on a 10G NIC with vfio driver 2> system lock and unlock the host 3> checked the resource of FEC device from k8s Allocatable: intel.com/intel_fpga_fec: 8 intel.com/pci_sriov_net_datanet_c: 16 intel.com/pci_sriov_net_datanet_u: 16 4> system lock and unlock the host multiple times during regular usage 5> check k8s allocatable resources becomes 0 and then never recovers, even after multipl host lock/unlock  Allocatable: intel.com/intel_fpga_fec: 0 intel.com/pci_sriov_net_datanet_c: 0 intel.com/pci_sriov_net_datanet_u: 0 6> sriov daemonset pod logs seem to indicate the correct processing of the abover resource set definiton and registration to Kubelet Expected Behavior: All FPGA resources and SRIOV resources properly populated in the output of "kubectl describe nodes controller-0" Actual Behavior: Resources are not seen as expected Reproducibility: Intermittent System Configuration Simplex (AIO) Branch/Pull Time/Commit StarlingX4.0 Official ISO from http://mirror.starlingx.cengn.ca/mirror/starlingx/release/4.0.1/ Same issue observed even with the ISO build on 26-Feb-2021 03:40 http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/latest_green_build/outputs/ Timestamp/Logs -------------- Logs are attached. Issue occurred on 03-03-2021 The collect log has been captured after we tried several work arounds of removing the VF association & recreating them (with a lock/unlock). Hence the config.json that is present in the collect log may not reflect the same resources at that point of time. Test Activity Evaluation Workaround None
2021-03-06 02:31:54 Ghada Khalil tags stx.networking
2021-04-29 12:44:41 Steven Webster starlingx: status New Triaged
2021-05-13 13:22:13 Steven Webster starlingx: assignee Steven Webster (swebster-wr)
2021-08-25 16:29:03 Ghada Khalil starlingx: importance Undecided Low