Boot a vm with pci-sriov and pci-passthrough failed by No valid host was found

Bug #1854516 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
High
zhipeng liu

Bug Description

Brief Description
-----------------
Boot a vm with following vifs on same internal net: ('virtio', 'pci-sriov', 'pci-passthrough'), it is failed by 'No valid host was found. There are not enough hosts available.

Severity
--------
Major

Steps to Reproduce
------------------
see log for detail

TC-name: networking/test_multiple_ports.py::TestMutiPortsPCI::test_multiports_on_same_network_pci_vm_actions[virtio_pci-sriov_pci-passthrough]

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Seen once

System Configuration
--------------------
Multi-node system

Lab-name:

Branch/Pull Time/Commit
-----------------------
2019-11-21_20-00-00

Last Pass
---------
unknow

Timestamp/Logs
--------------
2019-11-25 02:06:20,260] 311 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne port create --network=33db651b-a402-48c5-a432-e35560ecc079 --vnic-type=normal port_virtio-7'

[2019-11-25 02:06:22,850] 311 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne port create --network=33db651b-a402-48c5-a432-e35560ecc079 --vnic-type=direct port_pci-sriov-8'

[2019-11-25 02:06:25,556] 311 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne port create --network=33db651b-a402-48c5-a432-e35560ecc079 --vnic-type=direct-physical port_pci-passthrough-9'

[2019-11-25 02:06:27,976] 311 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne port create --network=ce00e391-1eab-4bbe-9d4f-13e22be1a7c8 --vnic-type=direct-physical port_pci-passthrough-10'

[2019-11-25 02:06:37,788] 311 DEBUG MainThread ssh.send :: Send 'nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --flavor=8bf0b131-4f9a-4e23-98e7-9ba54cd7a5aa --boot-volume=752ab134-6140-431b-8cf9-f366f0daee78 --key-name=keypair-tenant1 --poll --nic net-id=a35d11a3-7adc-4a73-a1c3-98ad4b547f23 --nic net-id=36498e81-3f8e-4b0a-815b-1ccc308c4bf4 --nic port-id=579913c4-a131-42fd-b211-bcd7322f000b --nic port-id=1bea07de-be5f-4142-9d82-17b69db26497 --nic port-id=f94315a7-36a5-4c85-b839-671f18c93842 --nic port-id=3cd6a181-1913-4f1a-b451-31e4fcebd194 tenant1-multiports_pci-21'
[2019-11-25 02:06:47,119] 433 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+---------------------------------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+---------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| adminPass | FvFD5wFCBk8V |
| config_drive | |
| created | 2019-11-25T02:27:08Z |
| description | - |
| flavor:disk | 2 |
| flavor:ephemeral | 0 |
| flavor:extra_specs | {"hw:cpu_policy": "dedicated", "hw:mem_page_size": "large", "hw:pci_numa_affinity_policy": "preferred"} |
| flavor:original_name | dedicated-3 |
| flavor:ram | 2048 |
| flavor:swap | 0 |
| flavor:vcpus | 2 |
| hostId | |
| id | 4a791eab-d1eb-41a9-a1bb-92be03b001b9 |
| image | Attempt to boot from volume - no image supplied |
| key_name | keypair-tenant1 |
| locked | False |
| metadata | {} |
| name | tenant1-multiports_pci-21 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| server_groups | [] |
| status | BUILD |
| tags | [] |
| tenant_id | e6eda99366be44938afcdbbb941447f0 |
| trusted_image_certificates | - |
| updated | 2019-11-25T02:27:08Z |
| user_id | 6ba2a9882ccb476cb1b74f99a92daef7 |
+--------------------------------------+---------------------------------------------------------------------------------------------------------+

Server building... 0% complete
Error building server
ERROR (ResourceInErrorState):

[2019-11-25 02:06:47,180] 311 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show 4a791eab-d1eb-41a9-a1bb-92be03b001b9'
                                                                                         |
| created | 2019-11-25T02:27:08Z |
| fault | {u'message': u'No valid host was found. There are not enough hosts available.', u'code': 500, u'details': u' File "/var/lib/openstack/lib/python2.7/site-packages/nova/conductor/manager.py", line 1346, in schedule_and_build_instances\n instance_uuids, return_alternates=True)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/conductor/manager.py", line 800, in _schedule_instances\n return_alternates=return_alternates)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations\n instance_uuids, return_objects, return_alternates)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations\n return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n File "/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 178, in call\n retry=self.retry)\n File "/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/transport.py", line 128, in _send\n retry=retry)\n File "/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 645, in send\n call_monitor_timeout, retry=retry)\n File "/var/lib/openstack/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 636, in _send\n raise result\n', u'created': u'2019-11-25T02:27:08Z'} |
| flavor | dedicated-3 (8bf0b131-4f9a-4e23-98e7-9ba54cd7a5aa) |
| hostId

Test Activity
-------------
Sanity
Feature Testing
Regression Testing
Developer Testing
Evaluation
Other - Please specify

Revision history for this message
Peng Peng (ppeng) wrote :
description: updated
Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Peng, Is this issue reproducible? It says "Seen once". Does that mean it passed in some cases? Or was the test run only once? Please run the test-case multiple times and update with the frequency of failure.

Please also indicate if the test-case was executed on stein previously (on the same setup) and whether it passed or not.

tags: added: stx.distro.openstack
Changed in starlingx:
status: New → Incomplete
assignee: nobody → Peng Peng (ppeng)
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was 3/3 reproduced on train
2019-11-21_20-00-00
wcp_3-6

Revision history for this message
yong hu (yhu6) wrote :

@zhipeng, please support debugging this LP.

tags: added: stx.3.0
Changed in starlingx:
importance: Undecided → Medium
yong hu (yhu6)
Changed in starlingx:
assignee: Peng Peng (ppeng) → zhipeng liu (zhipengs)
Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi Peng,

In LP 1841660 we fixed a bug that causes neutron port down.
The patch below has between merged since 20191122T023000Z.
https://review.opendev.org/#/c/695342/
Unfortunately, you tested it with 2019-11-21_20-00-00.
So, please use latest green DB to retest it again.
If still failed, please provide failure log as well.

Thanks!
Zhipeng

Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (12.4 KiB)

Issue was reproduced on
Lab: WCP_3_6
BUILD_ID="r/stx.3.0"
BUILD_DATE="2019-12-09 02:30:07 +0000"

created | 2019-12-10T21:18:15Z |
| fault | {u'message': u'No valid host was found. There are not enough hosts available.', u'code': 500, u'details': u'Traceback (most recent call last):\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/conductor/manager.py", line 1333, in schedule_and_build_instances\n instance_uuids, return_alternates=True)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/conductor/manager.py", line 839, in _schedule_instances\n return_alternates=return_alternates)\n File "/var/lib/openstack/lib/python2.7/site-packages/nova/sch...

Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi Peng,

I found the root cause in nova schedule log.
{"log":"2019-11-25 02:27:08.517 1 INFO nova.filters [req-7edad29a-0aa1-4219-9ee2-b80a329f5b66 6ba2a9882ccb476cb1b74f99a92daef7 e6eda99366be44938afcdbbb941447f0 - default default] Filtering removed all hosts for the request with instance ID '4a791eab-d1eb-41a9-a1bb-92be03b001b9'. Filter results: ['RetryFilter: (start: 2, end: 2)', 'ComputeFilter: (start: 2, end: 2)', 'AvailabilityZoneFilter: (start: 2, end: 2)', 'AggregateInstanceExtraSpecsFilter: (start: 2, end: 2)', 'ComputeCapabilitiesFilter: (start: 2, end: 2)', 'ImagePropertiesFilter: (start: 2, end: 2)', 'NUMATopologyFilter: (start: 2, end: 0)']\n","stream":"stdout","time":"2019-11-25T02:27:08.518277823Z"}

Numatopologyfilter not pass. 'NUMATopologyFilter: (start: 2, end: 0)']
It seems nova train still not support this feature completely.

Some background FYI.
https://bugs.launchpad.net/nova/+bug/1795920 SR-IOV shared PCI numa not working
https://blueprints.launchpad.net/nova/+spec/vm-scoped-sriov-numa-affinity

Related nova patch below is still ongoing
https://review.opendev.org/#/c/674072/
support pci numa affinity policies in flavor and image

So, I propose you to test this case after remove
"hw:pci_numa_affinity_policy": "preferred"
in flavor:extra_specs.

Thanks!
Zhipeng

Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi Peng,

Please also check the memory huge page allocation.
As we see the similar issue with NUMATopologyFilter failure.
Attached related email.
======================
What does "virsh capabilities|grep -C5 pages" show on the compute node?
What does /etc/nova/nova.conf show inside the nova-compute container?
It's possible to enable debug mode for nova.conf but it's a little tricky to get the syntax right. Basically you'd be using the "system helm-override-update" command. Can you give the output of "system helm-override-show" for the nova chart in the openstack application?

Thanks!
Zhipeng

Revision history for this message
zhipeng liu (zhipengs) wrote :

If still fail after above check, please help enable nova debug mode as we need further detail log, especially for nova-schedule, thanks!

Zhipeng

Revision history for this message
zhipeng liu (zhipengs) wrote :

In my side, I tried to start sriov VM.

If I set flavor with hw:mem_page_size=large
NUMATopologyFilter could not pass after I enabled and modified huge page memory configuration in compute node.
If I remove it in flavor, it can pass NUMATopologyFilter. But it could not pass PciPassthroughFilter in my setup.
It might be caused by my set up , I'm still looking into it.

Zhipeng

Revision history for this message
yong hu (yhu6) wrote :

Upgrade it to "high" because the related case is relatively important to the users.

Changed in starlingx:
importance: Medium → High
Revision history for this message
zhipeng liu (zhipengs) wrote :
Download full text (6.3 KiB)

Hi all,

This is a known issue.(With low priority, no solution so far)
Virtio port and sriov port could not be configured on the same data network, which will cause ovs-agent could not be started.
https://bugs.launchpad.net/starlingx/+bug/1836313

In TIS_AUTOMATION log, we can see 3 ports configured on the same datanetwork.
Line 45233: [2019-11-25 02:06:20,260] openstack port create --network=33db651b-a402-48c5-a432-e35560ecc079 --vnic-type=normal port_virtio-7
Line 45281: [2019-11-25 02:06:22,850] openstack port create --network=33db651b-a402-48c5-a432-e35560ecc079 --vnic-type=direct port_pci-sriov-8
Line 45329: [2019-11-25 02:06:25,556] openstack port create --network=33db651b-a402-48c5-a432-e35560ecc079 --vnic-type=direct-physical port_pci-passthrough-9

I also reproduced this issue in my BM setup.
If I put virtio and sriov ports on different network, it works as below.

controller-0:~$ openstack server create --flavor m1.test3 --image cirros --nic port-id=a566413b-6fcf-4f4f-b215-17b11fc78291 --nic port-id=f5158f6d-29b1-47ac-be8e-1f6618bc4ee4 --nic net-id=private-net0 test-sriov

controller-0:~$ openstack server list
+--------------------------------------+------------+--------+--------------------------------------------------------------------------+--------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------+--------+--------------------------------------------------------------------------+--------+----------+
| 7102d10d-f635-48d8-99d1-2e8e8b3ae14c | test-sriov | ACTIVE | private-net0=192.168.201.85; public-net0=192.168.101.101, 192.168.101.33 | cirros | m1.test3 |
+--------------------------------------+------------+--------+--------------------------------------------------------------------------+--------+----------+
controller-0:~$ openstack port list
+--------------------------------------+-------------+-------------------+--------------------------------------------------------------------------------+--------+
| ID | Name | MAC Address | Fixed IP Addresses | Status |
+--------------------------------------+-------------+-------------------+--------------------------------------------------------------------------------+--------+
| 0c664415-e1a7-4ab9-b5d6-f4d3a3a4b98f | | fa:16:3e:5f:a6:5e | ip_address='192.168.201.1', subnet_id='bf7f1185-2cc3-49c9-887c-196ccb48386d' | ACTIVE |
| 1b10ed27-703a-4ee6-971c-58ba7410213e | | fa:16:3e:53:24:dc | ip_address='192.168.201.85', subnet_id='bf7f1185-2cc3-49c9-887c-196ccb48386d' | ACTIVE |
| 21713345-da91-4ea6-9f24-1ce97c497c50 | | fa:16:3e:48:db:83 | ip_address='192.168.101.2', subnet_id='ed560723-af25-416b-ac6e-bc924a391816' | ACTIVE |
| 2a67da3a-ba8d-4626-90e7-0c3c61566f55 | | fa:16:3e:cf:79:85 | ip_address='192.168.1.158', subnet_id='0ff46234-a365-4736-8399-ec2ca6ffb950' | ACTIVE |
| a566413b-6fcf-4f4f-b215-17b11fc78291 | sriov_port | fa:16:3e:79:f2:04 | ip_add...

Read more...

Changed in starlingx:
status: Incomplete → Confirmed
Revision history for this message
yong hu (yhu6) wrote :

Need enhancement to amend this kind of test cases.
Let's move this LP to stx.4.0.

tags: added: stx.4.0
removed: stx.3.0
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Yang Liu (yliu12) wrote :

Will update the test case to remove this scenario due to ovs-agent limitation.

tags: removed: stx.retestneeded
Revision history for this message
zhipeng liu (zhipengs) wrote :

Set it to invalid after confirmed by Yang.

Changed in starlingx:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.