[pci-passthrough] nova-scheduler throws a 500 when PCI passthrough is used if the PCI passthrough scheduler filter has not been added

Bug #1391816 reported by Itzik Brown
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Low
Unassigned

Bug Description

When launching an instance with a preconfigured port When there are no VFs available for allocation the error message is not clear

#neutron port-create tenant1-net1 --binding:vnic-type direct
#nova boot --flavor m1.tiny --image cirros --nic port-id=<port-id> vm100

# nova show vm100

(output truncated)

                                                                                                                                                                                                          |
| fault | {"message": "PCI device request ({'requests': [InstancePCIRequest(alias_name=None,count=1,is_new=False,request_id=66b02b9b-600b-4c46-9f66-38ceb6cc2742,spec=[{physical_network='physnet2'}])], 'code': 500}equests)s failed", "code": 500, "created": "2014-11-12T10:10:14Z"} |

Expected:
Clear message in the fault entry when issuing 'nova show' or when launching the Instance(Better)

Tags: pci
Revision history for this message
Joe Gordon (jogo) wrote :

What version of nova is this on?

A 500 error code usually means there is a stacktrace in the logs somewhere. Can you please include a paste to the stacktrace causing the 500 error.

Changed in nova:
status: New → Incomplete
Revision history for this message
Itzik Brown (itzikb1) wrote :
Download full text (5.8 KiB)

Version
=======
Juno on RHEL7.0
openstack-nova-compute-2014.2-2.el7.centos.noarch
python-novaclient-2.20.0-1.el7.centos.noarch
openstack-nova-conductor-2014.2-2.el7.centos.noarch
openstack-nova-cert-2014.2-2.el7.centos.noarch
openstack-nova-scheduler-2014.2-2.el7.centos.noarch
python-nova-2014.2-2.el7.centos.noarch
openstack-nova-api-2014.2-2.el7.centos.noarch
openstack-nova-novncproxy-2014.2-2.el7.centos.noarch
openstack-nova-common-2014.2-2.el7.centos.noarch
openstack-nova-console-2014.2-2.el7.centos.noarch

nova-scheduler.log
===============
2014-11-16 09:36:58.779 12356 ERROR oslo.messaging.rpc.dispatcher [req-bd6e304d-00c3-489a-88f2-91cabbc23fed ] Exception during message handling: PCI device request ({'requests': [InstancePCIRequest(alias_name=None,count=1,is_new=False,request_id=d887d713-2c42-4a88-b238-37acc0517715,spec=[{physical_network='physnet2'}])], 'code': 500}equests)s failed
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/server.py", line 139, in inner
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher return func(*args, **kwargs)
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/scheduler/manager.py", line 175, in select_destinations
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher filter_properties)
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 147, in select_destinations
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher filter_properties)
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 308, in _schedule
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher chosen_host.obj.consume_from_instance(instance_properties)
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/scheduler/host_manager.py", line 246, in consume_from_instance
2014-11-16 09:36:58.779 12356 TRACE oslo.messaging.rpc.dispatcher self.pci_stats.apply_requests(p...

Read more...

Changed in nova:
status: Incomplete → New
Joe Gordon (jogo)
Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Moshe Levi (moshele)
tags: added: pci-passthrough
Revision history for this message
Moshe Levi (moshele) wrote :

hi Itzik,

I think you didn't add the PciPassthroughFilter in nova.conf
can you make sure you have the following lines:
scheduler_available_filters=nova.scheduler.filters.all_filters
scheduler_default_filters=RamFilter,ComputeFilter,AvailabilityZoneFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,PciPassthroughFilter

then you shall see this message in the log
Filter PciPassthroughFilter returned 0 hosts
2015-02-01 13:53:51.700 DEBUG nova.scheduler.filter_scheduler [req-cacc47a6-8242-46b6-83c8-ea696798612f demo demo] There are 0 hosts available but 1 instances requested to build

Revision history for this message
Moshe Levi (moshele) wrote :
Download full text (4.9 KiB)

Itzik,
the nova show output
moshele@r-ufm165:/.autodirect/mtrswgwork/moshele/openstack/devstack$ nova show 39ad9624-760f-43bf-8cf0-2b36ff9a7fb8
+--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | error |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| config_drive | |
| created | 2015-02-01T13:29:31Z |
| fault | {"message": "No valid host was found. There are not enough hosts available.", "code": 500, "created": "2015-02-01T13:29:32Z"} |
| flavor | m1.medium (3) |
| hostId | |
| id | 39ad9624-760f-43bf-8cf0-2b36ff9a7fb8 |
| image ...

Read more...

Revision history for this message
Itzik Brown (itzikb1) wrote :

You are right.
It's misconfiguration.

Changed in nova:
status: Confirmed → Invalid
Changed in nova:
status: Invalid → Triaged
summary: - [pci-passthrough] PCI Clear message should be reported when there are no
- VFs left for allocation
+ [pci-passthrough] nova-scheduler throws a 500 when PCI passthrough is
+ used if the PCI passthrough scheduler has not been added
description: updated
Revision history for this message
Sean Dague (sdague) wrote : Re: [pci-passthrough] nova-scheduler throws a 500 when PCI passthrough is used if the PCI passthrough scheduler has not been added

Triaged means there is a clear path to fixing it? What is that clear path?

Sean Dague (sdague)
Changed in nova:
status: Triaged → Confirmed
Revision history for this message
Rui Chen (kiwik-chenrui) wrote :
Download full text (4.5 KiB)

I specify a host to boot instance with pci request, but the host didn't have pci devices. like below:

 nova boot --image cirros-0.3.2-x86_64-disk --nic net-id=d9eee163-f148-4244-92c5-ffda7d9db06a --flavor chenrui_f --availability ::devstack chenrui_pci

A exception would be raised from self.pci_stats.apply_requests in HostState.consume_from_instance.

https://github.com/openstack/nova/blob/master/nova/pci/stats.py#L234

But at this time, the part of compute resource had been consumed, like: ram, disk, vcpus and so on. And there is no revert resource logic to release the part of resource when the exception was raised. I think it's a resource lacking.

I boot 12 instances, the following is nova-scheduler.log, you can found the resources constantly on the decrease. At final, I must restart the nova-scheduler, or else I can't boot any instances.

stack@devstack:/opt/stack/logs$ $ tailf screen-n-sch.log | fgrep 'Selected host: WeighedHost'
2015-05-11 15:54:45.735 DEBUG nova.scheduler.filter_scheduler [req-11dcc5ee-586a-472f-afa0-260c282676e3 admin admin] Selected host: WeighedHost [host: (devstack, devstack) ram:14509 disk:76800 io_ops:0 instances:2, weight: 0.965914386526] _schedule /opt/stack/nova/nova/scheduler/filter_scheduler.py:158
2015-05-11 15:54:53.620 DEBUG nova.scheduler.filter_scheduler [req-a88af594-2633-4527-8d8b-4db8feef7489 admin admin] Selected host: WeighedHost [host: (devstack, devstack) ram:13997 disk:75776 io_ops:0 instances:3, weight: 0.931828773051] _schedule /opt/stack/nova/nova/scheduler/filter_scheduler.py:158
2015-05-11 15:54:58.849 DEBUG nova.scheduler.filter_scheduler [req-8a79ad56-eb1b-4bc8-8573-d387bfc38184 admin admin] Selected host: WeighedHost [host: (devstack, devstack) ram:13485 disk:74752 io_ops:0 instances:4, weight: 0.897743159577] _schedule /opt/stack/nova/nova/scheduler/filter_scheduler.py:158
2015-05-11 15:55:05.956 DEBUG nova.scheduler.filter_scheduler [req-e2a3577a-e739-406b-957a-3bc8fc16a7d8 admin admin] Selected host: WeighedHost [host: (devstack, devstack) ram:12973 disk:73728 io_ops:0 instances:5, weight: 0.863657546102] _schedule /opt/stack/nova/nova/scheduler/filter_scheduler.py:158
2015-05-11 15:55:10.868 DEBUG nova.scheduler.filter_scheduler [req-6f943265-dfc7-473a-a9df-3e078c7abb08 admin admin] Selected host: WeighedHost [host: (devstack, devstack) ram:12461 disk:72704 io_ops:0 instances:6, weight: 0.829571932628] _schedule /opt/stack/nova/nova/scheduler/filter_scheduler.py:158
2015-05-11 15:55:43.500 DEBUG nova.scheduler.filter_scheduler [req-e171dcfd-373e-4ff9-b7de-e8d8d977b727 admin admin] Selected host: WeighedHost [host: (devstack, devstack) ram:11949 disk:71680 io_ops:0 instances:7, weight: 0.795486319153] _schedule /opt/stack/nova/nova/scheduler/filter_scheduler.py:158
2015-05-11 15:55:55.551 DEBUG nova.scheduler.filter_scheduler [req-522f9d71-35ed-44bb-b308-d3f78374c24e admin admin] Selected host: WeighedHost [host: (devstack, devstack) ram:11437 disk:70656 io_ops:0 instances:8, weight: 0.761400705679] _schedule /opt/stack/nova/nova/scheduler/filter_scheduler.py:158
2015-05-11 15:56:13.723 DEBUG nova.scheduler.filter_scheduler [req-106cccfb-4778-4eb7-90d8-a97d4a62de8c admin admin] Selec...

Read more...

Rui Chen (kiwik-chenrui)
Changed in nova:
assignee: nobody → Rui Chen (kiwik-chenrui)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/182165

Changed in nova:
status: Confirmed → In Progress
summary: [pci-passthrough] nova-scheduler throws a 500 when PCI passthrough is
- used if the PCI passthrough scheduler has not been added
+ used if the PCI passthrough scheduler filter has not been added
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

According to comment #7 - this sounds like a symptom of the following slightly more general bug:

https://bugs.launchpad.net/nova/+bug/1427772

Though not specifying a filter, and then requesting a pci device is really a missconfiguration rather than anything

I'll comment further on the proposed patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/182165
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=26a4f430d1ec45f7b9a61b69f009a705d2d3cd52
Submitter: Jenkins
Branch: master

commit 26a4f430d1ec45f7b9a61b69f009a705d2d3cd52
Author: Rui Chen <email address hidden>
Date: Mon May 11 17:45:04 2015 +0800

    Fix resource leaking when consume_from_instance raise exception

    When we boot instance with force host, and the instance
    request some pci devices, all the scheduler filters
    would been ignored, include: PciPassthroughFilter.
    So if the specified host can't apply the pci request,
    PciDeviceRequestFailed exception would been raised
    from HostState.consume_from_instance, at this time
    the part of compute resource had been updated,
    e.g. vcpus, ram, disk and so on, we should make
    the resource of HostState having chance to been reverted
    in order to avoid resource leaking.

    When the exception is raised from self.pci_stats.apply_requests(),
    catch it and don't set the update field of HostState,
    HostState will be synced with db in next scheduling,
    the resource is reverted at that time.

    Change-Id: I0a87088d56337bcd180606013a5473fa2ec6c608
    Related-Bug: #1391816
    Closes-Bug: #1482019

tags: added: pci
removed: pci-passthrough
Changed in nova:
status: In Progress → Confirmed
assignee: Rui Chen (kiwik-chenrui) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.