[SR-IOV][CPU Pinning] Error state of vms on sr-iov port and 1 numa node

Bug #1581471 reported by Kristina Berezovskaia
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Fix Committed
High
Sergey Nikitin
9.x
Fix Released
High
Sergey Nikitin

Bug Description

Upstream bug: https://bugs.launchpad.net/nova/+bug/1582278

Detailed bug description:
 Instances on sr-iov port with flavor for pinning doesn't work proper: in most cases we have an error state for vm after boot

Steps to reproduce:
 1) Deploy env with SR-IOV and CPU pinning enable
 2) Create aggregates for cpu pinning:
nova aggregate-create performance
nova aggregate-set-metadata performance pinned=true
nova aggregate-add-host performance node-2.test.domain.local
nova aggregate-add-host performance node-3.test.domain.local
nova aggregate-add-host performance node-4.test.domain.local
nova aggregate-add-host performance node-5.test.domain.local
 3) Create new flavor:
nova flavor-show m1.small.performance
+----------------------------+-------------------------------------------------------------------------------------------------------+
| Property | Value |
+----------------------------+-------------------------------------------------------------------------------------------------------+
| OS-FLV-DISABLED:disabled | False |
| OS-FLV-EXT-DATA:ephemeral | 0 |
| disk | 20 |
| extra_specs | {"aggregate_instance_extra_specs:pinned": "true", "hw:cpu_policy": "dedicated", "hw:numa_nodes": "1"} |
| id | 7b0e5ee0-0bf7-4a46-9653-9279a947c650 |
| name | m1.small.performance |
| os-flavor-access:is_public | True |
| ram | 2048 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 1 |
+----------------------------+--------------------------------------------------------------------------------
 4) download ubuntu image
 5) create sr-iov port and boot vm on this port with m1.small.performance flavor:
NODE_1='node-4.test.domain.local'
NODE_2='node-5.test.domain.local'
NET_ID_1=$(neutron net-list | grep net_EW_2 | awk '{print$2}')
neutron port-create $NET_ID_1 --binding:vnic-type direct --device_owner nova-compute --name sriov_23
port_id=$(neutron port-list | grep 'sriov_23' | awk '{print$2}')
nova boot vm23 --flavor m1.small.performance --image ubuntu_image --availability-zone nova:$NODE_1 --nic port-id=$port_id --key-name vm_key

Expected results:
 VM is an ACTIVE state
Actual result:
 In most cases the state is ERROR. Sometimes we can boot vms without zones but they very rarely is an ACTIVE state

Description of the environment:
 iso #312, 9.0, neutron+vlan. 1 controller, 4 compute:
- 2 computesr-iov+hp+cpu pinning
- cpu pinning + hp
- cpu pinning

shotgun2 short-report
cat /etc/fuel_build_id:
 312
cat /etc/fuel_build_number:
 312
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6344.noarch
 network-checker-9.0.0-1.mos72.x86_64
 fuel-mirror-9.0.0-1.mos135.noarch
 fuel-openstack-metadata-9.0.0-1.mos8681.noarch
 fuel-notify-9.0.0-1.mos8338.noarch
 fuel-ostf-9.0.0-1.mos934.noarch
 python-fuelclient-9.0.0-1.mos313.noarch
 fuelmenu-9.0.0-1.mos270.noarch
 fuel-9.0.0-1.mos6344.noarch
 fuel-utils-9.0.0-1.mos8338.noarch
 fuel-nailgun-9.0.0-1.mos8681.noarch
 rubygem-astute-9.0.0-1.mos742.noarch
 fuel-misc-9.0.0-1.mos8338.noarch
 fuel-library9.0-9.0.0-1.mos8338.noarch
 shotgun-9.0.0-1.mos88.noarch
 fuel-agent-9.0.0-1.mos276.noarch
 fuel-ui-9.0.0-1.mos2678.noarch
 fuel-setup-9.0.0-1.mos6344.noarch
 nailgun-mcagents-9.0.0-1.mos742.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8681.noarch
 python-packetary-9.0.0-1.mos135.noarch
 fuel-bootstrap-cli-9.0.0-1.mos276.noarch
 fuel-migrate-9.0.0-1.mos8338.noarch
- Operation system: <put your information here>
- Versions of components: <put your information here>
- Reference architecture: <put your information here>
- Network model: <put your information here>
- Related projects installed: <put your information here>

Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :
Changed in mos:
importance: Undecided → High
Changed in mos:
assignee: MOS Nova (mos-nova) → Sergey Nikitin (snikitin)
tags: added: area-nova
description: updated
Revision history for this message
Sergey Nikitin (snikitin) wrote :

fixes for stable/mitaka: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/mitaka+topic:stable_mitaka_sriov

will wait for merge and for sync with stable/mitaka

Revision history for this message
Sergey Nikitin (snikitin) wrote :

Fix was merged in master couple weeks ago so bug is fixed for 10.0.
https://review.fuel-infra.org/gitweb?p=openstack/nova.git;a=commit;h=74fbff88639891269f6a0752e70b78340cf87e9a

Still wait for merge in stable/mitaka.

Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :

On 9.0 iso 362 we also saw the same situation for Huge Pages

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to openstack/nova (9.0/mitaka)

Related fix proposed to branch: 9.0/mitaka
Change author: Jay Pipes <email address hidden>
Review: https://review.fuel-infra.org/21472

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/nova (9.0/mitaka)

Fix proposed to branch: 9.0/mitaka
Change author: Jay Pipes <email address hidden>
Review: https://review.fuel-infra.org/21473

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

The issue is critical for 9.0 release, please merge the fix in stable/mitaka.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to openstack/nova (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/21472
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: 8cda4898425372a062cdec1018b56116386738f0
Author: Jay Pipes <email address hidden>
Date: Wed Jun 1 09:52:56 2016

pci: pass in instance PCI requests to claim

Removes the calls to InstancePCIRequests.get_XXX() from within the
claims.Claim and claims.MoveClaim constructors and instead has the
resource tracker construct the PCI requests and pass them into the
constructor.

This allows us to remove the needlessly duplicative _test_pci() method
in claims.MoveClaim and will allow the next patch in the series to
remove the call in nova.pci.manager.PciDevTracker.claim_instance() that
re-fetches PCI requests for the supplied instance.

Related-Bug: #1368201
Related-Bug: #1581471

Change-Id: Ib2cc7c985839fbf88b5e6e437c4b395ab484b1b6
(cherry picked from commit 74fbff88639891269f6a0752e70b78340cf87e9a)

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/nova (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/21473
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: 36dbc9fe165e9f9936148a82f1c66e7cc44a9b19
Author: Jay Pipes <email address hidden>
Date: Wed Jun 1 09:52:56 2016

pci: eliminate DB lookup PCI requests during claim

The nova.pci.manager.PciDevTracker.claim_instance() accepted an Instance
object and called nova.objects.InstancePCIRequests.get_by_instance() to
retrieve the PCI requests for the instance. This caused a DB lookup of
the PCI requests for that instance, even though in all situations other
than for migration/resize, the instance's PCI requests were already
retrieved by the resource tracker.

This change removes that additional DB lookup during claim_instance() by
changing the instance parameter to instead be an InstancePCIRequests
object and an InstanceNUMATopology object.

Also in this patch is a change to nova.objects.PciDevice.claim() that
changes the single parameter to an instance UUID instead of an Instance
object, since nothing other than the instance's UUID was used in the
method.

Related-Bug: #1368201
Closes-Bug: #1581471

Change-Id: I9ab10c3035628f083233114b47b43a9b9ecdd166
(cherry picked from commit 1f259e2a9423a4777f79ca561d5e6a74747a5019)

tags: added: on-verification
Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :

Verify on:
cat /etc/fuel_build_id:
 466
cat /etc/fuel_build_number:
 466
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6349.noarch
 fuel-misc-9.0.0-1.mos8454.noarch
 python-packetary-9.0.0-1.mos140.noarch
 fuel-bootstrap-cli-9.0.0-1.mos285.noarch
 fuel-migrate-9.0.0-1.mos8454.noarch
 rubygem-astute-9.0.0-1.mos750.noarch
 fuel-mirror-9.0.0-1.mos140.noarch
 shotgun-9.0.0-1.mos90.noarch
 fuel-openstack-metadata-9.0.0-1.mos8742.noarch
 fuel-notify-9.0.0-1.mos8454.noarch
 nailgun-mcagents-9.0.0-1.mos750.noarch
 python-fuelclient-9.0.0-1.mos325.noarch
 fuel-9.0.0-1.mos6349.noarch
 fuel-utils-9.0.0-1.mos8454.noarch
 fuel-setup-9.0.0-1.mos6349.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8742.noarch
 fuel-library9.0-9.0.0-1.mos8454.noarch
 network-checker-9.0.0-1.mos74.x86_64
 fuel-agent-9.0.0-1.mos285.noarch
 fuel-ui-9.0.0-1.mos2717.noarch
 fuel-ostf-9.0.0-1.mos935.noarch
 fuelmenu-9.0.0-1.mos274.noarch
 fuel-nailgun-9.0.0-1.mos8742.noarch

Create sr-iov vm with flavor for pinning and both (pinning and huge pages). Vms are in active state, cpu and hugepages are distributed correctly

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.