live-migrate failed for VM with various # of interfaces

Bug #1837759 reported by Chris Winnicki
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
High
zhipeng liu

Bug Description

Brief Description
-----------------
live-migrate failed for VM with maximum number (16) of interfaces attached

Severity
--------
Major

Steps to Reproduce
------------------
Install a standard system, ex: 2 controllers + 3 computes

Start a VM with 16 NICs, ex:

nova --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --image=8c955aaf-3445-4aa6-aa4b-07a4c131e36a --flavor=90af2a35-8d30-4a35-805c-ca555b242b88 --poll --key-name=keypair-tenant1 --nic net-id=5b550dc6-aaf8-408e-acfb-7ec54cf477ea --nic port-id=2da9f920-00e2-43a9-825c-830f1c12c138 --nic port-id=2bb20890-377e-4cd5-8b3c-038a89a3817c --nic port-id=0c9d22f2-0cf9-47b5-abd9-59e74212f235 --nic port-id=9942ce14-ebb6-4deb-9405-19b9ab999a6e --nic port-id=4052d775-5cfc-444f-b6be-d643de1cf6fc --nic port-id=97002a5b-a973-42bd-b69d-2f0775601cf0 --nic port-id=98901dbf-d6f9-4558-8165-d593c4491d51 --nic port-id=f1d67bae-0393-47b5-994b-9b0bbcdea15b --nic port-id=8707ea32-04fe-4da6-8557-dbc7536e798e --nic port-id=edec9153-ff00-41b0-b80a-a8618d60df12 --nic port-id=58838a9a-9f10-4811-a592-76646dfb79a2 --nic port-id=853265c1-53c2-4fa6-a2f0-9c0118e01af5 --nic port-id=2cb244d0-7972-4632-9ffd-142fa7a601c9 --nic port-id=b770da4b-9507-4f0a-af27-20066c9d033a --nic port-id=67831b2c-029e-4087-8725-3e0c54dbbfc0 tenant1-max_vifs-tis-centos-guest-image-2

Attempte to live migrate the VM, ex:

nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration 39e9064c-69bc-4e29-9411-8159dc82755f

Expected Behavior
------------------
VM is expected to successfully live migrate

Actual Behavior
----------------
VM did not migrate:

[sysadmin@controller-0 ~(keystone_admin)]$ nova list --all-tenants --fields OS-EXT-SRV-ATTR:host,name,status,OS-EXT-STS:task_state,networks
+--------------------------------------+-----------------------+-------------------------------------------+--------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ID | OS-EXT-SRV-ATTR: Host | Name | Status | OS-EXT-STS: Task State | Networks |
+--------------------------------------+-----------------------+-------------------------------------------+--------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 39e9064c-69bc-4e29-9411-8159dc82755f | compute-0 | tenant1-max_vifs-tis-centos-guest-image-2 | ACTIVE | None | tenant1-mgmt-net=192.168.146.70; tenant1-net1=172.16.1.179, 172.16.1.235, 172.16.1.175, 172.16.1.214, 172.16.1.196, 172.16.1.186, 172.16.1.253, 172.16.1.241, 172.16.1.180, 172.16.1.195, 172.16.1.129, 172.16.1.130, 172.16.1.207, 172.16.1.162, 172.16.1.174 |
+--------------------------------------+-----------------------+-------------------------------------------+--------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ nova live-migration 39e9064c-69bc-4e29-9411-8159dc82755f

[sysadmin@controller-0 ~(keystone_admin)]$ nova list --all-tenants --fields OS-EXT-SRV-ATTR:host,name,status,OS-EXT-STS:task_state,networks
+--------------------------------------+-----------------------+-------------------------------------------+--------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ID | OS-EXT-SRV-ATTR: Host | Name | Status | OS-EXT-STS: Task State | Networks |
+--------------------------------------+-----------------------+-------------------------------------------+--------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 39e9064c-69bc-4e29-9411-8159dc82755f | compute-0 | tenant1-max_vifs-tis-centos-guest-image-2 | ACTIVE | None | tenant1-mgmt-net=192.168.146.70; tenant1-net1=172.16.1.179, 172.16.1.235, 172.16.1.175, 172.16.1.214, 172.16.1.196, 172.16.1.186, 172.16.1.253, 172.16.1.241, 172.16.1.180, 172.16.1.195, 172.16.1.129, 172.16.1.130, 172.16.1.207, 172.16.1.162, 172.16.1.174 |
+--------------------------------------+-----------------------+-------------------------------------------+--------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Standard system, ex: 2 controllers + 3 computes

Branch/Pull Time/Commit
-----------------------
[sysadmin@controller-0 ~(keystone_admin)]$ cat /etc/build.info
###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190720T013000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="186"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-07-20 01:30:00 +0000"

Last Pass
---------
20190713T013000Z

Timestamp/Logs
--------------
Attached

Test Activity
-------------
Networking regression
Wind River internal test name:
networking/test_interface_attach_detach.py::test_vm_with_max_vnics_attached_during_boot[tis-centos-guest-port_id-image]

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Chris, please attach the logs.
Does VM live migration consistently work with a smaller # of vnics? If so, how many?

tags: added: stx.regression
tags: added: stx.distro.openstack
Changed in starlingx:
status: New → Incomplete
Revision history for this message
Chris Winnicki (chriswinnicki) wrote :
Revision history for this message
Chris Winnicki (chriswinnicki) wrote :
Revision history for this message
Chris Winnicki (chriswinnicki) wrote :
Revision history for this message
Chris Winnicki (chriswinnicki) wrote :

To reconstruct the logs execute the following after downloading yow-cgcs-wildcat-92_98_ALL_NODES_20190724.151814.tar.part*:

cat yow-cgcs-wildcat-92_98_ALL_NODES_20190724.151814.tar.part* > yow-cgcs-wildcat-92_98_ALL_NODES_20190724.151814.tar

Ghada Khalil (gkhalil)
tags: added: stx.networking
description: updated
Revision history for this message
Chris Winnicki (chriswinnicki) wrote :

Likely this bug is similar and/or potentially the same as: https://bugs.launchpad.net/starlingx/+bug/1830915 and should be investigated in conjunction.

I performed another test where number of attached interfaces was dropped to 2 and live migration failed in the same way - with a generic error listed below:

{"log":"2019-07-24 20:06:08.975 1 ERROR oslo_messaging.rpc.server NoValidHost: No valid host was found. There are not enough hosts available.\n","stream":"stdout","time":"2019-07-24T20:06:08.976695508Z"}

* Note: Performing a cold migration on the same VM was successful

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to the distro.openstack PL for further action. I would consider this high priority given that this is consistently reproducible and live migration is a key VM operation.

tags: removed: stx.networking
Changed in starlingx:
assignee: nobody → yong hu (yhu6)
Revision history for this message
yong hu (yhu6) wrote :

Chris said "I performed another test where number of attached interfaces was dropped to 2 and live migration failed in the same way". If VM live-migration does work even with 2 interfaces - it should be a LP with HIGH importance.

@zhipeng, please help to reproduce this issue with 2 scenarios:
1. create VM with 2 interfaces and do live-migration. If this is not working, confirm if it is duplicated to https://bugs.launchpad.net/starlingx/+bug/1830915.
2. if VM does work, we go to create VM with 16 interfaces and do live-migration, and do further analysis.

Changed in starlingx:
assignee: yong hu (yhu6) → zhipeng liu (zhipengs)
Revision history for this message
yong hu (yhu6) wrote :

correct a typo: If VM live-migration does work even with 2 interfaces - it should be a LP with HIGH importance.
==>
If VM live-migration does NOT work even with 2 interfaces - it should be a LP with HIGH importance.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

I agree that this should be set as high importance

Changed in starlingx:
importance: Undecided → High
status: Incomplete → Triaged
tags: added: stx.2.0
Revision history for this message
zhipeng liu (zhipengs) wrote :

This issue seems take place after we rebase Nova to Stein.2 (7.15)

@Chris Winnicki Could you also share TIS_AUTOMATION.log
How about the flavor info of the VM. You used dedicated cpu mode?

Meanwhile I will check the log after downloading finish.

Thanks
Zhipeng

Revision history for this message
zhipeng liu (zhipengs) wrote :
Download full text (3.6 KiB)

Hi all,
From log, it is related to cpu compatibility check during live migration
In Nova part, only for kvm guests, we need check cpu compatibility during live migration.

{"log":"2019-07-23 19:26:39.464 70801 ERROR oslo_messaging.rpc.server File \"/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py\", line 7187, in check_can_live_migrate_destination\n","stream":"stdout","time":"2019-07-23T19:26:39.465248101Z"}
{"log":"2019-07-23 19:26:39.464 70801 ERROR oslo_messaging.rpc.server self._compare_cpu(None, source_cpu_info, instance)\n","stream":"stdout","time":"2019-07-23T19:26:39.46525473Z"}
{"log":"2019-07-23 19:26:39.464 70801 ERROR oslo_messaging.rpc.server File \"/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py\", line 7501, in _compare_cpu\n","stream":"stdout","time":"2019-07-23T19:26:39.465260018Z"}
{"log":"2019-07-23 19:26:39.464 70801 ERROR oslo_messaging.rpc.server raise exception.InvalidCPUInfo(reason=m % {'ret': ret, 'u': u})\n","stream":"stdout","time":"2019-07-23T19:26:39.465265978Z"}
{"log":"2019-07-23 19:26:39.464 70801 ERROR oslo_messaging.rpc.server InvalidCPUInfo: Unacceptable CPU info: CPU doesn't have compatibility.\n","stream":"stdout","time":"2019-07-23T19:26:39.46527134Z"}

=========================================================================================
I can also see the cpu info is exactly not the same for compute-0 and compute-1.

compute1: Instance launched has CPU info: {\"vendor\": \"Intel\", \"model\": \"Broadwell\", \"arch\": \"x86_64\", \"features\": [\"pge\", \"avx\", \"xsaveopt\", \"clflush\", \"sep\", \"rtm\", \"tsc_adjust\", \"tsc-deadline\", \"dtes64\", \"invpcid\", \"tsc\", \"fsgsbase\", \"xsave\", \"smap\", \"vmx\", \"erms\", \"xtpr\", \"cmov\", \"hle\", \"smep\", \"pcid\", \"est\", \"pat\", \"monitor\", \"smx\", \"pbe\", \"lm\", \"msr\", \"adx\", \"3dnowprefetch\", \"nx\", \"fxsr\", \"syscall\", \"tm\", \"sse4.1\", \"pae\", \"sse4.2\", \"pclmuldq\", \"acpi\", \"fma\", \"vme\", \"popcnt\", \"mmx\", \"osxsave\", \"cx8\", \"mce\", \"de\", \"rdtscp\", \"ht\", \"dca\", \"lahf_lm\", \"abm\", \"rdseed\", \"pdcm\", \"mca\", \"pdpe1gb\", \"apic\", \"sse\", \"f16c\", \"pse\", \"ds\", \"invtsc\", \"pni\", \"tm2\", \"avx2\", \"aes\", \"sse2\", \"ss\", \"ds_cpl\", \"arat\", \"bmi1\", \"bmi2\", \"ssse3\", \"fpu\", \"cx16\", \"pse36\", \"mtrr\", \"movbe\", \"rdrand\", \"x2apic\"], \"topology\": {\"cores\": 22, \"cells\": 2, \"threads\": 2, \"sockets\": 1}}\n","stream":"stdout","time":"2019-07-23T21:14:37.137830364Z"}

compute0: Instance launched has CPU info: {\"vendor\": \"Intel\", \"model\": \"Haswell-noTSX-IBRS\", \"arch\": \"x86_64\", \"features\": [\"pge\", \"avx\", \"xsaveopt\", \"clflush\", \"sep\", \"syscall\", \"tsc_adjust\", \"tsc-deadline\", \"dtes64\", \"invpcid\", \"tsc\", \"fsgsbase\", \"xsave\", \"vmx\", \"erms\", \"xtpr\", \"cmov\", \"smep\", \"ssse3\", \"est\", \"pat\", \"monitor\", \"smx\", \"pbe\", \"lm\", \"msr\", \"nx\", \"fxsr\", \"tm\", \"sse4.1\", \"pae\", \"sse4.2\", \"pclmuldq\", \"cx16\", \"pcid\", \"fma\", \"vme\", \"popcnt\", \"mmx\", \"osxsave\", \"cx8\", \"mce\", \"de\", \"rdtscp\", \"ht\", \"dca\", \"lahf_lm\", \"abm\"...

Read more...

zhipeng liu (zhipengs)
Changed in starlingx:
status: Triaged → Confirmed
Ghada Khalil (gkhalil)
summary: - live-migrate failed for VM with maximum number (16) of interfaces
- attached
+ live-migrate failed for VM with various # of interfaces
Revision history for this message
yong hu (yhu6) wrote :

@zhipeng,
what's the conclusion here? is "CPU compatibility check during live migration" a hard requirement from the latest Nova stein? Or it could be an option?
Please do some research in upstream community.

Revision history for this message
zhipeng liu (zhipengs) wrote :

From code comment in Nova, it is a hard requirement just for KVM virt-type

# NOTE(kchamart): Comparing host to guest CPU model for emulated
# guests (<domain type='qemu'>) should not matter -- in this
# mode (QEMU "TCG") the CPU is fully emulated in software and no
# hardware acceleration, like KVM, is involved. So, skip the CPU
# compatibility check for the QEMU domain type, and retain it for
# KVM guests.

Zhipeng

Revision history for this message
Gerry Kopec (gerry-kopec) wrote :

Based on the comments above, this is an expected failure. A couple week ago we set the nova config option libvirt/cpu_mode to host-model via https://review.opendev.org/#/c/669544/.

This means that on h/w labs (aka virt_type=kvm) libvirt will compare CPUs between the two hosts of a live migration to ensure they are closely matched. See nova docs for more info:
https://docs.openstack.org/nova/latest/admin/configuration/hypervisor-kvm.html#specify-the-cpu-model-of-kvm-guests

Our expectation is that most customers would have homogeneous hardware and if not they would organize their hosts into live migratable groups via nova host aggregates capability. So would recommend that for the lab in this test assuming 2 of the 3 computes are compatible.

Revision history for this message
yong hu (yhu6) wrote :

agreed in meeting, this is normal behavior, so the LP is invalid.

Changed in starlingx:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.