Neturon vm could not get IP after cold migrate

Bug #1888255 reported by Yvonne Ding
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
High
Austin Sun

Bug Description

Brief Description
------------------
Get IP is failed from vm after cold migrate.

Severity
--------
Major

Steps to Reproduce
------------------
1. Boot a vm with mgmt interface
2. Attach an vnics to the vm with attach_arg and vif model
3. Bring up the interface from vm
4. Ping vm over mgmt and tenant network
5. Perform vm actions cold migrate, live migrate
6. Ping vm over mgnt and tenant network after vm actions

TC-name:
test_interface_attach_detach_vnics

Expected Behavior
-----------------
Ping from Natbox to vm succeed

Actual Behavior
----------------
Ping from Natbox to vm failed

Reproducibility
---------------
reproducible

System Configuration
--------------------
Regular standard 2+2

Lab-name:
wcp_7_10

Branch/Pull Time/Commit
-----------------------
BUILD_ID="r/stx.4.0"

Timestamp/Logs
--------------
[2020-07-17 03:43:16,914] 314 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server resize confirm d113fcac-298c-4b7d-bfa1-fbbc659f6f7b'

[2020-07-17 03:43:25,194] 314 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show d113fcac-298c-4b7d-bfa1-fbbc659f6f7b'
[2020-07-17 03:43:27,689] 436 DEBUG MainThread ssh.expect :: Output:
+-------------------------------------+--------------------------------------------------------------------------+
| Field | Value |
+-------------------------------------+--------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | compute-0 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute-0 |
| OS-EXT-SRV-ATTR:instance_name | instance-00000007 |
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2020-07-17T03:42:16.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | tenant1-mgmt-net=192.168.121.87; tenant1-net4=172.16.4.218, 172.16.4.137 |
| config_drive | |
| created | 2020-07-17T03:29:34Z |
| flavor | flavor (1fc1dafb-6035-4f4c-85a5-8761b45b4974) |
| hostId | 36bb26974f12ccd941319f8d785ba454af9a97ca68aa3b63d694e0b6 |
| id | d113fcac-298c-4b7d-bfa1-fbbc659f6f7b |
| image | |
| key_name | keypair-tenant1 |
| name | tenant1-if_attach_tenant-4 |
| progress | 0 |
| project_id | 23a6f438af8e493ab5b13501d5b08d4f |
| properties | |
| security_groups | name='default' |
| | name='default' |
| | name='default' |
| status | ACTIVE |
| updated | 2020-07-17T03:42:28Z |
| user_id | 9b6cab5b9973430d88a87baff9dfd42a |
| volumes_attached | id='5a50767a-7fcf-4745-ba9b-500871582ec5' |
+-------------------------------------+--------------------------------------------------------------------------+

[2020-07-17 03:43:30,511] 314 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server list --a'
[2020-07-17 03:43:33,185] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------------------------+--------+-----------------------------------------------------------------------------------------+-------+----------------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+----------------------------+--------+-----------------------------------------------------------------------------------------+-------+----------------------+
| d113fcac-298c-4b7d-bfa1-fbbc659f6f7b | tenant1-if_attach_tenant-4 | ACTIVE | tenant1-mgmt-net=192.168.121.87; tenant1-net4=172.16.4.218, 172.16.4.137 | | flavor |
| 9fe23839-f230-4277-a26c-5fef46147413 | tenant1-base_vm-3 | ACTIVE | internal0-net0-1=10.1.1.152; tenant1-mgmt-net=192.168.121.92; tenant1-net4=172.16.4.198 | | flavor-default-size2 |
+--------------------------------------+----------------------------+--------+-----------------------------------------------------------------------------------------+-------+----------------------+
[sysadmin@controller-0 ~(keystone_admin)]$
[2020-07-17 03:43:33,185] 314 DEBUG MainThread ssh.send :: Send 'echo $?'
[2020-07-17 03:43:33,288] 436 DEBUG MainThread ssh.expect :: Output:
0
[sysadmin@controller-0 ~(keystone_admin)]$
[2020-07-17 03:43:33,288] 1714 DEBUG MainThread network_helper._get_net_ips_for_vms:: targeted ips for vm: ['192.168.121.87']
[2020-07-17 03:43:33,288] 1726 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'d113fcac-298c-4b7d-bfa1-fbbc659f6f7b': ['192.168.121.87']}
[2020-07-17 03:43:33,289] 2588 INFO MainThread network_helper.ping_server:: Ping 192.168.121.87 from host 128.224.186.181
[2020-07-17 03:43:33,289] 479 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2020-07-17 03:43:33,289] 314 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.121.87'
[2020-07-17 03:43:45,392] 436 DEBUG MainThread ssh.expect :: Output:
PING 192.168.121.87 (192.168.121.87) 56(84) bytes of data.

--- 192.168.121.87 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms

[2020-07-17 03:43:45,494] 2626 INFO MainThread network_helper.ping_server:: retry in 3 seconds
[2020-07-17 03:43:48,498] 2630 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.121.87 failed.

logs of .tar and automation log as below,
https://files.starlingx.kube.cengn.ca/launchpad/1888171

Test Activity
-------------

Yvonne Ding (yding)
description: updated
Revision history for this message
Frank Miller (sensfan22) wrote :

TC fails and may be related to recent upversion to Ussuri. Marking as stx.4.0 gating for now as VMs are expected to be reachable after migrations. Assigning to distro.openstack PL to determine how to proceed.

Changed in starlingx:
status: New → Triaged
importance: Undecided → High
assignee: nobody → yong hu (yhu6)
tags: added: stx.4.0 stx.distro.openstack
Austin Sun (sunausti)
Changed in starlingx:
assignee: yong hu (yhu6) → Austin Sun (sunausti)
Revision history for this message
Austin Sun (sunausti) wrote :
Download full text (3.4 KiB)

From TIS_AUTOMATION.log
[2020-07-17 03:30:25,103] boot tenant1-if_attach_tenant-4
[2020-07-17 03:31:24,176] running tenant1-if_attach_tenant-4 (instance-00000007,d113fcac-298c-4b7d-bfa1-fbbc659f6f7b, compute-0, 192.168.121.87)
[2020-07-17 03:31:49,920] reboot tenant1-if_attach_tenant-4
[2020-07-17 03:32:06,814] tenant1-if_attach_tenant-4 running
[2020-07-17 03:32:43,299] live-migration tenant1-if_attach_tenant-4 trigger
[2020-07-17 03:33:16,747] migration tenant1-if_attach_tenant-4 running on compute-1
[2020-07-17 03:33:34,555] ping tenant1-if_attach_tenant-4(192.168.121.87) test failure.
[2020-07-17 03:37:09,007] ping tenant1-if_attach_tenant-4(192.168.121.87) test still failure.
[2020-07-17 03:37:20,252] ping tenant1-if_attach_tenant-4(192.168.121.87) successful
[2020-07-17 03:38:10,227] ssh access tenant1-if_attach_tenant-4
[2020-07-17 03:38:15,071] reboot tenant1-if_attach_tenant-4
[2020-07-17 03:38:30,622] reboot successfully
[2020-07-17 03:39:17,249] reboot tenant1-if_attach_tenant-4
[2020-07-17 03:39:31,889] reboot successfully
[2020-07-17 03:41:35,368] reboot tenant1-if_attach_tenant-4
[2020-07-17 03:41:50,882] reboot successfully
[2020-07-17 03:41:53,830] Ping unsuccessful from vm
[2020-07-17 03:42:11,872] Ping successful
[2020-07-17 03:42:27,103] Attempt to connect to 192.168.121.92 from 128.224.186.181 (tenant1-base_vm-3 192.168.121.92)
[2020-07-17 03:42:30,443] Ping 172.16.4.218 from host 192.168.121.92
[2020-07-17 03:42:34,655] Ping 172.16.4.137(tenant1-if_attach_tenant-4) from host 192.168.121.92 Successfully
[2020-07-17 03:42:44,274] Cold migrating tenant1-if_attach_tenant-4
[2020-07-17 03:43:16,808] migration tenant1-if_attach_tenant-4 running on compute-0 again
[2020-07-17 03:43:16,808] Test Step 13
[2020-07-17 03:43:48,498] Ping from 128.224.186.181 to 192.168.121.87 failed.
[2020-07-17 03:49:34,782] Still Ping from 128.224.186.181 to 192.168.121.87 failed.
[2020-07-17 03:49:34,993] Start VM get logs
                          From Log bound to 172.16.4.137 and bound to 172.16.4.218, but not other ip bound.
                          those 2 172.16 ip are from tenant1-net4
                          eth0 did not get ip
in nova-compute log
2020-07-17T03:42:16.335614308Z stdout F 2020-07-17 03:42:16.334 138871 INFO nova.compute.manager [req-f6804da3-81cd-414f-8c4e-6d31dfaeb3d8 - - - - -] [instance: d113fcac-298c-4b7d-bfa1-fbbc659f6f7b] During sync_power_state the instance has a pending task (resize_finish). Skip.
2020-07-17T03:48:44.033473338Z stdout F 2020-07-17 03:48:44.032 138871 INFO nova.compute.manager [req-a6beaa6d-2cbb-42ac-8421-ea2a23cb8760 1f9715bfc7f24da69562bd29247e3d8c 3c468d05132e441aa08e163828ddb639 - default default] [instance: d113fcac-298c-4b7d-bfa1-fbbc659f6f7b] Get console output

2020-07-17T03:42:12.007448839Z stdout F 2020-07-17 03:42:12.006 138871 WARNING nova.compute.manager [req-f7860e90-1843-466e-ac7f-2456b1d1f1ff ec73d69ea8cb43ab8444b5e060c238a5 2ee5cf89fcd844cfbdd945cb5d52cd27 - 6b57fd7684d743ffbef8147fa5cd85ff 6b57fd7684d743ffbef8147fa5cd85ff] [instance: d113fcac-298c-4b7d-bfa1-fbbc659f6f7b] Received unexpected event network-vif-plugged-ced68e69-f8d3-4599-a3b6-1355a99c096b for...

Read more...

Revision history for this message
Yvonne Ding (yding) wrote :

Another two testcases from nova which can reproduce the same issues.

testcases/functional/nova/test_migrate_vms.py::test_migrate_vm_various_guest[win_2012-3-1024-dedicated-image]
testcases/functional/nova/test_migrate_vms.py::test migrate_vm_various_guest[win_2016-4-4096-dedicated-volume]

Logs of nova .tar,
https://files.starlingx.kube.cengn.ca/launchpad/1888255

Revision history for this message
Yvonne Ding (yding) wrote :

Steps to Reproduce
------------------
1. Boot a vm with mgmt interface
2. Attach an vnics to the vm with attach_arg and vif model
3. Bring up the interface from vm
4. Ping vm over mgmt and tenant network
5. Perform vm actions cold migrate, live migrate
6. Ping vm over mgnt and tenant network after vm actions

1. Create a flavor with 1 vcpu
openstack flavor create --disk=13 --ram=1024 --vcpus=3 migrate
2. Add extra specs: {'hw:cpu_policy': 'dedicated'}
openstack flavor set --property hw:cpu_policy=dedicated
3. Boot a win_2012 VM with above flavor from image
nova boot
4. Live migrate with win_2012 VM
nova live-migration
5. Ping vm from NatBox after live migration
6. Cold migrate vm and check vm is moved to different host
nova migrate --poll
7. Ping vm from NatBox after cold migration

[2020-07-24 05:50:47,589] 314 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.121.10'
[2020-07-24 05:50:59,658] 436 DEBUG MainThread ssh.expect :: Output:
PING 192.168.121.10 (192.168.121.10) 56(84) bytes of data.

--- 192.168.121.10 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2015ms

svc-cgcsauto@tis-lab-nat-box:~$
[2020-07-24 05:50:59,659] 314 DEBUG MainThread ssh.send :: Send 'echo $?'
[2020-07-24 05:50:59,711] 436 DEBUG MainThread ssh.expect :: Output:
1
svc-cgcsauto@tis-lab-nat-box:~$
[2020-07-24 05:50:59,712] 2626 INFO MainThread network_helper.ping_server:: retry in 3 seconds
[2020-07-24 05:51:02,715] 2630 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.121.10 failed.

Revision history for this message
Yvonne Ding (yding) wrote :

Please note there are two sets of nova logs under https://files.starlingx.kube.cengn.ca/launchpad/1888255
1) Original logs of nova regression which is based on build "20200710T214745Z"
   ALL_NODES_20200715.132742.tar & TIS_AUTOMATION.log

2) Latest logs of nova retest which is based on build "20200717T143848Z"
   ALL_NODES_20200724.221917.tar & TIS_AUTOMATION_20200724.log

Revision history for this message
Austin Sun (sunausti) wrote :

Hi Yvonne:
the new log were overwritten for ALL_NODES_20200724.221917 and TIS_AUTOMATION_20200724.log

TIS_AUTOMATION_20200724.log log is from [2020-07-24 21:29:23,577], but issue reported [2020-07-24 05:50:47,589], would you like to share the whole test case log.

ALL_NODES_20200724.221917
nova-compute-compute-0 log is from 2020-07-24T19:00:43
nova-compute-compute-1 log is from 2020-07-24T15:51:35

So it can not help analysis .

THanks.

Revision history for this message
Austin Sun (sunausti) wrote :

Hi Yvonne:
   1) would you like let me know the Reproducibility ?
     for cold migration , I tried several times in my setup , but all cold migration are successfully
   2) is possible stop test when cold-migration failure and keep the env ? or manually test this case ?

Thanks.
BR
Austin SUn.

Revision history for this message
Yvonne Ding (yding) wrote :

@Austin Sun,

The issue was reproduced a couple of times so you may see reported timestamp is different from .tar. Please ignore the timestamp for this case. Both can be used for debugging.

 The whole testcase log can be find in attached TIS_AUTMATION.log

Revision history for this message
Yvonne Ding (yding) wrote :

@Austin Sun,

Respond to two concerns above,
1) The reproduce steps are in description. Details of commands and parameters can be found in attached TIS_AUTOMATION.log

Having a closer look at attached log in LP, isn't cold migration successfully but failed at ping from natbox to comoupte node?

2) We could stop in any steps whatever we want.

Thanks,

Revision history for this message
Austin Sun (sunausti) wrote :

Thanks Yvonne:
  for item 1), my statement made some confused. I'm not asking how to reproduce this issue. I means is it 100% reproduce or random , and how much probability?

  for item 2), Thanks. I looked the log from 7/24, after cold migration finished,
  1) would you like to add some script to get vm info like live-migration do ? (virsh dumpxml to collect the vm info )?
  2) Once issue reproduce, could access to VM and check network card info , and VM can ping gateway or not ?

Thanks.
BR
Austin Sun.

Revision history for this message
Yvonne Ding (yding) wrote :

@Austin

1) Please go through description carefully before raising questions. The reproduce is described as below,

Reproducibility
---------------
reproducible

2) The issue is found via automation regression. Have you reproduced manually in your environment?
Do you add script on your side? What is the blocking issue to reproduce manually?

If you've already tried but can't reproduce the issue, please let me know.

Thanks,

Revision history for this message
Yvonne Ding (yding) wrote :

To speed up the progress, I can be reached at <email address hidden>

Revision history for this message
Austin Sun (sunausti) wrote :

Hi Yvonne:
  Thanks your response.
   for item1 : is it 100% reproducible? or 80% reproducible, or 50% reproducible? do you have estimate number ?
   for item2 : I tried 100 times for vm live-migration ---> ping vm --> cold-migration ----> ping vm , but can not reproduce. the vm image and network setup is not same as your setup since the HW limited .

Thanks.
BR
Austin Sun.

Revision history for this message
Yvonne Ding (yding) wrote :

@Austin Sun,

1) It is 100% reproduced.
2) Extra lab_setup.group0.log is attached.

Revision history for this message
Yvonne Ding (yding) wrote :

test image tis-centos-guest.img can be found here in case you need,
https://drive.google.com/drive/folders/1a4lHQ_op7qJ2ohXhsOYae-tJlOKC4dEi

Revision history for this message
Austin Sun (sunausti) wrote :

Thanks Yvonne for the info.
It's really bad this is 100% reproducible. :(
I will check and see if can find any clue.

for image, it seems empty in google drive . would you check again ?

Thanks.
BR
AUstin Sun

Revision history for this message
Yvonne Ding (yding) wrote :

Could you please check image in google drive again?

Revision history for this message
Austin Sun (sunausti) wrote :

Thanks. downloading...

Revision history for this message
yong hu (yhu6) wrote :

suggest to continue addressing this issue in stx.4.0 maintenance releases.

Revision history for this message
Austin Sun (sunausti) wrote :

Today I successfully deploy w/ dpdk enabled setup , but due to HW limited , only duplex was setup , but this issue can not still be reproduced yet.

this issue is very similar as https://bugs.launchpad.net/starlingx/+bug/1820378. will check further if it is due to lack resource cause vm can not get ip

Revision history for this message
Austin Sun (sunausti) wrote :

After joint debug w/ Yvonne, The issue will be gone once set test case timeout=1000s. so it should be related w/ setup and resource limitation which causes VM/migration and boot up slow.

Changed in starlingx:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.