ping to vm fails in live migration test

Bug #1841189 reported by Wendy Mitchell on 2019-08-23
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
High
Gongjun Song

Bug Description

Brief Description
-----------------
Ping from Natbox to instance failures in live migration tests

Severity
--------
Major

Steps to Reproduce
------------------
The testcase performs the following operations

a. created a bootable bolume
--bootable --image=fd57e28e-6ea4-4380-ba0d-32d0f787c152 --size=2 vol-tenant1-volume-stress-1-1
b. boot a single instance from the volume with flavor (hw:mem_page_size=large)

--flavor=b8f20ebe-4c60-4d9d-81e7-f30332e01049 --boot-volume=753449a4-494b-42f5-aa19-768f8fc0d60e --key-name=keypair-tenant1 --poll --nic net-id=dcfc9285-916f-41b5-b5d4-cc9dcdc1bd99 --nic net-id=30fed34d-e58d-47b7-902f-7566ccc264a9 tenant1-volume-stress-1
c. live migrates the instance many times.
The instance live migrates between compute-0, compute-1 and compute-2 in this test

The last live migration at ~ [2019-08-22 19:30:20,631] for this instance was from compute-2 to compute-1

698586e7-0ff0-4f52-a9f5-76ff1693539c

(Note: compute-0 was rebooted ~ this time.)

d. After the live migration to compute-1, ping failures were observed (ie. can not ping from Natbox to vm)

Expected Behavior
------------------
Did not expect ping failure from natbox to vm
(& did not expect lock/unlock of compute-0 to cause CPU usage on compute-2 to reach critical level)

Actual Behavior
----------------
Ping failures

Reproducibility
---------------
reproduceable

System Configuration
--------------------
Lab: IP_20-27
Load: 2019-08-20_20-59-00
testcases/stress/nova/test_migrate.py::test_migrate_stress[volume-1000]

Last Pass
---------

Timestamp/Logs
--------------
-----------------[2019-08-22 19:33:41]-----------------
==================Ping from NatBox to vm 698586e7-0ff0-4f52-a9f5-76ff1693539c failed for 200 seconds.===============

-----------------[2019-08-22 19:33:44]-----------------

#### Console log for vm 698586e7-0ff0-4f52-a9f5-76ff1693539c ####
[ 4248.518530] chronyd[520]: Can't synchronise: no selectable sources

-----------------[2019-08-22 19:33:59]-----------------
#### Ping router interfaces succeeded ####
{'192.168.104.1': 0, '192.168.104.65': 0, '192.168.104.33': 0}

Nova-compute.log compute-2
{"log":"2019-08-22 19:30:18.743 861834 INFO nova.compute.manager [-] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Migrating instance to compute-1 finished successfully.\n","stream":"stdout","time":"2019-08-22T19:30:18.743651362Z"}
{"log":"2019-08-22 19:30:18.743 861834 INFO nova.compute.manager [-] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Migrating instance to compute-1 finished successfully.\n","stream":"stdout","time":"2019-08-22T19:30:18.743877833Z"}

Nova-compute.log compute-1
{"log":"2019-08-22 19:30:09.452 852129 INFO nova.compute.manager [req-3675697d-dfd7-497a-8526-c3037f843654 - - - - -] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] During the sync_power process the instance has moved from host compute-2 to host compute-1\n","stream":"stdout","time":"2019-08-22T19:30:09.453965979Z"}
{"log":"2019-08-22 19:30:09.452 852129 INFO nova.compute.manager [req-3675697d-dfd7-497a-8526-c3037f843654 - - - - -] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] During the sync_power process the instance has moved from host compute-2 to host compute-1\n","stream":"stdout","time":"2019-08-22T19:30:09.453994652Z"}
{"log":"2019-08-22 19:30:16.162 852129 INFO nova.compute.manager [req-85e0e1b9-d6e4-482f-a401-24256e2113fc 59cf4be13765408298e8805246364b99 5e79f82bd4244daca343a86d2dbe8408 - default default] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Post operation of migration started\n","stream":"stdout","time":"2019-08-22T19:30:16.163123222Z"}
{"log":"2019-08-22 19:30:16.162 852129 INFO nova.compute.manager [req-85e0e1b9-d6e4-482f-a401-24256e2113fc 59cf4be13765408298e8805246364b99 5e79f82bd4244daca343a86d2dbe8408 - default default] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Post operation of migration started\n","stream":"stdout","time":"2019-08-22T19:30:16.163768603Z"}
{"log":"2019-08-22 19:30:20.732 852129 WARNING nova.compute.manager [req-ef961e8d-8e94-498e-851e-5aff2a24085f e4e954886a0a46b6a688f091633d08d1 305a436d092647d8815325402f17d3a8 - 4a3ccc82d5b046c4b9d903daa39752db 4a3ccc82d5b046c4b9d903daa39752db] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Received unexpected event network-vif-plugged-17154804-f1c6-45d9-9773-835fdd85baf8 for instance with vm_state active and task_state None.\n","stream":"stdout","time":"2019-08-22T19:30:20.73334443Z"}
{"log":"2019-08-22 19:30:20.732 852129 WARNING nova.compute.manager [req-ef961e8d-8e94-498e-851e-5aff2a24085f e4e954886a0a46b6a688f091633d08d1 305a436d092647d8815325402f17d3a8 - 4a3ccc82d5b046c4b9d903daa39752db 4a3ccc82d5b046c4b9d903daa39752db] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Received unexpected event network-vif-plugged-17154804-f1c6-45d9-9773-835fdd85baf8 for instance with vm_state active and task_state None.\n","stream":"stdout","time":"2019-08-22T19:30:20.733373191Z"}
{"log":"2019-08-22 19:30:22.809 852129 WARNING nova.compute.manager [req-2eefca8b-6913-4a6b-8b82-42378126fb94 e4e954886a0a46b6a688f091633d08d1 305a436d092647d8815325402f17d3a8 - 4a3ccc82d5b046c4b9d903daa39752db 4a3ccc82d5b046c4b9d903daa39752db] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Received unexpected event network-vif-plugged-f2cc70c4-2c6c-403e-9c6f-ad288c3d7128 for instance with vm_state active and task_state None.\n","stream":"stdout","time":"2019-08-22T19:30:22.81447433Z"}
{"log":"2019-08-22 19:30:22.809 852129 WARNING nova.compute.manager [req-2eefca8b-6913-4a6b-8b82-42378126fb94 e4e954886a0a46b6a688f091633d08d1 305a436d092647d8815325402f17d3a8 - 4a3ccc82d5b046c4b9d903daa39752db 4a3ccc82d5b046c4b9d903daa39752db] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Received unexpected event network-vif-plugged-f2cc70c4-2c6c-403e-9c6f-ad288c3d7128 for instance with vm_state active and task_state None.\n","stream":"stdout","time":"2019-08-22T19:30:22.814500403Z"}
{"log":"2019-08-22 19:33:43.812 852129 INFO nova.compute.manager [req-43d7e446-9cce-45c3-ab61-91a29a06aea2 59cf4be13765408298e8805246364b99 5e79f82bd4244daca343a86d2dbe8408 - default default] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Get console output\n","stream":"stdout","time":"2019-08-22T19:33:43.844570027Z"}
{"log":"2019-08-22 19:33:43.812 852129 INFO nova.compute.manager [req-43d7e446-9cce-45c3-ab61-91a29a06aea2 59cf4be13765408298e8805246364b99 5e79f82bd4244daca343a86d2dbe8408 - default default] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Get console output\n","stream":"stdout","time":"2019-08-22T19:33:43.844602Z"}
{"log":"2019-08-22 19:37:25.632 852129 WARNING nova.virt.libvirt.imagecache [req-2322e957-1704-437c-bab2-45b9ac21426b - - - - -] Unknown base file: /var/lib/nova/instances/_base/10bae7f8102b3e457aba60d89da898ea5b019fd6\n","stream":"stdout","time":"2019-08-22T19:37:25.647764906Z"}
{"log":"2019-08-22 19:37:25.632 852129 WARNING nova.virt.libvirt.imagecache [req-2322e957-1704-437c-bab2-45b9ac21426b - - - - -] Unknown base file: /var/lib/nova/instances/_base/10bae7f8102b3e457aba60d89da898ea5b019fd6\n","stream":"stdout","time":"2019-08-22T19:37:25.647800983Z"}
{"log":"2019-08-22 19:37:25.632 852129 INFO nova.virt.libvirt.imagecache [req-2322e957-1704-437c-bab2-45b9ac21426b - - - - -] Removable base files: /var/lib/nova/instances/_base/10bae7f8102b3e457aba60d89da898ea5b019fd6\n","stream":"stdout","time":"2019-08-22T19:37:25.647818282Z"}
{"log":"2019-08-22 19:37:25.632 852129 INFO nova.virt.libvirt.imagecache [req-2322e957-1704-437c-bab2-45b9ac21426b - - - - -] Removable base files: /var/lib/nova/instances/_base/10bae7f8102b3e457aba60d89da898ea5b019fd6\n","stream":"stdout","time":"2019-08-22T19:37:25.647825746Z"}
{"log":"2019-08-22 19:37:25.634 852129 INFO nova.virt.libvirt.imagecache [req-2322e957-1704-437c-bab2-45b9ac21426b - - - - -] Removing base or swap file: /var/lib/nova/instances/_base/10bae7f8102b3e457aba60d89da898ea5b019fd6\n","stream":"stdout","time":"2019-08-22T19:37:25.647843155Z"}
{"log":"2019-08-22 19:37:25.634 852129 INFO nova.virt.libvirt.imagecache [req-2322e957-1704-437c-bab2-45b9ac21426b - - - - -] Removing base or swap file: /var/lib/nova/instances/_base/10bae7f8102b3e457aba60d89da898ea5b019fd6\n","stream":"stdout","time":"2019-08-22T19:37:25.647849756Z"}

2019-08-22T19:46:00 clear 100.101 Platform CPU threshold exceeded ; threshold 95.00%, actual 99.32% host=compute-2 critical
 2019-08-22T19:44:00 set 100.101 Platform CPU threshold exceeded ; threshold 95.00%, actual 99.32% host=compute-2 critical
 2019-08-22T19:41:17 clear 100.101 Platform CPU threshold exceeded ; threshold 95.00%, actual 99.93% host=compute-0 critical
 2019-08-22T19:40:17 log 275.001 Host compute-0 hypervisor is now unlocked-enabled host=compute-0.hypervisor=5743f8ea-7e02-4c9d-8ddd-d66bfa4ab79a critical
 2019-08-22T19:39:40 clear 200.006 compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. host=compute-0.process=pci-irq-affinity-agent major
 2019-08-22T19:39:17 set 100.101 Platform CPU threshold exceeded ; threshold 95.00%, actual 99.93% host=compute-0 critical
 2019-08-22T19:38:04 log 275.001 Host compute-0 hypervisor is now unlocked-disabled host=compute-0.hypervisor=5743f8ea-7e02-4c9d-8ddd-d66bfa4ab79a critical
 2019-08-22T19:37:50 clear 200.004 compute-0 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-0 critical
 2019-08-22T19:37:39 log 200.022 compute-0 is now 'enabled' host=compute-0.state=enabled not-applicable
 2019-08-22T19:37:37 set 200.006 compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. host=compute-0.process=pci-irq-affinity-agent major
 2019-08-22T19:37:28 clear 200.009 compute-0 experienced a persistent critical 'Cluster-host Network' communication failure. host=compute-0.network=Cluster-host critical
 2019-08-22T19:37:28 clear 200.005 compute-0 experienced a persistent critical 'Management Network' communication failure. host=compute-0.network=Management critical
 2019-08-22T19:30:17 clear 700.008 Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2 tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c warning
 2019-08-22T19:30:17 log 700.156 Live-Migrate complete for instance tenant1-volume-stress-1 now enabled on host compute-1 tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c critical
 2019-08-22T19:30:16 log 275.001 Host compute-0 hypervisor is now locked-disabled host=compute-0.hypervisor=5743f8ea-7e02-4c9d-8ddd-d66bfa4ab79a critical
 2019-08-22T19:30:16 set 200.004 compute-0 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-0 critical
 2019-08-22T19:30:16 log 200.022 compute-0 is 'disabled-failed' to the system host=compute-0.status=failed not-applicable
 2019-08-22T19:30:16 log 200.022 compute-0 is now 'offline' host=compute-0.status=offline not-applicable

Ironpass 20-27
Ping from nat to vm unsuccessful from [2019-08-22 19:28:19,881]

Test Activity
-------------
Regression

Wendy Mitchell (wmitchellwr) wrote :

*wording correction

Expected Behavior
------------------
Did not expect ping failure from natbox to vm
(& did not expect reboot of compute-0 to cause CPU usage on compute-2 to reach critical level)

Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :
Download full text (4.1 KiB)

Note: There are critical threshold levels exceeded on compute-2 each time this instance (instance-00000023) is live migrated from compute-2

see fm-event.log
2019-08-22T18:57:25.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "warning", "state" : "set", "timestamp" : "2019-08-22 18:57:25.150504" }
2019-08-22T18:57:25.000 controller-0 fmManager: info { "event_log_id" : "700.152", "reason_text" : "Live-Migrate inprogress for instance tenant1-volume-stress-1 from host compute-2", "entity_instance_id" : "tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-22 18:57:25.150691" }
2019-08-22T18:57:58.000 controller-0 fmManager: info { "event_log_id" : "700.156", "reason_text" : "Live-Migrate complete for instance tenant1-volume-stress-1 now enabled on host compute-1", "entity_instance_id" : "tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-22 18:57:58.511765" }
2019-08-22T18:57:58.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "warning", "state" : "clear", "timestamp" : "2019-08-22 18:57:58.602502" }
--->
2019-08-22T18:58:00.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 98.93%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.host=compute-2", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-22 18:58:00.296733" }
2019

...

2019-08-22T19:27:19.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "warning", "state" : "set", "timestamp" : "2019-08-22 19:27:19.670505" }
2019-08-22T19:27:19.000 controller-0 fmManager: info { "event_log_id" : "700.152", "reason_text" : "Live-Migrate inprogress for instance tenant1-volume-stress-1 from host compute-2", "entity_instance_id" : "tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-22 19:27:19.670740" }
2019-08-22T19:27:55.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "ent...

Read more...

Wendy Mitchell (wmitchellwr) wrote :

again ping failure after live migration from compute-2 to compute-0 ~16:01
instance id fb8e41e1-d51b-4bec-b139-47be706f96f7

[2019-08-23 16:01:27,880] 1913 INFO MainThread vm_helper._ping_vms:: Ping successful from 128.224.186.181: {'192.168.104.84': 33}

[2019-08-23 16:01:32,886] 164 INFO MainThread test_migrate.test_migrate_stress:: Live migration iter25

[2019-08-23 16:01:37,288] 1288 INFO MainThread vm_helper.live_migrate_vm:: Live migrating VM fb8e41e1-d51b-4bec-b139-47be706f96f7 from compute-2 started.
[2019-08-23 16:01:37,288] 1289 INFO MainThread vm_helper.live_migrate_vm:: nova live-migration fb8e41e1-d51b-4bec-b139-47be706f96f7

[2019-08-23 16:01:39,265] 1300 INFO MainThread vm_helper.live_migrate_vm:: Waiting for VM status change to MIGRATING with best effort

[2019-08-23 16:01:41,791] 1761 INFO MainThread vm_helper.wait_for_vm_status:: VM status has reached MIGRATING
[2019-08-23 16:01:41,791] 1309 INFO MainThread vm_helper.live_migrate_vm:: Waiting for VM status change to original state ACTIVE

[2019-08-23 16:02:14,911] 1316 INFO MainThread vm_helper.live_migrate_vm:: Live migrate vm fb8e41e1-d51b-4bec-b139-47be706f96f7 completed

[2019-08-23 16:02:17,058] 1368 INFO MainThread vm_helper.live_migrate_vm:: VM fb8e41e1-d51b-4bec-b139-47be706f96f7 successfully migrated from compute-2 to compute-0
[2019-08-23 16:02:17,058] 1419 INFO MainThread ssh.get_natbox_client:: Getting NatBox Client...

[2019-08-23 16:02:19,825] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'fb8e41e1-d51b-4bec-b139-47be706f96f7': ['192.168.104.84']}
[2019-08-23 16:02:19,826] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.104.84 from host 128.224.186.181

[2019-08-23 16:02:32,031] 2563 INFO MainThread network_helper.ping_server:: retry in 3 seconds

[2019-08-23 16:02:35,034] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.104.84 failed.
[2019-08-23 16:02:35,035] 1926 INFO MainThread vm_helper._ping_vms:: Ping unsuccessful from vm (logged in via 128.224.186.181): {'192.168.104.84': 100}
[2019-08-23 16:02:35,035] 1419 INFO MainThread ssh.get_natbox_client:: Getting NatBox Client...

Ghada Khalil (gkhalil) wrote :

How many times was this issue reproduced? The bug report says the issue is reproducible.
Is there a definitive recipe?

Changed in starlingx:
status: New → Incomplete
Wendy Mitchell (wmitchellwr) wrote :

It has been reproduced multiple times on ip 20-27
(test runs failed everytime with ping failures - 2 runs failed without rebooting compute-0 at all)

Wendy Mitchell (wmitchellwr) wrote :

Looks like I can reproduce on same lab (ip 20-27) with the following steps:

- live migration iterations are successful running for many iterations where only compute-0 and compute-1 host are enabled and available.
-unlocked compute-2 so compute-0,compute-1 and compute-2 all enabled and available
- live migration iterations continued to be successful

compute-2 has HDD
hypervisor info
local_gb | 5544
cpu_info_model | IvyBridge-IBRS

compute-0/compute-1 109GB (have only SSD)
hypervisor info
cpu_info_model | IvyBridge-IBRS
local_gb | 109

-->Then compute-0 rebooted (sudo reboot command issued)
@ [2019-08-23 21:00:15,348] automation logs indicate tenant1-router Router f224e410-8643-47ae-9f7a-16ccb1f6e1e7 is hosted on compute-2

Automation logs reported failure in ping test on subsequent live migration iteration from compute-2 to compute-1 although the migration appears to have been successful.

[2019-08-23 20:55:38,059] 164 INFO MainThread test_migrate.test_migrate_stress:: Live migration iter82
[2019-08-23 20:55:42,582]
Live migrating VM 813861c3-7e7c-414f-81ad-828e2151ed55 from compute-2 started.
[2019-08-23 20:56:22,953]
VM 813861c3-7e7c-414f-81ad-828e2151ed55 successfully migrated from compute-2 to compute-1
status | ACTIVE

@ 20:52:20 fm-events report the failures related to compute-0
200.004 compute-0 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-0 critical 2019-08-23T20:52:25
200.009 compute-0 experienced a persistent critical 'Cluster-host Network' communication failure. host=compute-0.network=Cluster-host critical 2019-08-23T20:52:20
200.005 compute-0 experienced a persistent critical 'Management Network' communication failure. host=compute-0.network=Management critical 2019-08-23T20:52:20

Ping test failures start here
[2019-08-23 20:56:40,466]
Ping from 128.224.186.181 to 192.168.104.93 failed.

Dariush Eslimi (deslimi) on 2019-08-26
tags: added: stx.3.0 stx.distro.openstack
Changed in starlingx:
status: Incomplete → Triaged
importance: Undecided → High
assignee: nobody → yong hu (yhu6)
Changed in starlingx:
assignee: yong hu (yhu6) → chen haochuan (martin1982)
chen haochuan (martin1982) wrote :

reproduce issue in shanghai, as no netbox. Reproduce with following step

On controller-0
$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne image create --file cirros-0.4.0-x86_64-disk.img --disk-format qcow2 --container-format=bare cgcs-guest

$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --size=1 --image=31f88e65-9dbe-4795-979a-2df286123ab1 --bootable vol_inst1-3

$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne keypair create keypair-tenant1

$ nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --poll --flavor=4bce0697-ec05-486f-b7a3-3355e63529ec --boot-volume=fa20dcec-34ae-4ceb-afbf-8ddbbfa89313 --nic net-id=2d8b491e-6fcb-4b72-a60b-2fe3658c0026 --nic net-id=6611c00a-de71-48c2-b4fc-48e8332c264a --meta foo=bar tenant1-config_drive-5

ssh to compute-1
$ nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration 4f4c6c52-e989-49ae-bff1-949d4ae8f1be compute-0

ssh to compute-0
$ sudo ip netns exec qdhcp-6611c00a-de71-48c2-b4fc-48e8332c264a ping 192.168.101.53

could not ping to vm

Wendy Mitchell (wmitchellwr) wrote :

See automation log attached

last live migration is @ this time
instance 698586e7-0ff0-4f52-a9f5-76ff1693539c
[2019-08-22 19:30:20,631] 1368 INFO MainThread vm_helper.live_migrate_vm:: VM 698586e7-0ff0-4f52-a9f5-76ff1693539c successfully migrated from compute-2 to compute-1

Wendy Mitchell (wmitchellwr) wrote :
Download full text (15.5 KiB)

Reproduced the issue once again - see info below and logs (attaching now)

### StarlingX
BUILD_ID="20190912T013000Z"

ww: Ironpass 20-27

have reproduced the issue with the same reproduction scenario (Ie. live migrating instance and I rebooted compute-0 during the test)
I will update the LP with this info and the new logs

Instance instance-0000000b (tenant1-volume-stress-1) ping from NATbox to the vm failed.
I am able to login to the instance using the ‘sudo virsh console instance-0000000b’ and the instance appears to have an IP, however, I am unable to ping the internal interface to the tenant1-router (192.168.104.1) from the VM.

compute-1:~$ sudo virsh list
Password:
Id Name State
-----------------------------------
6 instance-0000000b running
7 instance-0000000e running

tenant1-mgmt-net subnets as follows:
• tenant1-mgmt0-subnet0 192.168.104.0/27
• tenant1-mgmt0-subnet1 192.168.104.32/27
• tenant1-mgmt0-subnet2 192.168.104.64/27
                (29080474-5be2)

(29080474-5be2) 192.168.104.1 fa:16:3e:f6:83:91 network:router_interface Active UP
(49ce62a2-0270) 192.168.104.65 fa:16:3e:cc:9c:d1 network:router_interface Active UP
(4a3a2700-bd7e) 192.168.104.33 fa:16:3e:d9:5f:22 network:router_interface Active UP

ID 29080474-5be2-4885-aba3-d278d7a5c23c
Network Name tenant1-mgmt-net
Network ID 113023fe-ca52-4a01-9db9-5534324f9fcc
Project ID c824ca71b6fc441bbe6df26773223155
MAC Address fa:16:3e:f6:83:91
Status Active
Admin State UP
Port Security Enabled False
DNS Name None
DNS Assignment None
Fixed IPs

IP Address 192.168.104.1
Subnet ID 25d506d6-dc76-426f-a37d-38d5022319ef
Attached Device
Device Owner network:router_interface
Device ID 558ebc16-10f5-488f-9f78-bf4bf11f9733
Security Groups

tenant1-volume-stress-1 login: root
[ 7158.596103] login[11064]: pam_unix(login:session): session opened for user root by LOGIN(uid=0)
[ 7158.602495] systemd-logind[529]: New session 4 of user root.
[ 7158.604156] login[11064]: DIALUP AT ttyS0 BY root
[ 7158.605429] systemd[1]: Started Session 4 of user root.
[ 7158.607371] login[11064]: ROOT LOGIN ON ttyS0
[ 7158.609313] systemd[1]: Starting Session 4 of user root.
tenant1-volume-stress-1:~# ifconfig
[ 7167.675942] -bash[11086]: HISTORY: PID=11086 UID=0 ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.168.104.7 netmask 255.255.255.224 broadcast 192.168.104.31
        inet6 fe80::f816:3eff:fe93:728c prefixlen 64 scopeid 0x20<link>
        ether fa:16:3e:93:72:8c txqueuelen 1000 (Ethernet)
        RX packets 1068 bytes 102974 (100.5 KiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 700 bytes 50286 (49.1 KiB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 172.16.3.246 netmask 255.255.255.0 broadcast 172.16.3.255
        inet6 fe80::f816:3eff:fe44:8ea9 prefixlen 64 scopeid 0x20<link>
        ether fa:16:3e:44:8e:a9 txqueuelen 1000 (Ethernet)
        RX packe...

Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :

(tenant1 user)
tenant1-volume-stress-1 -
    tenant1-net3 172.16.3.246
    tenant1-mgmt-net 192.168.104.7
flavor-default-size2 Active None Running 58 minutes

flavor flavor-default-size2 1 1GB 2GB 39aeccae-ff82-496f-aade-c31f8c03eac8
hw:mem_page_size=large

[2019-09-16 18:46:01,361] 1316 INFO MainThread vm_helper.live_migrate_vm:: Live migrate vm f9b18cc7-4002-499b-83a3-fd82b77cb68a completed

[2019-09-16 18:46:03,467] 1368 INFO MainThread vm_helper.live_migrate_vm:: VM f9b18cc7-4002-499b-83a3-fd82b77cb68a successfully migrated from compute-3 to compute-1
...
2019-09-16 18:49:18,395] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.104.7 from host 128.224.186.181
[2019-09-16 18:49:30,619] 2563 INFO MainThread network_helper.ping_server:: retry in 3 seconds
[2019-09-16 18:49:33,623] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.104.7 failed.
[2019-09-16 18:49:33,623] 1926 INFO MainThread vm_helper._ping_vms:: Ping unsuccessful from vm (logged in via 128.224.186.181): {'192.168.104.7': 100}
...
E Details: Ping from NatBox to vm f9b18cc7-4002-499b-83a3-fd82b77cb68a failed for 200 seconds.

chen haochuan (martin1982) wrote :

vm could not ping to qrouter, should be neutron or ovs issue. Request to transfer to network team.

yong hu (yhu6) on 2019-09-17
Changed in starlingx:
assignee: chen haochuan (martin1982) → nobody
tags: added: stx.networking
tags: added: stx.retestneeded
Cindy Xie (xxie1) on 2019-09-21
Changed in starlingx:
assignee: nobody → Forrest Zhao (forrest.zhao)
Changed in starlingx:
assignee: Forrest Zhao (forrest.zhao) → Gongjun Song (songgongjun)
Gongjun Song (songgongjun) wrote :

I installed the AIO-Duplex environment use the image September 25, 2019.
When I perform lock/unlock processing on the contoller-1 node, controller-0 does not show that the cpu exceeds the threshold,and there is no corresponding network error in ovs-switchd.log.
But my two nodes were all in a degraded state. I don't know if it's related to it. So I will rebuild the AIO-Duplex environment and test it again.

Gongjun Song (songgongjun) wrote :

@Wendy Mitchell
Are you deploying StarlingX on virtual environment? I see you ping from Natbox.

If you are using physical server to deploy virtual environment, is OVS-DPDK used or not?

Wendy Mitchell (wmitchellwr) wrote :

From the log collection file you can see the configuration
see var/extra/database/sysinv.db.sql.txt - this indicates that "vswitch_type": "none" (vswitch is containerized)
HW lab (2 controller, 2 storage, 4 worker nodes

Gongjun Song (songgongjun) wrote :
Download full text (8.4 KiB)

I tried to reproduce this bug in the virtual machine, but I can't reproduce this bug on duplex system using 1030 iso image and OVS.
I create an instance and live migration between controller-0 and controller-1. When I ping the instance ip in the network namespace multiple times, I can always ping successfully.
The log is as following:

controller-0:~$ openstack server create --image cirros --flavor m1.tiny --network public-net0 vm1 --security-group security1

controller-0:~$ openstack server show vm1
+-------------------------------------+----------------------------------------------------------+
| Field | Value |
+-------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | controller-0 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | controller-0 |
| OS-EXT-SRV-ATTR:instance_name | instance-00000020 |
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2019-11-11T06:43:36.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | public-net0=192.168.101.124 |
| config_drive | |
| created | 2019-11-11T06:43:22Z |
| flavor | m1.tiny (8f06630c-5562-45cb-8388-1cacc54a0db8) |
| hostId | d6a9ce7a8bce088e58adce0b116bac1932e3c4c95e3f1d41b749e3dc |
| id | e3e51909-b05b-483b-822b-2f719c6b598e |
| image | cirros (271e62a3-f3fd-47ad-aa06-de9f1dbcdb0a) |
| key_name | None |
| name | vm1 |
| progress | 0 |
| project_id | 820c5ebe5a164087815efc522ce79920 |
| properties | |
| security_groups ...

Read more...

yong hu (yhu6) wrote :

@songgongjun, have you tried to ping the VMs from outside through data network?
My understanding the original testing was not pinging between VMs.

Gongjun Song (songgongjun) wrote :

@yhu6, The log I posted above is the result of ping the VMs from the outside through the data network. Next I am going to try to reproduce this bug on Bare metal.

Gongjun Song (songgongjun) wrote :

@martin1982,is there set up a specific security-group when you create a virtual machine using openstack?

The log of the virtual machine you created is as follows:
$ nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --poll --flavor=4bce0697-ec05-486f-b7a3-3355e63529ec --boot-volume=fa20dcec-34ae-4ceb-afbf-8ddbbfa89313 --nic net-id=2d8b491e-6fcb-4b72-a60b-2fe3658c0026 --nic net-id=6611c00a-de71-48c2-b4fc-48e8332c264a --meta foo=bar tenant1-config_drive-5

From the log of creating a virtual machine, you are not using the specified security-group. If the specified security-group is not set, the default security-group does not allow you to ping the VMs from the outside through the data network.

Changed in starlingx:
assignee: Gongjun Song (songgongjun) → chen haochuan (martin1982)
Gongjun Song (songgongjun) wrote :

@martin1982,I saw that you assigned the bug to yourself. Are you planning to reproduce it?

Gongjun Song (songgongjun) wrote :

I also can't reproduce this bug on a bare metal with an AIO duplex system using 1030 iso image and OVS. Next, I will build a virtual environment of 2 controller nodes and 3 compute nodes to reproduce this bug.

yong hu (yhu6) wrote :

Thanks @Gongjun. Please consider to use a latest build from cengn when you run the test next time.
I can provide the build URL to you if needed.

Gongjun Song (songgongjun) wrote :

@yhu6, thanks for your advice, I will use the latest version of cengn when I run the test next time.

chen haochuan (martin1982) wrote :

For WR reproduce step. Security group "default" already add rule for icmp and tcp port 22.

[2019-08-22 18:19:54,133] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne security group rule list --long'
[2019-08-22 18:19:55,891] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+-------------+-----------+------------+-----------+-----------+--------------------------------------+--------------------------------------+
| ID | IP Protocol | IP Range | Port Range | Direction | Ethertype | Remote Security Group | Security Group |
+--------------------------------------+-------------+-----------+------------+-----------+-----------+--------------------------------------+--------------------------------------+
| 828f8902-681c-4dbd-b5a6-1e5ee477c1c9 | None | None | | egress | IPv4 | None | 37e619d8-1724-4dc7-a130-d4df9ffecb0d |
| 927ef8e5-b10a-4f72-8c46-aefaa5f8b6f0 | icmp | 0.0.0.0/0 | | ingress | IPv4 | None | 37e619d8-1724-4dc7-a130-d4df9ffecb0d |
| c0f13b7f-e844-402e-baad-eda4113097b5 | None | None | | ingress | IPv6 | 37e619d8-1724-4dc7-a130-d4df9ffecb0d | 37e619d8-1724-4dc7-a130-d4df9ffecb0d |
| dd9ff74d-1003-44a6-bcdc-5e5d20f613eb | None | None | | ingress | IPv4 | 37e619d8-1724-4dc7-a130-d4df9ffecb0d | 37e619d8-1724-4dc7-a130-d4df9ffecb0d |
| e13c763b-5030-4071-ba8d-1d9582b8ed0c | None | None | | egress | IPv6 | None | 37e619d8-1724-4dc7-a130-d4df9ffecb0d |
| e5088825-d652-41fd-b90d-aa4c7245f5b6 | tcp | 0.0.0.0/0 | 22:22 | ingress | IPv4 | None | 37e619d8-1724-4dc7-a130-d4df9ffecb0d |
+--------------------------------------+-------------+-----------+------------+-----------+-----------+--------------------------------------+--------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$

At first, it could ping vm from external.
[2019-08-22 19:33:52,813] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.104.33 from host 128.224.186.181
[2019-08-22 19:33:52,813] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-22 19:33:52,813] 301 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.104.33'
[2019-08-22 19:33:54,914] 423 DEBUG MainThread ssh.expect :: Output:
PING 192.168.104.33 (192.168.104.33) 56(84) bytes of data.
64 bytes from 192.168.104.33: icmp_seq=1 ttl=64 time=0.100 ms

Will set up with configuration 2+3, to check.

Changed in starlingx:
assignee: chen haochuan (martin1982) → Gongjun Song (songgongjun)
Gongjun Song (songgongjun) wrote :
Download full text (11.9 KiB)

@Wendy Mitchell,I set up a virtual env with 2+3(2-controller, 3-compute) system using 1120 iso image and OVS, and this bug also can't reproduce in this environment. Can you still reproduce this bug. If you can, can you elaborate on the steps to reproduce?

The log is as following:

controller-0:nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --poll --flavor=1f2d052a-7891-4c99-934b-a362547d4198 --boot-volume=0ae7ea19-af67-476b-8552-93578fe7ca2f --nic net-id=56a66d9c-0fbf-4bd9-a6ec-429e28e3253d --nic net-id=dabcebd3-accb-418b-8f7b-9a656fa60fb0 --nic net-id=f2a1eed0-1823-4fe4-b0fb-99de85c21ed8 --meta foo=bar tenant1-config_drive-5

controller-0:~$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show tenant1-config_drive-5
+-------------------------------------+----------------------------------------------------------------------------------------+
| Field | Value |
+-------------------------------------+----------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | compute-1 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute-1 |
| OS-EXT-SRV-ATTR:instance_name | instance-00000006 |
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2019-11-21T05:44:03.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | externa...

chen haochuan (martin1982) wrote :
Download full text (3.6 KiB)

openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne image create --file cirros-0.4.0-x86_64-disk.img --disk-format qcow2 --container-format=bare cgcs-guest
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --size=1 --image=aa3eb976-c606-42ce-8c3b-ad90b315ccbf --bootable vol_inst1-3
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne keypair create keypair-tenant1
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne keypair list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne network list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne flavor list

nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --poll --flavor=1f2d052a-7891-4c99-934b-a362547d4198 --boot-volume=0ae7ea19-af67-476b-8552-93578fe7ca2f --nic net-id=56a66d9c-0fbf-4bd9-a6ec-429e28e3253d --nic net-id=dabcebd3-accb-418b-8f7b-9a656fa60fb0 --nic net-id=f2a1eed0-1823-4fe4-b0fb-99de85c21ed8 --meta foo=bar tenant1-config_drive-5

openstack --os-username 'admin' --os-password ...

Read more...

Yang Liu (yliu12) wrote :

Hi Gongjun and Haochuan,

I think the vm needs to be reached via neutron router.

In our environment, the vm eth0 is reached via a server (we call it NatBox) that can reach the router external gateway, then the traffic will go through the router to reach one of the router internal interfaces, then to the vm.

[sysadmin@controller-0 ~(keystone_admin)]$ openstack port list --router=tenant2-router --long
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+-----------------+--------------------------+------+
| ID | Name | MAC Address | Fixed IP Addresses | Status | Security Groups | Device Owner | Tags |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+-----------------+--------------------------+------+
| 09ef76d7-caea-444c-9a20-665999b223e0 | | fa:16:3e:00:0e:7a | ip_address='192.168.220.1', subnet_id='d3ba508c-6ca9-4a5f-ade6-baf31a465d8e' | ACTIVE | | network:router_interface | |
| 3996236a-e33f-44d5-aeb5-d5ddb21fce62 | | fa:16:3e:e6:af:a7 | ip_address='192.168.20.3', subnet_id='de25fcbb-6d74-4847-a04c-fd15a60a072f' | ACTIVE | | network:router_gateway | |
| 566d17b7-905c-406f-940b-c2a9a5a61245 | | fa:16:3e:d8:3c:4e | ip_address='192.168.220.65', subnet_id='b79acfe2-74f3-42e9-9dfd-7a5f2e58ada4' | ACTIVE | | network:router_interface | |
| af0818df-883a-4468-94d0-eb435e8f48b1 | | fa:16:3e:60:eb:97 | ip_address='192.168.220.33', subnet_id='f5802f33-a85e-4d91-bb12-b0a3a9e78ca2' | ACTIVE | | network:router_interface | |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+-----------------+--------------------------+------+

Gongjun Song (songgongjun) wrote :

@yliu12,what kind of network environment do you have, how is the network connected to the router, can you provide more specific test steps, how/where you ping the vm to verify the connectivity?

Peng Peng (ppeng) wrote :

Issue was reproduced on train (3/11)
2019-11-21_20-00-00
wcp_3-6

[2019-11-22 21:47:10,381] 311 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.120.21'
[2019-11-22 21:47:22,444] 433 DEBUG MainThread ssh.expect :: Output:
PING 192.168.120.21 (192.168.120.21) 56(84) bytes of data.

--- 192.168.120.21 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms

Peng Peng (ppeng) wrote :
Gongjun Song (songgongjun) wrote :

@ppeng,Can you keep this environment?Maybe I will need more log information. At the same time, I will try to reproduce it in a 2 + 3 environment again.

Peng Peng (ppeng) wrote :

Issue was reproduced (1/36) on
Lab: WCP_3_6
Load: 2019-12-01_20-00-00

The lab is kept for investigation now.

Gongjun Song (songgongjun) wrote :

@ppeng, Thanks much for the detail log information. I checked your log and this issue seems to be happen after migrations for several times. While I still have some questions to reproduce this issue for more investigation, could you please help to clarify?
1. This issue looks not happen just for a single migration operation. So in general, how many times migration will cause this issue? when this issue happen, can the VM be pinged internally from the neutron router's dhcp namespace (to verify the neutron router us still work)?
2. How the natbox server (this should be physical machine instead of neutron created VM, right?) connect to starlingX neutron network, via a physical router or the neutron router? Is it possible to draw a topology diagram of the network so that I can more easily understand.

Wendy Mitchell (wmitchellwr) wrote :
Download full text (23.7 KiB)

I also reproduced the ping failure issue while testing live migrations
In this example, the 8th live migration iteration exhibited the ping failure

The HW system (PV-1) was installed with stx-openstack
(2 controllers, 2 storage, compute-0 through compute-7)

$ system application-list
+---------------------+------------------------------+-------------------------------+-----------------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+------------------------------+-------------------------------+-----------------+---------+-----------+
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-19-centos-stable- | armada-manifest | stx-openstack. | applied | completed |
| | versioned | | yaml | | |
| | | | | | |
+---------------------+------------------------------+-------------------------------+--------------

[The live migration to compute-6 seems to be when it started]
[2019-12-04 19:37:11,844] 1316 INFO MainThread vm_helper.live_migrate_vm:: Live migrate vm 358526fa-0f85-45e9-b246-63be77384e3d completed
[2019-12-04 19:37:13,951] 1368 INFO MainThread vm_helper.live_migrate_vm:: VM 358526fa-0f85-45e9-b246-63be77384e3d successfully migrated from compute-1 to compute-6
[2019-12-04 19:37:13,951] 1484 INFO MainThread ssh.get_natbox_client:: Getting NatBox Client...
[2019-12-04 19:37:16,074] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'358526fa-0f85-45e9-b246-63be77384e3d': ['192.168.131.50']}
[2019-12-04 19:37:16,074] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.131.50 from host 128.224.186.181
[2019-12-04 19:37:28,281] 2563 INFO MainThread network_helper.ping_server:: retry in 3 seconds
[2019-12-04 19:37:31,284] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.131.50 failed.

[Live migration 44 from compute-1 to compute-6 appears to complete but the ping failure is observed at this point]

| 44 | de40b7bd-66c7-437e-8d5e-6d9d08cd47f9 | compute-1 | compute-6 | compute-1 | compute-6 | 192.168.206.155 | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32 | 32 | 2019-12-04T19:36:48.000000 | 2019-12-04T19:37:09.000000 | live-migration |
| 41 | 994aa781-380b-4705-b308-83ac95017ade | compute-3 | compute-1 | compute-3 | compute-1 | 192.168.206.234 | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32 | 32 | 2019-12-04T19:36:08.000000 | 2019-12-04T19:36:30.000000 | live-migration |
| 38 | b6bdc142-f5b8-47f2-9d63-97ac094ac2c1 | compute-2 | compute-3 | compute-2 | compute-3 | 192.168.206.49 | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32 | 32 | 2019-12-04T19:35:29.000000 | 2019-12-04T19:35:51.000000 | live-migration |
| 35 | efb...

Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :
Wendy Mitchell (wmitchellwr) wrote :
Download full text (4.9 KiB)

instance on compute-6:
    tenant1-net10 172.16.10.135
    tenant1-mgmt-net 192.168.131.50

Name tenant1-router
ID cc260ce1-4dbf-473d-bd58-17cbbffe151e
Description Project ID 35b64b45763e4c9f9ae92f4a659cb656
Status Active
Admin State UP
L3 Agent
    Host ID
    compute-1 354e0515-c7c0-4779-8c27-bb47b067b3ca
Availability Zones nova

External Gateway
Network Name external-net0
Network ID 3f0b67e5-f3a2-44b3-ac0c-f39f9cc52f7a
External Fixed IPs
        Subnet ID 2fdd9804-fef4-455f-9f1a-e1738110e62e
        IP Address 192.168.31.2
SNAT Disabled

(20b14c7e-0453) 192.168.131.1 Active Internal Interface UP
(3006e0df-0a95) 192.168.131.65 Active Internal Interface UP
(69f7186e-f008) 192.168.131.33 Active Internal Interface UP
(72149874-f1fe) 192.168.31.2 Active External Gateway UP

tenant1-mgmt0-subnet1 192.168.131.32/27 IPv4 192.168.131.33 3 26
tenant1-mgmt0-subnet0 192.168.131.0/27 IPv4 192.168.131.1 2 27
tenant1-mgmt0-subnet2 192.168.131.64/27 IPv4 192.168.131.65 2 27

compute-6:~$ sudo virsh console instance-0000000b
Connected to domain instance-0000000b

tenant1-volume-stress-1:~# ip add
[ 3752.763497] -bash[1597]: HISTORY: PID=1597 UID=0 ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:77:6e:55 brd ff:ff:ff:ff:ff:ff
    inet 192.168.131.50/27 brd 192.168.131.63 scope global dynamic eth0
       valid_lft 82655sec preferred_lft 82655sec
    inet6 fe80::f816:3eff:fe77:6e55/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:1a:98:e7 brd ff:ff:ff:ff:ff:ff
    inet 172.16.10.135/24 brd 172.16.10.255 scope global dynamic eth1
       valid_lft 82661sec preferred_lft 82661sec
    inet6 fe80::f816:3eff:fe1a:98e7/64 scope link
       valid_lft forever preferred_lft forever
tenant1-volume-stress-1:~# ping 0.0.0.0
[ 3759.638041] -bash[1597]: HISTORY: PID=1597 UID=0 ping 0.0.0.0
PING 0.0.0.0 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.022 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.018 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.031 ms

--- 0.0.0.0 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.018/0.023/0.031/0.007 ms
tenant1-volume-stress-1:~# ping 192.168.131.1
[ 3770.922772] -bash[1597]: HISTORY: PID=1597 UID=0 ping 192.168.131.1
PING 192.168.131.1 (192.168.131.1) 56(84) bytes of data.
From 192.168.131.50 icmp_seq=1 Destination Host Unreachable
From 192.168.131.50 icmp_seq=2 Destination Host Unreachable
From 192.168.131.50 icmp_seq=3 Destination Host Unreachable
From 192.168.131.50 icmp_seq=4 Destination Host Unreachable

--- 192.168.131.1 ping statistics ---
5 packets transmi...

Read more...

To post a comment you must log in.