ping to vm fails in live migration test

Bug #1841189 reported by Wendy Mitchell
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
High
Wendy Mitchell

Bug Description

Brief Description
-----------------
Ping from Natbox to instance failures in live migration tests

Severity
--------
Major

Steps to Reproduce
------------------
The testcase performs the following operations

a. created a bootable bolume
--bootable --image=fd57e28e-6ea4-4380-ba0d-32d0f787c152 --size=2 vol-tenant1-volume-stress-1-1
b. boot a single instance from the volume with flavor (hw:mem_page_size=large)

--flavor=b8f20ebe-4c60-4d9d-81e7-f30332e01049 --boot-volume=753449a4-494b-42f5-aa19-768f8fc0d60e --key-name=keypair-tenant1 --poll --nic net-id=dcfc9285-916f-41b5-b5d4-cc9dcdc1bd99 --nic net-id=30fed34d-e58d-47b7-902f-7566ccc264a9 tenant1-volume-stress-1
c. live migrates the instance many times.
The instance live migrates between compute-0, compute-1 and compute-2 in this test

The last live migration at ~ [2019-08-22 19:30:20,631] for this instance was from compute-2 to compute-1

698586e7-0ff0-4f52-a9f5-76ff1693539c

(Note: compute-0 was rebooted ~ this time.)

d. After the live migration to compute-1, ping failures were observed (ie. can not ping from Natbox to vm)

Expected Behavior
------------------
Did not expect ping failure from natbox to vm
(& did not expect lock/unlock of compute-0 to cause CPU usage on compute-2 to reach critical level)

Actual Behavior
----------------
Ping failures

Reproducibility
---------------
reproduceable

System Configuration
--------------------
Lab: IP_20-27
Load: 2019-08-20_20-59-00
testcases/stress/nova/test_migrate.py::test_migrate_stress[volume-1000]

Last Pass
---------

Timestamp/Logs
--------------
-----------------[2019-08-22 19:33:41]-----------------
==================Ping from NatBox to vm 698586e7-0ff0-4f52-a9f5-76ff1693539c failed for 200 seconds.===============

-----------------[2019-08-22 19:33:44]-----------------

#### Console log for vm 698586e7-0ff0-4f52-a9f5-76ff1693539c ####
[ 4248.518530] chronyd[520]: Can't synchronise: no selectable sources

-----------------[2019-08-22 19:33:59]-----------------
#### Ping router interfaces succeeded ####
{'192.168.104.1': 0, '192.168.104.65': 0, '192.168.104.33': 0}

Nova-compute.log compute-2
{"log":"2019-08-22 19:30:18.743 861834 INFO nova.compute.manager [-] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Migrating instance to compute-1 finished successfully.\n","stream":"stdout","time":"2019-08-22T19:30:18.743651362Z"}
{"log":"2019-08-22 19:30:18.743 861834 INFO nova.compute.manager [-] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Migrating instance to compute-1 finished successfully.\n","stream":"stdout","time":"2019-08-22T19:30:18.743877833Z"}

Nova-compute.log compute-1
{"log":"2019-08-22 19:30:09.452 852129 INFO nova.compute.manager [req-3675697d-dfd7-497a-8526-c3037f843654 - - - - -] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] During the sync_power process the instance has moved from host compute-2 to host compute-1\n","stream":"stdout","time":"2019-08-22T19:30:09.453965979Z"}
{"log":"2019-08-22 19:30:09.452 852129 INFO nova.compute.manager [req-3675697d-dfd7-497a-8526-c3037f843654 - - - - -] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] During the sync_power process the instance has moved from host compute-2 to host compute-1\n","stream":"stdout","time":"2019-08-22T19:30:09.453994652Z"}
{"log":"2019-08-22 19:30:16.162 852129 INFO nova.compute.manager [req-85e0e1b9-d6e4-482f-a401-24256e2113fc 59cf4be13765408298e8805246364b99 5e79f82bd4244daca343a86d2dbe8408 - default default] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Post operation of migration started\n","stream":"stdout","time":"2019-08-22T19:30:16.163123222Z"}
{"log":"2019-08-22 19:30:16.162 852129 INFO nova.compute.manager [req-85e0e1b9-d6e4-482f-a401-24256e2113fc 59cf4be13765408298e8805246364b99 5e79f82bd4244daca343a86d2dbe8408 - default default] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Post operation of migration started\n","stream":"stdout","time":"2019-08-22T19:30:16.163768603Z"}
{"log":"2019-08-22 19:30:20.732 852129 WARNING nova.compute.manager [req-ef961e8d-8e94-498e-851e-5aff2a24085f e4e954886a0a46b6a688f091633d08d1 305a436d092647d8815325402f17d3a8 - 4a3ccc82d5b046c4b9d903daa39752db 4a3ccc82d5b046c4b9d903daa39752db] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Received unexpected event network-vif-plugged-17154804-f1c6-45d9-9773-835fdd85baf8 for instance with vm_state active and task_state None.\n","stream":"stdout","time":"2019-08-22T19:30:20.73334443Z"}
{"log":"2019-08-22 19:30:20.732 852129 WARNING nova.compute.manager [req-ef961e8d-8e94-498e-851e-5aff2a24085f e4e954886a0a46b6a688f091633d08d1 305a436d092647d8815325402f17d3a8 - 4a3ccc82d5b046c4b9d903daa39752db 4a3ccc82d5b046c4b9d903daa39752db] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Received unexpected event network-vif-plugged-17154804-f1c6-45d9-9773-835fdd85baf8 for instance with vm_state active and task_state None.\n","stream":"stdout","time":"2019-08-22T19:30:20.733373191Z"}
{"log":"2019-08-22 19:30:22.809 852129 WARNING nova.compute.manager [req-2eefca8b-6913-4a6b-8b82-42378126fb94 e4e954886a0a46b6a688f091633d08d1 305a436d092647d8815325402f17d3a8 - 4a3ccc82d5b046c4b9d903daa39752db 4a3ccc82d5b046c4b9d903daa39752db] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Received unexpected event network-vif-plugged-f2cc70c4-2c6c-403e-9c6f-ad288c3d7128 for instance with vm_state active and task_state None.\n","stream":"stdout","time":"2019-08-22T19:30:22.81447433Z"}
{"log":"2019-08-22 19:30:22.809 852129 WARNING nova.compute.manager [req-2eefca8b-6913-4a6b-8b82-42378126fb94 e4e954886a0a46b6a688f091633d08d1 305a436d092647d8815325402f17d3a8 - 4a3ccc82d5b046c4b9d903daa39752db 4a3ccc82d5b046c4b9d903daa39752db] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Received unexpected event network-vif-plugged-f2cc70c4-2c6c-403e-9c6f-ad288c3d7128 for instance with vm_state active and task_state None.\n","stream":"stdout","time":"2019-08-22T19:30:22.814500403Z"}
{"log":"2019-08-22 19:33:43.812 852129 INFO nova.compute.manager [req-43d7e446-9cce-45c3-ab61-91a29a06aea2 59cf4be13765408298e8805246364b99 5e79f82bd4244daca343a86d2dbe8408 - default default] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Get console output\n","stream":"stdout","time":"2019-08-22T19:33:43.844570027Z"}
{"log":"2019-08-22 19:33:43.812 852129 INFO nova.compute.manager [req-43d7e446-9cce-45c3-ab61-91a29a06aea2 59cf4be13765408298e8805246364b99 5e79f82bd4244daca343a86d2dbe8408 - default default] [instance: 698586e7-0ff0-4f52-a9f5-76ff1693539c] Get console output\n","stream":"stdout","time":"2019-08-22T19:33:43.844602Z"}
{"log":"2019-08-22 19:37:25.632 852129 WARNING nova.virt.libvirt.imagecache [req-2322e957-1704-437c-bab2-45b9ac21426b - - - - -] Unknown base file: /var/lib/nova/instances/_base/10bae7f8102b3e457aba60d89da898ea5b019fd6\n","stream":"stdout","time":"2019-08-22T19:37:25.647764906Z"}
{"log":"2019-08-22 19:37:25.632 852129 WARNING nova.virt.libvirt.imagecache [req-2322e957-1704-437c-bab2-45b9ac21426b - - - - -] Unknown base file: /var/lib/nova/instances/_base/10bae7f8102b3e457aba60d89da898ea5b019fd6\n","stream":"stdout","time":"2019-08-22T19:37:25.647800983Z"}
{"log":"2019-08-22 19:37:25.632 852129 INFO nova.virt.libvirt.imagecache [req-2322e957-1704-437c-bab2-45b9ac21426b - - - - -] Removable base files: /var/lib/nova/instances/_base/10bae7f8102b3e457aba60d89da898ea5b019fd6\n","stream":"stdout","time":"2019-08-22T19:37:25.647818282Z"}
{"log":"2019-08-22 19:37:25.632 852129 INFO nova.virt.libvirt.imagecache [req-2322e957-1704-437c-bab2-45b9ac21426b - - - - -] Removable base files: /var/lib/nova/instances/_base/10bae7f8102b3e457aba60d89da898ea5b019fd6\n","stream":"stdout","time":"2019-08-22T19:37:25.647825746Z"}
{"log":"2019-08-22 19:37:25.634 852129 INFO nova.virt.libvirt.imagecache [req-2322e957-1704-437c-bab2-45b9ac21426b - - - - -] Removing base or swap file: /var/lib/nova/instances/_base/10bae7f8102b3e457aba60d89da898ea5b019fd6\n","stream":"stdout","time":"2019-08-22T19:37:25.647843155Z"}
{"log":"2019-08-22 19:37:25.634 852129 INFO nova.virt.libvirt.imagecache [req-2322e957-1704-437c-bab2-45b9ac21426b - - - - -] Removing base or swap file: /var/lib/nova/instances/_base/10bae7f8102b3e457aba60d89da898ea5b019fd6\n","stream":"stdout","time":"2019-08-22T19:37:25.647849756Z"}

2019-08-22T19:46:00 clear 100.101 Platform CPU threshold exceeded ; threshold 95.00%, actual 99.32% host=compute-2 critical
 2019-08-22T19:44:00 set 100.101 Platform CPU threshold exceeded ; threshold 95.00%, actual 99.32% host=compute-2 critical
 2019-08-22T19:41:17 clear 100.101 Platform CPU threshold exceeded ; threshold 95.00%, actual 99.93% host=compute-0 critical
 2019-08-22T19:40:17 log 275.001 Host compute-0 hypervisor is now unlocked-enabled host=compute-0.hypervisor=5743f8ea-7e02-4c9d-8ddd-d66bfa4ab79a critical
 2019-08-22T19:39:40 clear 200.006 compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. host=compute-0.process=pci-irq-affinity-agent major
 2019-08-22T19:39:17 set 100.101 Platform CPU threshold exceeded ; threshold 95.00%, actual 99.93% host=compute-0 critical
 2019-08-22T19:38:04 log 275.001 Host compute-0 hypervisor is now unlocked-disabled host=compute-0.hypervisor=5743f8ea-7e02-4c9d-8ddd-d66bfa4ab79a critical
 2019-08-22T19:37:50 clear 200.004 compute-0 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-0 critical
 2019-08-22T19:37:39 log 200.022 compute-0 is now 'enabled' host=compute-0.state=enabled not-applicable
 2019-08-22T19:37:37 set 200.006 compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. host=compute-0.process=pci-irq-affinity-agent major
 2019-08-22T19:37:28 clear 200.009 compute-0 experienced a persistent critical 'Cluster-host Network' communication failure. host=compute-0.network=Cluster-host critical
 2019-08-22T19:37:28 clear 200.005 compute-0 experienced a persistent critical 'Management Network' communication failure. host=compute-0.network=Management critical
 2019-08-22T19:30:17 clear 700.008 Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2 tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c warning
 2019-08-22T19:30:17 log 700.156 Live-Migrate complete for instance tenant1-volume-stress-1 now enabled on host compute-1 tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c critical
 2019-08-22T19:30:16 log 275.001 Host compute-0 hypervisor is now locked-disabled host=compute-0.hypervisor=5743f8ea-7e02-4c9d-8ddd-d66bfa4ab79a critical
 2019-08-22T19:30:16 set 200.004 compute-0 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-0 critical
 2019-08-22T19:30:16 log 200.022 compute-0 is 'disabled-failed' to the system host=compute-0.status=failed not-applicable
 2019-08-22T19:30:16 log 200.022 compute-0 is now 'offline' host=compute-0.status=offline not-applicable

Ironpass 20-27
Ping from nat to vm unsuccessful from [2019-08-22 19:28:19,881]

Test Activity
-------------
Regression

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

*wording correction

Expected Behavior
------------------
Did not expect ping failure from natbox to vm
(& did not expect reboot of compute-0 to cause CPU usage on compute-2 to reach critical level)

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (4.1 KiB)

Note: There are critical threshold levels exceeded on compute-2 each time this instance (instance-00000023) is live migrated from compute-2

see fm-event.log
2019-08-22T18:57:25.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "warning", "state" : "set", "timestamp" : "2019-08-22 18:57:25.150504" }
2019-08-22T18:57:25.000 controller-0 fmManager: info { "event_log_id" : "700.152", "reason_text" : "Live-Migrate inprogress for instance tenant1-volume-stress-1 from host compute-2", "entity_instance_id" : "tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-22 18:57:25.150691" }
2019-08-22T18:57:58.000 controller-0 fmManager: info { "event_log_id" : "700.156", "reason_text" : "Live-Migrate complete for instance tenant1-volume-stress-1 now enabled on host compute-1", "entity_instance_id" : "tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-22 18:57:58.511765" }
2019-08-22T18:57:58.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "warning", "state" : "clear", "timestamp" : "2019-08-22 18:57:58.602502" }
--->
2019-08-22T18:58:00.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 98.93%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.host=compute-2", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-22 18:58:00.296733" }
2019

...

2019-08-22T19:27:19.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "warning", "state" : "set", "timestamp" : "2019-08-22 19:27:19.670505" }
2019-08-22T19:27:19.000 controller-0 fmManager: info { "event_log_id" : "700.152", "reason_text" : "Live-Migrate inprogress for instance tenant1-volume-stress-1 from host compute-2", "entity_instance_id" : "tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-22 19:27:19.670740" }
2019-08-22T19:27:55.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "ent...

Read more...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

again ping failure after live migration from compute-2 to compute-0 ~16:01
instance id fb8e41e1-d51b-4bec-b139-47be706f96f7

[2019-08-23 16:01:27,880] 1913 INFO MainThread vm_helper._ping_vms:: Ping successful from 128.224.186.181: {'192.168.104.84': 33}

[2019-08-23 16:01:32,886] 164 INFO MainThread test_migrate.test_migrate_stress:: Live migration iter25

[2019-08-23 16:01:37,288] 1288 INFO MainThread vm_helper.live_migrate_vm:: Live migrating VM fb8e41e1-d51b-4bec-b139-47be706f96f7 from compute-2 started.
[2019-08-23 16:01:37,288] 1289 INFO MainThread vm_helper.live_migrate_vm:: nova live-migration fb8e41e1-d51b-4bec-b139-47be706f96f7

[2019-08-23 16:01:39,265] 1300 INFO MainThread vm_helper.live_migrate_vm:: Waiting for VM status change to MIGRATING with best effort

[2019-08-23 16:01:41,791] 1761 INFO MainThread vm_helper.wait_for_vm_status:: VM status has reached MIGRATING
[2019-08-23 16:01:41,791] 1309 INFO MainThread vm_helper.live_migrate_vm:: Waiting for VM status change to original state ACTIVE

[2019-08-23 16:02:14,911] 1316 INFO MainThread vm_helper.live_migrate_vm:: Live migrate vm fb8e41e1-d51b-4bec-b139-47be706f96f7 completed

[2019-08-23 16:02:17,058] 1368 INFO MainThread vm_helper.live_migrate_vm:: VM fb8e41e1-d51b-4bec-b139-47be706f96f7 successfully migrated from compute-2 to compute-0
[2019-08-23 16:02:17,058] 1419 INFO MainThread ssh.get_natbox_client:: Getting NatBox Client...

[2019-08-23 16:02:19,825] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'fb8e41e1-d51b-4bec-b139-47be706f96f7': ['192.168.104.84']}
[2019-08-23 16:02:19,826] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.104.84 from host 128.224.186.181

[2019-08-23 16:02:32,031] 2563 INFO MainThread network_helper.ping_server:: retry in 3 seconds

[2019-08-23 16:02:35,034] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.104.84 failed.
[2019-08-23 16:02:35,035] 1926 INFO MainThread vm_helper._ping_vms:: Ping unsuccessful from vm (logged in via 128.224.186.181): {'192.168.104.84': 100}
[2019-08-23 16:02:35,035] 1419 INFO MainThread ssh.get_natbox_client:: Getting NatBox Client...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

How many times was this issue reproduced? The bug report says the issue is reproducible.
Is there a definitive recipe?

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

It has been reproduced multiple times on ip 20-27
(test runs failed everytime with ping failures - 2 runs failed without rebooting compute-0 at all)

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

Looks like I can reproduce on same lab (ip 20-27) with the following steps:

- live migration iterations are successful running for many iterations where only compute-0 and compute-1 host are enabled and available.
-unlocked compute-2 so compute-0,compute-1 and compute-2 all enabled and available
- live migration iterations continued to be successful

compute-2 has HDD
hypervisor info
local_gb | 5544
cpu_info_model | IvyBridge-IBRS

compute-0/compute-1 109GB (have only SSD)
hypervisor info
cpu_info_model | IvyBridge-IBRS
local_gb | 109

-->Then compute-0 rebooted (sudo reboot command issued)
@ [2019-08-23 21:00:15,348] automation logs indicate tenant1-router Router f224e410-8643-47ae-9f7a-16ccb1f6e1e7 is hosted on compute-2

Automation logs reported failure in ping test on subsequent live migration iteration from compute-2 to compute-1 although the migration appears to have been successful.

[2019-08-23 20:55:38,059] 164 INFO MainThread test_migrate.test_migrate_stress:: Live migration iter82
[2019-08-23 20:55:42,582]
Live migrating VM 813861c3-7e7c-414f-81ad-828e2151ed55 from compute-2 started.
[2019-08-23 20:56:22,953]
VM 813861c3-7e7c-414f-81ad-828e2151ed55 successfully migrated from compute-2 to compute-1
status | ACTIVE

@ 20:52:20 fm-events report the failures related to compute-0
200.004 compute-0 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-0 critical 2019-08-23T20:52:25
200.009 compute-0 experienced a persistent critical 'Cluster-host Network' communication failure. host=compute-0.network=Cluster-host critical 2019-08-23T20:52:20
200.005 compute-0 experienced a persistent critical 'Management Network' communication failure. host=compute-0.network=Management critical 2019-08-23T20:52:20

Ping test failures start here
[2019-08-23 20:56:40,466]
Ping from 128.224.186.181 to 192.168.104.93 failed.

Dariush Eslimi (deslimi)
tags: added: stx.3.0 stx.distro.openstack
Changed in starlingx:
status: Incomplete → Triaged
importance: Undecided → High
assignee: nobody → yong hu (yhu6)
Changed in starlingx:
assignee: yong hu (yhu6) → chen haochuan (martin1982)
Revision history for this message
chen haochuan (martin1982) wrote :

reproduce issue in shanghai, as no netbox. Reproduce with following step

On controller-0
$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne image create --file cirros-0.4.0-x86_64-disk.img --disk-format qcow2 --container-format=bare cgcs-guest

$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --size=1 --image=31f88e65-9dbe-4795-979a-2df286123ab1 --bootable vol_inst1-3

$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne keypair create keypair-tenant1

$ nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --poll --flavor=4bce0697-ec05-486f-b7a3-3355e63529ec --boot-volume=fa20dcec-34ae-4ceb-afbf-8ddbbfa89313 --nic net-id=2d8b491e-6fcb-4b72-a60b-2fe3658c0026 --nic net-id=6611c00a-de71-48c2-b4fc-48e8332c264a --meta foo=bar tenant1-config_drive-5

ssh to compute-1
$ nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration 4f4c6c52-e989-49ae-bff1-949d4ae8f1be compute-0

ssh to compute-0
$ sudo ip netns exec qdhcp-6611c00a-de71-48c2-b4fc-48e8332c264a ping 192.168.101.53

could not ping to vm

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

See automation log attached

last live migration is @ this time
instance 698586e7-0ff0-4f52-a9f5-76ff1693539c
[2019-08-22 19:30:20,631] 1368 INFO MainThread vm_helper.live_migrate_vm:: VM 698586e7-0ff0-4f52-a9f5-76ff1693539c successfully migrated from compute-2 to compute-1

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (15.5 KiB)

Reproduced the issue once again - see info below and logs (attaching now)

### StarlingX
BUILD_ID="20190912T013000Z"

ww: Ironpass 20-27

have reproduced the issue with the same reproduction scenario (Ie. live migrating instance and I rebooted compute-0 during the test)
I will update the LP with this info and the new logs

Instance instance-0000000b (tenant1-volume-stress-1) ping from NATbox to the vm failed.
I am able to login to the instance using the ‘sudo virsh console instance-0000000b’ and the instance appears to have an IP, however, I am unable to ping the internal interface to the tenant1-router (192.168.104.1) from the VM.

compute-1:~$ sudo virsh list
Password:
Id Name State
-----------------------------------
6 instance-0000000b running
7 instance-0000000e running

tenant1-mgmt-net subnets as follows:
• tenant1-mgmt0-subnet0 192.168.104.0/27
• tenant1-mgmt0-subnet1 192.168.104.32/27
• tenant1-mgmt0-subnet2 192.168.104.64/27
                (29080474-5be2)

(29080474-5be2) 192.168.104.1 fa:16:3e:f6:83:91 network:router_interface Active UP
(49ce62a2-0270) 192.168.104.65 fa:16:3e:cc:9c:d1 network:router_interface Active UP
(4a3a2700-bd7e) 192.168.104.33 fa:16:3e:d9:5f:22 network:router_interface Active UP

ID 29080474-5be2-4885-aba3-d278d7a5c23c
Network Name tenant1-mgmt-net
Network ID 113023fe-ca52-4a01-9db9-5534324f9fcc
Project ID c824ca71b6fc441bbe6df26773223155
MAC Address fa:16:3e:f6:83:91
Status Active
Admin State UP
Port Security Enabled False
DNS Name None
DNS Assignment None
Fixed IPs

IP Address 192.168.104.1
Subnet ID 25d506d6-dc76-426f-a37d-38d5022319ef
Attached Device
Device Owner network:router_interface
Device ID 558ebc16-10f5-488f-9f78-bf4bf11f9733
Security Groups

tenant1-volume-stress-1 login: root
[ 7158.596103] login[11064]: pam_unix(login:session): session opened for user root by LOGIN(uid=0)
[ 7158.602495] systemd-logind[529]: New session 4 of user root.
[ 7158.604156] login[11064]: DIALUP AT ttyS0 BY root
[ 7158.605429] systemd[1]: Started Session 4 of user root.
[ 7158.607371] login[11064]: ROOT LOGIN ON ttyS0
[ 7158.609313] systemd[1]: Starting Session 4 of user root.
tenant1-volume-stress-1:~# ifconfig
[ 7167.675942] -bash[11086]: HISTORY: PID=11086 UID=0 ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.168.104.7 netmask 255.255.255.224 broadcast 192.168.104.31
        inet6 fe80::f816:3eff:fe93:728c prefixlen 64 scopeid 0x20<link>
        ether fa:16:3e:93:72:8c txqueuelen 1000 (Ethernet)
        RX packets 1068 bytes 102974 (100.5 KiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 700 bytes 50286 (49.1 KiB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 172.16.3.246 netmask 255.255.255.0 broadcast 172.16.3.255
        inet6 fe80::f816:3eff:fe44:8ea9 prefixlen 64 scopeid 0x20<link>
        ether fa:16:3e:44:8e:a9 txqueuelen 1000 (Ethernet)
        RX packe...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

(tenant1 user)
tenant1-volume-stress-1 -
    tenant1-net3 172.16.3.246
    tenant1-mgmt-net 192.168.104.7
flavor-default-size2 Active None Running 58 minutes

flavor flavor-default-size2 1 1GB 2GB 39aeccae-ff82-496f-aade-c31f8c03eac8
hw:mem_page_size=large

[2019-09-16 18:46:01,361] 1316 INFO MainThread vm_helper.live_migrate_vm:: Live migrate vm f9b18cc7-4002-499b-83a3-fd82b77cb68a completed

[2019-09-16 18:46:03,467] 1368 INFO MainThread vm_helper.live_migrate_vm:: VM f9b18cc7-4002-499b-83a3-fd82b77cb68a successfully migrated from compute-3 to compute-1
...
2019-09-16 18:49:18,395] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.104.7 from host 128.224.186.181
[2019-09-16 18:49:30,619] 2563 INFO MainThread network_helper.ping_server:: retry in 3 seconds
[2019-09-16 18:49:33,623] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.104.7 failed.
[2019-09-16 18:49:33,623] 1926 INFO MainThread vm_helper._ping_vms:: Ping unsuccessful from vm (logged in via 128.224.186.181): {'192.168.104.7': 100}
...
E Details: Ping from NatBox to vm f9b18cc7-4002-499b-83a3-fd82b77cb68a failed for 200 seconds.

Revision history for this message
chen haochuan (martin1982) wrote :

vm could not ping to qrouter, should be neutron or ovs issue. Request to transfer to network team.

yong hu (yhu6)
Changed in starlingx:
assignee: chen haochuan (martin1982) → nobody
tags: added: stx.networking
tags: added: stx.retestneeded
Cindy Xie (xxie1)
Changed in starlingx:
assignee: nobody → Forrest Zhao (forrest.zhao)
Changed in starlingx:
assignee: Forrest Zhao (forrest.zhao) → Gongjun Song (songgongjun)
Revision history for this message
Gongjun Song (songgongjun) wrote :

I installed the AIO-Duplex environment use the image September 25, 2019.
When I perform lock/unlock processing on the contoller-1 node, controller-0 does not show that the cpu exceeds the threshold,and there is no corresponding network error in ovs-switchd.log.
But my two nodes were all in a degraded state. I don't know if it's related to it. So I will rebuild the AIO-Duplex environment and test it again.

Revision history for this message
Gongjun Song (songgongjun) wrote :

@Wendy Mitchell
Are you deploying StarlingX on virtual environment? I see you ping from Natbox.

If you are using physical server to deploy virtual environment, is OVS-DPDK used or not?

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

From the log collection file you can see the configuration
see var/extra/database/sysinv.db.sql.txt - this indicates that "vswitch_type": "none" (vswitch is containerized)
HW lab (2 controller, 2 storage, 4 worker nodes

Revision history for this message
Gongjun Song (songgongjun) wrote :
Download full text (8.4 KiB)

I tried to reproduce this bug in the virtual machine, but I can't reproduce this bug on duplex system using 1030 iso image and OVS.
I create an instance and live migration between controller-0 and controller-1. When I ping the instance ip in the network namespace multiple times, I can always ping successfully.
The log is as following:

controller-0:~$ openstack server create --image cirros --flavor m1.tiny --network public-net0 vm1 --security-group security1

controller-0:~$ openstack server show vm1
+-------------------------------------+----------------------------------------------------------+
| Field | Value |
+-------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | controller-0 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | controller-0 |
| OS-EXT-SRV-ATTR:instance_name | instance-00000020 |
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2019-11-11T06:43:36.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | public-net0=192.168.101.124 |
| config_drive | |
| created | 2019-11-11T06:43:22Z |
| flavor | m1.tiny (8f06630c-5562-45cb-8388-1cacc54a0db8) |
| hostId | d6a9ce7a8bce088e58adce0b116bac1932e3c4c95e3f1d41b749e3dc |
| id | e3e51909-b05b-483b-822b-2f719c6b598e |
| image | cirros (271e62a3-f3fd-47ad-aa06-de9f1dbcdb0a) |
| key_name | None |
| name | vm1 |
| progress | 0 |
| project_id | 820c5ebe5a164087815efc522ce79920 |
| properties | |
| security_groups ...

Read more...

Revision history for this message
yong hu (yhu6) wrote :

@songgongjun, have you tried to ping the VMs from outside through data network?
My understanding the original testing was not pinging between VMs.

Revision history for this message
Gongjun Song (songgongjun) wrote :

@yhu6, The log I posted above is the result of ping the VMs from the outside through the data network. Next I am going to try to reproduce this bug on Bare metal.

Revision history for this message
Gongjun Song (songgongjun) wrote :

@martin1982,is there set up a specific security-group when you create a virtual machine using openstack?

The log of the virtual machine you created is as follows:
$ nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --poll --flavor=4bce0697-ec05-486f-b7a3-3355e63529ec --boot-volume=fa20dcec-34ae-4ceb-afbf-8ddbbfa89313 --nic net-id=2d8b491e-6fcb-4b72-a60b-2fe3658c0026 --nic net-id=6611c00a-de71-48c2-b4fc-48e8332c264a --meta foo=bar tenant1-config_drive-5

From the log of creating a virtual machine, you are not using the specified security-group. If the specified security-group is not set, the default security-group does not allow you to ping the VMs from the outside through the data network.

Changed in starlingx:
assignee: Gongjun Song (songgongjun) → chen haochuan (martin1982)
Revision history for this message
Gongjun Song (songgongjun) wrote :

@martin1982,I saw that you assigned the bug to yourself. Are you planning to reproduce it?

Revision history for this message
Gongjun Song (songgongjun) wrote :

I also can't reproduce this bug on a bare metal with an AIO duplex system using 1030 iso image and OVS. Next, I will build a virtual environment of 2 controller nodes and 3 compute nodes to reproduce this bug.

Revision history for this message
yong hu (yhu6) wrote :

Thanks @Gongjun. Please consider to use a latest build from cengn when you run the test next time.
I can provide the build URL to you if needed.

Revision history for this message
Gongjun Song (songgongjun) wrote :

@yhu6, thanks for your advice, I will use the latest version of cengn when I run the test next time.

Revision history for this message
chen haochuan (martin1982) wrote :

For WR reproduce step. Security group "default" already add rule for icmp and tcp port 22.

[2019-08-22 18:19:54,133] 301 DEBUG MainThread ssh.send :: Send 'openstack --os-username 'tenant1' --os-password 'Li69nux*' --os-project-name tenant1 --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne security group rule list --long'
[2019-08-22 18:19:55,891] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+-------------+-----------+------------+-----------+-----------+--------------------------------------+--------------------------------------+
| ID | IP Protocol | IP Range | Port Range | Direction | Ethertype | Remote Security Group | Security Group |
+--------------------------------------+-------------+-----------+------------+-----------+-----------+--------------------------------------+--------------------------------------+
| 828f8902-681c-4dbd-b5a6-1e5ee477c1c9 | None | None | | egress | IPv4 | None | 37e619d8-1724-4dc7-a130-d4df9ffecb0d |
| 927ef8e5-b10a-4f72-8c46-aefaa5f8b6f0 | icmp | 0.0.0.0/0 | | ingress | IPv4 | None | 37e619d8-1724-4dc7-a130-d4df9ffecb0d |
| c0f13b7f-e844-402e-baad-eda4113097b5 | None | None | | ingress | IPv6 | 37e619d8-1724-4dc7-a130-d4df9ffecb0d | 37e619d8-1724-4dc7-a130-d4df9ffecb0d |
| dd9ff74d-1003-44a6-bcdc-5e5d20f613eb | None | None | | ingress | IPv4 | 37e619d8-1724-4dc7-a130-d4df9ffecb0d | 37e619d8-1724-4dc7-a130-d4df9ffecb0d |
| e13c763b-5030-4071-ba8d-1d9582b8ed0c | None | None | | egress | IPv6 | None | 37e619d8-1724-4dc7-a130-d4df9ffecb0d |
| e5088825-d652-41fd-b90d-aa4c7245f5b6 | tcp | 0.0.0.0/0 | 22:22 | ingress | IPv4 | None | 37e619d8-1724-4dc7-a130-d4df9ffecb0d |
+--------------------------------------+-------------+-----------+------------+-----------+-----------+--------------------------------------+--------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$

At first, it could ping vm from external.
[2019-08-22 19:33:52,813] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.104.33 from host 128.224.186.181
[2019-08-22 19:33:52,813] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-22 19:33:52,813] 301 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.104.33'
[2019-08-22 19:33:54,914] 423 DEBUG MainThread ssh.expect :: Output:
PING 192.168.104.33 (192.168.104.33) 56(84) bytes of data.
64 bytes from 192.168.104.33: icmp_seq=1 ttl=64 time=0.100 ms

Will set up with configuration 2+3, to check.

Changed in starlingx:
assignee: chen haochuan (martin1982) → Gongjun Song (songgongjun)
Revision history for this message
Gongjun Song (songgongjun) wrote :
Download full text (11.9 KiB)

@Wendy Mitchell,I set up a virtual env with 2+3(2-controller, 3-compute) system using 1120 iso image and OVS, and this bug also can't reproduce in this environment. Can you still reproduce this bug. If you can, can you elaborate on the steps to reproduce?

The log is as following:

controller-0:nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --poll --flavor=1f2d052a-7891-4c99-934b-a362547d4198 --boot-volume=0ae7ea19-af67-476b-8552-93578fe7ca2f --nic net-id=56a66d9c-0fbf-4bd9-a6ec-429e28e3253d --nic net-id=dabcebd3-accb-418b-8f7b-9a656fa60fb0 --nic net-id=f2a1eed0-1823-4fe4-b0fb-99de85c21ed8 --meta foo=bar tenant1-config_drive-5

controller-0:~$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show tenant1-config_drive-5
+-------------------------------------+----------------------------------------------------------------------------------------+
| Field | Value |
+-------------------------------------+----------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | compute-1 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute-1 |
| OS-EXT-SRV-ATTR:instance_name | instance-00000006 |
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2019-11-21T05:44:03.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | externa...

Revision history for this message
chen haochuan (martin1982) wrote :
Download full text (3.6 KiB)

openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne image create --file cirros-0.4.0-x86_64-disk.img --disk-format qcow2 --container-format=bare cgcs-guest
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --size=1 --image=aa3eb976-c606-42ce-8c3b-ad90b315ccbf --bootable vol_inst1-3
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne keypair create keypair-tenant1
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne keypair list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne network list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne flavor list

nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --poll --flavor=1f2d052a-7891-4c99-934b-a362547d4198 --boot-volume=0ae7ea19-af67-476b-8552-93578fe7ca2f --nic net-id=56a66d9c-0fbf-4bd9-a6ec-429e28e3253d --nic net-id=dabcebd3-accb-418b-8f7b-9a656fa60fb0 --nic net-id=f2a1eed0-1823-4fe4-b0fb-99de85c21ed8 --meta foo=bar tenant1-config_drive-5

openstack --os-username 'admin' --os-password ...

Read more...

Revision history for this message
Yang Liu (yliu12) wrote :

Hi Gongjun and Haochuan,

I think the vm needs to be reached via neutron router.

In our environment, the vm eth0 is reached via a server (we call it NatBox) that can reach the router external gateway, then the traffic will go through the router to reach one of the router internal interfaces, then to the vm.

[sysadmin@controller-0 ~(keystone_admin)]$ openstack port list --router=tenant2-router --long
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+-----------------+--------------------------+------+
| ID | Name | MAC Address | Fixed IP Addresses | Status | Security Groups | Device Owner | Tags |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+-----------------+--------------------------+------+
| 09ef76d7-caea-444c-9a20-665999b223e0 | | fa:16:3e:00:0e:7a | ip_address='192.168.220.1', subnet_id='d3ba508c-6ca9-4a5f-ade6-baf31a465d8e' | ACTIVE | | network:router_interface | |
| 3996236a-e33f-44d5-aeb5-d5ddb21fce62 | | fa:16:3e:e6:af:a7 | ip_address='192.168.20.3', subnet_id='de25fcbb-6d74-4847-a04c-fd15a60a072f' | ACTIVE | | network:router_gateway | |
| 566d17b7-905c-406f-940b-c2a9a5a61245 | | fa:16:3e:d8:3c:4e | ip_address='192.168.220.65', subnet_id='b79acfe2-74f3-42e9-9dfd-7a5f2e58ada4' | ACTIVE | | network:router_interface | |
| af0818df-883a-4468-94d0-eb435e8f48b1 | | fa:16:3e:60:eb:97 | ip_address='192.168.220.33', subnet_id='f5802f33-a85e-4d91-bb12-b0a3a9e78ca2' | ACTIVE | | network:router_interface | |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+-----------------+--------------------------+------+

Revision history for this message
Gongjun Song (songgongjun) wrote :

@yliu12,what kind of network environment do you have, how is the network connected to the router, can you provide more specific test steps, how/where you ping the vm to verify the connectivity?

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on train (3/11)
2019-11-21_20-00-00
wcp_3-6

[2019-11-22 21:47:10,381] 311 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.120.21'
[2019-11-22 21:47:22,444] 433 DEBUG MainThread ssh.expect :: Output:
PING 192.168.120.21 (192.168.120.21) 56(84) bytes of data.

--- 192.168.120.21 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Gongjun Song (songgongjun) wrote :

@ppeng,Can you keep this environment?Maybe I will need more log information. At the same time, I will try to reproduce it in a 2 + 3 environment again.

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced (1/36) on
Lab: WCP_3_6
Load: 2019-12-01_20-00-00

The lab is kept for investigation now.

Revision history for this message
Gongjun Song (songgongjun) wrote :

@ppeng, Thanks much for the detail log information. I checked your log and this issue seems to be happen after migrations for several times. While I still have some questions to reproduce this issue for more investigation, could you please help to clarify?
1. This issue looks not happen just for a single migration operation. So in general, how many times migration will cause this issue? when this issue happen, can the VM be pinged internally from the neutron router's dhcp namespace (to verify the neutron router us still work)?
2. How the natbox server (this should be physical machine instead of neutron created VM, right?) connect to starlingX neutron network, via a physical router or the neutron router? Is it possible to draw a topology diagram of the network so that I can more easily understand.

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (23.7 KiB)

I also reproduced the ping failure issue while testing live migrations
In this example, the 8th live migration iteration exhibited the ping failure

The HW system (PV-1) was installed with stx-openstack
(2 controllers, 2 storage, compute-0 through compute-7)

$ system application-list
+---------------------+------------------------------+-------------------------------+-----------------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+------------------------------+-------------------------------+-----------------+---------+-----------+
| platform-integ-apps | 1.0-8 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-19-centos-stable- | armada-manifest | stx-openstack. | applied | completed |
| | versioned | | yaml | | |
| | | | | | |
+---------------------+------------------------------+-------------------------------+--------------

[The live migration to compute-6 seems to be when it started]
[2019-12-04 19:37:11,844] 1316 INFO MainThread vm_helper.live_migrate_vm:: Live migrate vm 358526fa-0f85-45e9-b246-63be77384e3d completed
[2019-12-04 19:37:13,951] 1368 INFO MainThread vm_helper.live_migrate_vm:: VM 358526fa-0f85-45e9-b246-63be77384e3d successfully migrated from compute-1 to compute-6
[2019-12-04 19:37:13,951] 1484 INFO MainThread ssh.get_natbox_client:: Getting NatBox Client...
[2019-12-04 19:37:16,074] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'358526fa-0f85-45e9-b246-63be77384e3d': ['192.168.131.50']}
[2019-12-04 19:37:16,074] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.131.50 from host 128.224.186.181
[2019-12-04 19:37:28,281] 2563 INFO MainThread network_helper.ping_server:: retry in 3 seconds
[2019-12-04 19:37:31,284] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.131.50 failed.

[Live migration 44 from compute-1 to compute-6 appears to complete but the ping failure is observed at this point]

| 44 | de40b7bd-66c7-437e-8d5e-6d9d08cd47f9 | compute-1 | compute-6 | compute-1 | compute-6 | 192.168.206.155 | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32 | 32 | 2019-12-04T19:36:48.000000 | 2019-12-04T19:37:09.000000 | live-migration |
| 41 | 994aa781-380b-4705-b308-83ac95017ade | compute-3 | compute-1 | compute-3 | compute-1 | 192.168.206.234 | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32 | 32 | 2019-12-04T19:36:08.000000 | 2019-12-04T19:36:30.000000 | live-migration |
| 38 | b6bdc142-f5b8-47f2-9d63-97ac094ac2c1 | compute-2 | compute-3 | compute-2 | compute-3 | 192.168.206.49 | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32 | 32 | 2019-12-04T19:35:29.000000 | 2019-12-04T19:35:51.000000 | live-migration |
| 35 | efb...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (4.9 KiB)

instance on compute-6:
    tenant1-net10 172.16.10.135
    tenant1-mgmt-net 192.168.131.50

Name tenant1-router
ID cc260ce1-4dbf-473d-bd58-17cbbffe151e
Description Project ID 35b64b45763e4c9f9ae92f4a659cb656
Status Active
Admin State UP
L3 Agent
    Host ID
    compute-1 354e0515-c7c0-4779-8c27-bb47b067b3ca
Availability Zones nova

External Gateway
Network Name external-net0
Network ID 3f0b67e5-f3a2-44b3-ac0c-f39f9cc52f7a
External Fixed IPs
        Subnet ID 2fdd9804-fef4-455f-9f1a-e1738110e62e
        IP Address 192.168.31.2
SNAT Disabled

(20b14c7e-0453) 192.168.131.1 Active Internal Interface UP
(3006e0df-0a95) 192.168.131.65 Active Internal Interface UP
(69f7186e-f008) 192.168.131.33 Active Internal Interface UP
(72149874-f1fe) 192.168.31.2 Active External Gateway UP

tenant1-mgmt0-subnet1 192.168.131.32/27 IPv4 192.168.131.33 3 26
tenant1-mgmt0-subnet0 192.168.131.0/27 IPv4 192.168.131.1 2 27
tenant1-mgmt0-subnet2 192.168.131.64/27 IPv4 192.168.131.65 2 27

compute-6:~$ sudo virsh console instance-0000000b
Connected to domain instance-0000000b

tenant1-volume-stress-1:~# ip add
[ 3752.763497] -bash[1597]: HISTORY: PID=1597 UID=0 ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:77:6e:55 brd ff:ff:ff:ff:ff:ff
    inet 192.168.131.50/27 brd 192.168.131.63 scope global dynamic eth0
       valid_lft 82655sec preferred_lft 82655sec
    inet6 fe80::f816:3eff:fe77:6e55/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:1a:98:e7 brd ff:ff:ff:ff:ff:ff
    inet 172.16.10.135/24 brd 172.16.10.255 scope global dynamic eth1
       valid_lft 82661sec preferred_lft 82661sec
    inet6 fe80::f816:3eff:fe1a:98e7/64 scope link
       valid_lft forever preferred_lft forever
tenant1-volume-stress-1:~# ping 0.0.0.0
[ 3759.638041] -bash[1597]: HISTORY: PID=1597 UID=0 ping 0.0.0.0
PING 0.0.0.0 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.022 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.018 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.031 ms

--- 0.0.0.0 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.018/0.023/0.031/0.007 ms
tenant1-volume-stress-1:~# ping 192.168.131.1
[ 3770.922772] -bash[1597]: HISTORY: PID=1597 UID=0 ping 192.168.131.1
PING 192.168.131.1 (192.168.131.1) 56(84) bytes of data.
From 192.168.131.50 icmp_seq=1 Destination Host Unreachable
From 192.168.131.50 icmp_seq=2 Destination Host Unreachable
From 192.168.131.50 icmp_seq=3 Destination Host Unreachable
From 192.168.131.50 icmp_seq=4 Destination Host Unreachable

--- 192.168.131.1 ping statistics ---
5 packets transmi...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Leaving as gating stx.3.0 given that the issue is reproduced in openstack train

Revision history for this message
Gongjun Song (songgongjun) wrote :

@Wendy, @Peng, I am reproducing this bug, and can you please do the following test:
1. Create a subnet in a network (for example tenant1-mgmt-net has only one subnet of tenant1-mgmt0-subnet0).
2. Under this network, create two instances, vm1 and vm2.
3. Check if natbox ping vm1 is successful?
4. Check if natbox ping vm2 is successful?
5. Check if vm1 ping vm2 is successful?(If steps 3, 4, and 5 are successful, proceed to the next step.)
6. Live migration of vm1, vm2 remains unchanged.
7. When natbox can't ping to vm1, repeat steps 4, 5.
8. At this time,can you successfully ping vm1 and vm2 from the router's namespace?(The command is as follows)
   $ ip netns
   $ sudo ip netns exec "router's name" bash
   $ ping "vm1's ip"
   $ ping "vm2's ip"
9. Check the status of ovs at this time(The command is as follows)
   $: systemctl status ovs-vswitchd

Revision history for this message
Gongjun Song (songgongjun) wrote :
Download full text (3.3 KiB)

Check the log of Peng,when the natbox fails to ping the instance after the instance is migrated, delete the this instance and re-create a new instance. At this time, natbox ping this new instance successfully. After migrating this new instance, natbox can still ping successfully.The relevant log information is as follows:

(192.168.120.26 is the instance‘s ip that natbox ping failed after migration)
controller-0:~$
[2019-11-23 05:12:31,087] 1654 DEBUG MainThread network_helper._get_net_ips_for_vms:: targeted ips for vm: ['192.168.120.26']
[2019-11-23 05:12:31,088] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'a30dfd31-ed78-4a88-97d2-02aa762ff9ff': ['192.168.120.26']}
[2019-11-23 05:12:31,088] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.120.26 from host 128.224.186.181
[2019-11-23 05:12:31,089] 476 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-11-23 05:12:31,089] 311 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.120.26'
[2019-11-23 05:12:42,909] 433 DEBUG MainThread ssh.expect :: Output:
PING 192.168.120.26 (192.168.120.26) 56(84) bytes of data.

--- 192.168.120.26 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms

(192.168.120.22 is the new instance’s ip, and this is the log before migration)
controller-0:~$
[2019-11-23 05:25:49,088] 1654 DEBUG MainThread network_helper._get_net_ips_for_vms:: targeted ips for vm: ['192.168.120.22']
[2019-11-23 05:25:49,088] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'ea9e2ad0-b06f-4b5d-a3e9-bee87ac5f456': ['192.168.120.22']}
[2019-11-23 05:25:49,089] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.120.22 from host 128.224.186.181
[2019-11-23 05:25:49,089] 476 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-11-23 05:25:49,090] 311 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.120.22'
[2019-11-23 05:25:51,148] 433 DEBUG MainThread ssh.expect :: Output:
PING 192.168.120.22 (192.168.120.22) 56(84) bytes of data.
64 bytes from 192.168.120.22: icmp_seq=1 ttl=63 time=0.736 ms
64 bytes from 192.168.120.22: icmp_seq=2 ttl=63 time=0.664 ms
64 bytes from 192.168.120.22: icmp_seq=3 ttl=63 time=0.273 ms

--- 192.168.120.22 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.273/0.557/0.736/0.205 ms

(This is the log information after migration)
controller-0:~$
[2019-11-23 05:28:17,789] 1654 DEBUG MainThread network_helper._get_net_ips_for_vms:: targeted ips for vm: ['192.168.120.22']
[2019-11-23 05:28:17,789] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'ea9e2ad0-b06f-4b5d-a3e9-bee87ac5f456': ['192.168.120.22']}
[2019-11-23 05:28:17,790] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.120.22 from host 128.224.186.181
[2019-11-23 05:28:17,791] 476 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-11-23 05:28:17,791] 311 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.120.22'
[2019-11-23 05:28:19,869] 433 DEBUG MainThread ssh.expect :: Output:
PING 192.168.120.22 (192.168.120.22) 56(84) bytes of data.
64 bytes from 192.168.120....

Read more...

Revision history for this message
Gongjun Song (songgongjun) wrote :

@Wendy, I found the following information in your /var/log/openvswitch/ovs-vswitchd log:
2019-12-04T19:37:07.759Z|00728|netdev_linux|WARN|ethtool command ETHTOOL_GDRVINFO on network device tap1fd78fc1-56 failed: No such device
2019-12-04T19:37:07.793Z|00729|netdev_linux|WARN|ethtool command ETHTOOL_GDRVINFO on network device tapaec2ad7f-d1 failed: No such device
2019-12-04T19:37:07.821Z|00730|netdev_linux|WARN|ethtool command ETHTOOL_GDRVINFO on network device tap43b81a46-fc failed: No such device
2019-12-04T19:37:07.829Z|00731|netdev_linux|WARN|ethtool command ETHTOOL_GDRVINFO on network device tap51620f1e-97 failed: No such device

In order to check whether "No such device" caused the ping failure after the instance migration, in addition to the previous test, can you help me collect some other logs?
1. On the controller-0 node, check the ovs-vswitchd log of the compute node where the virtual machine is located.
$: kubectl get pods -n openstack -o wide | grep openvswitch
$: kubectl logs openvswitch-vswitchd-??? -n openstack | less
2. Login the compute node where the ping fails after the instance migration. Use "ifconfig" command check the device, compare with the "No such device" device in the ovs-vswitchd log, and check whether these devices can be found.
$: ssh compute-???
$: ifconfig
3. Use "ifconfig" command check the device in the network namespace, compare with the "No such device" device in the ovs-vswitchd log, and check whether these devices can be found.
$: ip netns
$: sudo ip netns exec "network namespace" bash
$: ifconfig

Revision history for this message
Le, Huifeng (hle2) wrote :

@Wendy,
From the ovs log analysis, It seems the tap device for the vm instance may not created correctly during migration. Could you please help to collect the log which Gongjun asked to help to confirm the assumption when the failure found? Thanks!

@Matt,
What's your opinion on the error log? Do you have any other suggestions? Thanks!

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

Sorry for the late response.
I have been swamped and will not have a chance to look at this.

As you can see from my comment above ( https://bugs.launchpad.net/starlingx/+bug/1841189/comments/46 )
the ping was failing only after the live migration (but not every time)

I can tell you the following details however:
- a new instance could be launched and could ping successfully to each target (ie. even though it was on the same host as the instance that could not ping successfully)
- if the instance that could not ping was subsequently live migration away from and back to the same host, the ping started to work again

Revision history for this message
Gongjun Song (songgongjun) wrote :

@Wendy, according to the analysis, the problem may be that the tap device has a problem after the migration. I am also reproducing. In addition, can you help me collect the following logs when you are free:
1.Login the compute node where the ping fails after the instance migration.
$: ssh compute-???

2.When ping fails, use the following command to collect some logs (don't delete the instance of ping failure).
$: ifconfig
$:sudo ovs-vsctl show

3.When ping fails, you also need to collect the logs of ovs, nova and neutron.

Revision history for this message
Gongjun Song (songgongjun) wrote :

Use an automated script to reproduce this bug in a 2 + 3 (2-controller, 3-compute, environment using ovs and 1204 mirror) environment. Vm1 live migration of 640 times (After vm1 migrated 640 times, the system crashed with other problems), and did not reproduce this bug. I will fix the system problem later and test again. Also, I'm checking nova's live-migration code to see how the network migrated.
Attach automated test scripts and collected logs.

Revision history for this message
Gongjun Song (songgongjun) wrote :

By using the automation scripts, I still can't reproduce this bug. And this makes it very difficult to solve this bug. So I suggest WR networking engineer to take this bug because this bug can be reproduced in WR lab, the servers are accessible for them and the logs provided by Wendy is not enough to solve this bug.

Revision history for this message
Le, Huifeng (hle2) wrote :

@Wendy Mitchell,
As aligned in the network meeting, due to we can not reproduce this issue in our lab, just reassigned this issue to you to reproduce in WR lab, then WR team will help to do the debug first, Thanks!

Changed in starlingx:
assignee: Gongjun Song (songgongjun) → Wendy Mitchell (wmitchellwr)
Revision history for this message
yong hu (yhu6) wrote :

Based on the comments above, we are waiting for test update and will target the solution in stx.4.0.

tags: added: stx.4.0
removed: stx.3.0
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (6.3 KiB)

I have successfully migrated 375+ iterations running the same testcase on the original lab that exhibited the failure previously (ip 20-27 BUILD_ID="20200115T023003Z).

However, I have compared of the original ovs-vswitchd.log and the current "working" logs.
The original logs had numa related error reported in the ovs-vswitchd.log as indicated below.

2019-12-04T19:36:13.494Z|00835|dpdk|INFO|VHOST_CONFIG: /var/run/openvswitch/vhu0ef6b42f-79: reconnecting...
2019-12-04T19:36:13.494Z|00836|dpif_netdev|WARN|There's no available (non-isolated) pmd thread on numa node 1. Queue 0 on port 'eth0' will be assigned to the pmd on core 1 (numa node 0). Expect reduced performance.
2019-12-04T19:36:13.494Z|00837|dpif_netdev|WARN|There's no available (non-isolated) pmd thread on numa node 1. Queue 1 on port 'eth0' will be assigned to the pmd on core 2 (numa node 0). Expect reduced performance.
2019-12-04T19:36:13.494Z|00838|dpif_netdev|INFO|Core 2 on numa node 0 assigned port 'vhuff279836-40' rx queue 0 (measured processing cycles 0).
2019-12-04T19:36:13.494Z|00839|dpif_netdev|INFO|Core 1 on numa node 0 assigned port 'vhu0ef6b42f-79' rx queue 0 (measured processing cycles 0).
...

2019-12-04T19:37:06.305Z|00889|bridge|INFO|bridge br-int: deleted interface vhuff279836-40 on port 14
2019-12-04T19:37:06.305Z|00890|dpif_netdev|WARN|There's no available (non-isolated) pmd thread on numa node 1. Queue 0 on port 'eth0' will be assigned to the pmd on core 1 (numa node 0). Expect reduced performance.
2019-12-04T19:37:06.305Z|00891|dpif_netdev|WARN|There's no available (non-isolated) pmd thread on numa node 1. Queue 1 on port 'eth0' will be assigned to the pmd on core 2 (numa node 0). Expect reduced performance.
2019-12-04T19:37:06.305Z|00892|dpif_netdev|INFO|Core 2 on numa node 0 assigned port 'vhu0ef6b42f-79' rx queue 0 (measured processing cycles 264156).
2019-12-04T19:37:06.343Z|00893|netdev_linux|WARN|Dropped 51 log messages in last 41 seconds (most recently, 34 seconds ago) due to excessive rate
2019-12-04T19:37:06.343Z|00894|netdev_linux|WARN|ethtool command ETHTOOL_GDRVINFO on network device tap7964d501-39 failed: No such device
2019-12-04T19:37:06.375Z|00895|netdev_linux|WARN|ethtool command ETHTOOL_GDRVINFO on network device tapdcbc1bc9-06 failed: No such device
2019-12-04T19:37:06.407Z|00896|netdev_linux|WARN|ethtool command ETHTOOL_GDRVINFO on network device tap05f66e28-9b failed: No such device
2019-12-04T19:37:06.508Z|00897|bridge|INFO|bridge br-int: deleted interface vhu0ef6b42f-79 on port 15
2019-12-04T19:37:06.510Z|00898|dpif_netdev|WARN|There's no available (non-isolated) pmd thread on numa node 1. Queue 0 on port 'eth0' will be assigned to the pmd on core 1 (numa node 0). Expect reduced performance.
2019-12-04T19:37:06.510Z|00899|dpif_netdev|WARN|There's no available (non-isolated) pmd thread on numa node 1. Queue 1 on port 'eth0' will be assigned to the pmd on core 2 (numa node 0). Expect reduced performance.
2019-12-04T19:37:08.746Z|00900|connmgr|INFO|br-int<->unix#3129: 1 flow_mods in the last 0 s (1 deletes)
2019-12-04T19:37:09.431Z|00901|connmgr|INFO|br-int<->unix#3132: 1 flow_mods in the last 0 s (1 deletes)
2019-12-04T19:37:09.868Z|00902|con...

Read more...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

As stated, I verified with 375+ iterations of live migrations as noted above when testing with vswitch type ovs-dpdk on the same lab as previously and with the same automation test
The issue did not happen now.

20200115T023003Z
This load has a newer version of openstack and a newer kernel.
Closing

tags: removed: stx.retestneeded
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

Also note: When using OVS-DPDK, the instance must be configured to use a flavor with property: hw:mem_page_size=large for the vhostuser port to function properly.

Yang Liu (yliu12)
Changed in starlingx:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.