Bug #1841189 “ping to vm fails in live migration test” : Bugs : StarlingX

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-23:

#5

*wording correction

Expected Behavior
------------------
Did not expect ping failure from natbox to vm
(& did not expect reboot of compute-0 to cause CPU usage on compute-2 to reach critical level)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-23:

#6

ALL_NODES_20190822.194132.tar Edit (155.6 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-23:

#7

20190822_19-33-test_migrate_compute-vswitch_logs.zip Edit (6.6 KiB, application/zip)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-23:

#8

Download full text (4.1 KiB)

Note: There are critical threshold levels exceeded on compute-2 each time this instance (instance-00000023) is live migrated from compute-2

see fm-event.log
2019-08-22T18:57:25.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "warning", "state" : "set", "timestamp" : "2019-08-22 18:57:25.150504" }
2019-08-22T18:57:25.000 controller-0 fmManager: info { "event_log_id" : "700.152", "reason_text" : "Live-Migrate inprogress for instance tenant1-volume-stress-1 from host compute-2", "entity_instance_id" : "tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-22 18:57:25.150691" }
2019-08-22T18:57:58.000 controller-0 fmManager: info { "event_log_id" : "700.156", "reason_text" : "Live-Migrate complete for instance tenant1-volume-stress-1 now enabled on host compute-1", "entity_instance_id" : "tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-22 18:57:58.511765" }
2019-08-22T18:57:58.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "warning", "state" : "clear", "timestamp" : "2019-08-22 18:57:58.602502" }
--->
2019-08-22T18:58:00.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 98.93%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.host=compute-2", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-22 18:58:00.296733" }
2019

...

2019-08-22T19:27:19.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "warning", "state" : "set", "timestamp" : "2019-08-22 19:27:19.670505" }
2019-08-22T19:27:19.000 controller-0 fmManager: info { "event_log_id" : "700.152", "reason_text" : "Live-Migrate inprogress for instance tenant1-volume-stress-1 from host compute-2", "entity_instance_id" : "tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-22 19:27:19.670740" }
2019-08-22T19:27:55.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "ent...

Note: There are critical threshold levels exceeded on compute-2 each time this instance (instance-00000023) is live migrated from compute-2

see fm-event.log
2019-08-22T18:57:25.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "warning", "state" : "set", "timestamp" : "2019-08-22 18:57:25.150504" }
2019-08-22T18:57:25.000 controller-0 fmManager: info { "event_log_id" : "700.152", "reason_text" : "Live-Migrate inprogress for instance tenant1-volume-stress-1 from host compute-2", "entity_instance_id" : "tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-22 18:57:25.150691" }
2019-08-22T18:57:58.000 controller-0 fmManager: info { "event_log_id" : "700.156", "reason_text" : "Live-Migrate complete for instance tenant1-volume-stress-1 now enabled on host compute-1", "entity_instance_id" : "tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-22 18:57:58.511765" }
2019-08-22T18:57:58.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "warning", "state" : "clear", "timestamp" : "2019-08-22 18:57:58.602502" }
--->
2019-08-22T18:58:00.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 98.93%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.host=compute-2", "severity" : "critical", "state" : "set", "timestamp" : "2019-08-22 18:58:00.296733" }
2019

...

2019-08-22T19:27:19.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "warning", "state" : "set", "timestamp" : "2019-08-22 19:27:19.670505" }
2019-08-22T19:27:19.000 controller-0 fmManager: info { "event_log_id" : "700.152", "reason_text" : "Live-Migrate inprogress for instance tenant1-volume-stress-1 from host compute-2", "entity_instance_id" : "tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-22 19:27:19.670740" }
2019-08-22T19:27:55.000 controller-0 fmManager: info { "event_log_id" : "700.008", "reason_text" : "Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-2", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "warning", "state" : "clear", "timestamp" : "2019-08-22 19:27:55.818202" }
2019-08-22T19:27:56.000 controller-0 fmManager: info { "event_log_id" : "700.156", "reason_text" : "Live-Migrate complete for instance tenant1-volume-stress-1 now enabled on host compute-1", "entity_instance_id" : "tenant=71e4950f-b29c-40d8-ac13-742b4990dfad.instance=698586e7-0ff0-4f52-a9f5-76ff1693539c", "severity" : "critical", "state" : "msg", "timestamp" : "2019-08-22 19:27:55.815832" }
---->2019-08-22T19:28:00.000 controller-0 fmManager: info { "event_log_id" : "100.101", "reason_text" : "Platform CPU threshold exceeded ; threshold 95.00%, actual 99.85%", "entity_instance_id" : "region=RegionOne.system=yow-cgcs-ironpass-20-27.host=compute-2", "severity" : "critical", "state" : "clear", "timestamp" : "2019-08-22 19:28:00.421944" }

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-23:

#9

ALL_NODES_20190823.163033.tar Edit (206.0 MiB, application/x-tar)

again ping failure after live migration from compute-2 to compute-0 ~16:01
instance id fb8e41e1-d51b-4bec-b139-47be706f96f7

[2019-08-23 16:01:27,880] 1913 INFO MainThread vm_helper._ping_vms:: Ping successful from 128.224.186.181: {'192.168.104.84': 33}

[2019-08-23 16:01:32,886] 164 INFO MainThread test_migrate.test_migrate_stress:: Live migration iter25

[2019-08-23 16:01:37,288] 1288 INFO MainThread vm_helper.live_migrate_vm:: Live migrating VM fb8e41e1-d51b-4bec-b139-47be706f96f7 from compute-2 started.
[2019-08-23 16:01:37,288] 1289 INFO MainThread vm_helper.live_migrate_vm:: nova live-migration fb8e41e1-d51b-4bec-b139-47be706f96f7

[2019-08-23 16:01:39,265] 1300 INFO MainThread vm_helper.live_migrate_vm:: Waiting for VM status change to MIGRATING with best effort

[2019-08-23 16:01:41,791] 1761 INFO MainThread vm_helper.wait_for_vm_status:: VM status has reached MIGRATING
[2019-08-23 16:01:41,791] 1309 INFO MainThread vm_helper.live_migrate_vm:: Waiting for VM status change to original state ACTIVE

[2019-08-23 16:02:14,911] 1316 INFO MainThread vm_helper.live_migrate_vm:: Live migrate vm fb8e41e1-d51b-4bec-b139-47be706f96f7 completed

[2019-08-23 16:02:17,058] 1368 INFO MainThread vm_helper.live_migrate_vm:: VM fb8e41e1-d51b-4bec-b139-47be706f96f7 successfully migrated from compute-2 to compute-0
[2019-08-23 16:02:17,058] 1419 INFO MainThread ssh.get_natbox_client:: Getting NatBox Client...

[2019-08-23 16:02:19,825] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'fb8e41e1-d51b-4bec-b139-47be706f96f7': ['192.168.104.84']}
[2019-08-23 16:02:19,826] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.104.84 from host 128.224.186.181

[2019-08-23 16:02:32,031] 2563 INFO MainThread network_helper.ping_server:: retry in 3 seconds

[2019-08-23 16:02:35,034] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.104.84 failed.
[2019-08-23 16:02:35,035] 1926 INFO MainThread vm_helper._ping_vms:: Ping unsuccessful from vm (logged in via 128.224.186.181): {'192.168.104.84': 100}
[2019-08-23 16:02:35,035] 1419 INFO MainThread ssh.get_natbox_client:: Getting NatBox Client...

again ping failure after live migration from compute-2 to compute-0 ~16:01
instance id fb8e41e1-d51b-4bec-b139-47be706f96f7

[2019-08-23 16:01:27,880] 1913 INFO  MainThread vm_helper._ping_vms:: Ping successful from 128.224.186.181: {'192.168.104.84': 33}

[2019-08-23 16:01:32,886] 164  INFO  MainThread test_migrate.test_migrate_stress:: Live migration iter25

[2019-08-23 16:01:37,288] 1288 INFO  MainThread vm_helper.live_migrate_vm:: Live migrating VM fb8e41e1-d51b-4bec-b139-47be706f96f7 from compute-2 started.
[2019-08-23 16:01:37,288] 1289 INFO  MainThread vm_helper.live_migrate_vm:: nova live-migration fb8e41e1-d51b-4bec-b139-47be706f96f7

[2019-08-23 16:01:39,265] 1300 INFO  MainThread vm_helper.live_migrate_vm:: Waiting for VM status change to MIGRATING with best effort

[2019-08-23 16:01:41,791] 1761 INFO  MainThread vm_helper.wait_for_vm_status:: VM status has reached MIGRATING
[2019-08-23 16:01:41,791] 1309 INFO  MainThread vm_helper.live_migrate_vm:: Waiting for VM status change to original state ACTIVE

[2019-08-23 16:02:14,911] 1316 INFO  MainThread vm_helper.live_migrate_vm:: Live migrate vm fb8e41e1-d51b-4bec-b139-47be706f96f7 completed

[2019-08-23 16:02:17,058] 1368 INFO  MainThread vm_helper.live_migrate_vm:: VM fb8e41e1-d51b-4bec-b139-47be706f96f7 successfully migrated from compute-2 to compute-0
[2019-08-23 16:02:17,058] 1419 INFO  MainThread ssh.get_natbox_client:: Getting NatBox Client...

[2019-08-23 16:02:19,825] 1666 INFO  MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'fb8e41e1-d51b-4bec-b139-47be706f96f7': ['192.168.104.84']}
[2019-08-23 16:02:19,826] 2525 INFO  MainThread network_helper.ping_server:: Ping 192.168.104.84 from host 128.224.186.181

[2019-08-23 16:02:32,031] 2563 INFO  MainThread network_helper.ping_server:: retry in 3 seconds

[2019-08-23 16:02:35,034] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.104.84 failed.
[2019-08-23 16:02:35,035] 1926 INFO  MainThread vm_helper._ping_vms:: Ping unsuccessful from vm (logged in via 128.224.186.181): {'192.168.104.84': 100}
[2019-08-23 16:02:35,035] 1419 INFO  MainThread ssh.get_natbox_client:: Getting NatBox Client...

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-23:

#10

20190823_16-05-test_migrate_vswitch_log16_01.zip Edit (8.4 KiB, application/zip)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-08-23:

#11

How many times was this issue reproduced? The bug report says the issue is reproducible.
Is there a definitive recipe?

Changed in starlingx:
status:	New → Incomplete

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-23:

#12

It has been reproduced multiple times on ip 20-27
(test runs failed everytime with ping failures - 2 runs failed without rebooting compute-0 at all)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-23:

#13

Looks like I can reproduce on same lab (ip 20-27) with the following steps:

- live migration iterations are successful running for many iterations where only compute-0 and compute-1 host are enabled and available.
-unlocked compute-2 so compute-0,compute-1 and compute-2 all enabled and available
- live migration iterations continued to be successful

compute-2 has HDD
hypervisor info
local_gb | 5544
cpu_info_model | IvyBridge-IBRS

compute-0/compute-1 109GB (have only SSD)
hypervisor info
cpu_info_model | IvyBridge-IBRS
local_gb | 109

-->Then compute-0 rebooted (sudo reboot command issued)
@ [2019-08-23 21:00:15,348] automation logs indicate tenant1-router Router f224e410-8643-47ae-9f7a-16ccb1f6e1e7 is hosted on compute-2

Automation logs reported failure in ping test on subsequent live migration iteration from compute-2 to compute-1 although the migration appears to have been successful.

[2019-08-23 20:55:38,059] 164 INFO MainThread test_migrate.test_migrate_stress:: Live migration iter82
[2019-08-23 20:55:42,582]
Live migrating VM 813861c3-7e7c-414f-81ad-828e2151ed55 from compute-2 started.
[2019-08-23 20:56:22,953]
VM 813861c3-7e7c-414f-81ad-828e2151ed55 successfully migrated from compute-2 to compute-1
status | ACTIVE

@ 20:52:20 fm-events report the failures related to compute-0
200.004 compute-0 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-0 critical 2019-08-23T20:52:25
200.009 compute-0 experienced a persistent critical 'Cluster-host Network' communication failure. host=compute-0.network=Cluster-host critical 2019-08-23T20:52:20
200.005 compute-0 experienced a persistent critical 'Management Network' communication failure. host=compute-0.network=Management critical 2019-08-23T20:52:20

Ping test failures start here
[2019-08-23 20:56:40,466]
Ping from 128.224.186.181 to 192.168.104.93 failed.

Looks like I can reproduce on same lab (ip 20-27) with the following steps:

- live migration iterations are successful running for many iterations where only compute-0 and compute-1 host are enabled and available.
-unlocked compute-2 so compute-0,compute-1 and compute-2 all enabled and available 
- live migration iterations continued to be successful

compute-2 has HDD 
hypervisor info 
local_gb                  | 5544
cpu_info_model            | IvyBridge-IBRS

compute-0/compute-1 109GB   (have only SSD)
hypervisor info
cpu_info_model            | IvyBridge-IBRS
local_gb                  | 109

-->Then compute-0 rebooted (sudo reboot command issued)
@ [2019-08-23 21:00:15,348] automation logs indicate tenant1-router Router f224e410-8643-47ae-9f7a-16ccb1f6e1e7 is hosted on compute-2

Automation logs reported failure in ping test on subsequent live migration iteration from compute-2 to compute-1 although the migration appears to have been successful.

[2019-08-23 20:55:38,059] 164  INFO  MainThread test_migrate.test_migrate_stress:: Live migration iter82
[2019-08-23 20:55:42,582] 
Live migrating VM 813861c3-7e7c-414f-81ad-828e2151ed55 from compute-2 started.
[2019-08-23 20:56:22,953]
VM 813861c3-7e7c-414f-81ad-828e2151ed55 successfully migrated from compute-2 to compute-1
status                               | ACTIVE

@ 20:52:20 fm-events report the failures related to compute-0
200.004 	compute-0 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. 	host=compute-0 	critical 	2019-08-23T20:52:25 	
200.009 	compute-0 experienced a persistent critical 'Cluster-host Network' communication failure. 	host=compute-0.network=Cluster-host 	critical 	2019-08-23T20:52:20 	
200.005 	compute-0 experienced a persistent critical 'Management Network' communication failure. 	host=compute-0.network=Management 	critical 	2019-08-23T20:52:20

Ping test failures start here
[2019-08-23 20:56:40,466] 
Ping from 128.224.186.181 to 192.168.104.93 failed.

Dariush Eslimi (deslimi) on 2019-08-26

tags:	added: stx.3.0 stx.distro.openstack
Changed in starlingx:
status:	Incomplete → Triaged
importance:	Undecided → High
assignee:	nobody → yong hu (yhu6)

chen haochuan (martin1982) on 2019-09-03

Changed in starlingx:
assignee:	yong hu (yhu6) → chen haochuan (martin1982)

Revision history for this message

chen haochuan (martin1982) wrote on 2019-09-03:

#14

reproduce issue in shanghai, as no netbox. Reproduce with following step

On controller-0
$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne image create --file cirros-0.4.0-x86_64-disk.img --disk-format qcow2 --container-format=bare cgcs-guest

$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --size=1 --image=31f88e65-9dbe-4795-979a-2df286123ab1 --bootable vol_inst1-3

$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne keypair create keypair-tenant1

$ nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --poll --flavor=4bce0697-ec05-486f-b7a3-3355e63529ec --boot-volume=fa20dcec-34ae-4ceb-afbf-8ddbbfa89313 --nic net-id=2d8b491e-6fcb-4b72-a60b-2fe3658c0026 --nic net-id=6611c00a-de71-48c2-b4fc-48e8332c264a --meta foo=bar tenant1-config_drive-5

ssh to compute-1
$ nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration 4f4c6c52-e989-49ae-bff1-949d4ae8f1be compute-0

ssh to compute-0
$ sudo ip netns exec qdhcp-6611c00a-de71-48c2-b4fc-48e8332c264a ping 192.168.101.53

could not ping to vm

reproduce issue in shanghai, as no netbox. Reproduce with following step

On controller-0
$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne image create --file cirros-0.4.0-x86_64-disk.img --disk-format qcow2  --container-format=bare cgcs-guest

$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --size=1 --image=31f88e65-9dbe-4795-979a-2df286123ab1  --bootable vol_inst1-3

$ openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne keypair create keypair-tenant1

$ nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1  --poll --flavor=4bce0697-ec05-486f-b7a3-3355e63529ec  --boot-volume=fa20dcec-34ae-4ceb-afbf-8ddbbfa89313 --nic net-id=2d8b491e-6fcb-4b72-a60b-2fe3658c0026 --nic net-id=6611c00a-de71-48c2-b4fc-48e8332c264a   --meta foo=bar  tenant1-config_drive-5

ssh to compute-1 
$ nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne live-migration  4f4c6c52-e989-49ae-bff1-949d4ae8f1be compute-0

ssh to compute-0
$ sudo ip netns exec qdhcp-6611c00a-de71-48c2-b4fc-48e8332c264a   ping 192.168.101.53

could not ping to vm

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-09-06:

#15

TIS_AUTOMATION.log Edit (4.1 MiB, text/plain)

See automation log attached

last live migration is @ this time
instance 698586e7-0ff0-4f52-a9f5-76ff1693539c
[2019-08-22 19:30:20,631] 1368 INFO MainThread vm_helper.live_migrate_vm:: VM 698586e7-0ff0-4f52-a9f5-76ff1693539c successfully migrated from compute-2 to compute-1

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-09-16:

#16

Download full text (15.5 KiB)

Reproduced the issue once again - see info below and logs (attaching now)

### StarlingX
BUILD_ID="20190912T013000Z"

ww: Ironpass 20-27

have reproduced the issue with the same reproduction scenario (Ie. live migrating instance and I rebooted compute-0 during the test)
I will update the LP with this info and the new logs

Instance instance-0000000b (tenant1-volume-stress-1) ping from NATbox to the vm failed.
I am able to login to the instance using the ‘sudo virsh console instance-0000000b’ and the instance appears to have an IP, however, I am unable to ping the internal interface to the tenant1-router (192.168.104.1) from the VM.

compute-1:~$ sudo virsh list
Password:
Id Name State
-----------------------------------
6 instance-0000000b running
7 instance-0000000e running

tenant1-mgmt-net subnets as follows:
• tenant1-mgmt0-subnet0 192.168.104.0/27
• tenant1-mgmt0-subnet1 192.168.104.32/27
• tenant1-mgmt0-subnet2 192.168.104.64/27
(29080474-5be2)

(29080474-5be2) 192.168.104.1 fa:16:3e:f6:83:91 network:router_interface Active UP
(49ce62a2-0270) 192.168.104.65 fa:16:3e:cc:9c:d1 network:router_interface Active UP
(4a3a2700-bd7e) 192.168.104.33 fa:16:3e:d9:5f:22 network:router_interface Active UP

ID 29080474-5be2-4885-aba3-d278d7a5c23c
Network Name tenant1-mgmt-net
Network ID 113023fe-ca52-4a01-9db9-5534324f9fcc
Project ID c824ca71b6fc441bbe6df26773223155
MAC Address fa:16:3e:f6:83:91
Status Active
Admin State UP
Port Security Enabled False
DNS Name None
DNS Assignment None
Fixed IPs

IP Address 192.168.104.1
Subnet ID 25d506d6-dc76-426f-a37d-38d5022319ef
Attached Device
Device Owner network:router_interface
Device ID 558ebc16-10f5-488f-9f78-bf4bf11f9733
Security Groups

tenant1-volume-stress-1 login: root
[ 7158.596103] login[11064]: pam_unix(login:session): session opened for user root by LOGIN(uid=0)
[ 7158.602495] systemd-logind[529]: New session 4 of user root.
[ 7158.604156] login[11064]: DIALUP AT ttyS0 BY root
[ 7158.605429] systemd[1]: Started Session 4 of user root.
[ 7158.607371] login[11064]: ROOT LOGIN ON ttyS0
[ 7158.609313] systemd[1]: Starting Session 4 of user root.
tenant1-volume-stress-1:~# ifconfig
[ 7167.675942] -bash[11086]: HISTORY: PID=11086 UID=0 ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.168.104.7 netmask 255.255.255.224 broadcast 192.168.104.31
        inet6 fe80::f816:3eff:fe93:728c prefixlen 64 scopeid 0x20<link>
        ether fa:16:3e:93:72:8c txqueuelen 1000 (Ethernet)
        RX packets 1068 bytes 102974 (100.5 KiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 700 bytes 50286 (49.1 KiB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 172.16.3.246 netmask 255.255.255.0 broadcast 172.16.3.255
        inet6 fe80::f816:3eff:fe44:8ea9 prefixlen 64 scopeid 0x20<link>
        ether fa:16:3e:44:8e:a9 txqueuelen 1000 (Ethernet)
        RX packe...

Reproduced the issue once again - see info below and logs (attaching now)

### StarlingX
BUILD_ID="20190912T013000Z"

ww: Ironpass 20-27

have reproduced the issue with  the same reproduction scenario (Ie. live migrating instance and I rebooted compute-0 during the test)
I will update the LP with this info and the new logs

Instance instance-0000000b (tenant1-volume-stress-1) ping from NATbox to the vm failed.
I am able to login to the instance using the ‘sudo virsh console instance-0000000b’ and the instance appears to have an IP, however, I am unable to ping the internal interface to the tenant1-router (192.168.104.1) from the VM.

compute-1:~$ sudo virsh list
Password:
Id   Name                State
-----------------------------------
6    instance-0000000b   running
7    instance-0000000e   running

tenant1-mgmt-net  subnets as follows:
•              tenant1-mgmt0-subnet0 192.168.104.0/27 
•              tenant1-mgmt0-subnet1 192.168.104.32/27 
•              tenant1-mgmt0-subnet2 192.168.104.64/27
                (29080474-5be2)

(29080474-5be2)   192.168.104.1          fa:16:3e:f6:83:91    network:router_interface       Active      UP
(49ce62a2-0270)    192.168.104.65       fa:16:3e:cc:9c:d1     network:router_interface       Active      UP
(4a3a2700-bd7e)    192.168.104.33       fa:16:3e:d9:5f:22    network:router_interface       Active      UP

ID    29080474-5be2-4885-aba3-d278d7a5c23c
Network Name    tenant1-mgmt-net
Network ID    113023fe-ca52-4a01-9db9-5534324f9fcc
Project ID    c824ca71b6fc441bbe6df26773223155
MAC Address    fa:16:3e:f6:83:91
Status    Active
Admin State    UP
Port Security Enabled    False
DNS Name    None
DNS Assignment    None
Fixed IPs

IP Address    192.168.104.1
Subnet ID    25d506d6-dc76-426f-a37d-38d5022319ef
Attached Device
Device Owner    network:router_interface
Device ID    558ebc16-10f5-488f-9f78-bf4bf11f9733
Security Groups

tenant1-volume-stress-1 login: root
[ 7158.596103] login[11064]: pam_unix(login:session): session opened for user root by LOGIN(uid=0)
[ 7158.602495] systemd-logind[529]: New session 4 of user root.
[ 7158.604156] login[11064]: DIALUP AT ttyS0 BY root
[ 7158.605429] systemd[1]: Started Session 4 of user root.
[ 7158.607371] login[11064]: ROOT LOGIN ON ttyS0
[ 7158.609313] systemd[1]: Starting Session 4 of user root.
tenant1-volume-stress-1:~# ifconfig
[ 7167.675942] -bash[11086]: HISTORY: PID=11086 UID=0 ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.104.7  netmask 255.255.255.224  broadcast 192.168.104.31
        inet6 fe80::f816:3eff:fe93:728c  prefixlen 64  scopeid 0x20<link>
        ether fa:16:3e:93:72:8c  txqueuelen 1000  (Ethernet)
        RX packets 1068  bytes 102974 (100.5 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 700  bytes 50286 (49.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.16.3.246  netmask 255.255.255.0  broadcast 172.16.3.255
        inet6 fe80::f816:3eff:fe44:8ea9  prefixlen 64  scopeid 0x20<link>
        ether fa:16:3e:44:8e:a9  txqueuelen 1000  (Ethernet)
        RX packets 301  bytes 22348 (21.8 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 151  bytes 10290 (10.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 130  bytes 13192 (12.8 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 130  bytes 13192 (12.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

tenant1-volume-stress-1:~# ping 192.168.104.1
[10465.281139] -bash[11133]: HISTORY: PID=11133 UID=0 ping 192.168.104.1
PING 192.168.104.1 (192.168.104.1) 56(84) bytes of data.

--- 192.168.104.1 ping statistics ---
9 packets transmitted, 0 received, 100% packet loss, time 7999ms

Note:
Another instance ten1test2 (instance-0000000e) launched on the same host after the failure above can ping to ip of tenant1-volume-stress-1 without issue AND ping all the way to the NATbox is not an issue for this instance.
ten1test2:~# ifconfig
[ 3672.645559] -bash[1714]: HISTORY: PID=1714 UID=0 ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.104.30  netmask 255.255.255.224  broadcast 192.168.104.31
        inet6 fe80::f816:3eff:fef3:79bc  prefixlen 64  scopeid 0x20<link>
        ether fa:16:3e:f3:79:bc  txqueuelen 1000  (Ethernet)
        RX packets 945  bytes 76980 (75.1 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 863  bytes 82660 (80.7 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.16.3.131  netmask 255.255.255.0  broadcast 172.16.3.255
        inet6 fe80::f816:3eff:fe2a:493  prefixlen 64  scopeid 0x20<link>
        ether fa:16:3e:2a:04:93  txqueuelen 1000  (Ethernet)
        RX packets 10  bytes 1292 (1.2 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 12  bytes 1388 (1.3 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 8  bytes 612 (612.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 612 (612.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ten1test2:~# ping 192.168.104.1
[ 3979.065866] -bash[1714]: HISTORY: PID=1714 UID=0 ping 192.168.104.1
PING 192.168.104.1 (192.168.104.1) 56(84) bytes of data.
64 bytes from 192.168.104.1: icmp_seq=1 ttl=64 time=0.793 ms
64 bytes from 192.168.104.1: icmp_seq=2 ttl=64 time=0.409 ms
64 bytes from 192.168.104.1: icmp_seq=3 ttl=64 time=0.192 ms
64 bytes from 192.168.104.1: icmp_seq=4 ttl=64 time=0.274 ms

--- 192.168.104.1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3005ms
rtt min/avg/max/mdev = 0.192/0.417/0.793/0.230 ms

The instance tenant1-volume-stress-1 last migrated from compute-3 to compute-1
2019-09-16T18:54:19             clear        100.101   Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%               host=compute-0                 critical     
2019-09-16T18:52:47             log           275.001   Host compute-0 hypervisor is now unlocked-enabled           host=compute-0.hypervisor=99c226e3-11d4-4c68-a5d8-c959617e89e8      critical     
2019-09-16T18:52:30             clear        200.006   compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.                 host=compute-0.process=pci-irq-affinity-agent    major      
2019-09-16T18:52:19             set           100.101   Platform CPU threshold exceeded ; threshold 95.00%, actual 100.00%               host=compute-0                 critical     
2019-09-16T18:51:40             log           275.001   Host compute-0 hypervisor is now unlocked-disabled          host=compute-0.hypervisor=99c226e3-11d4-4c68-a5d8-c959617e89e8      critical     
2019-09-16T18:51:00             clear        200.004   compute-0 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful.                host=compute-0     critical     
2019-09-16T18:50:49             log           200.022   compute-0 is now 'enabled'   host=compute-0.state=enabled             not-applicable         
2019-09-16T18:50:47             set           200.006   compute-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress.                 host=compute-0.process=pci-irq-affinity-agent    major      
2019-09-16T18:50:38             clear        200.009   compute-0 experienced a persistent critical 'Cluster-host Network' communication failure.                 host=compute-0.network=Cluster-host                 critical     
2019-09-16T18:50:38             clear        200.005   compute-0 experienced a persistent critical 'Management Network' communication failure.                 host=compute-0.network=Management               critical     
2019-09-16T18:46:18             clear        100.101   Platform CPU threshold exceeded ; threshold 95.00%, actual 99.49%                 host=compute-2                 critical
2019-09-16T18:45:58             clear        700.008   Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-3                 tenant=c824ca71-b6fc-441b-be6d-f26773223155.instance=f9b18cc7-4002-499b-83a3-fd82b77cb68a             warning   
2019-09-16T18:45:58             log           700.156   Live-Migrate complete for instance tenant1-volume-stress-1 now enabled on host compute-1                 tenant=c824ca71-b6fc-441b-be6d-f26773223155.instance=f9b18cc7-4002-499b-83a3-fd82b77cb68a             critical     
2019-09-16T18:45:33             log           700.152   Live-Migrate inprogress for instance tenant1-volume-stress-1 from host compute-3                 tenant=c824ca71-b6fc-441b-be6d-f26773223155.instance=f9b18cc7-4002-499b-83a3-fd82b77cb68a             critical     
2019-09-16T18:45:33             set           700.008   Instance tenant1-volume-stress-1 owned by tenant1 is live migrating from host compute-3                 tenant=c824ca71-b6fc-441b-be6d-f26773223155.instance=f9b18cc7-4002-499b-83a3-fd82b77cb68a             warning

router is on compute-3 (after the failure)
[sysadmin@controller-0 ~(keystone_admin)]$ neutron l3-agent-list-hosting-router 558ebc16-10f5-488f-9f78-bf4bf11f9733
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+-----------+----------------+-------+----------+
| id                                   | host      | admin_state_up | alive | ha_state |
+--------------------------------------+-----------+----------------+-------+----------+
| e625752d-14b3-459a-ac6d-f9dd9aeedcdd | compute-3 | True           | :-)   |          |
+--------------------------------------+-----------+----------------+-------+----------+

compute-1:~$ sudo ovs-vsctl show
Password:
9289f374-6f7c-4fd6-bd15-a9de5fa4b593
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-int
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
        Port "tape8d90651-c9"
            tag: 4
            Interface "tape8d90651-c9"
                type: internal
        Port "tapb4fa67c6-1c"
            tag: 7
            Interface "tapb4fa67c6-1c"
                type: internal
        Port "qg-b7ef5e47-17"
            tag: 2
            Interface "qg-b7ef5e47-17"
                type: internal
        Port "tapa38d1594-2d"
            tag: 5
            Interface "tapa38d1594-2d"
        Port "tap8fb372d3-6a"
            tag: 5
            Interface "tap8fb372d3-6a"
        Port "int-br-phy0"
            Interface "int-br-phy0"
                type: patch
                options: {peer="phy-br-phy0"}
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port "tap98ab5184-3f"
            tag: 3
            Interface "tap98ab5184-3f"
                type: internal
        Port "qr-831217e4-f0"
            tag: 3
            Interface "qr-831217e4-f0"
                type: internal
        Port "tap64b21573-76"
            tag: 1
            Interface "tap64b21573-76"
                type: internal
        Port "tap6f67cc35-53"
            tag: 5
            Interface "tap6f67cc35-53"
                type: internal
        Port "qr-037f9921-d0"
            tag: 3
            Interface "qr-037f9921-d0"
                type: internal
        Port "tap70447bb4-83"
            tag: 6
            Interface "tap70447bb4-83"
                type: internal
        Port "tap4272218d-d4"
            tag: 13
            Interface "tap4272218d-d4"
        Port "qr-a758d463-cb"
            tag: 3
            Interface "qr-a758d463-cb"
                type: internal
        Port "tap7bab5c4a-d2"
            tag: 13
            Interface "tap7bab5c4a-d2"
    Bridge br-tun
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
        Port br-tun
            Interface br-tun
                type: internal
    Bridge "br-phy0"
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "br-phy0"
            Interface "br-phy0"
                type: internal
        Port "ens2f0"
            Interface "ens2f0"
        Port "phy-br-phy0"
            Interface "phy-br-phy0"
                type: patch
                options: {peer="int-br-phy0"}

compute-3:~$ sudo ovs-vsctl show
Password:
6f198630-5687-4a6c-a00a-e063ead2e5f7
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge "br-phy0"
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "br-phy0"
            Interface "br-phy0"
                type: internal
        Port "ens2f0"
            Interface "ens2f0"
        Port "phy-br-phy0"
            Interface "phy-br-phy0"
                type: patch
                options: {peer="int-br-phy0"}
    Bridge br-int
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "qr-4a3a2700-bd"
            tag: 4
            Interface "qr-4a3a2700-bd"
                type: internal
        Port "tapa254e929-c3"
            tag: 6
            Interface "tapa254e929-c3"
                type: internal
        Port "tapa49cab58-58"
            tag: 2
            Interface "tapa49cab58-58"
                type: internal
        Port "tap07e5581c-82"
            tag: 11
            Interface "tap07e5581c-82"
                type: internal
        Port patch-tun
            Interface patch-tun
                type: patch
               options: {peer=patch-int}
        Port "tap65db7c9d-45"
            tag: 1
            Interface "tap65db7c9d-45"
                type: internal
        Port br-int
            Interface br-int
                type: internal
        Port "tap7f3ea8f1-e8"
            tag: 10
            Interface "tap7f3ea8f1-e8"
                type: internal
        Port "qr-29080474-5b"
            tag: 4
            Interface "qr-29080474-5b"
                type: internal
        Port "qg-5eb00be5-9f"
            tag: 3
            Interface "qg-5eb00be5-9f"
                type: internal
        Port "tapb0bdd2e7-29"
            tag: 5
            Interface "tapb0bdd2e7-29"
                type: internal
        Port "qr-49ce62a2-02"
            tag: 4
            Interface "qr-49ce62a2-02"
                type: internal
        Port "int-br-phy0"
            Interface "int-br-phy0"
                type: patch
                options: {peer="phy-br-phy0"}
    Bridge br-tun
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
        Port br-tun
            Interface br-tun
                type: internal

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-09-16:

#17

ALL_NODES_20190916.195210.tar Edit (120.3 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-09-16:

#18

(tenant1 user)
tenant1-volume-stress-1 -
tenant1-net3 172.16.3.246
tenant1-mgmt-net 192.168.104.7
flavor-default-size2 Active None Running 58 minutes

flavor flavor-default-size2 1 1GB 2GB 39aeccae-ff82-496f-aade-c31f8c03eac8
hw:mem_page_size=large

[2019-09-16 18:46:01,361] 1316 INFO MainThread vm_helper.live_migrate_vm:: Live migrate vm f9b18cc7-4002-499b-83a3-fd82b77cb68a completed

[2019-09-16 18:46:03,467] 1368 INFO MainThread vm_helper.live_migrate_vm:: VM f9b18cc7-4002-499b-83a3-fd82b77cb68a successfully migrated from compute-3 to compute-1
...
2019-09-16 18:49:18,395] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.104.7 from host 128.224.186.181
[2019-09-16 18:49:30,619] 2563 INFO MainThread network_helper.ping_server:: retry in 3 seconds
[2019-09-16 18:49:33,623] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.104.7 failed.
[2019-09-16 18:49:33,623] 1926 INFO MainThread vm_helper._ping_vms:: Ping unsuccessful from vm (logged in via 128.224.186.181): {'192.168.104.7': 100}
...
E Details: Ping from NatBox to vm f9b18cc7-4002-499b-83a3-fd82b77cb68a failed for 200 seconds.

Revision history for this message

chen haochuan (martin1982) wrote on 2019-09-17:

#19

vm could not ping to qrouter, should be neutron or ovs issue. Request to transfer to network team.

yong hu (yhu6) on 2019-09-17

Changed in starlingx:
assignee:	chen haochuan (martin1982) → nobody
tags:	added: stx.networking

Wendy Mitchell (wmitchellwr) on 2019-09-17

tags:

added: stx.retestneeded

Cindy Xie (xxie1) on 2019-09-21

Changed in starlingx:
assignee:	nobody → Forrest Zhao (forrest.zhao)

Forrest Zhao (forrest.zhao) on 2019-09-22

Changed in starlingx:
assignee:	Forrest Zhao (forrest.zhao) → Gongjun Song (songgongjun)

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-09-29:

#20

I installed the AIO-Duplex environment use the image September 25, 2019.
When I perform lock/unlock processing on the contoller-1 node, controller-0 does not show that the cpu exceeds the threshold，and there is no corresponding network error in ovs-switchd.log.
But my two nodes were all in a degraded state. I don't know if it's related to it. So I will rebuild the AIO-Duplex environment and test it again.

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-11-07:

#21

@Wendy Mitchell
Are you deploying StarlingX on virtual environment? I see you ping from Natbox.

If you are using physical server to deploy virtual environment, is OVS-DPDK used or not?

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-11-07:

#22

From the log collection file you can see the configuration
see var/extra/database/sysinv.db.sql.txt - this indicates that "vswitch_type": "none" (vswitch is containerized)
HW lab (2 controller, 2 storage, 4 worker nodes

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-11-11:

#23

Download full text (8.4 KiB)

I tried to reproduce this bug in the virtual machine, but I can't reproduce this bug on duplex system using 1030 iso image and OVS.
I create an instance and live migration between controller-0 and controller-1. When I ping the instance ip in the network namespace multiple times, I can always ping successfully.
The log is as following：

controller-0:~$ openstack server create --image cirros --flavor m1.tiny --network public-net0 vm1 --security-group security1

I tried to reproduce this bug in the virtual machine, but I can't reproduce this bug on duplex system using 1030 iso image and OVS.
I create an instance and live migration between controller-0 and controller-1. When I ping the instance ip in the network namespace multiple times, I can always ping successfully.
The log is as following：

controller-0:~$ openstack server create --image cirros --flavor m1.tiny --network public-net0 vm1 --security-group security1

controller-0:~$ openstack server migrate vm1 --live controller-1 --block-migration

controller-0:~$ ip netns
qrouter-9d28512d-1024-41b5-9b3a-7444dd3eedc0 (id: 33)
qrouter-1623c1ed-9a68-4b52-afbf-89b01e77007f (id: 22)
qdhcp-fa426efa-2282-4c84-9d5c-799e118ed283 (id: 30)
qdhcp-10367735-8509-4c1c-9132-e1bb980bc400 (id: 28)
qdhcp-3b8372e7-8ac0-4960-8d75-57d63cda01c8 (id: 20)

controller-0:~$ sudo ip netns exec qrouter-1623c1ed-9a68-4b52-afbf-89b01e77007f ping 192.168.101.124
PING 192.168.101.124 (192.168.101.124) 56(84) bytes of data.
64 bytes from 192.168.101.124: icmp_seq=1 ttl=64 time=7.03 ms
64 bytes from 192.168.101.124: icmp_seq=2 ttl=64 time=2.12 ms
64 bytes from 192.168.101.124: icmp_seq=3 ttl=64 time=1.31 ms
64 bytes from 192.168.101.124: icmp_seq=4 ttl=64 time=1.26 ms
64 bytes from 192.168.101.124: icmp_seq=5 ttl=64 time=1.37 ms
64 bytes from 192.168.101.124: icmp_seq=6 ttl=64 time=0.889 ms
64 bytes from 192.168.101.124: icmp_seq=7 ttl=64 time=1.49 ms
64 bytes from 192.168.101.124: icmp_seq=8 ttl=64 time=1.70 ms
64 bytes from 192.168.101.124: icmp_seq=9 ttl=64 time=1.26 ms
64 bytes from 192.168.101.124: icmp_seq=10 ttl=64 time=1.26 ms
64 bytes from 192.168.101.124: icmp_seq=11 ttl=64 time=1.22 ms
64 bytes from 192.168.101.124: icmp_seq=12 ttl=64 time=1.24 ms
64 bytes from 192.168.101.124: icmp_seq=13 ttl=64 time=1.07 ms
^C
--- 192.168.101.124 ping statistics ---
13 packets transmitted, 13 received, 0% packet loss, time 12016ms
rtt min/avg/max/mdev = 0.889/1.790/7.037/1.541 ms

Revision history for this message

yong hu (yhu6) wrote on 2019-11-13:

#24

@songgongjun, have you tried to ping the VMs from outside through data network?
My understanding the original testing was not pinging between VMs.

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-11-13:

#25

@yhu6, The log I posted above is the result of ping the VMs from the outside through the data network. Next I am going to try to reproduce this bug on Bare metal.

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-11-14:

#26

@martin1982，is there set up a specific security-group when you create a virtual machine using openstack?

The log of the virtual machine you created is as follows:
$ nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --poll --flavor=4bce0697-ec05-486f-b7a3-3355e63529ec --boot-volume=fa20dcec-34ae-4ceb-afbf-8ddbbfa89313 --nic net-id=2d8b491e-6fcb-4b72-a60b-2fe3658c0026 --nic net-id=6611c00a-de71-48c2-b4fc-48e8332c264a --meta foo=bar tenant1-config_drive-5

From the log of creating a virtual machine, you are not using the specified security-group. If the specified security-group is not set, the default security-group does not allow you to ping the VMs from the outside through the data network.

chen haochuan (martin1982) on 2019-11-19

Changed in starlingx:
assignee:	Gongjun Song (songgongjun) → chen haochuan (martin1982)

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-11-19:

#27

@martin1982，I saw that you assigned the bug to yourself. Are you planning to reproduce it?

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-11-19:

#28

I also can't reproduce this bug on a bare metal with an AIO duplex system using 1030 iso image and OVS. Next, I will build a virtual environment of 2 controller nodes and 3 compute nodes to reproduce this bug.

Revision history for this message

yong hu (yhu6) wrote on 2019-11-19:

#29

Thanks @Gongjun. Please consider to use a latest build from cengn when you run the test next time.
I can provide the build URL to you if needed.

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-11-20:

#30

@yhu6, thanks for your advice, I will use the latest version of cengn when I run the test next time.

Revision history for this message

chen haochuan (martin1982) wrote on 2019-11-20:

#31

For WR reproduce step. Security group "default" already add rule for icmp and tcp port 22.

At first, it could ping vm from external.
[2019-08-22 19:33:52,813] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.104.33 from host 128.224.186.181
[2019-08-22 19:33:52,813] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-22 19:33:52,813] 301 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.104.33'
[2019-08-22 19:33:54,914] 423 DEBUG MainThread ssh.expect :: Output:
PING 192.168.104.33 (192.168.104.33) 56(84) bytes of data.
64 bytes from 192.168.104.33: icmp_seq=1 ttl=64 time=0.100 ms

Will set up with configuration 2+3, to check.

For WR reproduce step. Security group "default" already add rule for icmp and tcp port 22.

At first, it could ping vm from external.
[2019-08-22 19:33:52,813] 2525 INFO  MainThread network_helper.ping_server:: Ping 192.168.104.33 from host 128.224.186.181
[2019-08-22 19:33:52,813] 466  DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-22 19:33:52,813] 301  DEBUG MainThread ssh.send    :: Send 'ping -c 3 192.168.104.33'
[2019-08-22 19:33:54,914] 423  DEBUG MainThread ssh.expect  :: Output: 
PING 192.168.104.33 (192.168.104.33) 56(84) bytes of data.
64 bytes from 192.168.104.33: icmp_seq=1 ttl=64 time=0.100 ms

Will set up with configuration 2+3, to check.

Gongjun Song (songgongjun) on 2019-11-25

Changed in starlingx:
assignee:	chen haochuan (martin1982) → Gongjun Song (songgongjun)

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-11-25:

#32

Download full text (11.9 KiB)

@Wendy Mitchell，I set up a virtual env with 2+3(2-controller, 3-compute) system using 1120 iso image and OVS, and this bug also can't reproduce in this environment. Can you still reproduce this bug. If you can, can you elaborate on the steps to reproduce?

The log is as following：

controller-0:nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --poll --flavor=1f2d052a-7891-4c99-934b-a362547d4198 --boot-volume=0ae7ea19-af67-476b-8552-93578fe7ca2f --nic net-id=56a66d9c-0fbf-4bd9-a6ec-429e28e3253d --nic net-id=dabcebd3-accb-418b-8f7b-9a656fa60fb0 --nic net-id=f2a1eed0-1823-4fe4-b0fb-99de85c21ed8 --meta foo=bar tenant1-config_drive-5

@Wendy Mitchell，I set up a virtual env with 2+3(2-controller, 3-compute) system using 1120 iso image and OVS, and this bug also can't reproduce in this environment. Can you still reproduce this bug. If you can, can you elaborate on the steps to reproduce?

The log is as following：

controller-0:nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --poll --flavor=1f2d052a-7891-4c99-934b-a362547d4198 --boot-volume=0ae7ea19-af67-476b-8552-93578fe7ca2f --nic net-id=56a66d9c-0fbf-4bd9-a6ec-429e28e3253d --nic net-id=dabcebd3-accb-418b-8f7b-9a656fa60fb0 --nic net-id=f2a1eed0-1823-4fe4-b0fb-99de85c21ed8  --meta foo=bar tenant1-config_drive-5

Migrate the instance from compute-1 to compute-2 and restart the compute-0 at the same time.

controller-0:~$ nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne live-migration 4f4c6c52-e989-49ae-bff1-949d4ae8f1be compute-2

cotroller-0:ssh compute-2

compute-2:~$ sudo ip netns exec qdhcp-56a66d9c-0fbf-4bd9-a6ec-429e28e3253d ping 192.168.101.172
PING 192.168.101.172 (192.168.101.172) 56(84) bytes of data.
64 bytes from 192.168.101.172: icmp_seq=1 ttl=64 time=13.8 ms
64 bytes from 192.168.101.172: icmp_seq=2 ttl=64 time=1.22 ms
64 bytes from 192.168.101.172: icmp_seq=3 ttl=64 time=0.876 ms
64 bytes from 192.168.101.172: icmp_seq=4 ttl=64 time=0.810 ms
64 bytes from 192.168.101.172: icmp_seq=5 ttl=64 time=0.978 ms
64 bytes from 192.168.101.172: icmp_seq=6 ttl=64 time=3.21 ms
64 bytes from 192.168.101.172: icmp_seq=7 ttl=64 time=0.834 ms
64 bytes from 192.168.101.172: icmp_seq=8 ttl=64 time=0.810 ms
64 bytes from 192.168.101.172: icmp_seq=9 ttl=64 time=0.853 ms
64 bytes from 192.168.101.172: icmp_seq=10 ttl=64 time=0.869 ms
64 bytes from 192.168.101.172: icmp_seq=11 ttl=64 time=0.817 ms
^C64 bytes from 192.168.101.172: icmp_seq=12 ttl=64 time=0.833 ms
64 bytes from 192.168.101.172: icmp_seq=13 ttl=64 time=0.860 ms
^C
--- 192.168.101.172 ping statistics ---
13 packets transmitted, 13 received, 0% packet loss, time 12017ms
rtt min/avg/max/mdev = 0.810/2.064/13.870/3.465 ms

Revision history for this message

chen haochuan (martin1982) wrote on 2019-11-25:

#33

Download full text (3.6 KiB)

openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne image create --file cirros-0.4.0-x86_64-disk.img --disk-format qcow2 --container-format=bare cgcs-guest
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --size=1 --image=aa3eb976-c606-42ce-8c3b-ad90b315ccbf --bootable vol_inst1-3
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne keypair create keypair-tenant1
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne keypair list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne network list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne flavor list

nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --poll --flavor=1f2d052a-7891-4c99-934b-a362547d4198 --boot-volume=0ae7ea19-af67-476b-8552-93578fe7ca2f --nic net-id=56a66d9c-0fbf-4bd9-a6ec-429e28e3253d --nic net-id=dabcebd3-accb-418b-8f7b-9a656fa60fb0 --nic net-id=f2a1eed0-1823-4fe4-b0fb-99de85c21ed8 --meta foo=bar tenant1-config_drive-5

openstack --os-username 'admin' --os-password ...

openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne image create --file cirros-0.4.0-x86_64-disk.img --disk-format qcow2 --container-format=bare cgcs-guest
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume create --size=1 --image=aa3eb976-c606-42ce-8c3b-ad90b315ccbf --bootable vol_inst1-3
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne keypair create keypair-tenant1
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne keypair list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne network list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne volume list
openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne flavor list

nova --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name=keypair-tenant1 --poll --flavor=1f2d052a-7891-4c99-934b-a362547d4198 --boot-volume=0ae7ea19-af67-476b-8552-93578fe7ca2f --nic net-id=56a66d9c-0fbf-4bd9-a6ec-429e28e3253d --nic net-id=dabcebd3-accb-418b-8f7b-9a656fa60fb0 --nic net-id=f2a1eed0-1823-4fe4-b0fb-99de85c21ed8  --meta foo=bar tenant1-config_drive-5

openstack --os-username 'admin' --os-password 'Local.123' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne security  group rule create 9cc43e55-6ba3-4170-99b9-e78e72883c41  --ingress --protocol icmp --ethertype IPv4

reproduce with above command(with security group rule add),

ping 
ssh compute-0
sudo ip netns list
sudo ping netns exec <qdpch-id> exec
ping <vm ip>

ping successfully.

Revision history for this message

Yang Liu (yliu12) wrote on 2019-11-25:

#34

Hi Gongjun and Haochuan,

I think the vm needs to be reached via neutron router.

In our environment, the vm eth0 is reached via a server (we call it NatBox) that can reach the router external gateway, then the traffic will go through the router to reach one of the router internal interfaces, then to the vm.

[sysadmin@controller-0 ~(keystone_admin)]$ openstack port list --router=tenant2-router --long
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+-----------------+--------------------------+------+
| ID | Name | MAC Address | Fixed IP Addresses | Status | Security Groups | Device Owner | Tags |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+-----------------+--------------------------+------+
| 09ef76d7-caea-444c-9a20-665999b223e0 | | fa:16:3e:00:0e:7a | ip_address='192.168.220.1', subnet_id='d3ba508c-6ca9-4a5f-ade6-baf31a465d8e' | ACTIVE | | network:router_interface | |
| 3996236a-e33f-44d5-aeb5-d5ddb21fce62 | | fa:16:3e:e6:af:a7 | ip_address='192.168.20.3', subnet_id='de25fcbb-6d74-4847-a04c-fd15a60a072f' | ACTIVE | | network:router_gateway | |
| 566d17b7-905c-406f-940b-c2a9a5a61245 | | fa:16:3e:d8:3c:4e | ip_address='192.168.220.65', subnet_id='b79acfe2-74f3-42e9-9dfd-7a5f2e58ada4' | ACTIVE | | network:router_interface | |
| af0818df-883a-4468-94d0-eb435e8f48b1 | | fa:16:3e:60:eb:97 | ip_address='192.168.220.33', subnet_id='f5802f33-a85e-4d91-bb12-b0a3a9e78ca2' | ACTIVE | | network:router_interface | |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+-----------------+--------------------------+------+

Hi Gongjun and Haochuan,

I think the vm needs to be reached via neutron router.

In our environment, the vm eth0 is reached via a server (we call it NatBox) that can reach the router external gateway, then the traffic will go through the router to reach one of the router internal interfaces, then to the vm.

[sysadmin@controller-0 ~(keystone_admin)]$ openstack port list --router=tenant2-router --long
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+-----------------+--------------------------+------+
| ID                                   | Name | MAC Address       | Fixed IP Addresses                                                            | Status | Security Groups | Device Owner             | Tags |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+-----------------+--------------------------+------+
| 09ef76d7-caea-444c-9a20-665999b223e0 |      | fa:16:3e:00:0e:7a | ip_address='192.168.220.1', subnet_id='d3ba508c-6ca9-4a5f-ade6-baf31a465d8e'  | ACTIVE |                 | network:router_interface |      |
| 3996236a-e33f-44d5-aeb5-d5ddb21fce62 |      | fa:16:3e:e6:af:a7 | ip_address='192.168.20.3', subnet_id='de25fcbb-6d74-4847-a04c-fd15a60a072f'   | ACTIVE |                 | network:router_gateway   |      |
| 566d17b7-905c-406f-940b-c2a9a5a61245 |      | fa:16:3e:d8:3c:4e | ip_address='192.168.220.65', subnet_id='b79acfe2-74f3-42e9-9dfd-7a5f2e58ada4' | ACTIVE |                 | network:router_interface |      |
| af0818df-883a-4468-94d0-eb435e8f48b1 |      | fa:16:3e:60:eb:97 | ip_address='192.168.220.33', subnet_id='f5802f33-a85e-4d91-bb12-b0a3a9e78ca2' | ACTIVE |                 | network:router_interface |      |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------+--------+-----------------+--------------------------+------+

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-11-26:

#35

@yliu12，what kind of network environment do you have, how is the network connected to the router, can you provide more specific test steps, how/where you ping the vm to verify the connectivity?

Revision history for this message

Peng Peng (ppeng) wrote on 2019-11-29:

#36

ALL_NODES_20191123.070135.tar Edit (120.8 MiB, application/x-tar)

Issue was reproduced on train (3/11)
2019-11-21_20-00-00
wcp_3-6

[2019-11-22 21:47:10,381] 311 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.120.21'
[2019-11-22 21:47:22,444] 433 DEBUG MainThread ssh.expect :: Output:
PING 192.168.120.21 (192.168.120.21) 56(84) bytes of data.

--- 192.168.120.21 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms

Revision history for this message

Peng Peng (ppeng) wrote on 2019-11-29:

#37

TIS_AUTOMATION.log Edit (42.3 MiB, application/octet-stream)

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-11-30:

#38

@ppeng，Can you keep this environment？Maybe I will need more log information. At the same time, I will try to reproduce it in a 2 + 3 environment again.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-12-03:

#39

ALL_NODES_20191203.151224.tar Edit (102.8 MiB, application/x-tar)

Issue was reproduced (1/36) on
Lab: WCP_3_6
Load: 2019-12-01_20-00-00

The lab is kept for investigation now.

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-12-04:

#40

@ppeng, Thanks much for the detail log information. I checked your log and this issue seems to be happen after migrations for several times. While I still have some questions to reproduce this issue for more investigation, could you please help to clarify?
1. This issue looks not happen just for a single migration operation. So in general, how many times migration will cause this issue? when this issue happen, can the VM be pinged internally from the neutron router's dhcp namespace (to verify the neutron router us still work)?
2. How the natbox server (this should be physical machine instead of neutron created VM, right?) connect to starlingX neutron network, via a physical router or the neutron router? Is it possible to draw a topology diagram of the network so that I can more easily understand.

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-05:

#41

Download full text (23.7 KiB)

I also reproduced the ping failure issue while testing live migrations
In this example, the 8th live migration iteration exhibited the ping failure

The HW system (PV-1) was installed with stx-openstack
(2 controllers, 2 storage, compute-0 through compute-7)

[The live migration to compute-6 seems to be when it started]
[2019-12-04 19:37:11,844] 1316 INFO MainThread vm_helper.live_migrate_vm:: Live migrate vm 358526fa-0f85-45e9-b246-63be77384e3d completed
[2019-12-04 19:37:13,951] 1368 INFO MainThread vm_helper.live_migrate_vm:: VM 358526fa-0f85-45e9-b246-63be77384e3d successfully migrated from compute-1 to compute-6
[2019-12-04 19:37:13,951] 1484 INFO MainThread ssh.get_natbox_client:: Getting NatBox Client...
[2019-12-04 19:37:16,074] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'358526fa-0f85-45e9-b246-63be77384e3d': ['192.168.131.50']}
[2019-12-04 19:37:16,074] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.131.50 from host 128.224.186.181
[2019-12-04 19:37:28,281] 2563 INFO MainThread network_helper.ping_server:: retry in 3 seconds
[2019-12-04 19:37:31,284] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.131.50 failed.
…
[Live migration 44 from compute-1 to compute-6 appears to complete but the ping failure is observed at this point]

| 44 | de40b7bd-66c7-437e-8d5e-6d9d08cd47f9 | compute-1 | compute-6 | compute-1 | compute-6 | 192.168.206.155 | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32 | 32 | 2019-12-04T19:36:48.000000 | 2019-12-04T19:37:09.000000 | live-migration |
| 41 | 994aa781-380b-4705-b308-83ac95017ade | compute-3 | compute-1 | compute-3 | compute-1 | 192.168.206.234 | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32 | 32 | 2019-12-04T19:36:08.000000 | 2019-12-04T19:36:30.000000 | live-migration |
| 38 | b6bdc142-f5b8-47f2-9d63-97ac094ac2c1 | compute-2 | compute-3 | compute-2 | compute-3 | 192.168.206.49 | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32 | 32 | 2019-12-04T19:35:29.000000 | 2019-12-04T19:35:51.000000 | live-migration |
| 35 | efb...

I also reproduced the ping failure issue while testing live migrations 
In this example, the 8th live migration iteration exhibited the ping failure

The HW system (PV-1) was installed with stx-openstack 
(2 controllers, 2 storage, compute-0 through compute-7)

[The live migration to compute-6 seems to be when it started]
[2019-12-04 19:37:11,844] 1316 INFO  MainThread vm_helper.live_migrate_vm:: Live migrate vm 358526fa-0f85-45e9-b246-63be77384e3d completed
[2019-12-04 19:37:13,951] 1368 INFO  MainThread vm_helper.live_migrate_vm:: VM 358526fa-0f85-45e9-b246-63be77384e3d successfully migrated from compute-1 to compute-6
[2019-12-04 19:37:13,951] 1484 INFO  MainThread ssh.get_natbox_client:: Getting NatBox Client...
[2019-12-04 19:37:16,074] 1666 INFO  MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'358526fa-0f85-45e9-b246-63be77384e3d': ['192.168.131.50']}
[2019-12-04 19:37:16,074] 2525 INFO  MainThread network_helper.ping_server:: Ping 192.168.131.50 from host 128.224.186.181
[2019-12-04 19:37:28,281] 2563 INFO  MainThread network_helper.ping_server:: retry in 3 seconds
[2019-12-04 19:37:31,284] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.131.50 failed.
…
[Live migration 44 from compute-1 to compute-6 appears to complete but the ping failure is observed at this point]

| 44 | de40b7bd-66c7-437e-8d5e-6d9d08cd47f9 | compute-1   | compute-6 | compute-1      | compute-6    | 192.168.206.155 | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32         | 32         | 2019-12-04T19:36:48.000000 | 2019-12-04T19:37:09.000000 | live-migration |
| 41 | 994aa781-380b-4705-b308-83ac95017ade | compute-3   | compute-1 | compute-3      | compute-1    | 192.168.206.234 | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32         | 32         | 2019-12-04T19:36:08.000000 | 2019-12-04T19:36:30.000000 | live-migration |
| 38 | b6bdc142-f5b8-47f2-9d63-97ac094ac2c1 | compute-2   | compute-3 | compute-2      | compute-3    | 192.168.206.49  | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32         | 32         | 2019-12-04T19:35:29.000000 | 2019-12-04T19:35:51.000000 | live-migration |
| 35 | efb88c55-b062-4fda-a028-749ec9e005d8 | compute-3   | compute-2 | compute-3      | compute-2    | 192.168.206.252 | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32         | 32         | 2019-12-04T19:34:32.000000 | 2019-12-04T19:34:53.000000 | live-migration |
| 32 | 0a978fd1-6809-485a-a8c4-07ff73165ea5 | compute-4   | compute-3 | compute-4      | compute-3    | 192.168.206.49  | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32         | 32         | 2019-12-04T19:33:36.000000 | 2019-12-04T19:33:57.000000 | live-migration |
| 29 | ac5016f8-01a6-4539-889d-65aa273f6f90 | compute-7   | compute-4 | compute-7      | compute-4    | 192.168.206.24  | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32         | 32         | 2019-12-04T19:32:38.000000 | 2019-12-04T19:32:58.000000 | live-migration |
| 26 | 5a91dd7b-1cd5-4fc1-bfcb-df0f6d74f3f4 | compute-2   | compute-7 | compute-2      | compute-7    | 192.168.206.33  | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32         | 32         | 2019-12-04T19:31:41.000000 | 2019-12-04T19:32:01.000000 | live-migration |
| 23 | 6cefe248-e16c-4b80-8028-2ce2e81bc3c0 | compute-6   | compute-2 | compute-6      | compute-2    | 192.168.206.252 | completed | 358526fa-0f85-45e9-b246-63be77384e3d | 32         | 32         | 2019-12-04T19:30:44.000000 | 2019-12-04T19:31:06.000000 | live-migration |
| 20 | 65c15307-e9dc-4dbf-8752-95908d57a53b | -           | -         | compute-6      | -            | -               | error     | 358526fa-0f85-45e9-b246-63be77384e3d | 32         | 32         | 2019-12-04T19:30:00.000000 | 2019-12-04T19:30:02.000000 | live-migration |

[The daemon.log on compute-6 had these ovs errors some time after the live migration  44 | de40b7bd-66c7-437e-8d5e-6d9d08cd47f9 where the ping failure started]

2019-12-04T19:50:31.000 compute-6 ovs-vswitchd[28172]: err ovs|00015|dpif_netdev(revalidator1076)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(4),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c8:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.59.55,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no))
2019-12-04T19:50:31.000 compute-6 ovs-vswitchd[28172]: err ovs|00015|dpif_netdev(revalidator1076)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(4),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c8:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.59.55,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no))
2019-12-04T19:50:31.000 compute-6 ovs-vswitchd[28172]: err ovs|00015|dpif_netdev(revalidator1076)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(4),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c8:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.59.55,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no))
2019-12-04T19:50:46.000 compute-6 ovs-vswitchd[28172]: err ovs|00017|dpif_netdev(revalidator1076)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(4),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c9:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.59.54,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no))
2019-12-04T19:50:46.000 compute-6 ovs-vswitchd[28172]: err ovs|00017|dpif_netdev(revalidator1076)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(4),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c9:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.59.54,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no))
2019-12-04T19:50:46.000 compute-6 ovs-vswitchd[28172]: err ovs|00017|dpif_netdev(revalidator1076)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(4),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c9:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.59.54,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no))
2019-12-04T19:51:19.000 compute-6 ovs-vswitchd[28172]: err ovs|00019|dpif_netdev(revalidator1076)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(4),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c9:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.59.54,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no))
2019-12-04T19:51:19.000 compute-6 ovs-vswitchd[28172]: err ovs|00019|dpif_netdev(revalidator1076)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(4),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c9:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.59.54,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no))
2019-12-04T19:51:19.000 compute-6 ovs-vswitchd[28172]: err ovs|00019|dpif_netdev(revalidator1076)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(4),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c9:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.59.54,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no))
2019-12-04T20:36:04.989 compute-6 systemd[1]: err Failed to start Session c85 of user root.

See also compute-6:/var/log/libvirt/qemu# cat instance-0000000b.log

2019-12-04 19:29:34.705+0000: starting up libvirt version: 4.7.0, package: 1.tis.101 (Unknown, 2019-12-03-17:36:47, 240a5750aad7), qemu version: 3.0.0qemu-kvm-ev-3.0.0-0.tis.100, kernel: 3.10.0-957.21.3.el7.2.tis.x86_64, hostname: compute-6
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOME=/root QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-0000000b,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-instance-0000000b/master-key.aes -machine pc-i440fx-3.0,accel=kvm,usb=off,dump-guest-core=off -cpu SandyBridge -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/huge-1048576kB/libvirt/qemu/1-instance-0000000b,share=yes,size=1073741824,host-nodes=1,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid 358526fa-0f85-45e9-b246-63be77384e3d -smbios 'type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=20.0.2,serial=358526fa-0f85-45e9-b246-63be77384e3d,uuid=358526fa-0f85-45e9-b246-63be77384e3d,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=27,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -object secret,id=virtio-disk0-secret0,data=QWT/3dHNNHOei/39DwYWNP4aEGeLlH4hJ0msCsmKypg=,keyid=masterKey0,iv=LDfQhk65YxIyejcz8q2MRw==,format=base64 -drive 'file=rbd:cinder-volumes/d174b7c5-8125-4733-82e8-1d29c06ade91:id=cinder:auth_supported=cephx\;none:mon_host=192.168.204.2\:6789\;192.168.204.3\:6789\;192.168.204.126\:6789,file.password-secret=virtio-disk0-secret0,format=raw,if=none,id=drive-virtio-disk0,cache=writeback,discard=unmap' -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on,serial=d174b7c5-8125-4733-82e8-1d29c06ade91 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhuff279836-40,server -netdev vhost-user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:77:6e:55,bus=pci.0,addr=0x3 -chardev socket,id=charnet1,path=/var/run/openvswitch/vhu0ef6b42f-79,server -netdev vhost-user,chardev=charnet1,id=hostnet1 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=fa:16:3e:1a:98:e7,bus=pci.0,addr=0x4 -chardev pty,id=charserial0,logfile=/var/lib/nova/instances/358526fa-0f85-45e9-b246-63be77384e3d/console.log,logappend=off -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 0.0.0.0:0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
2019-12-04 19:29:34.705+0000: Domain id=1 is tainted: high-privileges
2019-12-04T19:29:35.671240Z qemu-kvm: -chardev socket,id=charnet0,path=/var/run/openvswitch/vhuff279836-40,server: info: QEMU waiting for connection on: disconnected:unix:/var/run/openvswitch/vhuff279836-40,server
2019-12-04T19:29:36.395604Z qemu-kvm: -chardev socket,id=charnet1,path=/var/run/openvswitch/vhu0ef6b42f-79,server: info: QEMU waiting for connection on: disconnected:unix:/var/run/openvswitch/vhu0ef6b42f-79,server
2019-12-04T19:29:36.396341Z qemu-kvm: -chardev pty,id=charserial0,logfile=/var/lib/nova/instances/358526fa-0f85-45e9-b246-63be77384e3d/console.log,logappend=off: char device redirected to /dev/pts/0 (label charserial0)
2019-12-04 19:30:56.875+0000: initiating migration
2019-12-04 19:30:57.332+0000: shutting down, reason=migrated
2019-12-04T19:30:57.332410Z qemu-kvm: terminating on signal 15 from pid 69779 (/usr/sbin/libvirtd)
2019-12-04 19:36:57.818+0000: starting up libvirt version: 4.7.0, package: 1.tis.101 (Unknown, 2019-12-03-17:36:47, 240a5750aad7), qemu version: 3.0.0qemu-kvm-ev-3.0.0-0.tis.100, kernel: 3.10.0-957.21.3.el7.2.tis.x86_64, hostname: compute-6
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOME=/root QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-0000000b,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-instance-0000000b/master-key.aes -machine pc-i440fx-3.0,accel=kvm,usb=off,dump-guest-core=off -cpu SandyBridge,vme=on,hypervisor=on,arat=on,xsaveopt=on -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/huge-1048576kB/libvirt/qemu/2-instance-0000000b,share=yes,size=1073741824,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid 358526fa-0f85-45e9-b246-63be77384e3d -smbios 'type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=20.0.2,serial=358526fa-0f85-45e9-b246-63be77384e3d,uuid=358526fa-0f85-45e9-b246-63be77384e3d,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=28,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -object secret,id=virtio-disk0-secret0,data=7WqRnjzzhocNsXZQ5WcCDdYllCfnJ9Wu3bkPgxnILc8=,keyid=masterKey0,iv=eXnoLmrOIMRhBAjuFF6LLQ==,format=base64 -drive 'file=rbd:cinder-volumes/d174b7c5-8125-4733-82e8-1d29c06ade91:id=cinder:auth_supported=cephx\;none:mon_host=192.168.204.2\:6789\;192.168.204.3\:6789\;192.168.204.126\:6789,file.password-secret=virtio-disk0-secret0,format=raw,if=none,id=drive-virtio-disk0,cache=writeback,discard=unmap' -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on,serial=d174b7c5-8125-4733-82e8-1d29c06ade91 -chardev socket,id=charnet0,path=/var/run/openvswitch/vhuff279836-40,server -netdev vhost-user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:77:6e:55,bus=pci.0,addr=0x3 -chardev socket,id=charnet1,path=/var/run/openvswitch/vhu0ef6b42f-79,server -netdev vhost-user,chardev=charnet1,id=hostnet1 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=fa:16:3e:1a:98:e7,bus=pci.0,addr=0x4 -chardev pty,id=charserial0,logfile=/var/lib/nova/instances/358526fa-0f85-45e9-b246-63be77384e3d/console.log,logappend=off -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 0.0.0.0:0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
2019-12-04 19:36:57.818+0000: Domain id=2 is tainted: high-privileges
2019-12-04T19:36:58.487884Z qemu-kvm: -chardev socket,id=charnet0,path=/var/run/openvswitch/vhuff279836-40,server: info: QEMU waiting for connection on: disconnected:unix:/var/run/openvswitch/vhuff279836-40,server
2019-12-04T19:36:59.434743Z qemu-kvm: -chardev socket,id=charnet1,path=/var/run/openvswitch/vhu0ef6b42f-79,server: info: QEMU waiting for connection on: disconnected:unix:/var/run/openvswitch/vhu0ef6b42f-79,server
2019-12-04T19:37:00.435367Z qemu-kvm: -chardev pty,id=charserial0,logfile=/var/lib/nova/instances/358526fa-0f85-45e9-b246-63be77384e3d/console.log,logappend=off: char device redirected to /dev/pts/0 (label charserial0)

(@2019-12-04 19:41:01,886 the logs show router cc260ce1-4dbf-473d-bd58-17cbbffe151e is hosted on compute-1)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-05:

#42

compute-6_20191204.214646.tar Edit (13.6 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-05:

#43

compute-1_20191204.201616.tgz Edit (13.4 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-05:

#44

controller-0_20191204.201616.tgz Edit (29.6 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-05:

#45

controller-1_20191204.201616.tgz Edit (19.8 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-05:

#46

Download full text (4.9 KiB)

instance on compute-6:
tenant1-net10 172.16.10.135
tenant1-mgmt-net 192.168.131.50

Name tenant1-router
ID cc260ce1-4dbf-473d-bd58-17cbbffe151e
Description Project ID 35b64b45763e4c9f9ae92f4a659cb656
Status Active
Admin State UP
L3 Agent
Host ID
compute-1 354e0515-c7c0-4779-8c27-bb47b067b3ca
Availability Zones nova

External Gateway
Network Name external-net0
Network ID 3f0b67e5-f3a2-44b3-ac0c-f39f9cc52f7a
External Fixed IPs
Subnet ID 2fdd9804-fef4-455f-9f1a-e1738110e62e
IP Address 192.168.31.2
SNAT Disabled

(20b14c7e-0453) 192.168.131.1 Active Internal Interface UP
(3006e0df-0a95) 192.168.131.65 Active Internal Interface UP
(69f7186e-f008) 192.168.131.33 Active Internal Interface UP
(72149874-f1fe) 192.168.31.2 Active External Gateway UP

tenant1-mgmt0-subnet1 192.168.131.32/27 IPv4 192.168.131.33 3 26
tenant1-mgmt0-subnet0 192.168.131.0/27 IPv4 192.168.131.1 2 27
tenant1-mgmt0-subnet2 192.168.131.64/27 IPv4 192.168.131.65 2 27

compute-6:~$ sudo virsh console instance-0000000b
Connected to domain instance-0000000b

tenant1-volume-stress-1:~# ip add
[ 3752.763497] -bash[1597]: HISTORY: PID=1597 UID=0 ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:77:6e:55 brd ff:ff:ff:ff:ff:ff
    inet 192.168.131.50/27 brd 192.168.131.63 scope global dynamic eth0
       valid_lft 82655sec preferred_lft 82655sec
    inet6 fe80::f816:3eff:fe77:6e55/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:1a:98:e7 brd ff:ff:ff:ff:ff:ff
    inet 172.16.10.135/24 brd 172.16.10.255 scope global dynamic eth1
       valid_lft 82661sec preferred_lft 82661sec
    inet6 fe80::f816:3eff:fe1a:98e7/64 scope link
       valid_lft forever preferred_lft forever
tenant1-volume-stress-1:~# ping 0.0.0.0
[ 3759.638041] -bash[1597]: HISTORY: PID=1597 UID=0 ping 0.0.0.0
PING 0.0.0.0 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.022 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.018 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.031 ms

--- 0.0.0.0 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.018/0.023/0.031/0.007 ms
tenant1-volume-stress-1:~# ping 192.168.131.1
[ 3770.922772] -bash[1597]: HISTORY: PID=1597 UID=0 ping 192.168.131.1
PING 192.168.131.1 (192.168.131.1) 56(84) bytes of data.
From 192.168.131.50 icmp_seq=1 Destination Host Unreachable
From 192.168.131.50 icmp_seq=2 Destination Host Unreachable
From 192.168.131.50 icmp_seq=3 Destination Host Unreachable
From 192.168.131.50 icmp_seq=4 Destination Host Unreachable

--- 192.168.131.1 ping statistics ---
5 packets transmi...

instance on compute-6:
    tenant1-net10 172.16.10.135
    tenant1-mgmt-net 192.168.131.50

Name    tenant1-router
ID    cc260ce1-4dbf-473d-bd58-17cbbffe151e
Description Project ID    35b64b45763e4c9f9ae92f4a659cb656
Status    Active
Admin State    UP
L3 Agent
    Host 	ID
    compute-1 	354e0515-c7c0-4779-8c27-bb47b067b3ca
Availability Zones        nova

External Gateway
Network Name    external-net0
Network ID    3f0b67e5-f3a2-44b3-ac0c-f39f9cc52f7a
External Fixed IPs
        Subnet ID 2fdd9804-fef4-455f-9f1a-e1738110e62e
        IP Address 192.168.31.2
SNAT    Disabled

(20b14c7e-0453)	    192.168.131.1 	Active 	Internal Interface 	UP 
(3006e0df-0a95)	    192.168.131.65 	Active 	Internal Interface 	UP 
(69f7186e-f008)	    192.168.131.33 	Active 	Internal Interface 	UP 
(72149874-f1fe)	    192.168.31.2 	Active 	External Gateway 	UP

tenant1-mgmt0-subnet1	192.168.131.32/27 	IPv4 	192.168.131.33 	3 	26 	
tenant1-mgmt0-subnet0	192.168.131.0/27 	IPv4 	192.168.131.1 	2 	27 	
tenant1-mgmt0-subnet2	192.168.131.64/27 	IPv4 	192.168.131.65 	2 	27

compute-6:~$ sudo virsh console instance-0000000b
Connected to domain instance-0000000b

tenant1-volume-stress-1:~# ip add
[ 3752.763497] -bash[1597]: HISTORY: PID=1597 UID=0 ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:77:6e:55 brd ff:ff:ff:ff:ff:ff
    inet 192.168.131.50/27 brd 192.168.131.63 scope global dynamic eth0
       valid_lft 82655sec preferred_lft 82655sec
    inet6 fe80::f816:3eff:fe77:6e55/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:1a:98:e7 brd ff:ff:ff:ff:ff:ff
    inet 172.16.10.135/24 brd 172.16.10.255 scope global dynamic eth1
       valid_lft 82661sec preferred_lft 82661sec
    inet6 fe80::f816:3eff:fe1a:98e7/64 scope link
       valid_lft forever preferred_lft forever
tenant1-volume-stress-1:~# ping 0.0.0.0
[ 3759.638041] -bash[1597]: HISTORY: PID=1597 UID=0 ping 0.0.0.0
PING 0.0.0.0 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.022 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.018 ms
64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.031 ms

--- 0.0.0.0 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.018/0.023/0.031/0.007 ms
tenant1-volume-stress-1:~# ping 192.168.131.1
[ 3770.922772] -bash[1597]: HISTORY: PID=1597 UID=0 ping 192.168.131.1
PING 192.168.131.1 (192.168.131.1) 56(84) bytes of data.
From 192.168.131.50 icmp_seq=1 Destination Host Unreachable
From 192.168.131.50 icmp_seq=2 Destination Host Unreachable
From 192.168.131.50 icmp_seq=3 Destination Host Unreachable
From 192.168.131.50 icmp_seq=4 Destination Host Unreachable

--- 192.168.131.1 ping statistics ---
5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4001ms
pipe 4
tenant1-volume-stress-1:~# ping 192.168.131.2
[ 3777.060447] -bash[1597]: HISTORY: PID=1597 UID=0 ping 192.168.131.2
PING 192.168.131.2 (192.168.131.2) 56(84) bytes of data.
From 192.168.131.50 icmp_seq=1 Destination Host Unreachable
From 192.168.131.50 icmp_seq=2 Destination Host Unreachable
From 192.168.131.50 icmp_seq=3 Destination Host Unreachable
From 192.168.131.50 icmp_seq=4 Destination Host Unreachable
From 192.168.131.50 icmp_seq=5 Destination Host Unreachable
From 192.168.131.50 icmp_seq=6 Destination Host Unreachable
From 192.168.131.50 icmp_seq=7 Destination Host Unreachable
From 192.168.131.50 icmp_seq=8 Destination Host Unreachable
From 192.168.131.50 icmp_seq=9 Destination Host Unreachable
From 192.168.131.50 icmp_seq=10 Destination Host Unreachable
From 192.168.131.50 icmp_seq=11 Destination Host Unreachable
From 192.168.131.50 icmp_seq=12 Destination Host Unreachable

--- 192.168.131.2 ping statistics ---
13 packets transmitted, 0 received, +12 errors, 100% packet loss, time 12003ms
pipe 4
tenant1-volume-stress-1:~# ping 192.168.31.2
[ 3791.826266] -bash[1597]: HISTORY: PID=1597 UID=0 ping 192.168.31.2
PING 192.168.31.2 (192.168.31.2) 56(84) bytes of data.
From 192.168.131.50 icmp_seq=1 Destination Host Unreachable
From 192.168.131.50 icmp_seq=2 Destination Host Unreachable
From 192.168.131.50 icmp_seq=3 Destination Host Unreachable
From 192.168.131.50 icmp_seq=4 Destination Host Unreachable
From 192.168.131.50 icmp_seq=5 Destination Host Unreachable
From 192.168.131.50 icmp_seq=6 Destination Host Unreachable
From 192.168.131.50 icmp_seq=7 Destination Host Unreachable
From 192.168.131.50 icmp_seq=8 Destination Host Unreachable

--- 192.168.31.2 ping statistics ---
9 packets transmitted, 0 received, +8 errors, 100% packet loss, time 8003ms
pipe 4

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-12-11:

#47

Leaving as gating stx.3.0 given that the issue is reproduced in openstack train

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-12-13:

#48

@Wendy, @Peng, I am reproducing this bug, and can you please do the following test:
1. Create a subnet in a network (for example tenant1-mgmt-net has only one subnet of tenant1-mgmt0-subnet0).
2. Under this network, create two instances, vm1 and vm2.
3. Check if natbox ping vm1 is successful?
4. Check if natbox ping vm2 is successful?
5. Check if vm1 ping vm2 is successful?(If steps 3, 4, and 5 are successful, proceed to the next step.)
6. Live migration of vm1, vm2 remains unchanged.
7. When natbox can't ping to vm1, repeat steps 4, 5.
8. At this time,can you successfully ping vm1 and vm2 from the router's namespace?(The command is as follows)
   $ ip netns
   $ sudo ip netns exec "router's name" bash
   $ ping "vm1's ip"
   $ ping "vm2's ip"
9. Check the status of ovs at this time(The command is as follows)
   $: systemctl status ovs-vswitchd

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-12-16:

#49

Download full text (3.3 KiB)

Check the log of Peng，when the natbox fails to ping the instance after the instance is migrated, delete the this instance and re-create a new instance. At this time, natbox ping this new instance successfully. After migrating this new instance, natbox can still ping successfully.The relevant log information is as follows：

（192.168.120.26 is the instance‘s ip that natbox ping failed after migration）
controller-0:~$
[2019-11-23 05:12:31,087] 1654 DEBUG MainThread network_helper._get_net_ips_for_vms:: targeted ips for vm: ['192.168.120.26']
[2019-11-23 05:12:31,088] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'a30dfd31-ed78-4a88-97d2-02aa762ff9ff': ['192.168.120.26']}
[2019-11-23 05:12:31,088] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.120.26 from host 128.224.186.181
[2019-11-23 05:12:31,089] 476 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-11-23 05:12:31,089] 311 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.120.26'
[2019-11-23 05:12:42,909] 433 DEBUG MainThread ssh.expect :: Output:
PING 192.168.120.26 (192.168.120.26) 56(84) bytes of data.

--- 192.168.120.26 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms

（192.168.120.22 is the new instance’s ip, and this is the log before migration）
controller-0:~$
[2019-11-23 05:25:49,088] 1654 DEBUG MainThread network_helper._get_net_ips_for_vms:: targeted ips for vm: ['192.168.120.22']
[2019-11-23 05:25:49,088] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'ea9e2ad0-b06f-4b5d-a3e9-bee87ac5f456': ['192.168.120.22']}
[2019-11-23 05:25:49,089] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.120.22 from host 128.224.186.181
[2019-11-23 05:25:49,089] 476 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-11-23 05:25:49,090] 311 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.120.22'
[2019-11-23 05:25:51,148] 433 DEBUG MainThread ssh.expect :: Output:
PING 192.168.120.22 (192.168.120.22) 56(84) bytes of data.
64 bytes from 192.168.120.22: icmp_seq=1 ttl=63 time=0.736 ms
64 bytes from 192.168.120.22: icmp_seq=2 ttl=63 time=0.664 ms
64 bytes from 192.168.120.22: icmp_seq=3 ttl=63 time=0.273 ms

--- 192.168.120.22 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.273/0.557/0.736/0.205 ms

(This is the log information after migration)
controller-0:~$
[2019-11-23 05:28:17,789] 1654 DEBUG MainThread network_helper._get_net_ips_for_vms:: targeted ips for vm: ['192.168.120.22']
[2019-11-23 05:28:17,789] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'ea9e2ad0-b06f-4b5d-a3e9-bee87ac5f456': ['192.168.120.22']}
[2019-11-23 05:28:17,790] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.120.22 from host 128.224.186.181
[2019-11-23 05:28:17,791] 476 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-11-23 05:28:17,791] 311 DEBUG MainThread ssh.send :: Send 'ping -c 3 192.168.120.22'
[2019-11-23 05:28:19,869] 433 DEBUG MainThread ssh.expect :: Output:
PING 192.168.120.22 (192.168.120.22) 56(84) bytes of data.
64 bytes from 192.168.120....

Check the log of Peng，when the natbox fails to ping the instance after the instance is migrated, delete the this instance and re-create a new instance. At this time, natbox ping this new instance successfully. After migrating this new instance, natbox can still ping successfully.The relevant log information is as follows：

（192.168.120.26 is the instance‘s ip that natbox ping failed after migration）
controller-0:~$ 
[2019-11-23 05:12:31,087] 1654 DEBUG MainThread network_helper._get_net_ips_for_vms:: targeted ips for vm: ['192.168.120.26']
[2019-11-23 05:12:31,088] 1666 INFO  MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'a30dfd31-ed78-4a88-97d2-02aa762ff9ff': ['192.168.120.26']}
[2019-11-23 05:12:31,088] 2525 INFO  MainThread network_helper.ping_server:: Ping 192.168.120.26 from host 128.224.186.181
[2019-11-23 05:12:31,089] 476  DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-11-23 05:12:31,089] 311  DEBUG MainThread ssh.send    :: Send 'ping -c 3 192.168.120.26'
[2019-11-23 05:12:42,909] 433  DEBUG MainThread ssh.expect  :: Output: 
PING 192.168.120.26 (192.168.120.26) 56(84) bytes of data.

--- 192.168.120.26 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms

（192.168.120.22 is the new instance’s ip, and this is the log before migration）
controller-0:~$ 
[2019-11-23 05:25:49,088] 1654 DEBUG MainThread network_helper._get_net_ips_for_vms:: targeted ips for vm: ['192.168.120.22']
[2019-11-23 05:25:49,088] 1666 INFO  MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'ea9e2ad0-b06f-4b5d-a3e9-bee87ac5f456': ['192.168.120.22']}
[2019-11-23 05:25:49,089] 2525 INFO  MainThread network_helper.ping_server:: Ping 192.168.120.22 from host 128.224.186.181
[2019-11-23 05:25:49,089] 476  DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-11-23 05:25:49,090] 311  DEBUG MainThread ssh.send    :: Send 'ping -c 3 192.168.120.22'
[2019-11-23 05:25:51,148] 433  DEBUG MainThread ssh.expect  :: Output: 
PING 192.168.120.22 (192.168.120.22) 56(84) bytes of data.
64 bytes from 192.168.120.22: icmp_seq=1 ttl=63 time=0.736 ms
64 bytes from 192.168.120.22: icmp_seq=2 ttl=63 time=0.664 ms
64 bytes from 192.168.120.22: icmp_seq=3 ttl=63 time=0.273 ms

--- 192.168.120.22 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.273/0.557/0.736/0.205 ms

(This is the log information after migration)
controller-0:~$ 
[2019-11-23 05:28:17,789] 1654 DEBUG MainThread network_helper._get_net_ips_for_vms:: targeted ips for vm: ['192.168.120.22']
[2019-11-23 05:28:17,789] 1666 INFO  MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'ea9e2ad0-b06f-4b5d-a3e9-bee87ac5f456': ['192.168.120.22']}
[2019-11-23 05:28:17,790] 2525 INFO  MainThread network_helper.ping_server:: Ping 192.168.120.22 from host 128.224.186.181
[2019-11-23 05:28:17,791] 476  DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-11-23 05:28:17,791] 311  DEBUG MainThread ssh.send    :: Send 'ping -c 3 192.168.120.22'
[2019-11-23 05:28:19,869] 433  DEBUG MainThread ssh.expect  :: Output: 
PING 192.168.120.22 (192.168.120.22) 56(84) bytes of data.
64 bytes from 192.168.120.22: icmp_seq=1 ttl=63 time=0.465 ms
64 bytes from 192.168.120.22: icmp_seq=2 ttl=63 time=0.509 ms
64 bytes from 192.168.120.22: icmp_seq=3 ttl=63 time=0.621 ms

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-12-17:

#50

In order to check whether "No such device" caused the ping failure after the instance migration, in addition to the previous test, can you help me collect some other logs？
1. On the controller-0 node, check the ovs-vswitchd log of the compute node where the virtual machine is located.
$: kubectl get pods -n openstack -o wide | grep openvswitch
$: kubectl logs openvswitch-vswitchd-??? -n openstack | less
2. Login the compute node where the ping fails after the instance migration. Use "ifconfig" command check the device, compare with the "No such device" device in the ovs-vswitchd log, and check whether these devices can be found.
$: ssh compute-???
$: ifconfig
3. Use "ifconfig" command check the device in the network namespace, compare with the "No such device" device in the ovs-vswitchd log, and check whether these devices can be found.
$: ip netns
$: sudo ip netns exec "network namespace" bash
$: ifconfig

Revision history for this message

Le, Huifeng (hle2) wrote on 2019-12-18:

#51

@Wendy,
From the ovs log analysis, It seems the tap device for the vm instance may not created correctly during migration. Could you please help to collect the log which Gongjun asked to help to confirm the assumption when the failure found? Thanks!

@Matt,
What's your opinion on the error log? Do you have any other suggestions? Thanks!

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-18:

#52

Sorry for the late response.
I have been swamped and will not have a chance to look at this.

As you can see from my comment above ( https://bugs.launchpad.net/starlingx/+bug/1841189/comments/46 )
the ping was failing only after the live migration (but not every time)

I can tell you the following details however:
- a new instance could be launched and could ping successfully to each target (ie. even though it was on the same host as the instance that could not ping successfully)
- if the instance that could not ping was subsequently live migration away from and back to the same host, the ping started to work again

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-12-19:

#53

@Wendy, according to the analysis, the problem may be that the tap device has a problem after the migration. I am also reproducing. In addition, can you help me collect the following logs when you are free:
1.Login the compute node where the ping fails after the instance migration.
$: ssh compute-???

2.When ping fails, use the following command to collect some logs (don't delete the instance of ping failure).
$: ifconfig
$:sudo ovs-vsctl show

3.When ping fails, you also need to collect the logs of ovs, nova and neutron.

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-12-26:

#54

Automated script and log. Edit (1.0 MiB, application/x-tar)

Use an automated script to reproduce this bug in a 2 + 3 (2-controller, 3-compute, environment using ovs and 1204 mirror) environment. Vm1 live migration of 640 times (After vm1 migrated 640 times, the system crashed with other problems), and did not reproduce this bug. I will fix the system problem later and test again. Also, I'm checking nova's live-migration code to see how the network migrated.
Attach automated test scripts and collected logs.

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-12-30:

#55

By using the automation scripts, I still can't reproduce this bug. And this makes it very difficult to solve this bug. So I suggest WR networking engineer to take this bug because this bug can be reproduced in WR lab, the servers are accessible for them and the logs provided by Wendy is not enough to solve this bug.

Revision history for this message

Le, Huifeng (hle2) wrote on 2020-01-09:

#56

@Wendy Mitchell，
As aligned in the network meeting, due to we can not reproduce this issue in our lab, just reassigned this issue to you to reproduce in WR lab, then WR team will help to do the debug first, Thanks!

Changed in starlingx:
assignee:	Gongjun Song (songgongjun) → Wendy Mitchell (wmitchellwr)

Revision history for this message

yong hu (yhu6) wrote on 2020-01-14:

#57

Based on the comments above, we are waiting for test update and will target the solution in stx.4.0.

tags:

added: stx.4.0
removed: stx.3.0

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2020-01-16:

#58

Download full text (6.3 KiB)

I have successfully migrated 375+ iterations running the same testcase on the original lab that exhibited the failure previously (ip 20-27 BUILD_ID="20200115T023003Z).

However, I have compared of the original ovs-vswitchd.log and the current "working" logs.
The original logs had numa related error reported in the ovs-vswitchd.log as indicated below.

I have successfully migrated 375+ iterations running the same testcase on the original lab that exhibited the failure previously (ip 20-27 BUILD_ID="20200115T023003Z).

However, I have compared of the original ovs-vswitchd.log and the current "working" logs.
The original logs had numa related error reported in the ovs-vswitchd.log as indicated below.

2019-12-04T19:37:06.305Z|00889|bridge|INFO|bridge br-int: deleted interface vhuff279836-40 on port 14
2019-12-04T19:37:06.305Z|00890|dpif_netdev|WARN|There's no available (non-isolated) pmd thread on numa node 1. Queue 0 on port 'eth0' will be assigned to the pmd on core 1 (numa node 0). Expect reduced performance.
2019-12-04T19:37:06.305Z|00891|dpif_netdev|WARN|There's no available (non-isolated) pmd thread on numa node 1. Queue 1 on port 'eth0' will be assigned to the pmd on core 2 (numa node 0). Expect reduced performance.
2019-12-04T19:37:06.305Z|00892|dpif_netdev|INFO|Core 2 on numa node 0 assigned port 'vhu0ef6b42f-79' rx queue 0 (measured processing cycles 264156).
2019-12-04T19:37:06.343Z|00893|netdev_linux|WARN|Dropped 51 log messages in last 41 seconds (most recently, 34 seconds ago) due to excessive rate
2019-12-04T19:37:06.343Z|00894|netdev_linux|WARN|ethtool command ETHTOOL_GDRVINFO on network device tap7964d501-39 failed: No such device
2019-12-04T19:37:06.375Z|00895|netdev_linux|WARN|ethtool command ETHTOOL_GDRVINFO on network device tapdcbc1bc9-06 failed: No such device
2019-12-04T19:37:06.407Z|00896|netdev_linux|WARN|ethtool command ETHTOOL_GDRVINFO on network device tap05f66e28-9b failed: No such device
2019-12-04T19:37:06.508Z|00897|bridge|INFO|bridge br-int: deleted interface vhu0ef6b42f-79 on port 15
2019-12-04T19:37:06.510Z|00898|dpif_netdev|WARN|There's no available (non-isolated) pmd thread on numa node 1. Queue 0 on port 'eth0' will be assigned to the pmd on core 1 (numa node 0). Expect reduced performance.
2019-12-04T19:37:06.510Z|00899|dpif_netdev|WARN|There's no available (non-isolated) pmd thread on numa node 1. Queue 1 on port 'eth0' will be assigned to the pmd on core 2 (numa node 0). Expect reduced performance.
2019-12-04T19:37:08.746Z|00900|connmgr|INFO|br-int<->unix#3129: 1 flow_mods in the last 0 s (1 deletes)
2019-12-04T19:37:09.431Z|00901|connmgr|INFO|br-int<->unix#3132: 1 flow_mods in the last 0 s (1 deletes)
2019-12-04T19:37:09.868Z|00902|connmgr|INFO|br-int<->unix#3137: 1 flow_mods in the last 0 s (1 deletes)
2019-12-04T19:37:10.492Z|00903|connmgr|INFO|br-int<->unix#3140: 2 flow_mods in the last 0 s (2 deletes)
2019-12-04T19:37:37.817Z|00904|connmgr|INFO|br-phy0<->tcp:127.0.0.1:6633: 1 flow_mods 27 s ago (1 deletes)
2019-12-04T19:37:37.818Z|00905|connmgr|INFO|br-int<->tcp:127.0.0.1:6633: 3 flow_mods in the 4 s starting 31 s ago (3 deletes)
2019-12-04T19:50:32.212Z|00017|dpif_netdev(revalidator1076)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(3),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c8:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.59.55,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no))
2019-12-04T19:50:32.212Z|00018|dpif(revalidator1076)|WARN|netdev@ovs-netdev: failed to put[modify] (Invalid argument) ufid:a3ffe794-dffc-4f95-8595-5832e3bcd273 skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(3),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c8:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0/0x0),encap(eth_type(0x0800),ipv4(src=192.168.59.55/0.0.0.0,dst=224.0.0.22/0.0.0.0,proto=2/0,tos=0xc0/0,ttl=1/0,frag=no)), actions:userspace(pid=0,slow_path(match))
2019-12-04T19:50:46.751Z|00019|dpif_netdev(revalidator1076)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(3),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c9:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.59.54,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no))
2019-12-04T19:50:46.751Z|00020|dpif(revalidator1076)|WARN|netdev@ovs-netdev: failed to put[modify] (Invalid argument) ufid:ee243e26-98e5-4eaf-897c-88eac56e9b83 skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(3),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c9:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0/0x0),encap(eth_type(0x0800),ipv4(src=192.168.59.54/0.0.0.0,dst=224.0.0.22/0.0.0.0,proto=2/0,tos=0xc0/0,ttl=1/0,frag=no)), actions:userspace(pid=0,slow_path(match))
2019-12-04T19:51:18.803Z|00001|dpif_netdev(revalidator1085)|ERR|internal error parsing flow key skb_priority(0),skb_mark(0),ct_state(0),ct_zone(0),ct_mark(0),ct_label(0),recirc_id(0),dp_hash(0),in_port(3),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c9:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.59.54,dst=224.0.0.22,proto=2,tos=0xc0,ttl=1,frag=no))
2019-12-04T19:51:18.803Z|00002|dpif(revalidator1085)|WARN|netdev@ovs-netdev: failed to put[modify] (Invalid argument) ufid:ee243e26-98e5-4eaf-897c-88eac56e9b83 skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(3),packet_type(ns=0,id=0),eth(src=90:e2:ba:60:c9:20,dst=01:00:5e:00:00:16),eth_type(0x8100),vlan(vid=11,pcp=0/0x0),encap(eth_type(0x0800),ipv4(src=192.168.59.54/0.0.0.0,dst=224.0.0.22/0.0.0.0,proto=2/0,tos=0xc0/0,ttl=1/0,frag=no)), actions:userspace(pid=0,slow_path(match))

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2020-01-21:

#59

As stated, I verified with 375+ iterations of live migrations as noted above when testing with vswitch type ovs-dpdk on the same lab as previously and with the same automation test
The issue did not happen now.

20200115T023003Z
This load has a newer version of openstack and a newer kernel.
Closing

tags:

removed: stx.retestneeded

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2020-01-21:

#60

Also note: When using OVS-DPDK, the instance must be configured to use a flavor with property: hw:mem_page_size=large for the vhostuser port to function properly.

Yang Liu (yliu12) on 2020-01-22

Changed in starlingx:
status:	Triaged → Won't Fix

StarlingX

ping to vm fails in live migration test

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches