neutron

Bug #1813703
Activity log

Activity log for bug #1813703

Date	Who	What changed	Old value	New value	Message
2019-01-29 05:57:21	LIU Yulong	bug			added bug
2019-01-29 05:59:27	LIU Yulong	description	[L2] [summary] ovs-agent issues at large scale Recently we have tested the ovs-agent with ovs-flow based firewall, and we met some issues at large scale. This bug will give us a centralized location to track the following problems. Problems: (1) RPC timeout during ovs-agent restart (2) local connection to ovs-vswitchd was drop or timeout (3) ovs-agent failed to restart (4) ovs-agent restart costs too long time (15-40mins+) (5) unexpected flow lost (6) unexpected tunnel lost (7) multipe cookies flows (stale flows) (8) dump-flows takes a lots of time (9) really hard to do trouble shooting if one VM lost the connection, flow tables are almost unreadable (reach 30k+ flows). Problem can be seen in the following scenarios: (1) 2000-3000 ports related to one single security group (or one remote security group) (2) create 2000-3000 VMs in one single subnet (network) (3) create 2000-3000 VMs under one single security group Yes, the scale is the main problem, when one host's VM count is closing to 150-200 (at the same time the ports number in one subnet or security group is closing 2000), the ovs-agent restart will get worse. Test ENV: stable/queens Deployment topology: neutron-server, database, message queue all have its own dedicated physical hosts, 3 nodes for each service at least. Configurations: ovs-agent was setup with l2pop, firewall based on ovs flow, and the config was basiclly like the following: [agent] enable_distributed_routing = True l2_population = True tunnel_types = vxlan arp_responder = True prevent_arp_spoofing = True extensions = qos report_interval = 60 [ovs] bridge_mappings = tenant:br-vlan,external:br-ex local_ip = 10.114.4.48 [securitygroup] firewall_driver = openvswitch enable_security_group = True Some issue tracking: (1) mostly because the great number of ports related to one security grop or in one network (2) uncessary RPC call during ovs-agent restart (3) inefficient database query conditions (4) full sync will redo again and again if any exception was raised in rpc_loop (5) clean stale flows will dump all flows first (not once, multipe times), this is really time-consuming So this is a summay bug for the entire scale issues we have met. Some potential solutions: Increase some config like rpc_response_timeout, of_connect_timeout, of_request_timeout, ovsdb_timeout etc, does not help too much, and these changes can cause the restart cost time much more. And those issues can still be seen. One workaround is to disable the openvswitch flow based security group, the ovs-agent can restart in less than 10 mins.	[L2] [summary] ovs-agent issues at large scale Recently we have tested the ovs-agent with the openvswitch flow based security group, and we met some issues at large scale. This bug will give us a centralized location to track the following problems. Problems: (1) RPC timeout during ovs-agent restart (2) local connection to ovs-vswitchd was drop or timeout (3) ovs-agent failed to restart (4) ovs-agent restart costs too long time (15-40mins+) (5) unexpected flow lost (6) unexpected tunnel lost (7) multipe cookies flows (stale flows) (8) dump-flows takes a lots of time (9) really hard to do trouble shooting if one VM lost the connection, flow tables are almost unreadable (reach 30k+ flows). Problem can be seen in the following scenarios: (1) 2000-3000 ports related to one single security group (or one remote security group) (2) create 2000-3000 VMs in one single subnet (network) (3) create 2000-3000 VMs under one single security group Yes, the scale is the main problem, when one host's VM count is closing to 150-200 (at the same time the ports number in one subnet or security group is closing 2000), the ovs-agent restart will get worse. Test ENV: stable/queens Deployment topology: neutron-server, database, message queue all have its own dedicated physical hosts, 3 nodes for each service at least. Configurations: ovs-agent was setup with l2pop, firewall based on ovs flow, and the config was basiclly like the following: [agent] enable_distributed_routing = True l2_population = True tunnel_types = vxlan arp_responder = True prevent_arp_spoofing = True extensions = qos report_interval = 60 [ovs] bridge_mappings = tenant:br-vlan,external:br-ex local_ip = 10.114.4.48 [securitygroup] firewall_driver = openvswitch enable_security_group = True Some issue tracking: (1) mostly because the great number of ports related to one security grop or in one network (2) uncessary RPC call during ovs-agent restart (3) inefficient database query conditions (4) full sync will redo again and again if any exception was raised in rpc_loop (5) clean stale flows will dump all flows first (not once, multipe times), this is really time-consuming So this is a summay bug for the entire scale issues we have met. Some potential solutions: Increase some config like rpc_response_timeout, of_connect_timeout, of_request_timeout, ovsdb_timeout etc, does not help too much, and these changes can cause the restart cost time much more. And those issues can still be seen. One workaround is to disable the openvswitch flow based security group, the ovs-agent can restart in less than 10 mins.
2019-01-29 06:00:20	LIU Yulong	description	[L2] [summary] ovs-agent issues at large scale Recently we have tested the ovs-agent with the openvswitch flow based security group, and we met some issues at large scale. This bug will give us a centralized location to track the following problems. Problems: (1) RPC timeout during ovs-agent restart (2) local connection to ovs-vswitchd was drop or timeout (3) ovs-agent failed to restart (4) ovs-agent restart costs too long time (15-40mins+) (5) unexpected flow lost (6) unexpected tunnel lost (7) multipe cookies flows (stale flows) (8) dump-flows takes a lots of time (9) really hard to do trouble shooting if one VM lost the connection, flow tables are almost unreadable (reach 30k+ flows). Problem can be seen in the following scenarios: (1) 2000-3000 ports related to one single security group (or one remote security group) (2) create 2000-3000 VMs in one single subnet (network) (3) create 2000-3000 VMs under one single security group Yes, the scale is the main problem, when one host's VM count is closing to 150-200 (at the same time the ports number in one subnet or security group is closing 2000), the ovs-agent restart will get worse. Test ENV: stable/queens Deployment topology: neutron-server, database, message queue all have its own dedicated physical hosts, 3 nodes for each service at least. Configurations: ovs-agent was setup with l2pop, firewall based on ovs flow, and the config was basiclly like the following: [agent] enable_distributed_routing = True l2_population = True tunnel_types = vxlan arp_responder = True prevent_arp_spoofing = True extensions = qos report_interval = 60 [ovs] bridge_mappings = tenant:br-vlan,external:br-ex local_ip = 10.114.4.48 [securitygroup] firewall_driver = openvswitch enable_security_group = True Some issue tracking: (1) mostly because the great number of ports related to one security grop or in one network (2) uncessary RPC call during ovs-agent restart (3) inefficient database query conditions (4) full sync will redo again and again if any exception was raised in rpc_loop (5) clean stale flows will dump all flows first (not once, multipe times), this is really time-consuming So this is a summay bug for the entire scale issues we have met. Some potential solutions: Increase some config like rpc_response_timeout, of_connect_timeout, of_request_timeout, ovsdb_timeout etc, does not help too much, and these changes can cause the restart cost time much more. And those issues can still be seen. One workaround is to disable the openvswitch flow based security group, the ovs-agent can restart in less than 10 mins.	[L2] [summary] ovs-agent issues at large scale Recently we have tested the ovs-agent with the openvswitch flow based security group, and we met some issues at large scale. This bug will give us a centralized location to track the following problems. Problems: (1) RPC timeout during ovs-agent restart (2) local connection to ovs-vswitchd was drop or timeout (3) ovs-agent failed to restart (4) ovs-agent restart costs too long time (15-40mins+) (5) unexpected flow lost (6) unexpected tunnel lost (7) multipe cookies flows (stale flows) (8) dump-flows takes a lots of time (9) really hard to do trouble shooting if one VM lost the connection, flow tables are almost unreadable (reach 30k+ flows). Problem can be seen in the following scenarios: (1) 2000-3000 ports related to one single security group (or one remote security group) (2) create 2000-3000 VMs in one single subnet (network) (3) create 2000-3000 VMs under one single security group Yes, the scale is the main problem, when one host's VM count is closing to 150-200 (at the same time the ports number in one subnet or security group is closing 2000), the ovs-agent restart will get worse. Test ENV: stable/queens Deployment topology: neutron-server, database, message queue all have its own dedicated physical hosts, 3 nodes for each service at least. Configurations: ovs-agent was setup with l2pop, security group based on ovs flow, and the config was basiclly like the following: [agent] enable_distributed_routing = True l2_population = True tunnel_types = vxlan arp_responder = True prevent_arp_spoofing = True extensions = qos report_interval = 60 [ovs] bridge_mappings = tenant:br-vlan,external:br-ex local_ip = 10.114.4.48 [securitygroup] firewall_driver = openvswitch enable_security_group = True Some issue tracking: (1) mostly because the great number of ports related to one security grop or in one network (2) uncessary RPC call during ovs-agent restart (3) inefficient database query conditions (4) full sync will redo again and again if any exception was raised in rpc_loop (5) clean stale flows will dump all flows first (not once, multipe times), this is really time-consuming So this is a summay bug for the entire scale issues we have met. Some potential solutions: Increase some config like rpc_response_timeout, of_connect_timeout, of_request_timeout, ovsdb_timeout etc, does not help too much, and these changes can cause the restart cost time much more. And those issues can still be seen. One workaround is to disable the openvswitch flow based security group, the ovs-agent can restart in less than 10 mins.
2019-01-29 06:20:04	LIU Yulong	description	[L2] [summary] ovs-agent issues at large scale Recently we have tested the ovs-agent with the openvswitch flow based security group, and we met some issues at large scale. This bug will give us a centralized location to track the following problems. Problems: (1) RPC timeout during ovs-agent restart (2) local connection to ovs-vswitchd was drop or timeout (3) ovs-agent failed to restart (4) ovs-agent restart costs too long time (15-40mins+) (5) unexpected flow lost (6) unexpected tunnel lost (7) multipe cookies flows (stale flows) (8) dump-flows takes a lots of time (9) really hard to do trouble shooting if one VM lost the connection, flow tables are almost unreadable (reach 30k+ flows). Problem can be seen in the following scenarios: (1) 2000-3000 ports related to one single security group (or one remote security group) (2) create 2000-3000 VMs in one single subnet (network) (3) create 2000-3000 VMs under one single security group Yes, the scale is the main problem, when one host's VM count is closing to 150-200 (at the same time the ports number in one subnet or security group is closing 2000), the ovs-agent restart will get worse. Test ENV: stable/queens Deployment topology: neutron-server, database, message queue all have its own dedicated physical hosts, 3 nodes for each service at least. Configurations: ovs-agent was setup with l2pop, security group based on ovs flow, and the config was basiclly like the following: [agent] enable_distributed_routing = True l2_population = True tunnel_types = vxlan arp_responder = True prevent_arp_spoofing = True extensions = qos report_interval = 60 [ovs] bridge_mappings = tenant:br-vlan,external:br-ex local_ip = 10.114.4.48 [securitygroup] firewall_driver = openvswitch enable_security_group = True Some issue tracking: (1) mostly because the great number of ports related to one security grop or in one network (2) uncessary RPC call during ovs-agent restart (3) inefficient database query conditions (4) full sync will redo again and again if any exception was raised in rpc_loop (5) clean stale flows will dump all flows first (not once, multipe times), this is really time-consuming So this is a summay bug for the entire scale issues we have met. Some potential solutions: Increase some config like rpc_response_timeout, of_connect_timeout, of_request_timeout, ovsdb_timeout etc, does not help too much, and these changes can cause the restart cost time much more. And those issues can still be seen. One workaround is to disable the openvswitch flow based security group, the ovs-agent can restart in less than 10 mins.	[L2] [summary] ovs-agent issues at large scale Recently we have tested the ovs-agent with the openvswitch flow based security group, and we met some issues at large scale. This bug will give us a centralized location to track the following problems. Problems: (1) RPC timeout during ovs-agent restart https://bugs.launchpad.net/neutron/+bug/1813704 (2) local connection to ovs-vswitchd was drop or timeout https://bugs.launchpad.net/neutron/+bug/1813705 (3) ovs-agent failed to restart https://bugs.launchpad.net/neutron/+bug/1813706 (4) ovs-agent restart costs too long time (15-40mins+) https://bugs.launchpad.net/neutron/+bug/1813707 (5) unexpected flow lost (6) unexpected tunnel lost (7) multipe cookies flows (stale flows) (8) dump-flows takes a lots of time (9) really hard to do trouble shooting if one VM lost the connection, flow tables are almost unreadable (reach 30k+ flows). https://bugs.launchpad.net/neutron/+bug/1813708 Problem can be seen in the following scenarios: (1) 2000-3000 ports related to one single security group (or one remote security group) (2) create 2000-3000 VMs in one single subnet (network) (3) create 2000-3000 VMs under one single security group Yes, the scale is the main problem, when one host's VM count is closing to 150-200 (at the same time the ports number in one subnet or security group is closing 2000), the ovs-agent restart will get worse. Test ENV: stable/queens Deployment topology: neutron-server, database, message queue all have its own dedicated physical hosts, 3 nodes for each service at least. Configurations: ovs-agent was setup with l2pop, security group based on ovs flow, and the config was basiclly like the following: [agent] enable_distributed_routing = True l2_population = True tunnel_types = vxlan arp_responder = True prevent_arp_spoofing = True extensions = qos report_interval = 60 [ovs] bridge_mappings = tenant:br-vlan,external:br-ex local_ip = 10.114.4.48 [securitygroup] firewall_driver = openvswitch enable_security_group = True Some issue tracking: (1) mostly because the great number of ports related to one security grop or in one network (2) uncessary RPC call during ovs-agent restart (3) inefficient database query conditions (4) full sync will redo again and again if any exception was raised in rpc_loop (5) clean stale flows will dump all flows first (not once, multipe times), this is really time-consuming So this is a summay bug for the entire scale issues we have met. Some potential solutions: Increase some config like rpc_response_timeout, of_connect_timeout, of_request_timeout, ovsdb_timeout etc, does not help too much, and these changes can cause the restart cost time much more. And those issues can still be seen. One workaround is to disable the openvswitch flow based security group, the ovs-agent can restart in less than 10 mins.
2019-01-29 06:49:24	LIU Yulong	description	[L2] [summary] ovs-agent issues at large scale Recently we have tested the ovs-agent with the openvswitch flow based security group, and we met some issues at large scale. This bug will give us a centralized location to track the following problems. Problems: (1) RPC timeout during ovs-agent restart https://bugs.launchpad.net/neutron/+bug/1813704 (2) local connection to ovs-vswitchd was drop or timeout https://bugs.launchpad.net/neutron/+bug/1813705 (3) ovs-agent failed to restart https://bugs.launchpad.net/neutron/+bug/1813706 (4) ovs-agent restart costs too long time (15-40mins+) https://bugs.launchpad.net/neutron/+bug/1813707 (5) unexpected flow lost (6) unexpected tunnel lost (7) multipe cookies flows (stale flows) (8) dump-flows takes a lots of time (9) really hard to do trouble shooting if one VM lost the connection, flow tables are almost unreadable (reach 30k+ flows). https://bugs.launchpad.net/neutron/+bug/1813708 Problem can be seen in the following scenarios: (1) 2000-3000 ports related to one single security group (or one remote security group) (2) create 2000-3000 VMs in one single subnet (network) (3) create 2000-3000 VMs under one single security group Yes, the scale is the main problem, when one host's VM count is closing to 150-200 (at the same time the ports number in one subnet or security group is closing 2000), the ovs-agent restart will get worse. Test ENV: stable/queens Deployment topology: neutron-server, database, message queue all have its own dedicated physical hosts, 3 nodes for each service at least. Configurations: ovs-agent was setup with l2pop, security group based on ovs flow, and the config was basiclly like the following: [agent] enable_distributed_routing = True l2_population = True tunnel_types = vxlan arp_responder = True prevent_arp_spoofing = True extensions = qos report_interval = 60 [ovs] bridge_mappings = tenant:br-vlan,external:br-ex local_ip = 10.114.4.48 [securitygroup] firewall_driver = openvswitch enable_security_group = True Some issue tracking: (1) mostly because the great number of ports related to one security grop or in one network (2) uncessary RPC call during ovs-agent restart (3) inefficient database query conditions (4) full sync will redo again and again if any exception was raised in rpc_loop (5) clean stale flows will dump all flows first (not once, multipe times), this is really time-consuming So this is a summay bug for the entire scale issues we have met. Some potential solutions: Increase some config like rpc_response_timeout, of_connect_timeout, of_request_timeout, ovsdb_timeout etc, does not help too much, and these changes can cause the restart cost time much more. And those issues can still be seen. One workaround is to disable the openvswitch flow based security group, the ovs-agent can restart in less than 10 mins.	[L2] [summary] ovs-agent issues at large scale Recently we have tested the ovs-agent with the openvswitch flow based security group, and we met some issues at large scale. This bug will give us a centralized location to track the following problems. Problems: (1) RPC timeout during ovs-agent restart https://bugs.launchpad.net/neutron/+bug/1813704 (2) local connection to ovs-vswitchd was drop or timeout https://bugs.launchpad.net/neutron/+bug/1813705 (3) ovs-agent failed to restart https://bugs.launchpad.net/neutron/+bug/1813706 (4) ovs-agent restart costs too long time (15-40mins+) https://bugs.launchpad.net/neutron/+bug/1813707 (5) unexpected flow lost https://bugs.launchpad.net/neutron/+bug/1813714 (6) unexpected tunnel lost https://bugs.launchpad.net/neutron/+bug/1813715 (7) multipe cookies flows (stale flows) https://bugs.launchpad.net/neutron/+bug/1813712 (8) dump-flows takes a lots of time https://bugs.launchpad.net/neutron/+bug/1813709 (9) really hard to do trouble shooting if one VM lost the connection, flow tables are almost unreadable (reach 30k+ flows). https://bugs.launchpad.net/neutron/+bug/1813708 Problem can be seen in the following scenarios: (1) 2000-3000 ports related to one single security group (or one remote security group) (2) create 2000-3000 VMs in one single subnet (network) (3) create 2000-3000 VMs under one single security group Yes, the scale is the main problem, when one host's VM count is closing to 150-200 (at the same time the ports number in one subnet or security group is closing 2000), the ovs-agent restart will get worse. Test ENV: stable/queens Deployment topology: neutron-server, database, message queue all have its own dedicated physical hosts, 3 nodes for each service at least. Configurations: ovs-agent was setup with l2pop, security group based on ovs flow, and the config was basiclly like the following: [agent] enable_distributed_routing = True l2_population = True tunnel_types = vxlan arp_responder = True prevent_arp_spoofing = True extensions = qos report_interval = 60 [ovs] bridge_mappings = tenant:br-vlan,external:br-ex local_ip = 10.114.4.48 [securitygroup] firewall_driver = openvswitch enable_security_group = True Some issue tracking: (1) mostly because the great number of ports related to one security grop or in one network (2) uncessary RPC call during ovs-agent restart (3) inefficient database query conditions (4) full sync will redo again and again if any exception was raised in rpc_loop (5) clean stale flows will dump all flows first (not once, multipe times), this is really time-consuming So this is a summay bug for the entire scale issues we have met. Some potential solutions: Increase some config like rpc_response_timeout, of_connect_timeout, of_request_timeout, ovsdb_timeout etc, does not help too much, and these changes can cause the restart cost time much more. And those issues can still be seen. One workaround is to disable the openvswitch flow based security group, the ovs-agent can restart in less than 10 mins.
2019-01-29 14:34:59	Dongcan Ye	bug			added subscriber Dongcan Ye
2019-01-29 18:32:43	Swaminathan Vasudevan	tags		ovs
2019-01-29 19:16:42	Swaminathan Vasudevan	tags	ovs	l2-pop ovs
2019-01-29 19:30:22	Swaminathan Vasudevan	neutron: status	New	Confirmed
2019-01-29 19:30:43	Swaminathan Vasudevan	neutron: importance	Undecided	High
2019-02-04 09:48:13	Dennis Kusidlo	bug			added subscriber Dennis Kusidlo
2019-02-08 15:38:29	Kurt Garloff	bug			added subscriber Kurt Garloff
2019-02-20 08:48:34	LIU Yulong	neutron: assignee		LIU Yulong (dragon889)
2019-02-26 02:00:26	LIU Yulong	neutron: status	Confirmed	In Progress
2019-03-07 13:32:43	s10	bug			added subscriber s10
2019-03-07 19:59:47	OpenStack Infra	neutron: assignee	LIU Yulong (dragon889)	Brian Haley (brian-haley)
2019-03-07 23:58:15	OpenStack Infra	neutron: assignee	Brian Haley (brian-haley)	LIU Yulong (dragon889)
2019-03-23 04:46:52	OpenStack Infra	neutron: status	In Progress	Fix Released
2019-04-06 03:40:00	OpenStack Infra	tags	l2-pop ovs	in-stable-pike l2-pop ovs
2019-04-07 14:07:52	OpenStack Infra	tags	in-stable-pike l2-pop ovs	in-stable-pike in-stable-rocky l2-pop ovs
2019-04-08 22:07:41	OpenStack Infra	tags	in-stable-pike in-stable-rocky l2-pop ovs	in-stable-pike in-stable-rocky in-stable-stein l2-pop ovs
2019-04-08 22:08:18	OpenStack Infra	tags	in-stable-pike in-stable-rocky in-stable-stein l2-pop ovs	in-stable-pike in-stable-queens in-stable-rocky in-stable-stein l2-pop ovs
2019-04-10 23:44:32	OpenStack Infra	tags	in-stable-pike in-stable-queens in-stable-rocky in-stable-stein l2-pop ovs	in-stable-ocata in-stable-pike in-stable-queens in-stable-rocky in-stable-stein l2-pop ovs
2019-06-19 14:54:03	Bernard Cafarelli	tags	in-stable-ocata in-stable-pike in-stable-queens in-stable-rocky in-stable-stein l2-pop ovs	in-stable-ocata in-stable-pike in-stable-queens in-stable-rocky in-stable-stein l2-pop neutron-proactive-backport-potential ovs
2019-11-01 13:28:13	Dr. Jens Harbott	bug			added subscriber Dr. Jens Harbott
2020-04-20 11:33:03	Slawek Kaplonski	tags	in-stable-ocata in-stable-pike in-stable-queens in-stable-rocky in-stable-stein l2-pop neutron-proactive-backport-potential ovs	in-stable-ocata in-stable-pike in-stable-queens in-stable-rocky in-stable-stein l2-pop ovs
2020-12-21 18:04:24	Hang Yang	bug			added subscriber Hang Yang
2022-09-26 15:54:59	Arnaud Morin	bug			added subscriber Arnaud Morin