The results shows:
(1) ovs-agent can start successfully with almost 95% success rate
Failures are mainly concentrated on the following:
a. neutron server has heavy load
b. ovs-agent cache takes a lot of memory, and sometimes MemoryError raised
c. local ovs-vswitchd connection still meets some timeout or drop, but this may be also addressed in this bug and fix: https://bugs.launchpad.net/neutron/+bug/1817022 https://review.openstack.org/#/c/641681/
(2) no flow lose (no dataplane down)
(3) no RPC timeout coming out from RPC loop anymore
(sometimes neutron server raise timeout for that report_state, IMO, for such situation, you may need to add more neutron server state report worker, or restart ovs-agents in a small set, not all at once)
(4) restart time reduced to 15min-20min averagely
(5) dump and clean stale flows action has very high success rate, at least I didn't observe failure.
(6) no remain stale flows based on (5)
Let me give some updates here:
I've tested following patches many many times with 400+ ports and 1000+ security groups rules hosting in one single ovs-agent. /review. openstack. org/#/c/ 638641/ /review. openstack. org/#/c/ 638642/ /review. openstack. org/#/c/ 638644/ /review. openstack. org/#/c/ 638645/ /review. openstack. org/#/c/ 638646/ /review. openstack. org/#/c/ 638647/ /review. openstack. org/#/c/ 640797/
[1] https:/
[2] https:/
[3] https:/
[4] https:/
[5] https:/
[6] https:/
[7] https:/
The results shows: /bugs.launchpad .net/neutron/ +bug/1817022 /review. openstack. org/#/c/ 641681/
(1) ovs-agent can start successfully with almost 95% success rate
Failures are mainly concentrated on the following:
a. neutron server has heavy load
b. ovs-agent cache takes a lot of memory, and sometimes MemoryError raised
c. local ovs-vswitchd connection still meets some timeout or drop, but this may be also addressed in this bug and fix:
https:/
https:/
(2) no flow lose (no dataplane down)
(3) no RPC timeout coming out from RPC loop anymore
(sometimes neutron server raise timeout for that report_state, IMO, for such situation, you may need to add more neutron server state report worker, or restart ovs-agents in a small set, not all at once)
(4) restart time reduced to 15min-20min averagely
(5) dump and clean stale flows action has very high success rate, at least I didn't observe failure.
(6) no remain stale flows based on (5)