Comment 15 for bug 1813703

Revision history for this message
LIU Yulong (dragon889) wrote :

Let me give some updates here:

I've tested following patches many many times with 400+ ports and 1000+ security groups rules hosting in one single ovs-agent.
[1] https://review.openstack.org/#/c/638641/
[2] https://review.openstack.org/#/c/638642/
[3] https://review.openstack.org/#/c/638644/
[4] https://review.openstack.org/#/c/638645/
[5] https://review.openstack.org/#/c/638646/
[6] https://review.openstack.org/#/c/638647/
[7] https://review.openstack.org/#/c/640797/

The results shows:
(1) ovs-agent can start successfully with almost 95% success rate
Failures are mainly concentrated on the following:
a. neutron server has heavy load
b. ovs-agent cache takes a lot of memory, and sometimes MemoryError raised
c. local ovs-vswitchd connection still meets some timeout or drop, but this may be also addressed in this bug and fix:
https://bugs.launchpad.net/neutron/+bug/1817022
https://review.openstack.org/#/c/641681/

(2) no flow lose (no dataplane down)
(3) no RPC timeout coming out from RPC loop anymore
(sometimes neutron server raise timeout for that report_state, IMO, for such situation, you may need to add more neutron server state report worker, or restart ovs-agents in a small set, not all at once)
(4) restart time reduced to 15min-20min averagely
(5) dump and clean stale flows action has very high success rate, at least I didn't observe failure.
(6) no remain stale flows based on (5)