[scale issue] ovs-agent port processing time increases linearly and eventually timeouts

Bug #1838431 reported by LIU Yulong
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Confirmed
Medium
Unassigned

Bug Description

ENV: stable/queens
But master has basically same code, so the issue may also exist.

Config: L2 ovs-agent with enabled openflow based security group.

Recently I run one extreme test locally, booting 2700 instances for one single tenant.
The instance will be booted in 2000 networks. But the entire tenant has only one security group with only 5 rules. (This is the key point to the problem.)

The result is totally unacceptable. Almost 2000+ instances failed to boot (ERROR), and almost every of them meets the "vif-plug-timeout" exception.

How to reproduce:
1. create 2700 networks one by one "openstack network create"
2. create one IPv4 subnet and one IPv6 subnet for every network
3. create 2700 router (one single tenant can not create HA router more than 255, because of the VRID range) and connect to these subnets
4. boot instances
for i in {1..100}
do
    for i in {1..27}
        nova boot --nic net-name="test-network-xxx" ...
    done
    echo "CLI: boot 27 VMs"
    sleep 30s
done

I have some clue of this issue, the linearly processing time increasing is something like this:
(1) rpc_loop X
5 port added to the ovs-agent, they are processed and will be add to the updated list due to the local notification
(2) rpc_loop X + 1
another 10 ports are added to the ovs-agent, and 10 updated-port to local notification.
This loop the processing time is 5 ports update processing time, and 10 added port processing.
(3) rpc_loop X + 2
another 20 are ports added to ovs-agent,
10 updated + 20 added port processing time

And the worse thing is, when the port number is getting larger, every port under this one security group will be related. The openflow based security group processing time is get longer and longer.
Until some instance ports meet the timeout of vif-plug. And the instance get failed to boot.

tags: added: loadimpact ovs-fw
Changed in neutron:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Miguel Lavalle (minsel) wrote :

In the bug description you mention "But the entire tenant has only one security group with only 5 rules. (This is the key point to the problem.)". If you perform the exact same test, without the security group constraint, do you still get the timeouts?

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Please try to remove from Your security groups rules which refer to remote_security_group_id. This is IMO main source of the issue here and it is already known problem since long time.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.