neutron

Bug #2007674
Comment #3

Comment 3 for bug 2007674

Revision history for this message

Julien Cosmao (julien-cosmao) wrote on 2024-03-19 (last edit on 2024-03-19):

Hello,

Same observation here, i started looking at it after recurrent issues on neutron RabbitMQ cluster of our largest regions (~2000 nodes) with ovs deployment, dvr and metadata. Each node have a lot of connection to broker [1]

Issues mostly appears when agents need to be restarted or when an issue hit a node of rabbit cluster (e.g. cluster partition) and agent reconnect.

I would also note that number of queues created by neutron agents [2] are way too high and for most of them, are not even used. During a network partition for ex, rabbit cluster will need to reelect a leader for each queues owned by failed node, this process is also an issue at scale.

I started working on those topic for infra scaling need and reduce stress on rabbitmq cluster because we got too many outage related to neutron.

[1] reduce nb of connections
like Anton says, agent create separate topic/RPC server for each resources tracked (resource cache).
In oslo.messaging, 1 rpc server = 1 topic = 1 connection (pooling is only used for publishing)
for resource cache, Neutron is created "same" rpc server multiple time, for each resource.

here, i see 2 solutions for reducing nb connections:
- be able to associate 1 RPCserver to multiple topic in oslo.messaging, this way, only 1 connection can be used to consume from multiple queue. This change could be proposed to oslo.messagign project.
- reduce number of differentent topic / declare only 1 topic for common purpose on neutron side (resource cache, q-agent-notifier), but this will require more changes in how neutron implement RPC.
For resource cache example, we would go from 7 connections to only 1.

[2] reduce nb of queues
When 1 RPC server is declared, oslo.messaging create 1 connection, create 3 queues and start listening on them:
- topic_fanout
- topic
- topic.host

In most case only 1 of those queues is used by Neutron (e.g. resource cache use only fanout, so queues neutron-vo-RESSOURCE.hostxxx are not used). When rpc server is declared, an oslo.messaging Target describing topic is passed with fanout=bool information. We could use that on oslo messaging side to declare only needed queues on backend and then avoid having all agents declaring extra queues.

With few change in oslo.messaging, number of connection and queues can be reduced easily.

What do you think ?

Hello,

Issues mostly appears when agents need to be restarted or when an issue hit a node of rabbit cluster (e.g. cluster partition) and agent reconnect.

I would also note that number of queues created by neutron agents [2] are way too high and for most of them, are not even used. During a network partition for ex, rabbit cluster will need to reelect a  leader for each queues owned by failed node, this process is also an issue at scale.

I started working on those topic for infra scaling need and reduce stress on rabbitmq cluster because we got too many outage related to neutron.

[1] reduce nb of connections
like Anton says, agent create separate topic/RPC server for each resources tracked (resource cache). 
In oslo.messaging, 1 rpc server = 1 topic = 1 connection (pooling is only used for publishing)
for resource cache, Neutron is created "same" rpc server multiple time, for each resource.

[2] reduce nb of queues
When 1 RPC server is declared, oslo.messaging create 1 connection, create 3 queues and start listening on them:
- topic_fanout
- topic
- topic.host
 
In most case only 1 of those queues is used by Neutron (e.g. resource cache use only fanout, so queues neutron-vo-RESSOURCE.hostxxx are not used). When rpc server is declared, an oslo.messaging Target describing topic is passed with fanout=bool information. We could use that on oslo messaging side to declare only needed queues on backend and then avoid having all agents declaring extra queues.

With few change in oslo.messaging, number of connection and queues can be reduced easily.

What do you think ?