neutron

Bug #1806390
Comment #0

Comment 0 for bug 1806390

Revision history for this message

Yang Youseok (ileixe) wrote on 2018-12-03:

It was very old issue and ended with invalid feature though, I could not find ideal solution so that I raise this issue again. I wonder how other think of it.

It's heavily related to the old issue (https://bugs.launchpad.net/neutron/+bug/1468236), and I reconstruct the issue from my understanding.

Problems
- With giant shared provider network which has over than 10000 ports in a network.
- Several DHCP agents for the network. Even per hypervisor for Calico project.
- Scalability issue (DHCP lease file is not updated after the VM started) occurs.

Solutions from the reporter
1. Add distributed flag for the DHCP agent. And provision DHCP agent on every compute node.
2. Change DHCP agent notifier to specify DHCP agent per
3. Do not spread DHCP flow outside of local hypervisor.

Conclusion
- Rejected because
- Solution step (2) add big complexity to agent notifier RPC.
- (3) is not a general solution.
- Even worse for migration. There were many side effects to we have to care about.
- There were building blocks that we can achieve the purpose. (It was mentioned on IRC, but I still does not understand what the building block that mentioned is.)

Our private cluster is very much like the Calico. We have an giant provider network and make them routable with quagga and there were DHCP agents per compute node. I believe that community has formed some consensus that this kind of architecture is pretty good at handling scale issues to see the approach like Routed network.

And to achieve the architecture with the lack of L2, modifying DHCP agent could not be avoided since its default HA behavior make critical DB performance issues.

But at the same time, I absolutely agreed with the comment which care about the unnecessary complexity for distributed approach like DVR.

Then we can achieve from the change
- Reduce the performance overhead. I found the performance penalty is related to DB side (getting ports with get_active_info(), and complete provisioning step with dhcp_ready_on_ports(). RPC fanout is minor.
- Make new concept which means DHCP agent failure domain is splitted.

Any comments are appreciated.

It was very old issue and ended with invalid feature though, I could not find ideal solution so that I raise this issue again. I wonder how other think of it.

It's heavily related to the old issue (https://bugs.launchpad.net/neutron/+bug/1468236), and I reconstruct the issue from my understanding.

Solutions from the reporter
1. Add distributed flag for the DHCP agent. And provision DHCP agent on every compute node.
2. Change DHCP agent notifier to specify DHCP agent per 
3. Do not spread DHCP flow outside of local hypervisor.

Conclusion
- Rejected because
- Solution step (2) add big complexity to agent notifier RPC.
- (3) is not a general solution.
- Even worse for migration. There were many side effects to we have to care about.    
- There were building blocks that we can achieve the purpose. (It was mentioned on IRC, but I still does not understand what the building block that mentioned is.)

And to achieve the architecture with the lack of L2, modifying DHCP agent could not be avoided since its default HA behavior make critical DB performance issues.

But at the same time, I absolutely agreed with the comment which care about the unnecessary complexity for distributed approach like DVR.

So What I suggest is
- Do not modify current DHCP agent behaviors like notifier side API. It does not harm migration logic.
- Do not change the DHCP HA concept and L2 agent at all.
- Just add a distributed flag for DHCP agent. And add host filtering logic the handler side RPC (get_active_network_info, get_network_info) only when the DHCP agent is distributed.
- Operators have little bit new concept of distributed DHCP which the agent is only for ports within a local hypervisor.
 
Then we can achieve from the change
- Reduce the performance overhead. I found the performance penalty is related to DB side (getting ports with get_active_info(), and complete provisioning step with dhcp_ready_on_ports(). RPC fanout is minor.
- Make new concept which means DHCP agent failure domain is splitted.

Any comments are appreciated.