Enable openflow based dvr routing for east/west traffic

Bug #1509184 reported by sean mooney
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Undecided
sean mooney

Bug Description

In the juno cycle dvr support was added to neutron do decentralise routing to the compute nodes.
This RFE bug proposes the introduction of a new dvr mode (dvr_local_openflow) to optimise the datapath
of east/west traffic.

-----------------------------------------------High level description-------------------------------
The current implementation of DVR with ovs utilizes linux network namespaces to instantiate l3
routers, the details of which are described here: http://docs.openstack.org/networking-guide/scenario_dvr_ovs.html

fundamentally a neutron router comprises of 3 elements.
- a router instance (network namespace)
- a router interface (tap device)
- a set or routing rules (kernel ip routes)

In the special case of routing east/west traffic both the source and destination interfaces are known to neutron.
because of that fact neutron contains all the information required to logically route traffic from its origin to its destination
enabling the path to be established primitively. this proposal suggests moving the instantiation of the dvr local router from the kernel ip stack to Open vSwitch(ovs) for east/west traffic.

Open vSwitch provides a logical programmable interface (Openflow) to configure traffic forwarding and modification actions on arbitrary packet streams. When managed by the neutron openvswich l2 agent, ovs operates as a simple mac learning switch with limited utilisation of it programmable dataplane. to utilise ovs to create an l3 router the follow mappings from the 3 fundamental elements can be made
- a router instance (network namespace + a ovs bridge)
- a router interface (tap device + patch port pair)
- a set or routing rules (kernel ip routes + openflow rules)

----------------------------------------background context---------------------------------------------
TL;DR
basic explanation of openflow/ovs briges and patch ports
skip to implementation section if familiar.

ovs implementation background:
In openvswich at the control layer an ovs bridge is a unique logical domain of interfaces and flow rules.
Similarly at the control layer a patch port pair is a logical entity that interconnects two bridges(or logical domains).

From a dataplane perspective each ovs bridge is first created as a separate instance of a dataplane.
if these separate bridges/dataplanes are interconnected by patch ports, ovs will collapse the independent dataplanes into a single
ovs dataplane instance. As a direct result of this implementation a logical topology of 1 bridge with two interfaces is realised in the dataplane level identically to 2 bridges each with 1 interface interconnected by path ports. This translate to zero dataplane overhead to the creation of multiple bridge allowing for arbitrary numbers of router instances to be created.

Openflow capability background:
The openflow protocol provides many capabilities which can be generally summarised as packet match criteria and actions to apply
when the criteria is satisfied. In the case of l3 routeing the match criteria of relevance are the Ethernet type and the destination ip address.similarly the openflow actions required are mod_dest,set_field,move,dec_ttl,output and drop.

logical packet flow for a ping between two vms on same host:
in the l2 case if a vm tries to ping another vm in the same subnet thre are 4 stages.
- first it will send a broadcast arp packet to learn the mac address from the destination ip of the remote vm.
- second the destination vm receives the arp request and learns the source vms mac,then replies as follows:
    a.) swap the source and destination ip of the arp packet
    b.) copy the source mac address to the destination mac address and set the source mac address to the local interface mac.
    c.) set arp type code form request to reply.
    d.) transmit reply via received interface
- third on receiving the arp reply the source vm will transmit the icmp packet
  source vm will then transmit the icmp packet to the destination vm with the learned mac address
- fourth on receiving the icmp the destination vm replies.

in the l3 case the packet flow is similar but slightly different.
- first the source vm sends an arp to the subnet gateway.
- second the gateway router responds with its mac address
- third the source vm send the icmp packet to the router
- fourth the router receives the icmp packet and send an arp to the destination vm.
- fifth the destination vm sends a arp reply to the gateway
- sixth the router forwards the icmp to the destination vm
-seventh the destination vm replies to the router
- eight the reply is received by the source vm.

----------------------------------current implementation---------------------------------------------------

l3 ping packet flow in dvr_local mode(simplified to ignore broadcast):
logical:
- the arp packet is received from the source vm and logically vlan tagged(tenant isolation)
- the arp packet is output to the router tap device(tap1), the vlan is striped and the packet is copied from the ovs dataplane to the
  kernel networking stack in the routers linux namespace.
- the kernel network stack replies to the arp and the reply packet is copied to the ovs dataplane and it is logically vlan tagged
- the vlan is logically striped and the arp reply switched to the source vm interface.
- the icmp packet is received from the source vm and logically vlan tagged(tenant isolation)
- the icmp packet is output to the route tap device, the vlan is striped and the packet is copied from the ovs dataplane to the
  kernel networking stack in the routers linux namespace.
- the kernel generates an arp request to the destination vm which follows the same path as the arp described above
- the kernel modifies the dest mac address, decrements the ttl and routes the packet to the appropriate tap device(tap2) where the packet is copied to the ovs dataplane and it is logically vlan tagged
- the vlan is logically striped and the icmp packet switched to the destination vm interface.
- the reply path is similarly and is shortened as follows:
   destvm->vlan tagged->vlan stripped -> copied to kernel name space via tap2-> copied to ovs dataplane via tap1-> vlan tagged-> vlan stripped-> received by source vm.

actual:
- arp form source vm -> tap1 (vlan tagging skipped) + broadcast to other ports
- tap1-> kernel network stack
- kernel sends arp reply tap1
- tap1-> source vm (vlan tagging skipped)
- icmp from source vm -> tap1(vlan tagging skipped)
- kernel receives icmp on tap1 and send arp request to dest vm via tap2(broadcast)
- arp via tap2 -> dest vm (vlan tagging skipped)
- dest vm replies -> tap2
- kernel updates dest mac and decrement ttl the forward icmp packet to tap2
- tap2 -> dest vm-> dest vm replies->tap2.(vlan tagging skipped)
- kernel updates dest mac and decrement ttl the forward icmp reply packet to tap1
- tap1-> source vm

-------------------------------------proposed change----------------------------------------------------------
Proposed change:
- a new class will be added to implement the new mode that subclasses the existing
  dvr_local router class.
- if mode is dvr_local_openflow a routing bridge will be created for each dvr router.
- when an internal network is added to the router the following actions will be preformed:
  a.) the tap interface will be created in the router network namespaces as normal but added
        to routing bridge instead of the br-int.(tap devices are only used for north/south traffic)
  b.) a patch port pair will be created between the br-int and routing bridge
  c.) the attached-mac,iface-id and iface-status will be populated in the external-id field or the br-int side of the patch port.
       this will enabled the unmodified neutron l2 agent to correctly manage the patch port.
  d.) a low priority rule that send all traffic form the patch port to the tap device will be added to the routing bridge.
  e.) a medium priority rule that will reply to all arp request to the router will be added to the routing bridge.
        this rule will use openflows move and set field actions to rewrite the arp request into a reply and output=in_port.
  f.) a high priority dest mac update and ttl decrement rule will be added to the routing bridge for each port
       on the internal network.
- when an external network is added to the router the workflow will be unchanged and is inherited from the dvr_local
  implementation.
- the _update_arp_entry function will be extended additional populate and delete the high priority dest mac update rules
  as neutron ports are added/removed form connected networks.

l3 packet flow in dvr_local_openflow mode:

logical:
- the arp packet is received from the source vm and logically vlan tagged(tenant isolation)
- the arp packet is output to the router bridge patch port , the vlan is striped
- the arp request is rewritten into a reply and sent back to the br-int and logically vlan tagged
- the vlan is logically striped and the arp reply switched to the source vm interface.
- the icmp packet is received from the source vm and logically vlan tagged(tenant isolation)
- the icmp packet is output to the router bridge patch port , the vlan is striped.
- the icmp packet matches the high priority rule and its destination mac is updated the it is output to the second patch port and it is logically vlan tagged
- the vlan is logically striped and the icmp packet switched to the destination vm interface.
- the reply path is similarly and is shortened as follows:
   destvm->vlan tagged->vlan stripped -> router bridge via patch 2-> dest mac and ttl updated then output patch 1-> vlan tagged-> vlan stripped-> received by source vm.

actual:
- arp form source vm -> arp rewritten to reply -> sent to source vm ( single openflow action).
- icmp from source vm -> destination mac update, ttl decremented -> dest vm ( single openflow action)
- icmp from dest vm -> destination mac update, ttl decremented -> source vm ( single openflow action)

other considerations:

- north/south
    as ovs cannot lookup the destination mac dynamically via arp it is not possible to optimise the
    north/south path as described above.

- openvswich support
    this mechanism is compatible with both kernel and dpdk ovs.
    this mechanism requires nicira extensions for arp rewrite.
    arp rewrite can be skipped for great support if required as it will fall back to tap device and kernel.
    icmp traffic for router interface will be handled by tap device as ovs currently does not
    support setting icmp type code via set_field or load openflow actions.

- performance
   performance of l3 routing is expected to approach l2 performance for east/west traffic.
   performance is not expected to change for north/south.

Tags: rfe
Changed in neutron:
assignee: nobody → sean mooney (sean-k-mooney)
Revision history for this message
Rossella Sblendido (rossella-o) wrote :

This is a spec already, :) It would be good to copy the part where you explain how things work in the Neutron documentation. I am sure it would help many people. The idea is interesting, it would make the model much more complex though and more difficult to debug. If we could get rid of the router namespace...then I'd go for it but I can't think of any way to do that if we want to handle floating ips, etc

Revision history for this message
sean mooney (sean-k-mooney) wrote :

hi yes sorry it is so long. i originnaly wrote it as a draft spec before the summit with the hope of disscussing it there but i never found time to bring it up or reworking it into the actual spec format.

yes i orginally wanted to rework this into either a blog post or a wiki page to discibe so of the basice ovs concepts but a neutron doc entry might make sense too.

when i originally started looking at this i wanted to remove the linux namespace entirely.
it may be possible to extend this to handel north south traffic and remove the namespace when some of the changes ben pfaff is working on for ovs are merged.

currently ben pfaff is looking at adding the ablity to generate an arp from an arbitray packet as a flow action.
ovn will use this internally to impentent ther l3 router impentation.
with that we will be able to generate arps for ip address we(neutron) do not know about and use the learn action to add dynamic arp entries.

 in the above discription i make use of the namespace to workaround the arp lookup limitaion in ovs which hopefully will be resolve soon.
i agree the mixed interal and kernel/namespace routing is not elegent and increases the complexity.
i hope to be able to simply it but is need to revisit it in more detail as i have not looked at my prototype impentation since july.

i think floating ips could be supported by adding a learn action based rule to the router bridge that would translate from the floating ip to the private ip and then learn a revearse flow to convert the reply to use the floating ip from the private ip.
i would have to do some testing to confirm but i belive it would be posible.

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

Wasn't this how dragonflow got started?

Revision history for this message
Kyle Mestery (mestery) wrote :

Making the model more complex doesn't sound good at all, given how complex and hard to debug the current model is.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

actually thinking about this a little i think i know how to make this much simpler.
i will create an etherpad and describe the simplified version as it is a little long for a bug.
essentially i think i know how to completely remove the network namespace which will
simplify the model rather then making it more complex.
i have not had much time to work on this lately but hope to get back to it in the next week or two.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Dragonflow came to my mind just as it did to Carl's. Enabling DVR via openflow is not exclusive to Dragonflow though, and there are other examples (like OpenDaylight). So does this need to be exclusively handled in the existing core? Probably not, and I would encourage you to collaborate/check dragonflow to see if it fit your bill. I didn't see you comment on how your proposed approach is different.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Unless I am missing something, this can be handled by working in dragonflow.

Changed in neutron:
status: New → Won't Fix
Revision history for this message
sean mooney (sean-k-mooney) wrote :

hi armando i have been spread rather thin lately and have not been able to revisit this until now.
While i didnt want to tie this to ovs-dpdk as i did all my prototyping with kernel ovs this was originally intended to improve the performance of ovs with dpdk when managed by the ovs agent with dvr enabled.

i will look into dragon flow and see if it can used in an ovs-dpdk based deployment.
i have to admit i know of its existence but have not had time to look into how it works.
can it be used to provide l3 only in place of the standard l3 agent?

i am aware of the odl based routing solution also though i wanted to ensure that the reference implementation
in neutron had efficient dvr support when using ovs with dpdk.

the dpdk enhancements to ovs to date have focused solely on the physical interfaces and dpdk based northbound interface like ivshmem or vhost-user. as a result the ovs internal ports(kernel tun devices) still used the generic netdev implementation. the standard netdev datapath(without dpdk) in ovs is single threaded, as result all packets to/from these kernel tun devices must be copied to/form user space via the ovs vswitchd main thread.

ovs-dpdk is optimized to accelerate transferring packets from the nic to the vm and back but not to the host kernel as a result currently if using the neutron reference agents/driver the most efficient fault tolerant deployment is to use neutron ha routers with standard ovs on dedicated network nodes with ovs-dpdk used on the compute nodes.

ovn and odl should both be able to support this openflow based dvr if they are deployed, but i am just concerned about the case where you want dvr without a controller.

armando if you dont mind i may ask to reopen this RFE after m2 (probably targeting the n release at that point)
if i cannot find a solution that does not require an extension to the core dvr.

A simple solution would be to multi thread the handling of ovs local ports in the standard netdev implementation as it is in the kernel implementation or just move the processing off the main thread which is already busy doing other things.
i will pursue this thread in parallel with our vswitch team and see if its viable and what the openvswitch community
opinion is. an open flow based implementation would be faster but i understand the need for a balance between performance and usability.

regards
sean.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

actually Armando i just realized i could probably support my use case rather easily via the ml3 work
https://bugs.launchpad.net/neutron/+bug/1461133
https://blueprints.launchpad.net/neutron/+spec/multi-l3-backends

assuming that work is ontrack to be completed by mitaka-2 i could leverage it by providing a ml3 driver as part of the
networking-ovs-dpdk repo. that way the openflow dvr implementation would not have to be part of core neutron and would provide a clean mechanism to enable the extensible without complicating the existing dvr model.

if the community found value in the driver i could contribute it to neutron core at a later date.

my end goal is to find a clean solution to enable ovs-dpk with security groups,dvr,qos,vhost-user and sfc.
the last item could take a while but i think the first four (sec groups,dvr,qos,vhost-user) should be doable by the
end of mitaka either via the neutron ovs agent or a controller like odl,ovn or dragonflow.

anyway thanks for everyone's time.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.