If you've been following the other VxLAN articles, you will remember that it is an overlay technology. VTEPs are used to encapsulate traffic, and send it over the IP network.
But have you thought about how the destination MAC addresses are learned? What about the VTEPs? How do the switches know the VTEP IP's, and which one to send to?
Data plane learning is the typical flood-and-learn style. It's in the original VxLAN specification and is very similar to the way Ethernet learns addresses.
Control plane learning is more recent and sophisticated. This method uses MP-BGP to share MAC address and VTEP information. This feels similar to how BGP learns routes.
But before getting more detail on these methods, it's important to understand the types of traffic that we may encounter.
Unicast traffic is traffic that is sent to only one destination. This is fairly simple to handle, as the destination is always known in this case.
BUM traffic is the tricky type. This is traffic that has an unknown destination or is sent to more than one destination. This is the important one to understand.
BUM is an acronym, standing for Broadcast, Unknown Unicast, Multicast. It is significant, as it is any traffic that has more than one destination.
Consider how Ethernet would handle Unknown Unicast for example. If traffic needs to go to an IP, but the MAC address is unknown, an ARP request will be flooded to the network segment. The owner of the IP will respond, and the rest will ignore the request.
VxLAN is similar in some ways but uses a few tricks to make it more efficient.
One method is to use multicast. Each VNI is mapped to a multicast group. Each group can have one or more VNI mapped to it.
When a VTEP comes online, it uses IGMP to join the groups that it needs for the VNI's that it owns.
Now when the switch needs to send BUM traffic, it can send it to the multicast group. Any VTEP that's part of the group will receive the traffic. Other destinations are not in the group, and will not get the traffic, limiting the scope of flooding. An ARP request, for example, will not go to the entire network, just to the parts it needs to go to.
This is a very efficient way to handle BUM traffic, and it scales well. It does require a multicast infrastructure, which may make it a little more tricky to implement.
The other way to handle BUM traffic is called Head End Replication. This is only available when using control plane learning, as it needs to know where the VTEPs are. The reasons behind this will make more sense as you go through this article.
When BUM traffic needs to be sent, the switch creates a copy of the traffic for each relevant destination VTEP. This results in the duplication of traffic, so it's less efficient than multicast. It also won't scale as well. If you have 20 or more VTEPs, you should be using multicast.
The advantage of this method is that it's very simple to configure.
Flood and Learn was the original learning method for VxLAN. It is also known as bridging, as it's used to create virtual bridges (VNI's) between hosts.
The other reason this is called bridging is that this is a layer-2 only solution. There is no built-in way to route between VNI's. If you need this, you must connect an external router, and let traffic 'hair-pin' through it.
The flooding nature of this method limits its scalability. Also, as there is no control plane learning of VTEPs, its possible for a rogue VTEP to be added to the network. It could intercept and inject traffic.
BUM traffic must be handled by multicast. There is no option for Head End replication in this case.
While this method is still worth understanding, it is recommended to use control plane learning in production.
Let's take a moment to think about how this process works. Imagine a host needs to send traffic to a remote host in the same VNI.
The host knows the IP address of the destination, but not its MAC. It will send an ARP request (BUM traffic) to discover this information.
The ARP request reaches the switch. The switch needs to send it to all VTEPs with this VNI, as well as to all local ports with this VNI. To do this, it sends the ARP to the appropriate multicast group.
When the VTEPs receive the ARP, they cache the IP to MAC information for possible use later. The VTEPs forward the ARP to all ports that belong to the VNI.
One of the hosts will respond to the ARP, while the rest will discard the request. The response is sent to the switch. This is unicast traffic, as it has a single destination.
The local VTEP encapsulates the response and sends it to the original VTEP. It has already cached its address from before.
The original VTEP receives the response and passes it on to the host. The host then begins normal unicast communication with the destination host.
If the cache expires, the flooding process begins again.
BGP operates in the control plane. A normal BGP deployment will share IP reachability information (routes). When integrated with VxLAN, it can also share MAC and VTEP reachability information.
It does this with the EVPN (Ethernet VPN) address family. If you're not familiar with address families, they are a way for BGP to carry information for different protocols. There are address families for IPv4, IPv6, L3VPN (MPLS), and others.
As all the addresses are learned proactively, there's no need for flooding.
All switches in the VxLAN topology need to run BGP EVPN. They don't all need to be running VTEPs. An example is the spine switches in the spine/leaf topology.
The switches peer with each other. The standard BGP rules apply here. This means that you need an iBGP full mesh or route reflectors. The spine switches are good candidates for route-reflectors.
When a host comes online, it announces its MAC address. This can also happen at other times with GARP messages. The local switch will add the MAC into the local BGP database. This is then sent to its peers as a BGP update.
When VTEPs are learned through BGP, they are dynamically added to a whitelist. This prevents rogue VTEP injection. BGP neighbour authentication can be used to prevent rogue peers.
A major advantage to control plane learning is ARP Suppression. As we saw earlier, ARP messages are flooded through the network. The behaviour is quite different now.
ARP still exists of course. Hosts are unaware of VxLAN and BGP, and will still maintain their own tables of MAC to IP bindings. But after they send an ARP, the behaviour is different.
When the ARP request reaches the local switch, the switch looks at it's BGP database. It sees the information it needs and responds to the host. The ARP doesn't get any further than the local switch.
The strange exception to this is when there are silent hosts. A silent host is a host that does not announce its presence. There aren't too many of these, but they do exist.
As they don't announce themselves, they can't be proactively learned. In this case, an ARP request will still need to be flooded to all VTEPs in the VNI. When a response is seen, the MAC to IP informaiton is gleaned, and added into BGP.
Good news! Integrated Routing and Bridging (IRB) is supported when using BGP. This means that each switch with a VTEP can also be a router. There is no need to hairpin traffic through an external router. Not all platforms support IRB, so choose wisely.
VNI's can now be considered layer-2 VNI's (L2VNI) or layer-3 VNI's (L3VNI). BGP distributes both of these to all peers.
The L2VNI is the bridge domain. This is for bridging hosts on the same layer-2 segment. Essentially, it is the VxLAN equivalent of a VLAN.
An L3VNI can be used to route between L2VNI's. The ingress or egress VTEP can perform routing. This is called Symmetric IRB. Another form of routing called Asymmetric IRB, uses the ingress VTEP for routing and bridging, while the egress VTEP can only do bridging. Not a lot of vendors support asymmetric IRB.
A VTEP needs to know about all locally used L2VNI's. It does not need to know about any L2VNI's that it doesn't need to support. A VTEP also needs to know about all L3VNI's in use across the network.
Each L3VNI can be associated with a VRF. This makes multitenancy possible, in a similar way to MPLS.
Each VRF is still configured with a Route Distinguisher to keep it unique. Reachability information is imported and exported with Route Targets. Remember that you will need extended communities for this.
Each tenant can have one L3VNI, and many L2VNI's.
Now that routing is supported, each switch with a VTEP can be a default gateway. To make this better, this is an anycast gateway.
This means that every switch will have an IP address in every routable VNI. All hosts in the VNI will use that IP as their default gateway. The IP address within the VNI is the same on every switch that supports the VNI. This means that a host can connect to any switch, and use the same default gateway.
The switches also use the same vMAC, so hosts don't have to wait for their MAC table entries to expire if they move.
This is better that VRRP or HSRP, as there can be many different switches in the solution. Also, there are no hello messages, priorities, preemption concerns, and so on.
If a switch receives a frame, and the destination MAC is the anycast gateway, the switch knows that routing needs to occur.
The frame will be encapsulated with the layer-3 VNI, which identifies the tenant's VRF.
Wikipedia - Virtual Extensible LAN
Last update 2018-05-10 12:17