Welcome to Mat's Cloud: How VXLANs Work

TIP: An Ethernet Ethertype of 0x0800 indicates that the payload is an IP.

TIP: The maximum transmission unit (MTU) requirement for VXLAN is minimum of 1,600 bytes to support IPv4 and IPv6 guest traffic.

TIP: The VLAN tag in the layer 2 Ethernet frame exists if the port group that your VXLAN VMkernel port is connected to has an associated VLAN number. When the port group is associated with a VLAN number, the port group tags the VXLAN frame with that VLAN number.

VXLAN (Virtual Extensible LAN) is an Overlay between the ESXi hosts, and it´s an Ethernet in IP overlay technology, where the original layer 2 frames are encapsulated in a User Datagram Protocol (UDP port 4789) packet and delivered over a transport network. It gives the capability to create a proper micro-segmentation, and it doesn’t have the number limitation as VLANs do. VXLAN uses the VXLAN encapsulation, so VLAN configuration becomes irrelevant. VXLAN modes are Unicast, Hybrid and Multicast (this refers to the Control Traffic), which impact the Teaming Type decision. In one of the later posts I will explain the concepts of Logical Switch and the Transport Zone, where the Control traffic transport will be better understood.

Virtual Tunnel End Point (VTEP) is an entity that encapsulates an Ethernet frame in a VXLAN frame or de-encapsulates a VXLAN frame and forwards the inner Ethernet frame. VXLAN VTEP is the VMkernel interface that serves as the endpoint for encapsulation or de-encapsulation of VXLAN traffic.

We can have up to 16 Million VXLANs, not only 4096, which is the case for L2 VLANs. Similar to the field in the VLAN header where a VLAN ID is stored, the 24-bit header allows for 16 million potential logical networks.

VXLAN implementation has two simple networking requirements:

Network MTU needs to be set to minimum 1600 bytes (VXLAN has around 50 bytes of header). Jumbo Frames might be a good solution here, 9000 bytes.
We need to assign at least one VLAN.

vSphere hosts use VMkernel interfaces to communicate over VXLAN. When you configure VXLAN on a cluster, NSX creates the VMkernel interfaces called the VTEPs (VXLAN Tunnel End Points). The number of VTEPs per host depends on the number of NICs and teaming type.

Directly Connected Networks

If I have three vmkernel port groups defined with the following IP information

vmk0: 10.1.0.1 255.255.255.0

vmk1: 10.1.1.1 255.255.255.0

vmk2: 10.1.2.1 255.255.255.0

Then vmk0 will be used to talk to everything on 10.1.0.0, vmk1 for 10.1.1.0, and vmk2 for 10.1.2.0.

Remote Networks

So, what happens when the device I am talking to is on a subnet that I am not directly connected to? This is where the routing table really comes into play so let’s take a look at it using:

vicfg-route –list

VMkernel Routes:

Network Netmask Gateway

10.1.0.0 255.255.255.0 Local Subnet

10.1.1.0 255.255.255.0 Local Subnet

10.1.2.0 255.255.255.0 Local Subnet

default 0.0.0.0 10.1.0.254

We see the directly connected networks with a Gateway of Local Subnet. This describes the direct communication that we discussed in Directly Connected Networks. The last line is a result of our configuration of the “VMkernel Default Gateway” when setting up the vmkernel port group. What it says is send everything else to the router at 10.1.0.254. The router is in the 10.1.0.0 network and since vmk0 is directly connected to that subnet we know that it will be used for all non-local traffic.

VXLAN is a network overlay technology design for data center networks. It provides massively increased scalability over VLAN IDs alone while allowing for L2 adjacency over L3 networks. The VXLAN VTEP can be implemented in both virtual and physical switches allowing the virtual network to map to physical resources and network services. VXLAN currently has both wide support and hardware adoption in switching ASICS and hardware NICs, as well as virtualization software.

The VXLAN encapsulation method is IP based and provides for a virtual L2 network. With VXLAN the full Ethernet Frame (with the exception of the Frame Check Sequence: FCS) is carried as the payload of a UDP packet. VXLAN utilizes a 24-bit VXLAN header, shown in the diagram

below, to identify virtual networks. This header provides for up to 16 million virtual L2 networks.

As I said before the frame encapsulation is done by an entity known as a VXLAN Tunnel Endpoint (VTEP). VTEP has two logical interfaces: an uplink and a downlink. The uplink is responsible for receiving VXLAN frames and acts as a tunnel endpoint with an IP address used for routing VXLAN encapsulated frames. These IP addresses are infrastructure addresses and are separate from the tenant IP addressing for the nodes using the VXLAN fabric. VTEP functionality can be implemented in software such as a virtual switch or in the form a physical switch.

The best VXLAN/VTEP explanation I´ve found comes from the Define the Cloud forum []:

VXLAN frames are sent to the IP address assigned to the destination VTEP; this IP is placed in the Outer IP DA. The IP of the VTEP sending the frame resides in the Outer IP SA. Packets received on the uplink are mapped from the VXLAN ID to a VLAN and the Ethernet frame payload is sent as an 802.1Q Ethernet frame on the downlink. During this process the inner MAC SA and VXLAN ID is learned in a local table. Packets received on the downlink are mapped to a VXLAN ID using the VLAN of the frame. A lookup is then performed within the VTEP L2 table using the VXLAN ID and destination MAC; this lookup provides the IP address of the destination VTEP. The frame is then encapsulated and sent out the uplink interface.

Using the diagram above for reference a frame entering the downlink on VLAN 100 with a destination MAC of 11:11:11:11:11:11 will be encapsulated in a VXLAN packet with an outer destination address of 10.1.1.1. The outer source address will be the IP of this VTEP (not shown) and the VXLAN ID will be 1001.

VTEP Table Concept: In a traditional L2 switch a behaviour known as flood and learn is used for unknown destinations (i.e. a MAC not stored in the MAC table. This means that if there is a miss when looking up the MAC the frame is flooded out all ports except the one on which it was received. When a response is sent the MAC is then learned and written to the table. The next frame for the same MAC will not incur a miss because the table will reflect the port it exists on. VXLAN preserves this behaviour over an IP network using IP multicast groups.

Each VXLAN ID has an assigned IP multicast group to use for traffic flooding (the same multicast group can be shared across VXLAN IDs.) When a frame is received on the downlink bound for an unknown destination it is encapsulated using the IP of the assigned multicast group as the Outer DA; it’s then sent out the uplink. Any VTEP with nodes on that VXLAN ID will have joined the multicast group and therefore receive the frame. This maintains the traditional Ethernet “flood and learn” behaviour.

VTEPs are designed as a logical device on an L2 switch. The L2 switch connects to the VTEP via a logical 802.1Q VLAN trunk. This trunk contains a VXLAN infrastructure VLAN in addition to the production VLANs. The infrastructure VLAN is used to carry VXLAN encapsulated traffic to the VXLAN fabric. The only member interfaces of this VLAN will be VTEP’s logical connection to the bridge itself and the uplink to the VXLAN fabric. This interface is the ‘uplink’ described above, while the logical 802.1Q trunk is the downlink.

IP addresses have to be assigned per VMkernel, and the number of IPs will depend on the number of NICs and Teaming, while the NIC Teaming is configured within the VDS. Network Pools are used to assign the IP addresses to VTEPs, or you can use the DHCP pool. VMKernel ports are Management ports actually, VXLAN configuration can be broken down into three important steps:

Configure Virtual Tunnel Endpoint (VTEP) on each host. VTEP is basically the IP address configured on the VMKernel interface, and normally these will be in the different subnets, which is why we need the VXLAN to simulate the L2 connectivity in the first place.
Configure Segment ID range to create a pool of logical networks. The Segment is used like a logical broadcast domain for VXLAN. I tend to use the Unicast mode and then we don't need to specify a multicast range.

- Define the span of the logical network by configuring the Transport zone (remember that transport zone defines the span of a logical switch). As you add new clusters in your datacenter, you can increase the transport zone and thus increase the span of the logical networks. Once you have the logical switch spanning across all compute clusters, you remove all the mobility and placement barriers you had before because of limited VLAN boundary.

Load Balancing is kind of statistical distribution of the Load; it’s not truly balancing the Load. Also the throughput will be maximally as the biggest NIC throughput because the traffic load has the same Source and Origin IP and Port, so there is no way to Load Balance them. You can distribute the Load based on a few factors (ways to hash the traffic), and the most popular are:

1. Load Balancing, Virtual Port ID, which is the default: Every NIC will get the Virtual Port ID, and Round Robin is used to choose the NIC over which the VM will travel.

2. Load Balancing, MAC Hash: The same but based on the NIC MAD addresses.

3. Load Balancing, IP Hash: Does Source/Destination IP Hashing, and it might be the best option for the NSX. Switches must be STACKED (vPC on Nexus, VSS on 6500, Stack on 3x50), and it requires the Port Channels. This is more complex, and it requires a physical Switch configuration.

4. Load Balancing, Load-Based Teaming (LBT): This is the ONLY mechanism that is Utilization Aware, and you must use the VDS. There is no special Switch configuration, but it’s NOT supported by NSX.

5. Load Balancing, LACP (Link Aggregation Control Protocol), which is not available on VDS. It allows vSphere and Switch to negotiate the hashing mechanism.

6. Explicit Failover: The simplest form, where you have an ACTIVE uplink, and the other(s) are STANDBY. This is used when you have a 10G Uplinks, and it’s common in NSX world if it meets the performance needs.

Equal-Cost Multi-Path (ECMP) routing is a routing strategy that provides the ability to forward traffic across multiple next-hop "paths" to a single destination (IP prefix). These next-hop "paths" can be added statically via static routes, or through the use of dynamic routing protocols that support ECMP such as Open Shortest Path First (OSPF) and Border Gateway Protocol (BGP).

The compatibility between the NSX and the Teaming Types is shown below.

Replication mode relates to the hand ling of broadcast, unknown unicast, and multicast (BUM) trafficThree modes of traffic Replication exist: two modes are based on VMware NSX Controller" based and one mode is based on data plane:

Unicast mode is all replication using unicast.
Hybrid mode [recommended for most deployments] is local replication that is offloaded to the physical network and remote replication through unicast.
Multicast mode requires IGMP for a layer 2 topology and multicast routing for L3 topology.

Welcome to Mat's Cloud

How VXLANs Work

No comments:

Post a Comment

Most Popular Posts