How VXLANs Work

TIP: An Ethernet Ethertype of 0x0800 indicates that the payload is an IP.
TIP: The maximum transmission unit (MTU) requirement for VXLAN is minimum of 1,600 bytes to support IPv4 and IPv6 guest traffic.
TIP: The VLAN tag in the layer 2 Ethernet frame exists if the port group that your VXLAN VMkernel port is connected to has an associated VLAN number. When the port group is associated with a VLAN number, the port group tags the VXLAN frame with that VLAN number.

VXLAN (Virtual Extensible LAN) is an Overlay between the ESXi hosts, and it´s an Ethernet in IP overlay technology, where the original layer 2 frames are encapsulated in a User Datagram Protocol (UDP port 4789) packet and delivered over a transport network. It gives the capability to create a proper micro-segmentation, and it doesn’t have the number limitation as VLANs do. VXLAN uses the VXLAN encapsulation, so VLAN configuration becomes irrelevant. VXLAN modes are Unicast, Hybrid and Multicast (this refers to the Control Traffic), which impact the Teaming Type decision. In one of the later posts I will explain the concepts of Logical Switch and the Transport Zone, where the Control traffic transport will be better understood.

Virtual Tunnel End Point (VTEP) is an entity that encapsulates an Ethernet frame in a VXLAN frame or de-encapsulates a VXLAN frame and forwards the inner Ethernet frame. VXLAN VTEP is the VMkernel interface that serves as the endpoint for encapsulation or de-encapsulation of VXLAN traffic.

We can have up to 16 Million VXLANs, not only 4096, which is the case for L2 VLANs. Similar to the field in the VLAN header where a VLAN ID is stored, the 24-bit header allows for 16 million potential logical networks.

VXLAN implementation has two simple networking requirements:
  1.         Network MTU needs to be set to minimum 1600 bytes (VXLAN has around 50 bytes of header). Jumbo Frames might be a good solution here, 9000 bytes.
  2.         We need to assign at least one VLAN.

vSphere hosts use VMkernel interfaces to communicate over VXLAN. When you configure VXLAN on a cluster, NSX creates the VMkernel interfaces called the VTEPs (VXLAN Tunnel End Points). The number of VTEPs per host depends on the number of NICs and teaming type.

Directly Connected Networks

If I have three vmkernel port groups defined with the following IP information
vmk0: 10.1.0.1    255.255.255.0
vmk1: 10.1.1.1    255.255.255.0
vmk2: 10.1.2.1    255.255.255.0

Then vmk0 will be used to talk to everything on 10.1.0.0, vmk1 for 10.1.1.0, and vmk2 for 10.1.2.0.

Remote Networks

So, what happens when the device I am talking to is on a subnet that I am not directly connected to? This is where the routing table really comes into play so let’s take a look at it using:
vicfg-route –list
VMkernel Routes:
Network             Netmask             Gateway
10.1.0.0            255.255.255.0       Local Subnet
10.1.1.0            255.255.255.0       Local Subnet
10.1.2.0            255.255.255.0       Local Subnet
default             0.0.0.0             10.1.0.254

We see the directly connected networks with a Gateway of Local Subnet. This describes the direct communication that we discussed in Directly Connected Networks. The last line is a result of our configuration of the “VMkernel Default Gateway” when setting up the vmkernel port group. What it says is send everything else to the router at 10.1.0.254. The router is in the 10.1.0.0 network and since vmk0 is directly connected to that subnet we know that it will be used for all non-local traffic.

VXLAN is a network overlay technology design for data center networks.  It provides massively increased scalability over VLAN IDs alone while allowing for L2 adjacency over L3 networks.  The VXLAN VTEP can be implemented in both virtual and physical switches allowing the virtual network to map to physical resources and network services.  VXLAN currently has both wide support and hardware adoption in switching ASICS and hardware NICs, as well as virtualization software.

The VXLAN encapsulation method is IP based and provides for a virtual L2 network.  With VXLAN the full Ethernet Frame (with the exception of the Frame Check Sequence: FCS) is carried as the payload of a UDP packet.  VXLAN utilizes a 24-bit VXLAN header, shown in the diagram
below, to identify virtual networks.  This header provides for up to 16 million virtual L2 networks.



As I said before the frame encapsulation is done by an entity known as a VXLAN Tunnel Endpoint (VTEP).  VTEP has two logical interfaces: an uplink and a downlink.  The uplink is responsible for receiving VXLAN frames and acts as a tunnel endpoint with an IP address used for routing VXLAN encapsulated frames.  These IP addresses are infrastructure addresses and are separate from the tenant IP addressing for the nodes using the VXLAN fabric.  VTEP functionality can be implemented in software such as a virtual switch or in the form a physical switch.

The best VXLAN/VTEP explanation I´ve found comes from the Define the Cloud forum []:

VXLAN frames are sent to the IP address assigned to the destination VTEP; this IP is placed in the Outer IP DA.  The IP of the VTEP sending the frame resides in the Outer IP SA.  Packets received on the uplink are mapped from the VXLAN ID to a VLAN and the Ethernet frame payload is sent as an 802.1Q Ethernet frame on the downlink.  During this process the inner MAC SA and VXLAN ID is learned in a local table.  Packets received on the downlink are mapped to a VXLAN ID using the VLAN of the frame.  A lookup is then performed within the VTEP L2 table using the VXLAN ID and destination MAC; this lookup provides the IP address of the destination VTEP.  The frame is then encapsulated and sent out the uplink interface.



Using the diagram above for reference a frame entering the downlink on VLAN 100 with a destination MAC of 11:11:11:11:11:11 will be encapsulated in a VXLAN packet with an outer destination address of 10.1.1.1.  The outer source address will be the IP of this VTEP (not shown) and the VXLAN ID will be 1001.

VTEP Table Concept: In a traditional L2 switch a behaviour known as flood and learn is used for unknown destinations (i.e. a MAC not stored in the MAC table.  This means that if there is a miss when looking up the MAC the frame is flooded out all ports except the one on which it was received.  When a response is sent the MAC is then learned and written to the table.  The next frame for the same MAC will not incur a miss because the table will reflect the port it exists on.  VXLAN preserves this behaviour over an IP network using IP multicast groups.

Each VXLAN ID has an assigned IP multicast group to use for traffic flooding (the same multicast group can be shared across VXLAN IDs.)  When a frame is received on the downlink bound for an unknown destination it is encapsulated using the IP of the assigned multicast group as the Outer DA; it’s then sent out the uplink.  Any VTEP with nodes on that VXLAN ID will have joined the multicast group and therefore receive the frame.  This maintains the traditional Ethernet “flood and learn” behaviour.

VTEPs are designed as a logical device on an L2 switch.  The L2 switch connects to the VTEP via a logical 802.1Q VLAN trunk.  This trunk contains a VXLAN infrastructure VLAN in addition to the production VLANs.  The infrastructure VLAN is used to carry VXLAN encapsulated traffic to the VXLAN fabric.  The only member interfaces of this VLAN will be VTEP’s logical connection to the bridge itself and the uplink to the VXLAN fabric.  This interface is the ‘uplink’ described above, while the logical 802.1Q trunk is the downlink.



IP addresses have to be assigned per VMkernel, and the number of IPs will depend on the number of NICs and Teaming, while the NIC Teaming is configured within the VDS. Network Pools are used to assign the IP addresses to VTEPs, or you can use the DHCP pool.  VMKernel ports are Management ports actually, VXLAN configuration can be broken down into three important steps:
  •         Configure Virtual Tunnel Endpoint (VTEP) on each host. VTEP is basically the IP address configured on the VMKernel interface, and normally these will be in the different subnets, which is why we need the VXLAN to simulate the L2 connectivity in the first place.
  •         Configure Segment ID range to create a pool of logical networks. The Segment is used like a logical broadcast domain for VXLAN. I tend to use the Unicast mode and then we don't need to specify a multicast range.

-        Define the span of the logical network by configuring the Transport zone (remember that transport zone defines the span of a logical switch). As you add new clusters in your datacenter, you can increase the transport zone and thus increase the span of the logical networks. Once you have the logical switch spanning across all compute clusters, you remove all the mobility and placement barriers you had before because of limited VLAN boundary.

Load Balancing is kind of statistical distribution of the Load; it’s not truly balancing the Load. Also the throughput will be maximally as the biggest NIC throughput because the traffic load has the same Source and Origin IP and Port, so there is no way to Load Balance them. You can distribute the Load based on a few factors (ways to hash the traffic), and the most popular are:
1.      Load Balancing, Virtual Port ID, which is the default: Every NIC will get the Virtual Port ID, and Round Robin is used to choose the NIC over which the VM will travel.
2.      Load Balancing, MAC Hash: The same but based on the NIC MAD addresses.
3.      Load Balancing, IP Hash: Does Source/Destination IP Hashing, and it might be the best option for the NSX. Switches must be STACKED (vPC on Nexus, VSS on 6500, Stack on 3x50), and it requires the Port Channels. This is more complex, and it requires a physical Switch configuration.
4.      Load Balancing, Load-Based Teaming (LBT): This is the ONLY mechanism that is Utilization Aware, and you must use the VDS. There is no special Switch configuration, but it’s NOT supported by NSX.
5.      Load Balancing, LACP (Link Aggregation Control Protocol), which is not available on VDS. It allows vSphere and Switch to negotiate the hashing mechanism.
6.      Explicit Failover: The simplest form, where you have an ACTIVE uplink, and the other(s) are STANDBY. This is used when you have a 10G Uplinks, and it’s common in NSX world if it meets the performance needs.

Equal-Cost Multi-Path (ECMP) routing is a routing strategy that provides the ability to forward traffic across multiple next-hop "paths" to a single destination (IP prefix). These next-hop "paths" can be added statically via static routes, or through the use of dynamic routing protocols that support ECMP such as Open Shortest Path First (OSPF) and Border Gateway Protocol (BGP).

The compatibility between the NSX and the Teaming Types is shown below.



Replication mode relates to the hand ling of broadcast, unknown unicast, and multicast (BUM) trafficThree modes of traffic Replication exist: two modes are based on VMware NSX Controller" based and one mode is based on data plane:
  •         Unicast mode is all replication using unicast.
  •         Hybrid mode [recommended for most deployments] is local replication that is offloaded to the physical network and remote replication through unicast.
  •         Multicast mode requires IGMP for a layer 2 topology and multicast routing for L3 topology.

VMware NSX Fundamentals

In July 2012 VMware acquired Nicira (Nicira was founded by Martin Casado of Stanford University and it had a product called NVP - Network Virtualization Platform), and that’s basically how VMware started the NSX Venture and got into the SDN.
NSX enables you to start with your existing network and server hardware in the data center, as it´s independent of the network hardware. This does not mean that you can use just any hardware; you still to have a stable, Highly Available and Fast network. ESXi hosts, virtual switches, and distributed switches run on the hardware. 



On the other hand, to avoid the Physical Network problem, tend towards Life and Spine architecture.


Nowadays you wont find many clients with the Spine and Leaf network deployed. Therefore maybe the best approach would be proposing a slow transition (Upgrade even) where the traditional 3-Tier Architecture would be evolving towards the below presented L3 Spine and Leaf:





There are 2 versions of NSX:
-        NSXv, or NSC for vSphere (you need 100% vSphere, no other Hypervisors are allowed), which has more features.
-        NSX-mh, or Multi Hypervisor (pending on some standards to be globally accepted).

SDN concept, or any Software Defined Service concept is based on moving the proprietary features (“Intelligence”) from Hardware to Software Layer, and turning Hardware into a Comodity. Some popular terms are SDS (Storage, such as VSAN), SDC (Compute, such as vSphere and Hyper-V), SDN (Network, such as NSX, Juniper Contrail, and Cisco ACI) and SDDC (Data Center) that includes all the previous.
SDN introduces a very simple idea: provision and deploy an infrastructure in accordance with the applications needs. The optimization technique here is clear, since we all know that it takes a long time to provision FW rules, Load Balancers, VPNs, VLANs, IP addresses etc.



Any NSX implementation will need Management Cluster and Compute Cluster. Both Layer 3 and Layer 2 transport networks are fully supported by NSX, however to most effectively demonstrate the flexibility and scalability of network Virtualization a L3 topology with different routed networks for the Compute and Management/Edge Clusters is recommended.



Best practices when integrating NSX with your infrastructure are to deploy the following components:
  •         Management Cluster is a really important concept, and the best practice recommendation by VMware.
  •         Install kernel modules on the vSphere hosts. No disruption in service.
  •         Layers on top of VDS (vSphere Distributed Switch). Yes, you have to deploy VDS.


Why would a customer deploy NSX, what improvements does it introduce?
  •        Network Abstraction (VXLAN), allowing transparent communication over an entire network, Layer 2 over Layer 3, decoupled from the physical network.
  •         Automation brings the transparency to the Vendor Variety in the infrastructure by adding a management layer on top.
  •         DLR (Distributed Logical Routing)
  •         EDGE Services, better then the old ones, more features.
  •         Distributed Firewalling (DLF), to have the FW over an entire infrastructure, and beeing able to allow/deny communication even between the hosts in the same network.
  •         3rd Party Extensions like L7 Firewall, any Network Service really.

NSX components require a number of ports for NSX communications:
  •         443 between the ESXi hosts, vCenter Server, and NSX Manager.
  •         443 between the REST client and NSX Manager.
  •         TCP 902 and 903 between the vSphere Web Client and ESXi hosts.
  •         TCP 80 and 443 to access the NSX Manager management user interface and initialize the vSphere and NSX Manager connection.
  •         TCP 22 for CLI troubleshooting.

NSX Manager is installed as a typical .ova Virtual Machine, and have in mind that you need to integrate the NSX Manager into the existing platform by connecting it to the desired vCenter Server. NSX uses the management plane, control plane, and data plane models. Components on one plane have minimal or no effect on the functions of the planes below.


NSX controller is the central control point for all logical switches within a network and maintains information of all virtual machines, hosts, logical switches, and VXLANs (supports Multicast, Unicast and Hybrid control plane modes). The unicast mode replicates all the BUM traffic locally on the host and requires no physical network configuration.

How vSphere sees VLANs

-     Virtual Switch (vSwitch): Manages virtual machine and networking at the host level. There is NEVER a direct connection between two vSwitches, and the Spanning Tree is OFF. So EAST-WEST Traffic is NOT ALLOWED between the vSwitches, and the only way out of the vSwitch is via UPLINKs (physical interconnections with the Physical Switch, NIC=VMNIC) that are Teamed to work as one link. There is a variety of ways of teaming them (Active-Standby, LACP etc.).



Since Spanning-Tree is not running at all, be sure to do BPDUGUARD and PORTFAST TRUNK on the physical ports of the Switch.

The existence of VLANs is inevitable in any kind of L2 environment, but in the case of vSphere, there are 3 methods to configure them:
  •         EST (External Tag Switching), which is a default method, and all Port Groups on a vSwitch are in VLAN 0. The Physical Switch facing the host needs to be set to an Access Mode (any VLAN will work, depends on your network), because the traffic is coming untagged.
  •         VST (Virtual Switch Tagging), which means that you basically create a new port group and put it into the VLAN you want, and the VLAN is automatically created on the vSwitch. The Physical Switch needs to have the ports defined as Trunk.
  •         VGT (Virtual Guest Tagging), when you want to TRUNK to the actual VMs (VM receives the packets with dot1q Trunk with various VLANs). To do this, you need to set the VLAN to be All (4095).

Most Popular Posts