TIP: An Ethernet Ethertype of 0x0800 indicates that the payload is an IP.
TIP: The maximum transmission unit (MTU) requirement for VXLAN
is minimum of 1,600 bytes to support IPv4 and IPv6 guest traffic.
TIP: The VLAN tag in the layer 2 Ethernet frame exists if the
port group that your VXLAN VMkernel port is connected to has an associated VLAN
number. When the port group is associated with a VLAN number, the port group
tags the VXLAN frame with that VLAN number.
VXLAN (Virtual Extensible LAN) is an Overlay between the ESXi
hosts, and it´s an Ethernet in IP overlay technology, where the original layer 2 frames are encapsulated in
a User Datagram Protocol (UDP port 4789) packet and delivered over a transport
network. It gives the capability to create a proper micro-segmentation, and it doesn’t have the number limitation as
VLANs do. VXLAN uses the VXLAN encapsulation, so VLAN configuration becomes
irrelevant. VXLAN modes are Unicast, Hybrid and Multicast (this refers to the
Control Traffic), which impact the Teaming Type decision. In one of the later
posts I will explain the concepts of Logical Switch and the Transport Zone,
where the Control traffic transport will be better understood.
Virtual Tunnel End Point (VTEP) is an entity that
encapsulates an Ethernet frame in a VXLAN frame or de-encapsulates a VXLAN
frame and forwards the inner Ethernet frame. VXLAN VTEP is the VMkernel interface that serves
as the endpoint for encapsulation or de-encapsulation of VXLAN traffic.
We
can have up to 16 Million VXLANs,
not only 4096, which is the case for L2 VLANs. Similar to the field in the VLAN
header where a VLAN ID is stored, the 24-bit header allows for 16 million
potential logical networks.
VXLAN
implementation has two simple networking requirements:
- Network MTU needs to be set to minimum 1600 bytes (VXLAN has around 50 bytes of header). Jumbo Frames might be a good solution here, 9000 bytes.
- We need to assign at least one VLAN.
vSphere hosts use VMkernel interfaces
to communicate over VXLAN. When you configure VXLAN on a cluster, NSX creates the VMkernel interfaces called the VTEPs (VXLAN Tunnel End
Points). The number of VTEPs per host depends on the number of NICs and
teaming type.
Directly Connected Networks
If I
have three vmkernel port groups defined with the following IP information
vmk0: 10.1.0.1
255.255.255.0
vmk1: 10.1.1.1
255.255.255.0
vmk2: 10.1.2.1
255.255.255.0
Then
vmk0 will be used to talk to everything on 10.1.0.0, vmk1 for 10.1.1.0, and
vmk2 for 10.1.2.0.
Remote Networks
So,
what happens when the device I am talking to is on a subnet that I am not
directly connected to? This is where the routing table really comes into play
so let’s take a look at it using:
vicfg-route –list
VMkernel Routes:
Network
Netmask Gateway
10.1.0.0 255.255.255.0 Local Subnet
10.1.1.0
255.255.255.0 Local Subnet
10.1.2.0
255.255.255.0 Local Subnet
default
0.0.0.0 10.1.0.254
We
see the directly connected networks with a Gateway of Local Subnet. This
describes the direct communication that we discussed in Directly Connected
Networks. The last line is a result of our configuration of the “VMkernel
Default Gateway” when setting up the vmkernel port group. What it says is send
everything else to the router at 10.1.0.254. The router is in the 10.1.0.0
network and since vmk0 is directly connected to that subnet we know that it
will be used for all non-local traffic.
VXLAN
is a network overlay technology design for data center networks. It
provides massively increased scalability over VLAN IDs alone while allowing for L2 adjacency over
L3 networks. The VXLAN VTEP
can be implemented in both virtual and physical switches allowing the virtual
network to map to physical resources and network services. VXLAN
currently has both wide support and hardware adoption in switching ASICS and
hardware NICs, as well as virtualization software.
The
VXLAN encapsulation method is IP based and provides for a virtual L2
network. With VXLAN the full Ethernet Frame (with the exception of the
Frame Check Sequence: FCS) is carried as the payload of a UDP packet.
VXLAN utilizes a 24-bit VXLAN header, shown in the diagram
below,
to identify virtual networks. This header provides for up to 16 million
virtual L2 networks.
As I
said before the frame encapsulation is done by an entity known as a VXLAN
Tunnel Endpoint (VTEP). VTEP has two logical interfaces: an uplink and a
downlink. The uplink is responsible for receiving VXLAN frames and acts
as a tunnel endpoint with an IP address used for routing VXLAN encapsulated
frames. These IP addresses are infrastructure addresses and are separate
from the tenant IP addressing for the nodes using the VXLAN fabric. VTEP functionality can be implemented in
software such as a virtual switch or in the form a physical switch.
The
best VXLAN/VTEP explanation I´ve found comes from the Define the Cloud forum []:
VXLAN
frames are sent to the IP address assigned to the destination VTEP; this IP is
placed in the Outer IP DA. The IP of the VTEP sending the frame resides
in the Outer IP SA. Packets received on the uplink are mapped from the
VXLAN ID to a VLAN and the Ethernet frame payload is sent as an 802.1Q Ethernet
frame on the downlink. During this process the inner MAC SA and VXLAN ID
is learned in a local table. Packets received on the downlink are mapped
to a VXLAN ID using the VLAN of the frame. A lookup is then performed
within the VTEP L2 table using the VXLAN ID and destination MAC; this lookup
provides the IP address of the destination VTEP. The frame is then
encapsulated and sent out the uplink interface.
Using
the diagram above for reference a frame entering the downlink on VLAN 100 with
a destination MAC of 11:11:11:11:11:11 will be encapsulated in a VXLAN packet
with an outer destination address of 10.1.1.1. The outer source address
will be the IP of this VTEP (not shown) and the VXLAN ID will be 1001.
VTEP Table Concept: In a traditional L2 switch a behaviour known as flood
and learn is used for unknown destinations (i.e. a MAC not stored in the MAC
table. This means that if there is a miss when looking up the MAC the
frame is flooded out all ports except the one on which it was received.
When a response is sent the MAC is then learned and written to the table.
The next frame for the same MAC will not incur a miss because the table will
reflect the port it exists on. VXLAN preserves this behaviour over an IP
network using IP multicast groups.
Each
VXLAN ID has an assigned IP multicast group to use for traffic flooding (the
same multicast group can be shared across VXLAN IDs.) When a frame is
received on the downlink bound for an unknown destination it is encapsulated
using the IP of the assigned multicast group as the Outer DA; it’s then sent
out the uplink. Any VTEP with nodes on that VXLAN ID will have joined the
multicast group and therefore receive the frame. This maintains the
traditional Ethernet “flood and learn” behaviour.
VTEPs are designed as a
logical device on an L2 switch. The L2 switch connects to the VTEP via a logical
802.1Q VLAN trunk. This trunk contains a VXLAN infrastructure VLAN in
addition to the production VLANs. The infrastructure VLAN is used to
carry VXLAN encapsulated traffic to the VXLAN fabric. The only member
interfaces of this VLAN will be VTEP’s logical connection to the bridge itself
and the uplink to the VXLAN fabric. This interface is the ‘uplink’ described above, while the
logical 802.1Q trunk is the downlink.
IP addresses have to be assigned per
VMkernel, and
the number of IPs will depend on the number of NICs and Teaming, while the NIC
Teaming is configured within the VDS. Network Pools are used to assign the IP
addresses to VTEPs, or you can use the DHCP pool. VMKernel ports are Management ports actually, VXLAN configuration can be broken down into three important steps:
- Configure Virtual Tunnel Endpoint (VTEP) on each host. VTEP is basically the IP address configured on the VMKernel interface, and normally these will be in the different subnets, which is why we need the VXLAN to simulate the L2 connectivity in the first place.
- Configure Segment ID range to create a pool of logical networks. The Segment is used like a logical broadcast domain for VXLAN. I tend to use the Unicast mode and then we don't need to specify a multicast range.
-
Define the span of the logical network by
configuring the Transport zone (remember
that transport zone defines the span of a logical switch). As you add new
clusters in your datacenter, you can increase the transport zone and thus
increase the span of the logical networks. Once you have the logical switch
spanning across all compute clusters, you remove all the mobility and placement
barriers you had before because of limited VLAN boundary.
Load Balancing is kind of statistical distribution of the Load; it’s
not truly balancing the Load. Also the throughput will be maximally as the
biggest NIC throughput because the traffic load has the same Source and Origin
IP and Port, so there is no way to Load Balance them. You can distribute the
Load based on a few factors (ways to hash the traffic), and the most popular
are:
1. Load Balancing, Virtual Port ID, which is the default: Every
NIC will get the Virtual Port ID, and Round Robin is used to choose the NIC
over which the VM will travel.
2. Load Balancing, MAC Hash: The same but based on the NIC MAD addresses.
3. Load Balancing, IP Hash: Does Source/Destination IP Hashing, and it might be
the best option for the NSX. Switches must be STACKED (vPC on Nexus, VSS on
6500, Stack on 3x50), and it requires the Port Channels. This is more complex,
and it requires a physical Switch configuration.
4. Load Balancing, Load-Based Teaming (LBT): This is the ONLY mechanism
that is Utilization Aware, and you must use the VDS. There is no special Switch
configuration, but it’s NOT
supported by NSX.
5. Load Balancing, LACP (Link Aggregation Control Protocol), which is not available on
VDS. It allows vSphere and Switch to negotiate the hashing mechanism.
6. Explicit Failover: The simplest form, where you have an ACTIVE uplink, and
the other(s) are STANDBY. This is used when you have a 10G Uplinks, and it’s
common in NSX world if it meets the performance needs.
Equal-Cost Multi-Path (ECMP) routing is a routing
strategy that provides the ability to forward traffic across multiple next-hop
"paths" to a single destination (IP prefix). These next-hop
"paths" can be added statically via static routes, or through the use
of dynamic routing protocols that support ECMP such as Open Shortest Path First
(OSPF) and Border Gateway Protocol (BGP).
The
compatibility between the NSX and the Teaming
Types is shown below.
Replication mode relates to the hand ling of broadcast, unknown
unicast, and multicast (BUM) trafficThree modes of traffic Replication exist: two modes are based on VMware NSX
Controller" based and one mode is based on data plane:
- Unicast mode is all replication using unicast.
- Hybrid mode [recommended for most deployments] is local replication that is offloaded to the physical network and remote replication through unicast.
- Multicast mode requires IGMP for a layer 2 topology and multicast routing for L3 topology.