The high bandwidth requirements of modern data centers are driven by the demands of business applications, data explosion, and the much faster storage devices available today. For example, to utilize a 100GbE link, you needed 250 hard drives in the past, while today, you need only three NVMe SSDs.
After investing in the most advanced network infrastructure, the highest bandwidth links, the shiniest SDN controller, and the latest Cloud automation tools, your expectation is to fully utilize each link, whether 100GbE, 25GbE, or legacy 10GbE, in order to reach the highest IOPs measurements with your Software Defined Storage solution. But, is a collection of the cutting edge technologies enough?
Moving to a scale-out paradigm is a common practice today. Especially with hyper-converged solutions, data traffic is continuously running east-west between storage and compute server nodes. Even with 10GbE interfaces on individual servers, the aggregated data flow can fully utilize 100GbE links between leaf and spine layer switches. In addition, software defined storage generates extra traffic to maintain the solution, providing yet another source to consume network bandwidth.
To get the most from your network equipment, one needs to look at it from a PCIe to PCIe perspective, define the specific use case, and run a few simulations. Let us consider a simple example of an OpenStack deployment:
- Dual 10GbE SFP port NICs on a medium density rack of 40 compute servers, 60 VMs per server
- Layer 2 between ToR and Server with high availability requirement
- Layer 3 between ToR and Aggregation layer, VXLAN Overlay and Nuage SDN controller
- Storage is Ceph connected with 50GbE QSFP28 dual ports
Now, where are those pesky bottlenecks?
VXLAN UDP packets are hitting the NIC card on the server, and the NIC has no idea what to do with this creature, so it pushes it up to the kernel. Once the software is involved in the data plane, it is game over for high performance. The only way to sustain 10Gbps is for the NIC to know how to get inside the UDP packet and parse it for checksum, RSS, TSS and other operations that are natively handled with simple VLAN. If the NIC cannot do that, then the CPU will need to, and that will come at the expense of your application.
So, till now we were able to achieve higher CPU and lower performance, but what about the switch?
Can my switch sustain the 100GbE between the ToR and Spine? Losing packets means re-transmissions, how can you be sure that your switch has zero packet loss?
Ceph is now pushing 50GbE to the compute nodes 10GbE interfaces; congestion occurs and you cannot design it in a way that the congestion points will be predictable since the computes are dispersed. So, the question remains, can you ensure the switch will be able to handle this kind of congestion fairly?
There is a need for VXLAN Termination End Point (VTEP) to connect bare-metal servers to the virtual networks. This should be handled by the switch hardware. Another VTEP can be done on the Hypervisor, but then the OVS becomes the bottleneck. So, what if we offloaded it to the NIC?
I can continue on and on about the TCP/IP flow that involves the CPU in the network operations, but now let’s talk about deployment 100GbE infrastructure and getting 100GbE SDN deployment via:
- Mellanox ConnectX4-Lx on your Servers provides VXLAN offload with a single parameter configuration on the driver.
- Provide your Nuage SDN controller the ability to do VXLAN terminating on the Server but with hardware OVS offload; Mellanox ASAP² provides OVS Offload with Nuage integration on ConnectX®4-Lx.
- Provide your Nuage SDN controller the ability to do VXLAN terminating on the ToR for bare metal servers and other L2 gateway requirements.
- Switch that is running L2 and L3 at scale, not losing packets and can handle congestion scenarios so Mellanox Spectrum™-based switches are not losing packets and they still provide fair buffering for each of the device ports. This is not a trivial accomplishment.
- In 1RU, two SN2100 Spectrum-based switches serve 40 x 10GbE servers in MLAG topology with no oversubscription using 400GbE downlinks and 400GbE uplinks.
- Run your Ansible, Puppet or other automation tools for the servers as for the Network Operating System (NOS), Cumulus Linux over Spectrum-based switches.
- Note that fabric monitoring and provisioning tool for the switch and NIC that can be launched from Horizon or vSphere or from any VM: Mellanox NEO.
- A tool that can provision the transformation of network types for Ironic and that has an ML2 mechanism driver interface: Mellanox NEO:
- Reduce complexity by choosing inbox solutions.
And now, what about 25GbE from the server to the Top of Rack switch? You have the infrastructure, make sure that your cables are SFP28 capable, the form factor is the same and you are all set for the next step. You are now ready for 25G.