In the “old days” of tech—meaning roughly 3-6 years ago, there were some hard and fast rules about getting really high throughput from your storage:
Figure 1: The Good Old Days may have been good for many reasons, but faster computer products was not one of them.
The Times They Are A’ Changing
But starting in 2013 I started to see people breaking these rules. Software-defined storage delivered good performance on commodity servers. Hyper-converged infrastructure let compute and storage run on the same machines. Flash delivered many times the performance of spinning disks. Faster interconnects like 40Gb Ethernet grew in popularity for large clouds, compute clusters, and scale-out storage, as five vendors, including Mellanox, announced the new 25 and 50Gb Ethernet standards. And then there was Microsoft…
Figure 2: The HPE DL380 Gen 9 looks like a server, but thanks to software-defined storage and hyper-converged infrastructure, it can be storage, or compute and storage simultaneously.
Revolution from Redmond
Microsoft was an early leader in several of these fields. Windows Server 2012 R2 had native support to run over both 40 Gb Ethernet (with RoCE—RDMA over Converged Ethernet) and FDR 56Gb InfiniBand at a time when most enterprise storage systems only supported 10GbE and 8Gb Fibre Channel. In 2013 Microsoft and their server partners demonstrated that SMB Direct on 40 Gb Ethernet or FDR InfiniBand could best Fibre Channel SANs in both performance and price, and reduce the number of application servers needed to support a given workload. Faster and more efficient networking saved customers money on both server hardware and software licenses.
Figure 3: Microsoft 2013 study showed Windows Storage with RDMA and SAS hard drives had half the cost/capacity of Fibre Channel SAN, with the same performance.
At Tech Ed 2013, Microsoft demonstrated the power of RDMA with 40 Gb Ethernet by showing the live migration of virtual machines — a frequent and important task in both public and private clouds — was up to ten times faster using RDMA than using TCP/IP.
Figure 4: RDMA Enables live VM migration 10x faster than using TCP/IP – presented at the TechED’13 Opening Keynote Session.
In 2014, at the Open Networking Summit, Microsoft presented how they ran storage traffic using RoCE on 40GbE in their own cloud to lower the cost of running their Azure Storage. Dell and Mellanox teamed up with Microsoft to demonstrate over one million read IOPS using just two storage nodes and two clients, connected with FDR 56Gb/s InfiniBand. At the time, reaching 1M IOPS normally required a fancy and expensive dedicated storage array but this demo achieved it with just two Windows servers.
Then in 2015, we saw demonstrations of Windows Storage Spaces at Microsoft Ignite 2015 using one Mellanox 100Gb Ethernet link and Micron’s NVMe flash cards to achieve over 90 Gb/s (~11.1 GB/s) of actual throughput with just one millisecond latency. This was the first time I’d seen any single server really use 100 Gb Ethernet, let alone a Windows Server. It also proved that using SMB Direct with RDMA was a major advantage, with approximately twice the throughput, half the latency, and half the CPU utilization of using the regular SMB 3 protocol over TCP/IP.
Figure 5: One Windows Server delivers over 90 Gb/s of throughput using a single 100GbE link with RoCE. Performance without RoCE was halved.
Hyper-Race to Hyper-converged Windows Storage
In 2016, the race began to demonstrate ever faster performance using Windows Storage Spaces Direct (S2D) in a hyper-converged setup with NVMe flash storage and 100Gb RoCE. First Mellanox, Dell and HGST (a Western Digital brand) built a two-server cluster with Dell R730XD machines, each with two HGST UltraStar SN150 NVMe SSDs and two Mellanox ConnectX-4 100GbE NICs. A Mellanox Spectrum™ switch connected the machines and the cluster delivered 178Gb/s (22.3 GB/s). Then, at Flash Memory Summit and Intel Developer Forum, Microsoft, Dell, Samsung and Mellanox showed 640 Gb/s (80GB/s) using four Dell R730XD server with 4 Samsung NVMe SSDs and 2 Mellanox’s ConnectX-4 100GbE NIC, all connected by Mellanox’s Spectrum switch and LinkX® cables.
These demos were getting faster ever few months and showing the power and flexibility of a hyper-converged deployment on Windows Server 2016.
Figure 6: Microsoft recommends RDMA when deploying Windows Storage Spaces Direct as a hyper-converged infrastructure.
Breaking the One Terabit Per Second Barrier
Now, at Microsoft Ignite 2016, comes the latest demonstration of Windows S2D with an amazing milestone. A 12-node cluster using HPE DL380 servers with Micron 9100 Max NVMe SSDs and Mellanox ConnectX-4 100GbE NICs delivers over 1.2 Terabits per second (>160 GB/s). Connectivity was through a Mellanox Spectrum 100 GbE switch and LinkX cables. More impressively, the cluster hosted 336 virtual machines and only 25 percent of the CPU power was consumed, leaving 75 percent of the CPU capacity for running additional applications. Which is important because the whole purpose of hyper-converged infrastructure is to run applications.
This HPE server is the latest version of their ever popular DL380 series, the world’s best-selling server (per IDC’s Server Tracker for Q1 2016). It supports up to 3TB of DDR4 memory, up to 6 NVMe SSDs, 25/40/50/100GbE adapters from Mellanox, and of course Microsoft Windows Server 2016, which was used for this demo.
Figure 7: HPE, Mellanox and Micron exceed 1.2 Terabit/second using twelve HPE DL380 Gen 9 servers, 48 Micron NVMe SSDs, with Mellanox 100GbE NICs, switch, and cables.
Amazing Performance, Price and Footprint
This is an amazing achievement considering that a few years ago this was was easily considered to be supercomputer performance and now it is available in any datacenter using off-the-shelf servers, SSDs, and Ethernet, with software-defined storage (Windows Server 2016), all at a reasonable off-the-shelf price. To give an idea of how impressive this is, look at what’s required to deliver this throughput using traditional enterprise or HPC storage. Suppose each HPE server with 4 Micron NVMe SSDs and two Mellanox ConnectX-4 100Gb Ethernet NICs included costs $15,000 each, plus $12,000 for the 100GbE switch and cables. That makes the cost of the entire 12-node cluster $192,000 and at 25U in height (2U per server, 1U for the switch) it consumes half a rack of footprint. But if we use traditional storage arrays…
|1.2 Tb/s Solution||Nodes||Footprint||Network Ports||Clients||Estimated Cost|
|Windows S2D||12||½ rack||24||Included||$192,000|
|AFA||13||2 Racks||52 IB / 104 FC||Separate||$390,000|
|Lustre cluster||14||3 Racks||56||Separate||$450,000|
|Scale-out SAN||4||1.5 Racks||
|High-end SAN||3-4||2 Racks||80-96||Separate||$600,000|
Figure 8: The Windows S2D solution uses from one-quarter to one-third the footprint of traditional storage solutions and acquisition cost is estimated to be less than half of other solutions.
To be clear, these are all fine and proven enterprise or HPC storage solutions which offer many rich features, not all of which are necessarily available in Windows S2D. They probably have much more storage capacity than the 115 TB (raw) in the 1.2 Tb/s Windows S2D cluster. (Also, these are approximate price estimates, and actual prices for each system above could be significantly higher or lower.) But these are also prices for storage only, not including any compute power to run virtual machines and applications, whereas the Windows S2D solution includes the compute nodes and plenty of free CPU power. That only re-emphasizes my point is the Windows solution delivers much more throughput at a much lower price, and it also consumes much less space, power, and network ports to achieve it.
Figure 9: This amazing Windows S2D performance didn’t come from a completely generic components, but it did use general-purpose, off-the-shelf servers, storage, software, and networking from HPE, Micron, Microsoft and Mellanox to achieve performance far from generic.
The key lesson here is that software-defined storage and hyper-converged infrastructure, combined with NVMe flash and 100Gb RDMA networking, can now equal or surpass traditional enterprise arrays in performance at a far lower price. What used to be the sole realm of racks of dedicated arrays, expensive Fibre Channel networks, and row-upon-row of drives can now be accomplished with half a rack of general purpose servers with a few SSDs and 2 Mellanox NICs in each server.
It’s a software-defined storage revolution riding on the back of faster flash and faster networks — a revolution in more ways than one.
Discover more about technology innovation being led by Mellanox and Microsoft below: