HPE, Mellanox, Micron, and Microsoft Exceed One Terabit Per Second of Storage Throughput With Hyper-converged Solution

 
Uncategorized

In the “old days” of tech—meaning roughly 3-6 years ago, there were some hard and fast rules about getting really high throughput from your storage:

  1. The storage was always separate from the compute servers and you had to buy dedicated, specialized storage systems
  2. The storage network was most likely Fibre Channel SAN for block storage (InfiniBand for a parallel file system and 10Gb Ethernet for scale-out NAS)
  3. You needed many disk drives — dozens or hundreds to meet the needed throughput
  4. It was really expensive — $200,000 to $1M just to reach 100Gb/s (~12GB/s) of sustained throughput.
  5. “High performance storage” and “Microsoft Windows” were never in the same rack, let alone the same paragraph — all fast storage ran Linux, Solaris, FreeBSD, or a specialized real-time operating system.

1

Figure 1: The Good Old Days may have been good for many reasons, but faster computer products was not one of them.

 

The Times They Are A’ Changing

But starting in 2013 I started to see people breaking these rules. Software-defined storage delivered good performance on commodity servers. Hyper-converged infrastructure let compute and storage run on the same machines. Flash delivered many times the performance of spinning disks. Faster interconnects like 40Gb Ethernet grew in popularity for large clouds, compute clusters, and scale-out storage, as five vendors, including Mellanox, announced the new 25 and 50Gb Ethernet standards. And then there was Microsoft…

a

Figure 2: The HPE DL380 Gen 9 looks like a server, but thanks to software-defined storage and hyper-converged infrastructure, it can be storage, or compute and storage simultaneously.

Revolution from Redmond

Microsoft was an early leader in several of these fields. Windows Server 2012 R2 had native support to run over both 40 Gb Ethernet (with RoCE—RDMA over Converged Ethernet) and FDR 56Gb InfiniBand at a time when most enterprise storage systems only supported 10GbE and 8Gb Fibre Channel. In 2013 Microsoft and their server partners demonstrated that SMB Direct on 40 Gb Ethernet or FDR InfiniBand could best Fibre Channel SANs in both performance and price, and reduce the number of application servers needed to support a given workload. Faster and more efficient networking saved customers money on both server hardware and software licenses.

2

Figure 3: Microsoft 2013 study showed Windows Storage with RDMA and SAS hard drives had half the cost/capacity of Fibre Channel SAN, with the same performance.

 

At Tech Ed 2013, Microsoft demonstrated the power of RDMA with 40 Gb Ethernet by showing the live migration of virtual machines — a frequent and important task in both public and private clouds — was up to ten times faster using RDMA than using TCP/IP.

 

3

Figure 4: RDMA Enables live VM migration 10x faster than using TCP/IP – presented at the TechED’13 Opening Keynote Session.

In 2014, at the Open Networking Summit, Microsoft presented how they ran storage traffic using RoCE on 40GbE in their own cloud to lower the cost of running their Azure Storage. Dell and Mellanox teamed up with Microsoft to demonstrate over one million read IOPS using just two storage nodes and two clients, connected with FDR 56Gb/s InfiniBand. At the time, reaching 1M IOPS normally required a fancy and expensive dedicated storage array but this demo achieved it with just two Windows servers.

 

Then in 2015, we saw demonstrations of Windows Storage Spaces at Microsoft Ignite 2015 using one Mellanox 100Gb Ethernet link and Micron’s NVMe flash cards to achieve over 90 Gb/s (~11.1 GB/s) of actual throughput with just one millisecond latency. This was the first time I’d seen any single server really use 100 Gb Ethernet, let alone a Windows Server. It also proved that using SMB Direct with RDMA was a major advantage, with approximately twice the throughput, half the latency, and half the CPU utilization of using the regular SMB 3 protocol over TCP/IP.

4

Figure 5: One Windows Server delivers over 90 Gb/s of throughput using a single 100GbE link with RoCE. Performance without RoCE was halved.

Hyper-Race to Hyper-converged Windows Storage

 

In 2016, the race began to demonstrate ever faster performance using Windows Storage Spaces Direct (S2D) in a hyper-converged setup with NVMe flash storage and 100Gb RoCE. First Mellanox, Dell and HGST (a Western Digital brand) built a two-server cluster with Dell R730XD machines, each with two HGST UltraStar SN150 NVMe SSDs and two Mellanox ConnectX-4 100GbE NICs. A Mellanox Spectrum switch connected the machines and the cluster delivered 178Gb/s (22.3 GB/s). Then, at Flash Memory Summit and Intel Developer Forum, Microsoft, Dell, Samsung and Mellanox showed 640 Gb/s (80GB/s) using four Dell R730XD server with 4 Samsung NVMe SSDs and 2 Mellanox’s ConnectX-4 100GbE NIC, all connected by Mellanox’s Spectrum switch and LinkX® cables.

 

These demos were getting faster ever few months and showing the power and flexibility of a hyper-converged deployment on Windows Server 2016.

5

Figure 6: Microsoft recommends RDMA when deploying Windows Storage Spaces Direct as a hyper-converged infrastructure.

Breaking the One Terabit Per Second Barrier

 

Now, at Microsoft Ignite 2016, comes the latest demonstration of Windows S2D with an amazing milestone. A 12-node cluster using HPE DL380 servers with Micron 9100 Max NVMe SSDs and Mellanox ConnectX-4 100GbE NICs delivers over 1.2 Terabits per second (>160 GB/s).  Connectivity was through a Mellanox Spectrum 100 GbE switch and LinkX cables. More impressively, the cluster hosted 336 virtual machines and only 25 percent of the CPU power was consumed, leaving 75 percent of the CPU capacity for running additional applications. Which is important because the whole purpose of hyper-converged infrastructure is to run applications.

 

This HPE server is the latest version of their ever popular DL380 series, the world’s best-selling server (per IDC’s Server Tracker for Q1 2016). It supports up to 3TB of DDR4 memory, up to 6 NVMe SSDs, 25/40/50/100GbE adapters from Mellanox, and of course Microsoft Windows Server 2016, which was used for this demo.

Capture

Figure 7: HPE, Mellanox and Micron exceed 1.2 Terabit/second using twelve HPE DL380 Gen 9 servers, 48 Micron NVMe SSDs, with Mellanox 100GbE NICs, switch, and cables.

Amazing Performance, Price and Footprint

This is an amazing achievement considering that a few years ago this was was easily considered to be supercomputer performance and now it is available in any datacenter using off-the-shelf servers, SSDs, and Ethernet, with software-defined storage (Windows Server 2016), all at a reasonable off-the-shelf price. To give an idea of how impressive this is, look at what’s required to deliver this throughput using traditional enterprise or HPC storage. Suppose each HPE server with 4 Micron NVMe SSDs and two Mellanox ConnectX-4 100Gb Ethernet NICs included costs $15,000 each, plus $12,000 for the 100GbE switch and cables. That makes the cost of the entire 12-node cluster $192,000 and at 25U in height (2U per server, 1U for the switch) it consumes half a rack of footprint. But if we use traditional storage arrays…

 

  • A well-known all- flash array can deliver up to 96Gb/s (12GB/s) using 8x16Gb FC ports or 4x56Gb FDR InfiniBand ports. Suppose it costs $30,000 and consumes 6U each; to equal the Windows cluster throughout would require 13 systems costing $390,000 and consuming two racks.
  • A proven Lustre-based appliance delivers 96 Gb/s (12GB/s) per node in an 8U system so 480Gb/s (60GB/s) per rack. Suppose it costs $150,000 per rack — matching the Windows cluster throughput would require three racks and $450,000.
  • A popular enterprise scale-out SAN supports 320 Gb/s (40GB/s) of front-end bandwidth per node (20x16Gb FC ports). If each controller can drive the full 320 Gb/s, you would need four nodes plus a load of flash or disk shelves to reach 1.2 Tb/s. That’s probably at least $400,000 and 1.5 racks.
  • A high-end, Tier-1 scale-out SAN supports 512 Gb/s (64 GB/s) of front-end bandwidth per engine so reaching 1.2Tb/s would require 3 engines (4 if they must be deployed in pairs); let’s assume that costs at least $600,000 and consumes at least two racks.
1.2 Tb/s Solution Nodes Footprint Network Ports Clients Estimated Cost
Windows S2D 12 ½ rack 24 Included $192,000
AFA 13 2 Racks 52 IB / 104 FC Separate $390,000
Lustre cluster 14 3 Racks 56 Separate $450,000
Scale-out SAN 4 1.5 Racks

80

Separate $400,000
High-end SAN 3-4 2 Racks 80-96 Separate $600,000

Figure 8: The Windows S2D solution uses from one-quarter to one-third the footprint of traditional storage solutions and acquisition cost is estimated to be less than half of other solutions.

To be clear, these are all fine and proven enterprise or HPC storage solutions which offer many rich features, not all of which are necessarily available in Windows S2D. They probably have much more storage capacity than the 115 TB (raw) in the 1.2 Tb/s Windows S2D cluster. (Also, these are approximate price estimates, and actual prices for each system above could be significantly higher or lower.) But these are also prices for storage only, not including any compute power to run virtual machines and applications, whereas the Windows S2D solution includes the compute nodes and plenty of free CPU power. That only re-emphasizes my point is the Windows solution delivers much more throughput at a much lower price, and it also consumes much less space, power, and network ports to achieve it.

7

Figure 9: This amazing Windows S2D performance didn’t come from a completely generic components, but it did use general-purpose, off-the-shelf servers, storage, software, and networking from HPE, Micron, Microsoft and Mellanox to achieve performance far from generic.

The key lesson here is that software-defined storage and hyper-converged infrastructure, combined with NVMe flash and 100Gb RDMA networking, can now equal or surpass traditional enterprise arrays in performance at a far lower price. What used to be the sole realm of racks of dedicated arrays, expensive Fibre Channel networks, and row-upon-row of drives can now be accomplished with half a rack of general purpose servers with a few SSDs and 2 Mellanox NICs in each server.

 

It’s a software-defined storage revolution riding on the back of faster flash and faster networks — a revolution in more ways than one.

Read the solution brief here. 

Discover more about technology innovation being led by Mellanox and Microsoft below:

About Motti Beck

Motti Beck is Sr. Director Enterprise Market Development at Mellanox Technologies Inc. Before joining Mellanox Motti was a founder of BindKey Technologies an EDC startup that provided deep submicron semiconductors verification solutions and was acquired by DuPont Photomask and Butterfly Communications a pioneering startup provider of Bluetooth solutions that was acquired by Texas Instrument. Prior to that, he was a Business Unit Director at National Semiconductors. Motti hold B.Sc in computer engineering from the Technion – Israel Institute of Technology. Follow Motti on Twitter: @MottiBeck

Comments are closed.