All posts by Motti Beck

About Motti Beck

Motti Beck is Sr. Director Enterprise Market Development at Mellanox Technologies Inc. Before joining Mellanox Motti was a founder of BindKey Technologies an EDC startup that provided deep submicron semiconductors verification solutions and was acquired by DuPont Photomask and Butterfly Communications a pioneering startup provider of Bluetooth solutions that was acquired by Texas Instrument. Prior to that, he was a Business Unit Director at National Semiconductors. Motti hold B.Sc in computer engineering from the Technion – Israel Institute of Technology. Follow Motti on Twitter: @MottiBeck

How Microsoft Enhanced the Azure Cloud Efficiency

There is no doubt that over the past couple of years, in a relatively short period of time, Microsoft became the worldwide leader Software-as-a-Service (SaaS) cloud provider. Recent data shows that Microsoft continues to grow and in addition to the existing 30 locations, four new locations are coming soon.

 

Figure 1: Microsoft Azure – 30 established with four new sites coming soon.

 

There is also no doubt that scalability and efficiency were among the top priorities of Microsoft’s hardware team that was originally chartered to build the Azure Cloud infrastructure. Those two features are key for any cloud and have a direct impact on the quality-of-Service (QoS) and on the Return on the investment (ROI). As such, choosing the right architecture and solutions are key to establishing a sustainable differentiation that is critical as it has a direct impact on a Company’s ability to meet business goals.

In addition to easy scalability, which is a requirement that is mostly driven by the tremendous growth of storage which is doubling every two years and expected to grow from 4.4 Zeta Bytes in 2013 to over 44 Zeta Bytes in 2020, the need to analyze the data, in real-time, is driving the need to reduce latency and increase the throughput of infrastructure using a scale-out based architecture. This, in turn, will boost data analytics and greatly enhance the interactive user experience.

In order to achieve this goal, the Microsoft team made a decision to build an architecture and use technologies that offload, as much as possible, network functions to the networking subsystems, and chose to use Mellanox’s 40Gb/s efficient Ethernet product in its Azure’s mission critical worldwide deployments. This decision has been proven to be a sound one already as it have been preventing bottlenecks that are associated with waiting for the CPU to complete a job. This, of course, improves, the overall cloud efficiency, QoS and ROI.

One of the best examples of efficient offload is the use of RDMA enabled networks vs. TCP/IP. There are numerous papers and videos that have been published already by Microsoft that validate the higher efficiency that RDMA enables, including Jose Barreto’s video. In the video, Barreto using Mellanox ConnectX-4® 100GbE to compare the performance of TCP/IP vs RDMA (Ethernet vs. RoCE) and clearly shows that RoCE delivers almost two times higher bandwidth and two times lower latency than Ethernet at 50 percent of the CPU utilization required for a data communication task (More data is can be found at: Enabling Higher Azure Stack Efficiency – Networking Matters).

 

Figure 2: RoCE vs. TCP/IP

Offloading the CPU doesn’t only free up more CPU cycles to run applications, but also eliminates computational bottlenecks that have to do with the CPU’s ability to process the data fast enough and to feed the network at the maximum wire speed. This is why Azure architects decided to use RoCE connectivity in Azure storage, because it eliminates the typical bottlenecks that are associated with accessing storage.

One of most impressive and convincing benchmarks that shows the differentiation that Mellanox’s RoCE solutions enables is a record-breaking benchmark performed over a 12-node clusters using HPE’s DL380 G9 servers, each with four Micron 9100MAX NVMe storage cards and dual Mellanox ConnectX-4 100GbE NICs, all connected by Mellanox’s Spectrum 100GbE switch and LinkX cables. The cluster delivered an astonishing sustainable 1.2Tb/s bandwidth across application-to-application communication, which, of course, enables higher data center efficiency.  A separate benchmark, using a smaller cluster of only four nodes to run MS SQL, achieved a record performance of 2.5 million transactions per minute (using SOFS architecture).

 

Figure 3: Storage Spaces Direct record performance of 1.2 Terabit/sec over 12-node cluster

 

Figure 4: NVGRE Offload enables 3X Higher Microsoft Azure Pack Applications’ Efficiency

RDMA isn’t the only offload engine available today. Another good offload example is the three times higher application efficiency that NVRGE offload enables in a Cloud Platform System (CPS) vs. running over the CPU. However, many others are available over Mellanox’s ConnextX family of NICs, including Erasure Coding, VXLAN, Geneve, Packet Pacing and others.

However, Microsoft’s Azure team hasn’t stopped just with using advanced NICs. Modern hyper-scale datacenters that have hundreds of thousands to millions of servers require much higher levels of loading, which drives the need for an FPGA in the IO path, between the server and the switches. Having such a programmable device as a, “bump-in-the-wire” function, can be used to offload CPU-hungry network functions from the CPU to the FPGA, and enable data communication at the wire speed. A good example is the offload of IPSEC or TLS that can run over Mellanox’s Innova IPsec 4 Lx EN Adapter which enables four times improvement in crypto performance, and frees the CPU cores to run the real users’ applications.

Security is just one of many functions that can be offloaded from the CPU to the network. Other functions such as software defined networks (SDN) controller or deep machine learning algorithms can use the FPGA to maximize the scalability and agility of the infrastructure, without the need to replace the existing deployment. It is also expected that these functions have been, or will be available, in standard NICs including in existing ConnectX-5 and the upcoming ConnectX-6 which will enable in-network computing. In addition, a new class of light, agile and fast co-processors, such as Mellanox’s BlueField are emerging, which will free the CPU to drive data in a wire speed of 25G/s or 50Gb/s and soon at 100Gb/s. All of these are expected to drive the clouds’ efficiency to a new record heights.

 

 

 

 

Mellanox: First to Fast at Dell EMC World 2017 – Pick your Dell Server, We’ve got your Networking Needs Covered

Mellanox will be showcasing Dell sever solutions at Dell/EMC World, May 8-11, at the Sands Expo Center, Las Vegas, NV. We will be at booth #1549 and invite you to drop by and see for yourself demo of our data center networking solutions that will include: Spectrum 100GbE switch, ConnectX-5, ConnectX-4, ConnectX-4 Lx and Dell IO NICs and LinkX copper and optical cables solutions.

Figure 1: Mellanox End-to-End Data Center Networking Products

 

We also pleased to announce that at the event we also have in our booth the Dell’s PowerEdge 6320p, a modular server node with our EDR InfiniBand connectivity. Dell EMC continues a proud history of innovation with the new C6320p modular server node. This 1U half-width server contains a single Intel® Xeon® Phi™ processor with up to 72 out-of-order cores and Mellanox 100G EDR InfiniBand to ensure maximum CPU and server utilization, which makes the system ideal for the performance demanding HPC environments.

Figure 2: Dell’s PowerEdge 6320p has in impeded ConnectX-4 EDR HCA

 

Housing up to four nodes per C6300 chassis, it can deliver a maximum of 288 cores per 2U for tremendous parallel performance. With Mellanox® ConnectX®-4 100Gb/s embedded HCA in each node, this system can offer more than 6000 cores per 42U rack, creating a platform for accelerating data-intensive computation, like deep learning, life sciences or weather simulation.

 

Mellanox EDR networking offers SwitchIB-2 with SHArP™ technology as a scale out platform that can support offload technology of MPI and SHMEM/PGAS communications. This is particularly important for Knight’s Landing processors which have a lower clock speed than their Xeon cousins and are not well suited to run RDMA emulation or MPI collective processing themselves. This key differentiating capability that Mellanox networks offer is particularly appropriate in this age of artificial intelligence and machine learning, as it allows the network to operate as an intelligent coprocessor in your cluster, making the fabric a synergistic partner in the parallel processing paradigm.

 

The C6320p also leverages the scale and cost efficiencies of the integrated Dell Remote Access Controller 8 (iDRAC8) with Lifecycle Controller, proven by operation in millions of servers worldwide. iDRAC8 with Lifecycle Controller delivers intelligent management and configuration automation for hyper-converged solutions and appliances, and enables you to select the management functions you need, to streamline operations by reducing the time and number of steps to deploy, monitor, and update PowerEdge C6320p servers throughout their entire life cycle.

 

We also have several Mellanox technologists speaking at the show and invite you to attend:

  • Tuesday, May 9th, Time: 3:25 p.m. – 3:40 p.m., Rob Davis Vice President, Storage Technology: NVMe over Fabrics – High Performance SSDs Networked over Ethernet
  • Wednesday, May 10th, Time: 12:55 p.m. – 1:10 p.m., Bryan Varble OEM Solutions Architect: Mellanox: Maximizing Performance and Scale from Dell Solutions for Artificial Intelligence

 

Looking forward to see you there.

 

 

 

 

Mellanox’s RoCE Networking Enables Micron’s SolidScale™ NVMeoF-based SAN Platform to Achieve a Breakthrough in Data Center Efficiency

With the tremendous growth of data, the leading worldwide enterprises already realize that analyzing/understanding large amount of data, in real-time, is a key differentiator, that they must have. Micron’s SolidScale, which was launched on March 3rd 2017, is first NVMe-over-Fabric (NVMeoF) based SAN scale-out system, and it enables enterprises to achieve their goal of becoming a more efficient data-driven company.

Figure 1: Micro’s SolidScale™ Platform Architecture

 

The Micron SolidScale platform connects multiple server nodes (with Micron NVMe SSDs inside) using Mellanox’s high-speed and low latency RDMA over Converged Ethernet (RoCE) networking solution with low-latency software that provides a crafted set of data services. This combination results in extremely efficient Hyperconverged solution that performs like local direct attached storage.

Preliminary benchmark results for the Micron SolidScale platform have been published in Micron® SolidScale Platform Architecture solution brief. The results show that when running over three 2U servers, which are connected over a single Mellanox 100Gb/s RoCE link, the solution achieves more than 10.9M IOPS. This is within 4 percent of an equivalent server-local deployment (Figure 2) and achieved without adding more than an average of 10µs to the overall IO in 100 percent random reads in 4K blocks. This is also less than 1 percent of an equivalent server-local deployment (Figure 3). Those results validate again the efficiency of Mellanox’s end-to-end RoCE solution which includes the Spectrum Switch, ConnectX®-4 NICs and LinkX cables.

At the launch event, Micron also published a couple of application level benchmark results that have been tested over the same three nodes SolidScale cluster. The results show that under this configuration, and using a single 100Gb/E RoCE link between the nodes, each one of the SQL servers was able to achieve 11.1 GB/s (see figure 4) that is very close to the theoretical bandwidth that each link can support, and practically achieved the maximum possible wire speed.

Figure 4: SQL Server over SolidScale Benchmark Results

The above benchmark results show that combining the speed of Micron NVMe SSDs, with high-bandwidth Mellanox fabric, delivers performance that scales with a negligible performance penalty when compared to a local in-server NVMe. Positioning the joint solution to power the modern NVMeoF-based SAN solutions will take the data center to new record levels of efficiency.

 

Accelerating Artificial Intelligence over High Performance Network

Last year, on November 10, 2016, IBM announced that: Tencent Cloud Joins IBM and Mellanox to Break Data Sorting World Records. In that press release, IBM mentioned that when running over high speed network, at 100Gb/s, the Tencent Cloud established four world records by winning the GraySort and MinuteSort categories at the renowned Sort Benchmark Contest – and all at rates two to five times faster than last year’s winners. Highlights included:

  • Took just 98.8 seconds to sort 100TB of data
  • Used IBM OpenPOWER servers and Mellanox 100Gb/s Ethernet

This year, at the end of last month, Tencent published a blog: “The Future is here: Tencent AI Computing Network”, in which they discuss the advantages of using standard RDMA-enabled networks with NVMe, when building an AI system. The blog was written in Chinese and below is the translation to English.

Both references show the significant role of the networking in building a data center that needs to meet the growing Compute and Storage demands for artificial intelligence applications, such as image recognition and speech recognition.

The Future is Here: Tencent AI Computing Network

March 24, 2017, Xiang Li, Tencent Network

Tencent Network – A platform for young Tencent network enthusiasts and peers to exchange their ideas. This group of young people have planned, designed, built and operated a huge and complex Tencent network, and experienced all the ups and downs.

“Tencent Network” is operated by the technical engineering business network platform department of the Shenzhen Tencent Computer Systems Co., Ltd. We hope to work with like-minded friends in the industry to stay on top of the latest network and server trends, while sharing Tencent’s experiences and achievements in network and server planning, operations, research and development, as well as service. We look forward to working and growing with everyone.

There is no doubt that artificial intelligence has been the hottest topic in the IT industry in recent years. Especially since landmark events like the Alpha GO in 2016, technology giants in China and worldwide have been continuing to increase investment in artificial intelligence. At present, the main aspects of artificial intelligence, such as image recognition and speech recognition, are achieved by the machine through learning, with a powerful computing platform to perform massive data analysis and calculation. However, with data growth, standalone machines have become unable to meet the calculation need, so the need for high-performance computing (HPC) clusters has arisen for further enhancing computing power.

An HPC cluster is a distributed system that organizes multiple computing nodes together. It generally uses RDMA (Remote Direct Memory Access) technology such as iWARP / RoCE / InfiniBand (IB) to complete the fast exchange of data between computing nodes. As shown in Figure 1, the RDMA network card can fetch data from the sending node’s address space and send it directly to the address space of the receiving node. The entire interactive process does not require kernel memory to participate, thus greatly reducing the processing delay on the server side. At the same time, with the network as part of the HPC cluster, any transmission block will cause a waste of computing resources. In order to maximize the cluster computing power, the network is usually required to complete the RDMA traffic within 10us. Therefore, for HPC-enabled networks, latency is a primary indicator of cluster computing performance.

Figure 1: RDMA interconnection architecture

 

In the actual deployment, the main factors that affect the network delay are:

  1. Hardware delay. Network equipment forwarding, forwarding hops, and fiber distance can affect the network delay; an optimization program is to use two levels of “Fat-Tree” to reduce the number of network forwarding levels, upgrade the network speed to forward data at a higher baud rate, and deploy low delay switch (minimum 0.3us);
  2. Network packet loss. When the network experiences buffer overflow packet loss due to congestion, the server side needs to retransmit the entire data segment, resulting in serious deterioration of the delay. Common solutions are: Increasing the switch cache, increasing the network bandwidth to improve the anti-congestion capacity, engaging in the application layer algorithm to optimize the incast scene and reduce the network congestion points, and deploying flow control technology to reduce source deceleration in order to eliminate congestion, etc.

When the data center network hardware environment is relatively fixed, the effect is very limited when relying on upgrading hardware to reduce the delay; what is more common practice is to decrease the delay by reducing network congestion. So for the HPC network, the industry is more focused on studying the “lossless network”; however, the currently more mature solutions include lossy network and flow control protocol, which is in a different direction to the industrial lossless network.

Lossy network and flow control protocol

 

Ethernet uses “best effort” for forwarding, where each network element tries its best to transmit the data to the downstream network element without caring about the forwarding capacity of the downstream network element, which may cause the downstream network element congestion packet loss. This means that Ethernet is a lossy network that does not guarantee reliable transmission. Data centers use reliable TCP protocol to transmit data, but Ethernet RDMA packets are mostly UDP packets, which require the deployment of cache management and flow control technology to reduce the packet loss on the network side.

PFC (Priority Flow Control) is a queue-based back pressure protocol. The congestion network element prevents the buffer overflow packet loss by sending the Pause frame to inform the upstream network element to reduce speed. In a stand-alone scenario, the PFC can adjust the server speed quickly and effectively in order to ensure that the network does not lose packets. However, in a multi-level network, there may occur thread blocking (Figure 2), unfair deceleration, PFC storm and other issues, and when an abnormal server transmits the PFC message to the network, it may also paralyze the entire network. Therefore, opening the PFC in the data center requires strict monitoring and management of Pause frame in order to ensure network reliability.

 

Figure 2: PFC thread blocking issue

 

ECN (Explicit Congestion Notification) is an IP-based end-to-end flow control mechanism.

Figure 3: ECN deceleration process

 

As shown in Figure 3, when the switch detects that the port cache is occupied, it will set the ECN field of the packet at the time of forwarding. The destination network card generates the alert according to the message feature, and the source network card is decelerated accurately. ECN avoids the problem of thread blocking, and can achieve accurate deceleration at the stream level. However, because of its need to generate back pressure packets on the card side, the response period is longer and it is usually used as an auxiliary means of PFC to reduce the amount of PFC in the network. As shown in Figure 4, the ECN should have a smaller trigger threshold and perform a deceleration of the flow before the PFC takes effect.

Figure 4: PFC and ECN trigger time

 

In addition to the mainstream large cache, PFC and ECN, the industry has also proposed RDMA field-based HASH, elephant flow shaping, queue length based HASH algorithm DRILL, bandwidth cache algorithm HULL, and other solutions. However, most of these programs need the support of network cards and switch chips, which is hard to engage in scale deployment in the short-term.

  • Industrial lossless network

Figure 5: InfiniBand flow control mechanism

InfiniBand is an interconnection architecture designed for high-performance computing and storage, and is the complete definition of a seven-story protocol stack, with characteristics such as low latency and lossless forwarding. As shown in Figure 5, the IB network adopts the “credit” based flow control mechanism. The sender negotiates the initial Credit for each queue when the link is initialized, indicating the number of packets that can be sent to the other end, and the receiver, according to its own forwarding capability, refreshes the Credit of each queue simultaneously and in real-time with the sender; when the sender’s Credit is exhausted, then packet sending is stopped. As the network element and the network card must be authorized before sending the packets, IB network will not experience prolonged congestion, which can ensure a lossless network’s reliable transmission. IB provides 15 service queues to differentiate traffic, and the traffic from different queues does not experience blockage. At the same time, IB switches use a “Cut-through” forwarding mode with a single-hop forwarding delay of about 0.3us, much lower than the Ethernet switch.

Therefore, for a small HPC and storage network, IB is an excellent choice, but there are other issues such as IB not being compatible with Ethernet, monotonous product form, etc., which makes it difficult to be integrated into the Tencent production network.

The Tencent AI Computing Network

The Tencent AI computing network is part of the production network; as well as needing to communicate with other network modules, it also needs to dock background systems such as network management and security. This means that only the Ethernet option compatible with existing networks can be chosen. The architecture of the computing network has experienced multiple iterations with the growth of business requirements, from the earliest HPC v1.0 that supported 80 40G nodes to today’s HPC v3.0 that supports 2000 100G nodes.

Computing nodes in the computing network are used by the entire company as a resource pool, which puts the network in multi-service traffic concurrent congestion. For the network carrying a single service, we can avoid network congestion through the application layer algorithm scheduling. However, for a multi-service sharing network, it is inevitable that there is concurrent congestion of multi-service traffic; even if with queue protection and flow control mechanisms to reduce network packet loss, it still experiences loss of computing capacity of the cluster due to server slowdown. At the same time, PFC defects are not suitable for a multi-level network, and its effectiveness scope needs to be limited. Therefore, our design idea is as follows:

  1. Physically isolate business, use high-density equipment as access equipment to concentrate a department’s nodes in an access device as far as possible, and limit the number of cross-equipment clusters;
  2. PFC is only opened in the access device to ensure rapid back pressure, and ECN protection across the equipment cluster is opened in the entire network;
  3. For small cross-device clusters, provide enough network bandwidth to reduce congestion and use large cache switches to solve the problem of long ECN back pressure cycles.
  4. To meet the requirements of high-density access, large cache, and end-to-end back pressure, etc., HPCv3.0 architecture opted to use the BCM DUNE series of chip-based box switches as access devices.

Figure 6: HPC3.0 architecture

As shown in Figure 6, HPC v3.0 is a two-stage CLOS architecture, while the convergence device LC and access equipment LA are BCM DUNE chip box switches and each LA can access up to 72 40G / 100G servers. Taking into account that the scale of clusters used by most of the applications at present are at 10 to 20 nodes, and that the performance of future computing nodes and algorithms will be improved and optimized, further limiting cluster sizes, 72 are sufficient to meet the computing requirements of a single service. The DUNE line card supports 4GB of cache, can buffer the congestion of ms-level, and support the end-to-end flow control scheme based on VoQ (Figure 7). It can realize the accurate deceleration of the server under the same frame as PFC. Although the forwarding delay (4us) of the box switch is greater than that of the cassette switch (1.3us), it does not affect the performance of the cluster in consideration of the reduced latency of packet forwarding, packet loss, and congestion.

Figure 7: DUNE chip end-to-end flow control

Figure 7: DUNE chip end-to-end flow control

From the financial perspective, the cost of the single-port box switch is higher than the cassette switch. The single LA node, however, can meet most of the computing needs, and the cross-LA cluster demand is limited, which reduces the number of interconnection modules, and it is lower in cost than the traditional cassette access and the one-to-one convergence program.

Summary

For a long time, the network was not the bottleneck in data center performance, and “large bandwidth”-based network design was able to meet business application needs. In recent years, however, the rapid development of server technology has in turn led to rapid improvement of data center computing and storage capacity, and RDMA technologies such as RoCE and NVME over Fabric have transferred the data center performance bottlenecks to the network side. Especially for RDMA-based new applications such as HPC, distributed storage, GPU cloud, and ultra-converged architecture, network delays have become a major constraint in the performance. Therefore, it is foreseeable that the future design of the data center will gradually shift from being bandwidth-driven to delay-driven. Our long-term goals include building a low latency, lossless, large Ethernet data center, and establishing a complete cache and delay monitoring mechanism.

You are more than welcome to follow our public account, “Tencent Network”. We provide you with the latest industry news, the hands-on experiences of Tencent network and servers, as well as a few interactive exchange events with prizes that are being prepared. We welcome and look forward to your active participation.

  • Note 1: The copyright of any texts and images that are marked as from “Tencent Network” belongs to the Shenzhen Tencent Computer System Co., Ltd. Without its official authorization, use is not permitted. Any violation of such, once verified, will be prosecuted.
  • Note 2: Some of the images of this article come from the internet. Please contact kevinmi@tencent.com for any copyright issues

You can subscribe by simply clicking on the “public account” above!

Enabling Higher Azure Stack Efficiency – Networking Matters

A couple of weeks ago, Mellanox’s ConnectX®-3/ConnectX-3 Pro and ConnectX-4/ConnectX-4 Lx NICs became the first to pass Microsoft Server Software Defined Data Center (SDDC) Premium certification for Microsoft Windows at all Ethernet standard speeds, means, 10, 25, 40, 50 & 100 GbE. This was the latest significant milestone and crucial in the journey that Microsoft and Mellanox started more than six years ago to enable our networking hardware to deliver the most efficient solutions for the new Windows Server and Azure Stack-based deployments. These ConnectX NICs have already been certified by the world’s leading server OEMs (HPE, Dell, Lenovo, and others¹), and when deployed with the most advanced switches and cables, like Mellanox’s Spectrum switch and LinkX copper and optical cables, they have been proven to provide the most efficient Azure Stack solutions. This latest milestone is a good cause to look back to analyze all of the progress that has led to this point. For brevity’s sake, let’s start in 2012.

In 2012, Microsoft first released its first Windows Server 2012 product. In that release, Microsoft launched a game-changing Enterprise-Class storage solution, called Storage Spaces. The new solution was developed to handle the exponential growth of data, which created significant challenges for IT. Traditional database block storage, no longer proved as effective for query and analysis. Compared to older storage solutions, Storage Spaces doubled the performance at half of the cost, enabling significantly higher efficiency in Windows-based data centers.

To accomplish this, Microsoft’s storage team leveraged the availability of new technologies, like Flash-based storage and high performance networking, and made a couple brave decisions that helped them to achieve their goals. The first was to enable SMB 3.0 to run over an RDMA-enabled network (SMB Direct), and the second was to replace the traditional Fibre Channel (FC)-based Storage Area Network (SAN) architecture, which fell short of addressing modern data centers’ storage needs, with the advanced Storage Spaces solution running over SMB Direct, which meant using only Ethernet networking, a faster and lower cost replacement to FC and SAN.

Figure 1: Windows Server 2012 Storage Spaces over SMB Direct

 

Windows Server 2012 supports only a Converged System architecture (in which there are dedicated servers for compute and dedicated servers for storage), so Storage Spaces could only run only as a Scale-out File Systems (SOFS). The efficiency boost of replacing FC with an RDMA-enabled network, as published by Microsoft in Windows Server 2012 R2 Storage, showed 50 percent lower cost per GB for storage.

Figure 2: A comparison of cost acquisition between model scenarios

 

Immediately after the release of Windows Server 2012, several papers were published, all demonstrating the higher efficiency of the solution, including, “Achieving Over 1-Million IOPS from Hyper-V VMs in a Scale-Out File Server Cluster Using Windows Server 2012 R2” or, Optimizing MS-SQL AlwaysOn Availability Groups With Server SSD. All showed the advantages of using Mellanox’s RDMA-enabled network solution in the scale-out deployments.

However, Microsoft continued to develop and enhance their Storage Space features and capabilities, and in 2016, in the Windows Server 2016 release, they added support for Hyperconverged systems, a solution that uses Software-Defined Storage (SDS) to run compute and storage over the same server by using Storage Spaces over RDMA-enabled networks (Storage Spaces Direct, or S2D).

Figure 3: Windows Server, past and future Storage Solutions (source: Microsoft Storage Spaces Direct – the Future of Hyper-V and Azure Stack)

 

The efficiency boost that Microsoft’s new Hyperconverged S2D system delivers is clearly illustrated in Figure 3. However, building a Hyperconverged system requires special attention to network performance, as the network must handle all data communication, including:

  • Application to application
  • Applications to storage
  • Management
  • User access
  • Backup and recovery
  • Compute, storage and management traffic

Figure 4: Networking matters in a Hyperconverged deployment – CapEx only

 

When building a system in which 25GbE is replacing the more traditional 10GbE, the higher bandwidth enables close to two times higher efficiency, as displayed in Figure 4. However, above and beyond the higher bandwidth, an RDMA-enabled network, like RDMA over Converged Ethernet (RoCE) reduces the overall data communication latency and maximizes server utilization, resulting in improved deployment efficiency.

Jose Barreto delivered a fascinating presentation at Microsoft’s Ignite 2015, where he showed in real-time the performance boost that RDMA enables.  In a three minute video, Barreto compared the performance of TCP/IP vs RDMA (Ethernet vs. RoCE) and clearly showed that RoCE delivers almost two times higher bandwidth and two times lower latency than Ethernet at 50 percent of the CPU utilization required for the data communication task.

Figure 5: Mellanox RoCE solutions maximize S2D efficiency

 

The presentation also analyzed the, “magic” behind RoCE’s advantage, showing that when running over TCP/IP, all of the CPU cores that were assigned to run communication tasks were 100 percent utilized. This is opposed to RoCE, in which the cores were used very little. As such, the TCP/IP protocol stack could not scale, while RoCE could support a much larger scale-out cluster size. With such performance advantages, Mellanox’s RoCE solutions became the de facto standard for Windows Server 2016 S2D benchmarks and products, to the degree that at Ignite’16, a number of record-level benchmarks and products were announced.

Figure 6: Storage Spaces Direct record performance of 1.2 Terabit/sec over 12-node cluster

 

One of most impressive demos at the show was a record-breaking benchmark performed over a 12-node clusters using HPE’s DL380 G9 servers, each with four Micron 9100MAX NVMe storage cards and dual Mellanox ConnectX-4 100GbE NICs, all connected by Mellanox’s Spectrum 100GbE switch and LinkX cables. The cluster delivered an astonishing sustainable 1.2Tb/s bandwidth across application-to-application communication, which, of course, enables higher data center efficiency.  A separate benchmark, using a smaller cluster of only 4 nodes to run MS SQL, achieved a record performance of 2.5 million transactions per minute (using SOFS). In addition, many other blogs have been published showing the competitive advantages that Mellanox networking solutions enable when used in Windows Server 2016-based deployments, including:

 

At Ignite 2016, Microsoft also announced that it is expecting the lead OEMs to release Azure Stack-based Hyperconverged systems in June 2017. Such systems will be compatible with Microsoft’s Azure cloud, which already runs over RoCE, enabling seamless operation between on premise (private) and off-premise (Azure public) clouds.

Figure 7: Quote from Albert Greenberg, Microsoft / Azure, presenting ONS2014 Keynote

 

Azure stack can run over Hyperconverged systems that use traditional networking, requiring the SDDC Standard certification, or over Software Defined Network (SDN)-based Hyperconverged systems, requiring the SDDC Premium certification. DataON’s Hyperconverged system, for example, is connected by Mellanox’s end-to-end networking solution. A DataON appliance that was launched at Ignite’16 has already been deployed and delivers significant competitive advantages to its users.

The Storage Spaces Direct journey that started six years ago with uncertainty has led to its establishment as the next-generation Windows-based deployment. S2D leverages the high performance that RoCE delivers, enabling higher performance at lower cost, and replacing traditional FC-based SAN. That journey continues, such that additional networking enhancements to the solution’s capabilities are under development and will be added soon.

¹References:

2017 Prediction – Networking will take Clouds to New Levels of Efficiency

At the dawn of the 21st century, and in order to meet the market demand, data center architects started to move away from the traditional scale-up architecture, which suffered from a limited and expensive scalability, to a scale-out architecture. Looking backward, it was the right direction to take, since that the ongoing growth of data and mobile applications has required efficient and easy-to-scale infrastructure that the new architecture successfully enables.

However, as we roll into 2017 and beyond, IT managers that deploy data centers that are based on scale-out architecture, should be aware that, in the past, when scale-up architecture were used, in order  to maximize the efficiency, they had only to verify that the performance of the CPU, memory and storage needs to be balanced, in the new scale-out architecture, the networking performance should be taken into consideration too.

There are couple of parameters that IT managers need to consider in 2017 when choosing the right networking for its next-generation deployments. The first, of course, is the networking speed. Although today, 10GbE is the most popular speed, the industry has already started to realize that 25GbE provides higher efficiency and that it is fast becoming the next 10GbE. 25GbE will become the new 10GbE in 2017. The main reason behind this prediction is the much higher bandwidth that flash-based storage, such as NVMe, SAS and SATA SSD can provide; whereas a single 25GbE port replaces three 10GbE ports. This by itself, can cut the networking cost by three fold, enabling the use of a single switch port, a single NIC port and a single cable instead of three each.  As IT managers start looking at next year’s budget, they know this is where they should be allocating their networking dollars.

 

Another couple of good examples for 2017, is where higher bandwidth enables higher efficiency. This will be happening more and more next year in VDI deployment, where a 25GbE solution cuts the cost per Virtual Desktop in half, or when deploying new infrastructure that has to support modern media and entertainment requirements, and 10GbE can’t deliver any additional the speed to support the number of streams that are required for today’s high definition resolutions. As IT managers revisit those all-important 2017 budgets, ROI will become more and more important as companies are increasingly unwilling to take the performance-cost trade off. Essentially, in 2017, IT Managers and their companies want to have their networking cake and eat it too.

However, deploying higher speed networking speed is just one way that IT managers can use to take their cloud efficiency to the next level. As they consider their options, they also should use networking products that offloads specific networking functions from the CPU to the IO controller itself. By choosing this solution, more CPU cycles are going to be freed to run the applications that will accelerate the job’s completion and enable using less CPUs or CPUs with less cores. Ultimately, in 2107, the overall licenses fees for the OS or and the hypervisor will be lower – both of course, will increase the overall cloud efficiency.

There are several network functions that have been offloaded already to the IO controller. One of the most widely used is RDMA (Remote Direct Memory Access) which offloads to the NIC to run the transport later, instead of running the heavy and CPU demanding TCP/IP protocol over the CPU. This is the main reason why IT managers should consider deploying RoCE (RDMA over Converged Ethernet) next year. Using RoCE makes data transfers more efficient and enables fast data move­ment between servers and storage without involving the server’s CPU. Throughput is increased, latency reduced, and CPU power freed up for running the real applications. RoCE is already widely used for efficient data transfer in render farms and in large cloud deployments such as Microsoft Azure. Moreover it has proven, superior efficiency vs. TCP/IP and thus will be utilized more than ever before in 2017.

2

 

Offloading overlay network technologies such as VXLAN, NVGRE and the new Geneve standard on the NIC or VTEP on the switch enables another significant cloud accretion. It represents another typical stateless networking function that, by offloading it, the jobs execution time is significantly shortened. One of the good examples is the comparison that Dell published, running typical enterprise applications over its PCS appliance, with and without NVGRE offloading. This shows that offloading accelerated the applications by more than 2.5 times, over the same system, which of course, increased the overall system efficiency by the same amount.
3.jpg

 

There are several other offloads that are supported by the networking components, like the offloading of the security functions, for example, IPSEC or erasure coding which is being used in Storage Systems. In addition, a couple of IO solutions providers already announced that their next generation products will include new offload functions, such as vSwitch offloading which will accelerate virtualization or NVMeoF offload. This has also been announced by Mellanox in their next ConnectX-5 NIC which we believe will proliferate as a solution of choice in 2017.

Those new networking capabilities have already been added to the lead OS and Hypervisors. At VMworld’16 VMware already announced support for all networking speeds 10, 25, 40, 50 and 100 GbE and VM-to-VM communication over RoCE in their vSphere 6.5. Also, Microsoft, at their recent Ignite’16 conference, announced the support of up to 100 GbE and that for production deployment of Storage Spaces Direct, they recommend running over RoCE. They have also published superior SQL 2016 performance results, when running over a network that support the highest speeds and RoCE. Those capabilities have been included in Linux for a very long time too. So, now, as we see a New Year looming on the horizon, it’s up to IT architects to choose the right networking speeds and offloads that will take their cloud efficiency to the next level in 2017.

This post originally appeared on VMblog here.

HPE, Mellanox, Micron, and Microsoft Exceed One Terabit Per Second of Storage Throughput With Hyper-converged Solution

In the “old days” of tech—meaning roughly 3-6 years ago, there were some hard and fast rules about getting really high throughput from your storage:

  1. The storage was always separate from the compute servers and you had to buy dedicated, specialized storage systems
  2. The storage network was most likely Fibre Channel SAN for block storage (InfiniBand for a parallel file system and 10Gb Ethernet for scale-out NAS)
  3. You needed many disk drives — dozens or hundreds to meet the needed throughput
  4. It was really expensive — $200,000 to $1M just to reach 100Gb/s (~12GB/s) of sustained throughput.
  5. “High performance storage” and “Microsoft Windows” were never in the same rack, let alone the same paragraph — all fast storage ran Linux, Solaris, FreeBSD, or a specialized real-time operating system.

1

Figure 1: The Good Old Days may have been good for many reasons, but faster computer products was not one of them.

 

The Times They Are A’ Changing

But starting in 2013 I started to see people breaking these rules. Software-defined storage delivered good performance on commodity servers. Hyper-converged infrastructure let compute and storage run on the same machines. Flash delivered many times the performance of spinning disks. Faster interconnects like 40Gb Ethernet grew in popularity for large clouds, compute clusters, and scale-out storage, as five vendors, including Mellanox, announced the new 25 and 50Gb Ethernet standards. And then there was Microsoft…

a

Figure 2: The HPE DL380 Gen 9 looks like a server, but thanks to software-defined storage and hyper-converged infrastructure, it can be storage, or compute and storage simultaneously.

Revolution from Redmond

Microsoft was an early leader in several of these fields. Windows Server 2012 R2 had native support to run over both 40 Gb Ethernet (with RoCE—RDMA over Converged Ethernet) and FDR 56Gb InfiniBand at a time when most enterprise storage systems only supported 10GbE and 8Gb Fibre Channel. In 2013 Microsoft and their server partners demonstrated that SMB Direct on 40 Gb Ethernet or FDR InfiniBand could best Fibre Channel SANs in both performance and price, and reduce the number of application servers needed to support a given workload. Faster and more efficient networking saved customers money on both server hardware and software licenses.

2

Figure 3: Microsoft 2013 study showed Windows Storage with RDMA and SAS hard drives had half the cost/capacity of Fibre Channel SAN, with the same performance.

 

At Tech Ed 2013, Microsoft demonstrated the power of RDMA with 40 Gb Ethernet by showing the live migration of virtual machines — a frequent and important task in both public and private clouds — was up to ten times faster using RDMA than using TCP/IP.

 

3

Figure 4: RDMA Enables live VM migration 10x faster than using TCP/IP – presented at the TechED’13 Opening Keynote Session.

In 2014, at the Open Networking Summit, Microsoft presented how they ran storage traffic using RoCE on 40GbE in their own cloud to lower the cost of running their Azure Storage. Dell and Mellanox teamed up with Microsoft to demonstrate over one million read IOPS using just two storage nodes and two clients, connected with FDR 56Gb/s InfiniBand. At the time, reaching 1M IOPS normally required a fancy and expensive dedicated storage array but this demo achieved it with just two Windows servers.

 

Then in 2015, we saw demonstrations of Windows Storage Spaces at Microsoft Ignite 2015 using one Mellanox 100Gb Ethernet link and Micron’s NVMe flash cards to achieve over 90 Gb/s (~11.1 GB/s) of actual throughput with just one millisecond latency. This was the first time I’d seen any single server really use 100 Gb Ethernet, let alone a Windows Server. It also proved that using SMB Direct with RDMA was a major advantage, with approximately twice the throughput, half the latency, and half the CPU utilization of using the regular SMB 3 protocol over TCP/IP.

4

Figure 5: One Windows Server delivers over 90 Gb/s of throughput using a single 100GbE link with RoCE. Performance without RoCE was halved.

Hyper-Race to Hyper-converged Windows Storage

 

In 2016, the race began to demonstrate ever faster performance using Windows Storage Spaces Direct (S2D) in a hyper-converged setup with NVMe flash storage and 100Gb RoCE. First Mellanox, Dell and HGST (a Western Digital brand) built a two-server cluster with Dell R730XD machines, each with two HGST UltraStar SN150 NVMe SSDs and two Mellanox ConnectX-4 100GbE NICs. A Mellanox Spectrum switch connected the machines and the cluster delivered 178Gb/s (22.3 GB/s). Then, at Flash Memory Summit and Intel Developer Forum, Microsoft, Dell, Samsung and Mellanox showed 640 Gb/s (80GB/s) using four Dell R730XD server with 4 Samsung NVMe SSDs and 2 Mellanox’s ConnectX-4 100GbE NIC, all connected by Mellanox’s Spectrum switch and LinkX® cables.

 

These demos were getting faster ever few months and showing the power and flexibility of a hyper-converged deployment on Windows Server 2016.

5

Figure 6: Microsoft recommends RDMA when deploying Windows Storage Spaces Direct as a hyper-converged infrastructure.

Breaking the One Terabit Per Second Barrier

 

Now, at Microsoft Ignite 2016, comes the latest demonstration of Windows S2D with an amazing milestone. A 12-node cluster using HPE DL380 servers with Micron 9100 Max NVMe SSDs and Mellanox ConnectX-4 100GbE NICs delivers over 1.2 Terabits per second (>160 GB/s).  Connectivity was through a Mellanox Spectrum 100 GbE switch and LinkX cables. More impressively, the cluster hosted 336 virtual machines and only 25 percent of the CPU power was consumed, leaving 75 percent of the CPU capacity for running additional applications. Which is important because the whole purpose of hyper-converged infrastructure is to run applications.

 

This HPE server is the latest version of their ever popular DL380 series, the world’s best-selling server (per IDC’s Server Tracker for Q1 2016). It supports up to 3TB of DDR4 memory, up to 6 NVMe SSDs, 25/40/50/100GbE adapters from Mellanox, and of course Microsoft Windows Server 2016, which was used for this demo.

Capture

Figure 7: HPE, Mellanox and Micron exceed 1.2 Terabit/second using twelve HPE DL380 Gen 9 servers, 48 Micron NVMe SSDs, with Mellanox 100GbE NICs, switch, and cables.

Amazing Performance, Price and Footprint

This is an amazing achievement considering that a few years ago this was was easily considered to be supercomputer performance and now it is available in any datacenter using off-the-shelf servers, SSDs, and Ethernet, with software-defined storage (Windows Server 2016), all at a reasonable off-the-shelf price. To give an idea of how impressive this is, look at what’s required to deliver this throughput using traditional enterprise or HPC storage. Suppose each HPE server with 4 Micron NVMe SSDs and two Mellanox ConnectX-4 100Gb Ethernet NICs included costs $15,000 each, plus $12,000 for the 100GbE switch and cables. That makes the cost of the entire 12-node cluster $192,000 and at 25U in height (2U per server, 1U for the switch) it consumes half a rack of footprint. But if we use traditional storage arrays…

 

  • A well-known all- flash array can deliver up to 96Gb/s (12GB/s) using 8x16Gb FC ports or 4x56Gb FDR InfiniBand ports. Suppose it costs $30,000 and consumes 6U each; to equal the Windows cluster throughout would require 13 systems costing $390,000 and consuming two racks.
  • A proven Lustre-based appliance delivers 96 Gb/s (12GB/s) per node in an 8U system so 480Gb/s (60GB/s) per rack. Suppose it costs $150,000 per rack — matching the Windows cluster throughput would require three racks and $450,000.
  • A popular enterprise scale-out SAN supports 320 Gb/s (40GB/s) of front-end bandwidth per node (20x16Gb FC ports). If each controller can drive the full 320 Gb/s, you would need four nodes plus a load of flash or disk shelves to reach 1.2 Tb/s. That’s probably at least $400,000 and 1.5 racks.
  • A high-end, Tier-1 scale-out SAN supports 512 Gb/s (64 GB/s) of front-end bandwidth per engine so reaching 1.2Tb/s would require 3 engines (4 if they must be deployed in pairs); let’s assume that costs at least $600,000 and consumes at least two racks.
1.2 Tb/s Solution Nodes Footprint Network Ports Clients Estimated Cost
Windows S2D 12 ½ rack 24 Included $192,000
AFA 13 2 Racks 52 IB / 104 FC Separate $390,000
Lustre cluster 14 3 Racks 56 Separate $450,000
Scale-out SAN 4 1.5 Racks

80

Separate $400,000
High-end SAN 3-4 2 Racks 80-96 Separate $600,000

Figure 8: The Windows S2D solution uses from one-quarter to one-third the footprint of traditional storage solutions and acquisition cost is estimated to be less than half of other solutions.

To be clear, these are all fine and proven enterprise or HPC storage solutions which offer many rich features, not all of which are necessarily available in Windows S2D. They probably have much more storage capacity than the 115 TB (raw) in the 1.2 Tb/s Windows S2D cluster. (Also, these are approximate price estimates, and actual prices for each system above could be significantly higher or lower.) But these are also prices for storage only, not including any compute power to run virtual machines and applications, whereas the Windows S2D solution includes the compute nodes and plenty of free CPU power. That only re-emphasizes my point is the Windows solution delivers much more throughput at a much lower price, and it also consumes much less space, power, and network ports to achieve it.

7

Figure 9: This amazing Windows S2D performance didn’t come from a completely generic components, but it did use general-purpose, off-the-shelf servers, storage, software, and networking from HPE, Micron, Microsoft and Mellanox to achieve performance far from generic.

The key lesson here is that software-defined storage and hyper-converged infrastructure, combined with NVMe flash and 100Gb RDMA networking, can now equal or surpass traditional enterprise arrays in performance at a far lower price. What used to be the sole realm of racks of dedicated arrays, expensive Fibre Channel networks, and row-upon-row of drives can now be accomplished with half a rack of general purpose servers with a few SSDs and 2 Mellanox NICs in each server.

 

It’s a software-defined storage revolution riding on the back of faster flash and faster networks — a revolution in more ways than one.

Read the solution brief here. 

Discover more about technology innovation being led by Mellanox and Microsoft below:

Mellanox RoCE’d Las Vegas at VMWorld 2016

Mellanox accelerated the speed of data in the virtualized data center from 10G to new heights of 25G at VMWorld 2016 which was held in Las Vegas Aug. 29 – Sept. 1st at the Mandalay Bay. crowd 2

 

The big news at the show revolved around Mellanox’s announcement regarding the integration of its software driver support for ConnectX®-4 Ethernet and RoCE (RDMA over Converged Ethernet) in VMware vSphere®, the industry’s leading virtualization platform. For the first time, virtualized enterprise applications are able to realize the same industry-leading performance and efficiency as non-virtualized environments. The new vSphere 6.5 software for ConnectX-4 delivers three critical new capabilities: increased Ethernet network speeds at 25/50 and 100 Gb/s, virtualized application communication over RoCE, and advanced network virtualization and SDN (Software Defined Networking) acceleration support. Now applications that run in the Cloud over a vSphere-based virtualized infrastructure, can communicate over 10/25/40/50 and 100 Gb/s Ethernet and leverage RoCE Networking to maximize Cloud infrastructure efficiency.

Mellanox also hosted technology demos and on-going presentations showcasing the benefits of using our Spectrum 10/25/40/50/100GbE switch that doesn’t lose packets and we demonstrated why vSphere runs best over it. We also had a number of key Mellanox partners participate in the Mellanox booth presentations showing how we enabled the superior capabilities of our network solutions, including end-to-end support on the 25GbE that is much more efficient networking then the 10GbE, which is no longer fast enough to meet today’s performance and efficiency needs. This is why, when it’s time to choose your next networking technology provider for your cloud deployment, or for your hyper-converged systems, you will choose Mellanox.crpowd1

Mellanox also had a number of key presentations at the show on the topics of:

  • Achieving New Levels of Cloud Efficiency over vSphere based Hyper-Converged Infrastructure
  • iSCSI/iSER: HW SAN Performance Over the Converged Data Center
  • Latencies and Extreme Bandwidth Without Compromise

Some highlight endorsements from our partners included:

“We are happy to collaborate with Mellanox to enable running VMware vSphere-based deployments with high performance networking,” said Mike Adams, senior director, Product Marketing, VMware. “With VMware vSphere and Mellanox technologies, mutual customers can accelerate mission critical applications, while achieving high performance and reduced costs.”

“High performance, reliable fabrics are fundamental to application performance and efficiency,” said JR Rivers, Co-Founder and CTO, Cumulus Networks. “We are pleased to collaborate with Mellanox and deliver the Cumulus Linux and Mellanox Spectrum joint solution to enable RDMA fabrics in vSphere.”

Overall, at the show, Mellanox demonstrated its Ethernet and RoCE end-to-end solutions, and showed how easy it is to increase the efficiency of the Cloud just by deploying tomorrow’s networking solutions today. Those solutions are based on company’s recent network products that improve productivity, scalability and flexibility, all of which now enable the industry to define the data center that meets your current needs, and at the same time to be future proofed. The solutions support not just all data speeds but also include efficient offload engines that accelerate the data center applications with minimal CPU overhead.

 

Things Are About to Get RoCE with Mellanox in Las Vegas

Mellanox will accelerate the speed of data in virtualized data center from 10G to new heights of 25G at VMWorld 2016 which converges in Las Vegas on Aug. 29 to Sept. 1st at Mandalay Bay.

las-vegas-1306002_1920

With an announcement coming for Mellanox’s ConnectX®-4 Ethernet and RoCE (RDMA over Converged Ethernet), things are about to get rocky in the best possible way. For starters, Mellanox will be showing how easy it is to increase the efficiency of the Cloud just by deploying tomorrow’s networking solutions today. Mellanox delivers network products that improve productivity, scalability and flexibility, all of which enable the industry to define the data center that meets your needs now and at the same time to be a future full proof. As the leading provider of higher performance Ethernet NICs that has an 85 percent market share above 10GbE, Mellanox’s Connect-X-4 and Connect-X-4 Lx NICs support not just all data speeds but also include efficient offload engines that accelerate the data center applications with minimal CPU overhead. Stay tuned at the show for exciting news about Mellanox’s 10/25/40/50/100Gb/s Ethernet and RoCE end-to-end solution.

boxing-158519_1280

Mellanox will also be hosting technology demos and on-going presentations at booth #2223 where show attendees can learn about the benefits of using our Spectrum 10/25/40/50/100GbE switch that doesn’t lose packets and why vSphere runs best over it. Show goers will also have the chance to win prizes, all around the theme of driving the industry to 25G. In addition, a number of key Mellanox partners will be participating in the Mellanox booth presentations as we enable the superior capabilities of our network solutions, including end-to-end support on the 25GbE that is much more efficient networking then the 10GbE, which is no longer fast enough to meet today’s performance and efficiency needs. This is why, when it’s time to choose your next networking technology provider for your cloud deployment, or for your hyper-converged systems, you will choose Mellanox – but don’t just believe me, come to visit our booth and see for yourself!

Lastly, don’t miss one of our paper presentations that were selected by the VMworld committee this year:

  • Achieving New Levels of Cloud Efficiency over vSphere based Hyper-Converged Infrastructure [HBC9453-SPO]
    • Monday Aug 29 from 5:30 p.m. to 6:30 p.m.
  • iSCSI/iSER: HW SAN Performance Over the Converged Data Center [INF8469]
    • Wednesday, Aug 31 from 1 p.m. to 2 p.m.
  • Latencies and Extreme Bandwidth Without Compromise [CTO8519]
    • Thursday, Sept 1 from 12 p.m. to 1 p.m.

Hope to see you there.

 

 

dell_25gbe

Achieving New Levels of Application Efficiency with Dell’s PowerEdge Connected over 25GbE

After many years of extensive development of data center visualization technologies, which started with server virtualization and continued with networking virtualization and storage virtualization, the time has arrived to work on maximizing the efficiency of the data centers that have been deployed over those advanced solutions. The rationale for doing this is pretty clear. New data centers are based on the Hyper-Converged architecture which eliminates the need for dedicated storage systems (such as SAN or NAS) and the need for dedicated servers for just storage. Modern servers that are used in such Hyper-Converged deployments usually contain multiple CPUs and large storage capacity. Modern CPUs have double-digit cores that enable the servers to supports tens, and in some cases, hundreds of Virtual Machines (VMs). From the storage point of view, such servers have a higher number of PCIe slots, which enables the NVMe storage to be used as well the ability to host 24 or 48 SAT/SATA SSDs, both of which result in extremely high storage capacity.

Fig-1-25Gb

Figure 1: Microsoft’s Windows Servers 2016 Hyper-Converged Architecture, in which the same server is used for Compute and Storage.

Now that there are high performance servers, each capable of tens of VMs and millions IOPs, IT managers must take a careful look at the networking capabilities and avoid IO bounded situations. The network must now support all traffic classes, the compute communication, the storage communications, the control, and so on. As such, not having high enough networking bandwidth will result in unbalanced systems (see: How Scale-Out Systems Affect Amdahl’s Law) and will therefore reduce the overall deployment efficiency. That is why Dell has equipped their PowerEdge 13th generation servers with Mellanox’s ConnectX®-4 Lx 10/25Gb/s Ethernet adapters, delivering significant application efficiency advantages and cost savings for private and hybrid clouds running demanding big data, Web 2.0, analytics, and storage workloads.

In addition to data communication over 25GbE, Dell’s PowerEdge servers, equipped with ConnectX-4 Lx-based 10/25GbE adapters, are capable of accelerating latency-sensitive data center applications over RoCE (RDMA over Converged Ethernet), which enables similar performance in a virtualized infrastructure as in a non-virtualized infrastructure. This, of course, further maximizes system efficiency.

A good example that demonstrates the efficiency that higher bandwidth and lower latency networks enable is Microsoft’s recent blog which published the performance results of a benchmark that they ran over a 4-node Dell PowerEdge R730XD cluster and connected over 100Gb Ethernet. Each node was equipped with the following hardware:

  • 2x Xeon E5-2660v3 2.6Ghz (10c20t)
  • 256GB DRAM (16x 16GB DDR4 2133 MHz DIMM)
  • 4x Samsung PM1725 3.2TB NVME SSD (PCIe 3.0 x8 AIC)
  • Dell HBA330
    • 4x Intel S3710 800GB SATA SSD
    • 12x Seagate 4TB Enterprise Capacity 3.5” SATA HDD
  • 2x Mellanox ConnectX-4 100Gb (Dual Port 100Gb PCIe 3.0 x16)
    • Mellanox FW v. 12.14.2036
    • Mellanox ConnectX-4 Driver v. 1.35.14894
    • Device PSID MT_2150110033
    • Single port connected / adapter

Fig-2-25Gb

Figure 2: Storage throughput with Storage Spaces Direct (TP5)

The Microsoft team measured the storage performance, and, in order to maximize the traffic, they ran 20 VMs per server (total of 80 VMs for the entire cluster). They achieved astonishing performance of 60GB/s over a 4-node cluster, which perfectly demonstrates the higher efficiency that can be achieved when the three components of compute, storage, and networking are balanced, minimizing potential bottlenecks that can occur in an unbalanced system.

Another example that shows the efficiency advantages of a higher bandwidth network is a simple ROI analysis of VDI deployment of 5000 Virtual Desktops, which compares connectivity over 25GbE versus 10GbE (published in my previous blog: “10/40GbE Architecture Efficiency Maxed-Out? It’s Time to deploy 25/50/100GbE”). When looking at only the hardware CAPEX savings, running over 25GbE cuts the VM costs in half, while adding the cost of the software and the OPEX even further improves the ROI.

Fig-3-25Gb

Modern data centers must be capable to handle the flow of data flow of data, and to enable (near) real-time analysis, which is driving the demand for higher performance and more efficient networks. New deployments that are based on Dell PowerEdge servers, equipped with Mellanox ConnectX-4 Lx 10/25GbE adapters, allows clients an easy migration from today 10GbE to 25GbE without demanding costly upgrades or incurring additional operating expenses.