All posts by Motti Beck

About Motti Beck

Motti Beck is Director of Marketing, EDC market segment at Mellanox Technologies, Inc. Before joining Mellanox, Motti was a founder of several start-up companies including BindKey Technologies that was acquired by DuPont Photomask (today Toppan Printing Company LTD) and Butterfly Communications that was acquired by Texas Instrument. Prior to that he was a Business Unit Director at National Semiconductors. Motti hold B.Sc in computer engineering from the Technion - Israel Institute of Technology. Follow Motti on Twitter: @MottiBeck

2017 Prediction – Networking will take Clouds to New Levels of Efficiency

At the dawn of the 21st century, and in order to meet the market demand, data center architects started to move away from the traditional scale-up architecture, which suffered from a limited and expensive scalability, to a scale-out architecture. Looking backward, it was the right direction to take, since that the ongoing growth of data and mobile applications has required efficient and easy-to-scale infrastructure that the new architecture successfully enables.

However, as we roll into 2017 and beyond, IT managers that deploy data centers that are based on scale-out architecture, should be aware that, in the past, when scale-up architecture were used, in order  to maximize the efficiency, they had only to verify that the performance of the CPU, memory and storage needs to be balanced, in the new scale-out architecture, the networking performance should be taken into consideration too.

There are couple of parameters that IT managers need to consider in 2017 when choosing the right networking for its next-generation deployments. The first, of course, is the networking speed. Although today, 10GbE is the most popular speed, the industry has already started to realize that 25GbE provides higher efficiency and that it is fast becoming the next 10GbE. 25GbE will become the new 10GbE in 2017. The main reason behind this prediction is the much higher bandwidth that flash-based storage, such as NVMe, SAS and SATA SSD can provide; whereas a single 25GbE port replaces three 10GbE ports. This by itself, can cut the networking cost by three fold, enabling the use of a single switch port, a single NIC port and a single cable instead of three each.  As IT managers start looking at next year’s budget, they know this is where they should be allocating their networking dollars.


Another couple of good examples for 2017, is where higher bandwidth enables higher efficiency. This will be happening more and more next year in VDI deployment, where a 25GbE solution cuts the cost per Virtual Desktop in half, or when deploying new infrastructure that has to support modern media and entertainment requirements, and 10GbE can’t deliver any additional the speed to support the number of streams that are required for today’s high definition resolutions. As IT managers revisit those all-important 2017 budgets, ROI will become more and more important as companies are increasingly unwilling to take the performance-cost trade off. Essentially, in 2017, IT Managers and their companies want to have their networking cake and eat it too.

However, deploying higher speed networking speed is just one way that IT managers can use to take their cloud efficiency to the next level. As they consider their options, they also should use networking products that offloads specific networking functions from the CPU to the IO controller itself. By choosing this solution, more CPU cycles are going to be freed to run the applications that will accelerate the job’s completion and enable using less CPUs or CPUs with less cores. Ultimately, in 2107, the overall licenses fees for the OS or and the hypervisor will be lower – both of course, will increase the overall cloud efficiency.

There are several network functions that have been offloaded already to the IO controller. One of the most widely used is RDMA (Remote Direct Memory Access) which offloads to the NIC to run the transport later, instead of running the heavy and CPU demanding TCP/IP protocol over the CPU. This is the main reason why IT managers should consider deploying RoCE (RDMA over Converged Ethernet) next year. Using RoCE makes data transfers more efficient and enables fast data move­ment between servers and storage without involving the server’s CPU. Throughput is increased, latency reduced, and CPU power freed up for running the real applications. RoCE is already widely used for efficient data transfer in render farms and in large cloud deployments such as Microsoft Azure. Moreover it has proven, superior efficiency vs. TCP/IP and thus will be utilized more than ever before in 2017.



Offloading overlay network technologies such as VXLAN, NVGRE and the new Geneve standard on the NIC or VTEP on the switch enables another significant cloud accretion. It represents another typical stateless networking function that, by offloading it, the jobs execution time is significantly shortened. One of the good examples is the comparison that Dell published, running typical enterprise applications over its PCS appliance, with and without NVGRE offloading. This shows that offloading accelerated the applications by more than 2.5 times, over the same system, which of course, increased the overall system efficiency by the same amount.


There are several other offloads that are supported by the networking components, like the offloading of the security functions, for example, IPSEC or erasure coding which is being used in Storage Systems. In addition, a couple of IO solutions providers already announced that their next generation products will include new offload functions, such as vSwitch offloading which will accelerate virtualization or NVMeoF offload. This has also been announced by Mellanox in their next ConnectX-5 NIC which we believe will proliferate as a solution of choice in 2017.

Those new networking capabilities have already been added to the lead OS and Hypervisors. At VMworld’16 VMware already announced support for all networking speeds 10, 25, 40, 50 and 100 GbE and VM-to-VM communication over RoCE in their vSphere 6.5. Also, Microsoft, at their recent Ignite’16 conference, announced the support of up to 100 GbE and that for production deployment of Storage Spaces Direct, they recommend running over RoCE. They have also published superior SQL 2016 performance results, when running over a network that support the highest speeds and RoCE. Those capabilities have been included in Linux for a very long time too. So, now, as we see a New Year looming on the horizon, it’s up to IT architects to choose the right networking speeds and offloads that will take their cloud efficiency to the next level in 2017.

This post originally appeared on VMblog here.

HPE, Mellanox, Micron, and Microsoft Exceed One Terabit Per Second of Storage Throughput With Hyper-converged Solution

In the “old days” of tech—meaning roughly 3-6 years ago, there were some hard and fast rules about getting really high throughput from your storage:

  1. The storage was always separate from the compute servers and you had to buy dedicated, specialized storage systems
  2. The storage network was most likely Fibre Channel SAN for block storage (InfiniBand for a parallel file system and 10Gb Ethernet for scale-out NAS)
  3. You needed many disk drives — dozens or hundreds to meet the needed throughput
  4. It was really expensive — $200,000 to $1M just to reach 100Gb/s (~12GB/s) of sustained throughput.
  5. “High performance storage” and “Microsoft Windows” were never in the same rack, let alone the same paragraph — all fast storage ran Linux, Solaris, FreeBSD, or a specialized real-time operating system.


Figure 1: The Good Old Days may have been good for many reasons, but faster computer products was not one of them.


The Times They Are A’ Changing

But starting in 2013 I started to see people breaking these rules. Software-defined storage delivered good performance on commodity servers. Hyper-converged infrastructure let compute and storage run on the same machines. Flash delivered many times the performance of spinning disks. Faster interconnects like 40Gb Ethernet grew in popularity for large clouds, compute clusters, and scale-out storage, as five vendors, including Mellanox, announced the new 25 and 50Gb Ethernet standards. And then there was Microsoft…


Figure 2: The HPE DL380 Gen 9 looks like a server, but thanks to software-defined storage and hyper-converged infrastructure, it can be storage, or compute and storage simultaneously.

Revolution from Redmond

Microsoft was an early leader in several of these fields. Windows Server 2012 R2 had native support to run over both 40 Gb Ethernet (with RoCE—RDMA over Converged Ethernet) and FDR 56Gb InfiniBand at a time when most enterprise storage systems only supported 10GbE and 8Gb Fibre Channel. In 2013 Microsoft and their server partners demonstrated that SMB Direct on 40 Gb Ethernet or FDR InfiniBand could best Fibre Channel SANs in both performance and price, and reduce the number of application servers needed to support a given workload. Faster and more efficient networking saved customers money on both server hardware and software licenses.


Figure 3: Microsoft 2013 study showed Windows Storage with RDMA and SAS hard drives had half the cost/capacity of Fibre Channel SAN, with the same performance.


At Tech Ed 2013, Microsoft demonstrated the power of RDMA with 40 Gb Ethernet by showing the live migration of virtual machines — a frequent and important task in both public and private clouds — was up to ten times faster using RDMA than using TCP/IP.



Figure 4: RDMA Enables live VM migration 10x faster than using TCP/IP – presented at the TechED’13 Opening Keynote Session.

In 2014, at the Open Networking Summit, Microsoft presented how they ran storage traffic using RoCE on 40GbE in their own cloud to lower the cost of running their Azure Storage. Dell and Mellanox teamed up with Microsoft to demonstrate over one million read IOPS using just two storage nodes and two clients, connected with FDR 56Gb/s InfiniBand. At the time, reaching 1M IOPS normally required a fancy and expensive dedicated storage array but this demo achieved it with just two Windows servers.


Then in 2015, we saw demonstrations of Windows Storage Spaces at Microsoft Ignite 2015 using one Mellanox 100Gb Ethernet link and Micron’s NVMe flash cards to achieve over 90 Gb/s (~11.1 GB/s) of actual throughput with just one millisecond latency. This was the first time I’d seen any single server really use 100 Gb Ethernet, let alone a Windows Server. It also proved that using SMB Direct with RDMA was a major advantage, with approximately twice the throughput, half the latency, and half the CPU utilization of using the regular SMB 3 protocol over TCP/IP.


Figure 5: One Windows Server delivers over 90 Gb/s of throughput using a single 100GbE link with RoCE. Performance without RoCE was halved.

Hyper-Race to Hyper-converged Windows Storage


In 2016, the race began to demonstrate ever faster performance using Windows Storage Spaces Direct (S2D) in a hyper-converged setup with NVMe flash storage and 100Gb RoCE. First Mellanox, Dell and HGST (a Western Digital brand) built a two-server cluster with Dell R730XD machines, each with two HGST UltraStar SN150 NVMe SSDs and two Mellanox ConnectX-4 100GbE NICs. A Mellanox Spectrum switch connected the machines and the cluster delivered 178Gb/s (22.3 GB/s). Then, at Flash Memory Summit and Intel Developer Forum, Microsoft, Dell, Samsung and Mellanox showed 640 Gb/s (80GB/s) using four Dell R730XD server with 4 Samsung NVMe SSDs and 2 Mellanox’s ConnectX-4 100GbE NIC, all connected by Mellanox’s Spectrum switch and LinkX® cables.


These demos were getting faster ever few months and showing the power and flexibility of a hyper-converged deployment on Windows Server 2016.


Figure 6: Microsoft recommends RDMA when deploying Windows Storage Spaces Direct as a hyper-converged infrastructure.

Breaking the One Terabit Per Second Barrier


Now, at Microsoft Ignite 2016, comes the latest demonstration of Windows S2D with an amazing milestone. A 12-node cluster using HPE DL380 servers with Micron 9100 Max NVMe SSDs and Mellanox ConnectX-4 100GbE NICs delivers over 1.2 Terabits per second (>160 GB/s).  Connectivity was through a Mellanox Spectrum 100 GbE switch and LinkX cables. More impressively, the cluster hosted 336 virtual machines and only 25 percent of the CPU power was consumed, leaving 75 percent of the CPU capacity for running additional applications. Which is important because the whole purpose of hyper-converged infrastructure is to run applications.


This HPE server is the latest version of their ever popular DL380 series, the world’s best-selling server (per IDC’s Server Tracker for Q1 2016). It supports up to 3TB of DDR4 memory, up to 6 NVMe SSDs, 25/40/50/100GbE adapters from Mellanox, and of course Microsoft Windows Server 2016, which was used for this demo.


Figure 7: HPE, Mellanox and Micron exceed 1.2 Terabit/second using twelve HPE DL380 Gen 9 servers, 48 Micron NVMe SSDs, with Mellanox 100GbE NICs, switch, and cables.

Amazing Performance, Price and Footprint

This is an amazing achievement considering that a few years ago this was was easily considered to be supercomputer performance and now it is available in any datacenter using off-the-shelf servers, SSDs, and Ethernet, with software-defined storage (Windows Server 2016), all at a reasonable off-the-shelf price. To give an idea of how impressive this is, look at what’s required to deliver this throughput using traditional enterprise or HPC storage. Suppose each HPE server with 4 Micron NVMe SSDs and two Mellanox ConnectX-4 100Gb Ethernet NICs included costs $15,000 each, plus $12,000 for the 100GbE switch and cables. That makes the cost of the entire 12-node cluster $192,000 and at 25U in height (2U per server, 1U for the switch) it consumes half a rack of footprint. But if we use traditional storage arrays…


  • A well-known all- flash array can deliver up to 96Gb/s (12GB/s) using 8x16Gb FC ports or 4x56Gb FDR InfiniBand ports. Suppose it costs $30,000 and consumes 6U each; to equal the Windows cluster throughout would require 13 systems costing $390,000 and consuming two racks.
  • A proven Lustre-based appliance delivers 96 Gb/s (12GB/s) per node in an 8U system so 480Gb/s (60GB/s) per rack. Suppose it costs $150,000 per rack — matching the Windows cluster throughput would require three racks and $450,000.
  • A popular enterprise scale-out SAN supports 320 Gb/s (40GB/s) of front-end bandwidth per node (20x16Gb FC ports). If each controller can drive the full 320 Gb/s, you would need four nodes plus a load of flash or disk shelves to reach 1.2 Tb/s. That’s probably at least $400,000 and 1.5 racks.
  • A high-end, Tier-1 scale-out SAN supports 512 Gb/s (64 GB/s) of front-end bandwidth per engine so reaching 1.2Tb/s would require 3 engines (4 if they must be deployed in pairs); let’s assume that costs at least $600,000 and consumes at least two racks.
1.2 Tb/s Solution Nodes Footprint Network Ports Clients Estimated Cost
Windows S2D 12 ½ rack 24 Included $192,000
AFA 13 2 Racks 52 IB / 104 FC Separate $390,000
Lustre cluster 14 3 Racks 56 Separate $450,000
Scale-out SAN 4 1.5 Racks


Separate $400,000
High-end SAN 3-4 2 Racks 80-96 Separate $600,000

Figure 8: The Windows S2D solution uses from one-quarter to one-third the footprint of traditional storage solutions and acquisition cost is estimated to be less than half of other solutions.

To be clear, these are all fine and proven enterprise or HPC storage solutions which offer many rich features, not all of which are necessarily available in Windows S2D. They probably have much more storage capacity than the 115 TB (raw) in the 1.2 Tb/s Windows S2D cluster. (Also, these are approximate price estimates, and actual prices for each system above could be significantly higher or lower.) But these are also prices for storage only, not including any compute power to run virtual machines and applications, whereas the Windows S2D solution includes the compute nodes and plenty of free CPU power. That only re-emphasizes my point is the Windows solution delivers much more throughput at a much lower price, and it also consumes much less space, power, and network ports to achieve it.


Figure 9: This amazing Windows S2D performance didn’t come from a completely generic components, but it did use general-purpose, off-the-shelf servers, storage, software, and networking from HPE, Micron, Microsoft and Mellanox to achieve performance far from generic.

The key lesson here is that software-defined storage and hyper-converged infrastructure, combined with NVMe flash and 100Gb RDMA networking, can now equal or surpass traditional enterprise arrays in performance at a far lower price. What used to be the sole realm of racks of dedicated arrays, expensive Fibre Channel networks, and row-upon-row of drives can now be accomplished with half a rack of general purpose servers with a few SSDs and 2 Mellanox NICs in each server.


It’s a software-defined storage revolution riding on the back of faster flash and faster networks — a revolution in more ways than one.

Read the solution brief here. 

Discover more about technology innovation being led by Mellanox and Microsoft below:

Mellanox RoCE’d Las Vegas at VMWorld 2016

Mellanox accelerated the speed of data in the virtualized data center from 10G to new heights of 25G at VMWorld 2016 which was held in Las Vegas Aug. 29 – Sept. 1st at the Mandalay Bay. crowd 2


The big news at the show revolved around Mellanox’s announcement regarding the integration of its software driver support for ConnectX®-4 Ethernet and RoCE (RDMA over Converged Ethernet) in VMware vSphere®, the industry’s leading virtualization platform. For the first time, virtualized enterprise applications are able to realize the same industry-leading performance and efficiency as non-virtualized environments. The new vSphere 6.5 software for ConnectX-4 delivers three critical new capabilities: increased Ethernet network speeds at 25/50 and 100 Gb/s, virtualized application communication over RoCE, and advanced network virtualization and SDN (Software Defined Networking) acceleration support. Now applications that run in the Cloud over a vSphere-based virtualized infrastructure, can communicate over 10/25/40/50 and 100 Gb/s Ethernet and leverage RoCE Networking to maximize Cloud infrastructure efficiency.

Mellanox also hosted technology demos and on-going presentations showcasing the benefits of using our Spectrum 10/25/40/50/100GbE switch that doesn’t lose packets and we demonstrated why vSphere runs best over it. We also had a number of key Mellanox partners participate in the Mellanox booth presentations showing how we enabled the superior capabilities of our network solutions, including end-to-end support on the 25GbE that is much more efficient networking then the 10GbE, which is no longer fast enough to meet today’s performance and efficiency needs. This is why, when it’s time to choose your next networking technology provider for your cloud deployment, or for your hyper-converged systems, you will choose Mellanox.crpowd1

Mellanox also had a number of key presentations at the show on the topics of:

  • Achieving New Levels of Cloud Efficiency over vSphere based Hyper-Converged Infrastructure
  • iSCSI/iSER: HW SAN Performance Over the Converged Data Center
  • Latencies and Extreme Bandwidth Without Compromise

Some highlight endorsements from our partners included:

“We are happy to collaborate with Mellanox to enable running VMware vSphere-based deployments with high performance networking,” said Mike Adams, senior director, Product Marketing, VMware. “With VMware vSphere and Mellanox technologies, mutual customers can accelerate mission critical applications, while achieving high performance and reduced costs.”

“High performance, reliable fabrics are fundamental to application performance and efficiency,” said JR Rivers, Co-Founder and CTO, Cumulus Networks. “We are pleased to collaborate with Mellanox and deliver the Cumulus Linux and Mellanox Spectrum joint solution to enable RDMA fabrics in vSphere.”

Overall, at the show, Mellanox demonstrated its Ethernet and RoCE end-to-end solutions, and showed how easy it is to increase the efficiency of the Cloud just by deploying tomorrow’s networking solutions today. Those solutions are based on company’s recent network products that improve productivity, scalability and flexibility, all of which now enable the industry to define the data center that meets your current needs, and at the same time to be future proofed. The solutions support not just all data speeds but also include efficient offload engines that accelerate the data center applications with minimal CPU overhead.


Things Are About to Get RoCE with Mellanox in Las Vegas

Mellanox will accelerate the speed of data in virtualized data center from 10G to new heights of 25G at VMWorld 2016 which converges in Las Vegas on Aug. 29 to Sept. 1st at Mandalay Bay.


With an announcement coming for Mellanox’s ConnectX®-4 Ethernet and RoCE (RDMA over Converged Ethernet), things are about to get rocky in the best possible way. For starters, Mellanox will be showing how easy it is to increase the efficiency of the Cloud just by deploying tomorrow’s networking solutions today. Mellanox delivers network products that improve productivity, scalability and flexibility, all of which enable the industry to define the data center that meets your needs now and at the same time to be a future full proof. As the leading provider of higher performance Ethernet NICs that has an 85 percent market share above 10GbE, Mellanox’s Connect-X-4 and Connect-X-4 Lx NICs support not just all data speeds but also include efficient offload engines that accelerate the data center applications with minimal CPU overhead. Stay tuned at the show for exciting news about Mellanox’s 10/25/40/50/100Gb/s Ethernet and RoCE end-to-end solution.


Mellanox will also be hosting technology demos and on-going presentations at booth #2223 where show attendees can learn about the benefits of using our Spectrum 10/25/40/50/100GbE switch that doesn’t lose packets and why vSphere runs best over it. Show goers will also have the chance to win prizes, all around the theme of driving the industry to 25G. In addition, a number of key Mellanox partners will be participating in the Mellanox booth presentations as we enable the superior capabilities of our network solutions, including end-to-end support on the 25GbE that is much more efficient networking then the 10GbE, which is no longer fast enough to meet today’s performance and efficiency needs. This is why, when it’s time to choose your next networking technology provider for your cloud deployment, or for your hyper-converged systems, you will choose Mellanox – but don’t just believe me, come to visit our booth and see for yourself!

Lastly, don’t miss one of our paper presentations that were selected by the VMworld committee this year:

  • Achieving New Levels of Cloud Efficiency over vSphere based Hyper-Converged Infrastructure [HBC9453-SPO]
    • Monday Aug 29 from 5:30 p.m. to 6:30 p.m.
  • iSCSI/iSER: HW SAN Performance Over the Converged Data Center [INF8469]
    • Wednesday, Aug 31 from 1 p.m. to 2 p.m.
  • Latencies and Extreme Bandwidth Without Compromise [CTO8519]
    • Thursday, Sept 1 from 12 p.m. to 1 p.m.

Hope to see you there.




Achieving New Levels of Application Efficiency with Dell’s PowerEdge Connected over 25GbE

After many years of extensive development of data center visualization technologies, which started with server virtualization and continued with networking virtualization and storage virtualization, the time has arrived to work on maximizing the efficiency of the data centers that have been deployed over those advanced solutions. The rationale for doing this is pretty clear. New data centers are based on the Hyper-Converged architecture which eliminates the need for dedicated storage systems (such as SAN or NAS) and the need for dedicated servers for just storage. Modern servers that are used in such Hyper-Converged deployments usually contain multiple CPUs and large storage capacity. Modern CPUs have double-digit cores that enable the servers to supports tens, and in some cases, hundreds of Virtual Machines (VMs). From the storage point of view, such servers have a higher number of PCIe slots, which enables the NVMe storage to be used as well the ability to host 24 or 48 SAT/SATA SSDs, both of which result in extremely high storage capacity.


Figure 1: Microsoft’s Windows Servers 2016 Hyper-Converged Architecture, in which the same server is used for Compute and Storage.

Now that there are high performance servers, each capable of tens of VMs and millions IOPs, IT managers must take a careful look at the networking capabilities and avoid IO bounded situations. The network must now support all traffic classes, the compute communication, the storage communications, the control, and so on. As such, not having high enough networking bandwidth will result in unbalanced systems (see: How Scale-Out Systems Affect Amdahl’s Law) and will therefore reduce the overall deployment efficiency. That is why Dell has equipped their PowerEdge 13th generation servers with Mellanox’s ConnectX®-4 Lx 10/25Gb/s Ethernet adapters, delivering significant application efficiency advantages and cost savings for private and hybrid clouds running demanding big data, Web 2.0, analytics, and storage workloads.

In addition to data communication over 25GbE, Dell’s PowerEdge servers, equipped with ConnectX-4 Lx-based 10/25GbE adapters, are capable of accelerating latency-sensitive data center applications over RoCE (RDMA over Converged Ethernet), which enables similar performance in a virtualized infrastructure as in a non-virtualized infrastructure. This, of course, further maximizes system efficiency.

A good example that demonstrates the efficiency that higher bandwidth and lower latency networks enable is Microsoft’s recent blog which published the performance results of a benchmark that they ran over a 4-node Dell PowerEdge R730XD cluster and connected over 100Gb Ethernet. Each node was equipped with the following hardware:

  • 2x Xeon E5-2660v3 2.6Ghz (10c20t)
  • 256GB DRAM (16x 16GB DDR4 2133 MHz DIMM)
  • 4x Samsung PM1725 3.2TB NVME SSD (PCIe 3.0 x8 AIC)
  • Dell HBA330
    • 4x Intel S3710 800GB SATA SSD
    • 12x Seagate 4TB Enterprise Capacity 3.5” SATA HDD
  • 2x Mellanox ConnectX-4 100Gb (Dual Port 100Gb PCIe 3.0 x16)
    • Mellanox FW v. 12.14.2036
    • Mellanox ConnectX-4 Driver v. 1.35.14894
    • Device PSID MT_2150110033
    • Single port connected / adapter


Figure 2: Storage throughput with Storage Spaces Direct (TP5)

The Microsoft team measured the storage performance, and, in order to maximize the traffic, they ran 20 VMs per server (total of 80 VMs for the entire cluster). They achieved astonishing performance of 60GB/s over a 4-node cluster, which perfectly demonstrates the higher efficiency that can be achieved when the three components of compute, storage, and networking are balanced, minimizing potential bottlenecks that can occur in an unbalanced system.

Another example that shows the efficiency advantages of a higher bandwidth network is a simple ROI analysis of VDI deployment of 5000 Virtual Desktops, which compares connectivity over 25GbE versus 10GbE (published in my previous blog: “10/40GbE Architecture Efficiency Maxed-Out? It’s Time to deploy 25/50/100GbE”). When looking at only the hardware CAPEX savings, running over 25GbE cuts the VM costs in half, while adding the cost of the software and the OPEX even further improves the ROI.


Modern data centers must be capable to handle the flow of data flow of data, and to enable (near) real-time analysis, which is driving the demand for higher performance and more efficient networks. New deployments that are based on Dell PowerEdge servers, equipped with Mellanox ConnectX-4 Lx 10/25GbE adapters, allows clients an easy migration from today 10GbE to 25GbE without demanding costly upgrades or incurring additional operating expenses.



Accelerating High-Frequency Trading with HPE Apollo 2000 and 25Gb/s Ethernet


The financial services industry (FSI) is facing various challenges these days, including the ongoing data explosion, new regulatory demands, more messages per trade, and increased competition. In a business where profits are directly determined by communications speed and latency, building a high-performance infrastructure that is capable of analyzing a high volume of data is critical. In particular, for high frequency trading applications saving a few microseconds in latency can be worth millions of dollars. Furthermore, in order to maintain a competitive advantage, financial firms must constantly upgrade infrastructure and accelerate data analytics. Given these factors the Trading and Market Data Applications market is one of the most demanding in terms of data center networking requirements, and requires IT managers to incorporate the most advanced networking technologies, supporting ultra-low latency and the highest possible throughput, while maintaining the lowest possible total cost of ownership (TCO).


Figure 1: HPE dual-port 25GbE adapter in both mezzanine (640SFP28) and PCIe card

This week at the HPE Discover 2016 conference, Mellanox announced the availability of new 25/100Gb/s Ethernet solutions for ProLiant and Apollo servers that will reach new levels of networking performance at lower TCO. The announcement includes two dual-port 10/25GbE network interface controllers (NICs): the HPE 10/25Gb/s 2-port 640SFP28 Ethernet Adapter and the HPE 10/25Gb/s 2-port 640FLR-SFP28 Ethernet Adapter. Both are based on the Mellanox Connect-X®-4 Lx 10/25GbE controller.

One of the simplest and most effective ways to take advantage of the higher speed is with VMA Messaging Acceleration Software. VMA is an open source, dynamically-linked user-space Linux library for accelerating mes­saging traffic, and is proven to boost performance of high frequency trading applications. Applications that utilize standard BSD sockets use the library to offload network processing from a server’s CPU. The traffic is passed di­rectly to the NIC from the application user space, bypassing the kernel and IP stack and thereby minimizing context switches, buffer copies, and interrupts. This results in extremely low latency. VMA software runs on both of the new HPE Ethernet 10/25 Gb/s adapters and requires no changes to the applications.


Figure 2: VMA block diagram

Running trading and market data applications over 25GbE and VMA enables the lowest application latency, highest application throughput, and improved scalability compared to other solutions, making Mellanox Ethernet the best interconnect solution for high frequency trading. At the conference, Mellanox and HPE demonstrated its Trade and Match Server solution that is based on the Apollo 2000 platform and that has been designed to minimize system latency and optimized for higher performance, specifically for high-frequency trading operations. HPE has published benchmark results for the Trade and Match Server, connected by ConnectX-4 Lx 25GbE, that demonstrate the competitive advantages that Mellanox’s high-performance interconnect solutions enable.


Figure 3: HPE’s Trade and Match Server, with industry-leading TCP and UDP latencies when connected by ConnectX-4 Lx 25GbE

In another example, HP compares UDP latency under various traffic load scenarios, thereby simulating the consumption of high volume market data feeds like OPRA, where systems are required to maintain low and consistent latency under high volumes of traffic from the feed. Here too, the solution is able to sustain very low latency even under conditions of high message rate.


Figure 4: VMA UDP latency under high message rates (sockperf)

In addition to its higher bandwidth and lower latency, the ConnectX-4 Lx also enables IT managers to leverage Remote Direct Memory Access (RDMA) offload engines by running the latency-sensitive applications required by trading and market data applications over RoCE (RDMA over Converged Ethernet). RDMA enables the network adapter to transfer data directly from application to application without involving the operating system, thereby eliminating intermediate buffer copies. As such, running over RoCE minimizes the latency and maximizes the messages per second that the infrastructure is capable of providing, both of which are essential for businesses to maintain their competitive advantage in data analysis.

The financial services industry is one of the most demanding in terms of IT networking requirements. Much more data needs to be analyzed in real-time, and every microsecond can translate into mil­lions of dollars of profits or losses. It is therefore crucial to improve system performance with a low latency, high bandwidth connectivity such as Mellanox 25Gb/s Ethernet in order to maintain a sustainable advantage over the competition.

10/40GbE Architecture Efficiency Maxed-Out? It’s Time to Deploy 25/50/100GbE

iStock_flying-animation-information-in-cloud-78487761_HD_1080_2In 2014, after the IEEE rejected the idea of standardizing 25GbE and 50GbE over one lane and two lanes respectively. It was then that a group of technology leaders (including Mellanox, Google, Microsoft, Broadcom, and Arista) formed the 25Gb Ethernet consortium in order to create an industry standard for defining interoperable solutions. The Consortium has been so successfully pervasive in its mission that many of the larger companies that had opposed standardizing 25GbE in the IEEE, have joined the 25GbE Consortium and are now top-level promoters. Since then, the IEEE has changed its original position and has now standardized 25/50GbE.

However, now that 25/50GbE is an industry standard, it is interesting to look back and analyze whether the decision to form the Consortium was the right one.


There are many ways to handle such an analysis, but the best way is to compare the efficiency that modern ultra-fast and ultra-scalable data centers experience when running over 10/40GbE architecture versus over 25/50/100 architecture. Here, too, there are many parameters that can be analyzed, but the most important is the architecture’s ability to achieve (near) real-time data processing (serving the ever-growing “mobile world”) at the lowest possible TCO per virtual machine (VM).

Of course, processing the data in (near) real-time requires higher performance, but it also needs cost-efficient storage systems, which implies that scale-out software defined storage with flash-based disks must be deployed. Doing so will enable Ethernet-based networking and eliminate the need for an additional separate network (like Fibre Channel) that is dedicated to storage, thereby reducing the overall deployment cost and maintenance.

To further reduce cost, and yet to still support the faster speeds that flash-based storage can provide, it is more efficient to use only one 25GbE NIC instead of using three 10GbE NICs. Running over 25GbE also reduces the number of switch ports and the number of cables by a factor of three. So, access to storage is accelerated at a lower system cost.  A good example of this is the NexentaEdge high performance scale-out block and object storage that has been deployed by Cambridge University for their OpenStack-based cloud.


Building a bottleneck-free storage system is critical for achieving the highest possible efficiency of various workloads in a virtualized data center. (For example, VDI performance issues begin in the storage infrastructure.) However, no less important is to find ways to reduce the cost per VM, which can be best accomplished by maximizing the numbers of VMs that can run over a single server. With the growing number of cores per CPU, as well as the growing number of CPUs per server, hundreds of VMs can run over a single server, cutting the cost per VM. However, a faster network is essential to avoid being IO bounded. For example, a simple ROI analysis of VDI deployment of 5000 Virtual Desktops that compares just the hardware CAPEX savings shows that running over 25GbE cuts the VM cost in half. Adding the cost of the software and the OPEX further improves the ROI.


The growth in computing power per server and the move to faster flash-based storage systems demands higher performance networking. The old 10/40GbE-based architecture simply cannot hit the right density/price point and the new 25/50/100GbE speeds are therefore the right choice to close the ROI gap.

As such, the move by Mellanox, Google, Microsoft, and others to form the 25Gb Consortium in order to push ahead with 25/50GbE as a standard despite the IEEE’s initial short-sighted rejection now seems like an enlightened decision, not only because of the IEEE’s ultimate change-of-heart, but even so more because of the performance and efficiency gains that 25/50GbE bring to data centers.

Content at the Speed of Your Imagination

media-clip-loop-200pxIn the past, one port of 10GbE was enough to support the bandwidth need of 4K DPX, three ports could drive 8K formats and four ports could drive 4K-Full EXR.  However, the recent evolution in the media and entertainment industry that has been presented this week at the NAB Show showcases the need for higher resolution.  This trend continues to drive the need for networking technologies that can stream more bits per second in real-time. However, these number of ports can drive only one stream of data. New films or video productions today include special effects that necessitate the need to support multiple streams simultaneously in real-time. This creates a major “data size” challenge for the studios and post-production shops, as 10GbE interconnects have been maxed-out and can no longer provide an efficient solution that can handle the ever-growing workload demands.

This is why IT managers should consider using the new emerging Ethernet speeds of 25, 50, and 100GbE. These speeds have been established as the new industry standard, driven by a consortium of companies that includes Google, Microsoft, Mellanox, Arista, and Broadcom, and recently adopted by the IEEE as well.  A good example of the efficiency that higher speed enables is Mellanox ConnectX-4 100GbE NIC that has been deployed in Netflix’s new data center. This solution now provides the highest-quality viewing experience for as many as 100K concurrent streams out of a single server. (Mellanox also published a CDN reference architecture based our end-to-end 25/50/100GbE solutions including: the Mellanox Spectrum™ switch, the ConnectX®-4 and ConnectX-4 LX NICs, and LinkX™ copper and optical cables.)



Bandwidth required for uncompressed 4K/8K video streams

Another important parameter that IT managers must take into account when building media and entertainment data centers is the latency that it takes to stream the data. Running multiple streams over the heavy and CPU-hungry TCP/IP protocol will result in lower CPU utilization (as a significant percentage of the CPU cycles will be used to run the data communication protocol and not the workload itself), which will reduce the effective bandwidth that the real workload can use.

This is why IT managers should consider deploying RoCE (RDMA over Converged Ethernet). Remote Direct Memory Access (RDMA) makes data transfers more efficient and enables fast data move­ment between servers and storage without involving the server’s CPU. Throughput is increased, latency reduced, and CPU power freed up for video editing, compositing, and rendering work. RDMA technology is already widely used for efficient data transfer in render farms and in large cloud deployments such as Microsoft Azure, and can accelerate video editing, encoding/transcoding, and playback.



RoCE utilizes advances in Ethernet to enable more efficient implementations of RDMA over Ethernet. It enables widespread deployment of RDMA technologies in mainstream data center applications. RoCE-based network management is the same as that for any Ethernet network management, eliminating the need for IT managers to learn new technologies. Using RoCE can result is 2X higher efficiency since it doubles the number of streams compared to running over Ethernet (source: ATTO technology).


The impact of RoCE for 40Gb/s vs. TCP in the number of supported video steams


Designing data centers that can serve the needs of the media and entertainment industry has traditionally been a complicated task that has often led to slow streams and bottlenecks in the pure storage performance, and in many cases has required the use of very expensive systems that resulted in lower-than-expected efficiency gains. Using high performance networking that supports higher bandwidth and low latency guarantees a hassle-free operation and enables extreme scalability and higher ROI for any industry-standard resolution and any content imaginable.

QCT’s Cloud Solution Center – Innovative Hyper Converged Solution at Work

On Tuesday, October 6, QCT opened its Cloud Solution Center located within QCT’s new U.S. corporate headquarters in San Jose. The new facility is designed to test and demonstrate modern cloud datacenter solutions that have been jointly developed by QCT and it’s technology partners. Among the demonstrated solutions, there was an innovative VDI deployment that has been jointly developed by QCT and Mellanox and based on a virtualized hyper-converged infrastructure with scale-out Software-Defined-Storage and connected over 40GbE.


VDI enables companies to centralize all of their desktop services over a virtualized data center. With VDI, users are not tied to a specific PC and can access their desktop and run applications from anywhere. VDI also helps IT administrators by creating more efficient and secure environments, which enables them to better serve their customers’ business needs.


VDI efficiency is measured by the number of virtual desktops that a specific infrastructure can support, or, in other words, by measuring the cost per user. The major limiting factor is the access time to storage. Replacing the traditional Storage Area Network (SAN) architecture with a modern scale-out software-defined storage architecture with fast interconnect supporting 40GbE significantly eliminates potential bottlenecks, enabling the lowest total cost of ownership (TCO) and highest efficiency.

Continue reading

Ethernet That Delivers: VMworld 2015

Just one more week to go before VMworld 2015 begins at Moscone Center in San Francisco. VMworld is the go-to event where business and technical decision makers converge.  In recent years, this week-long conference has become the major virtualization technologies event, and this year is expected to be the biggest ever.


We are thrilled to co-present a breakout session in the Technology Deep Dives and Futures track: Delivering Maximum Performance for Scale-Out Applications with ESX 6 [Tuesday, September 1, 2015: 11AM-Noon]


Session CTO6454:
Presented by Josh Simons, Office of the CTO, HPC – VMware and Liran Liss, Senior Principal Architect, Mellanox.

An increasing number of important scale-out workloads – Telco Network Function Virtualization (NFV), in-memory distributed databases, parallel file systems, Microsoft Server Message Block (SMB) Direct, and High Performance Computing – benefit significantly from network interfaces that provide ultra-low latency, high bandwidth, and high packet rates. Prior to ESX 6.0,Single-Root-IO-Virtualization (SR-IOV) and Fixed Pass through (FPT), which allow placing hardware network interfaces directly under VM control, introduced significant latency and CPU overheads relative to bare-metal configurations. ESXi 6.0 introduces support for Write Combining, which eliminates these overheads, resulting in near-native performance on this important class of workloads. The benefits of these improvements will be demonstrated using several prominent workloads, including a High Performance Computing (HPC) application, a Data-Plane-Development-Kit (DPDK) based NFV appliance, and the Windows SMB-direct storage protocol Detailed information will be provided to show attendees how to configure systems to achieve these results.

Continue reading