All posts by John F. Kim

About John F. Kim

John Kim is Director of Storage Marketing at Mellanox Technologies, where he helps storage customers and vendors benefit from high performance interconnects and RDMA (Remote Direct Memory Access). After starting his high tech career in an IT helpdesk, John worked in enterprise software and networked storage, with many years of solution marketing, product management, and alliances at enterprise software companies, followed by 12 years working at NetApp and EMC. Follow him on Twitter: @Tier1Storage

Excelero Unites NVMe Over Fabrics With Hyper-Converged Infrastructure

Two Hot IT Topics Standing Alone, Until Now…

Two of the hottest topics and IT trends right now are hyper-converged infrastructure (HCI) and NVMe Over Fabrics (NVMe-oF).  The hotness of HCI is evident in the IPO of Nutanix in September and HPE’s acquisition of Simplivity in January 2017. The interest in NVMe-oF has been astounding with all the major storage vendors working on it and all the major SSD vendors promoting it as well.

But the two trends have been completely separate—you could do one, the other, or both, but not together in the same architecture. HCI solutions could use NVMe SSDs but not NVMe-oF, while NVMe-oF solutions were being deployed either as separate, standalone flash arrays or NVMe flash shelves behind a storage controller. There was no easy way to create a hyper-converged solution using NVMe-oF.


Excelero NVMesh Combines NVMe-oF with HCI

Now a new solution launched by Excelero combines the low latency and high throughput of NVMe-oF with the scale-out and software-defined power of HCI. Excelero does this with a technology called NVMesh that takes commodity server, flash, and networking technology and connects it in a hyper-converged configuration using an enhanced version of the NVMe-oF protocol. With this solution, each node can act both as an application server and as a storage target, making its local flash storage accessible to all the other nodes in the cluster. It also supports a disaggregated flash model so customers have a choice between scale-out converged infrastructure and a traditional centralized storage array.

Figure 1: Excelero NVMesh combines NVMe-oF with HCI, much like combining peanut butter and chocolate into one tasty treat).



Remote Flash Access Without the Usual CPU Penalties

NVMesh creates a virtualized pool of block storage using the NVMe SSDs on each server and leverages a technology called Remote Direct Drive Access (RDDA) to let each node access flash storage remotely.   RDDA itself builds on top of industry-standard Remote Direct Memory Access (RDMA) networking to maintain the low latency of NVMe SSDs even when accessed over the network fabric.  The virtualized pools allow several NVMe SSDs to be accessed as one logical volume by either local or remote applications.

In a traditional hyper-converged model, the storage sharing consumes some part of the local CPU cycles, meaning they are not available for the application. The faster the storage and the network, the more CPU is required to share the storage. RDDA avoids this by allowing the NVMesh clients to directly access the remote storage without interrupting the target node’s CPU. This means high performance—whether throughput or IOPS—is supported across the cluster without eating up all the CPU cycles.


Recent testing showed a 4-server NVMesh cluster with 8 SSDs per server could support several million 4KB IOPS or over 6.5GB/s (>50Gb/s)—very impressive results for a cluster that size.

Figure 2: NVMesh leverages RDDA and RDMA to allow fast storage sharing with minimal latency and without consuming CPU cycles on the target. The control path passes through the management module and CPUs but the data path does not, eliminating potential performance bottlenecks.


Integrates with Docker and OpenStack

Another feature NVMesh has over the standard NVMe-oF 1.0 protocol is that it supports integration with Docker and OpenStack. NVMesh includes plugins for both Docker Persistent Volumes and Cinder, which makes it easy to support and manage container and OpenStack block storage. In a world where large clouds increasingly use either OpenStack or Docker, this is a critical feature.

Figure 3: Excelero’s NVMesh includes plug-ins for both Docker and OpenStack Cinder, making it easy to use it for both container and cloud block storage.



Another Step Forward in the NVMe-oF Revolution

The launch of Excelero’s NVMesh is an important step forward in the ongoing revolution of NVMe over Fabrics. The open source solution supports high performance but only with a centralized storage solution and without many important storage features. The NVMe-oF array solutions offer a proven appliance solution but some customers want a software-defined storage option built on their favorite server hardware.  Excelero offers them all of these features together: hyper-converged infrastructure, NVMe over Fabrics technology, and software-defined storage.


Supporting Resources:

Storage Predictions for 2017

Looking at what’s to come for storage in 2017, I find three simple and easy predictions which lead to three more complex predictions.  Let’s start with the easy ones:

  • Flash keeps taking over
  • NVMe over Fabrics remains the hottest storage technology
  • Cloud continues to eat the world of IT


Flash keeps taking over

Every year, for the past four years, has been “The Year Flash Takes Over” and every year flash owns a growing minority of storage capacity and spend, but it’s still in the minority. 2017 is not the year flash surpasses disk in spending or capacity — there’s simply not enough NAND fab capacity yet, but it is the year all-flash arrays go mainstream. SSDs are now growing in capacity faster than HDDs (15TB SSD recently announced) and every storage vendor offers an all-flash flavor. New forms of 3D NAND are lowering price/TB on one side to compete with high capacity disks while persistent memory technologies like 3D-XPoint (while not actually buillt on NAND flash) are increasing SSD performance even further above that of disk. HDDs will still dominate low price, high-capacity storage for some years, but are rapidly becoming a niche technology.


Figure 1: TrendFocus 2015 chart shows worldwide hard drive shipments have fallen since 2010. Flash is one major reason, cloud is another.


According to IDC (Worldwide Quarterly Enterprise Storage Systems Tracker, September 2016) in Q2 2016 the all-flash array (AFA) market grew 94.5% YoY while the overall enterprise storage market grew 0%, giving AFAs 19.4% of the external (outside the server) enterprise storage systems market. This share will continue to rise.


Figure 2: Wikibon 2015 forecast predicts 4-year TCO of flash storage dropped below that of hard disk storage in 2016. 


NVMe over Fabrics (NVMe-oF) remains the hottest storage technology

It’s been a hot topic since 2014 and it’s getting hotter, even though production deployments are not yet widespread. The first new block storage protocol in 20 years has all the storage and SSD vendors excited because it makes their products and the applications running on them work better.  At least 4 startups have NVMe-oF products out with POCs in progress, while large vendors such as Intel, Samsung, Seagate, and Western Digital are demonstrating it regularly. Mainstream storage vendors are exploring how to use it while Web 2.0 customers want it to disaggregate storage, moving flash out of each individual server into more flexible, centralized repositories.

It’s so hot because it helps vendors and customers get the most out of flash (and other non-volatile memory) storage. Analyst G2M, Inc. predicts the NVMe market will exceed $57 Billion by 2020, with a compound annual growth rate (CAGR) of 95%. They also claim say 40% of AFAs will use NVMe SSDs by 2020, and hundreds of thousands of those arrays will connect with NVMe over Fabrics.


Figure 3: G2M predicts incredibly fast growth for NVMe SSDs, servers, appliances, and NVMe over Fabrics.


Cloud continues to eat the world of IT 

Nobody is surprised to hear cloud is growing faster than enterprise IT. IDC reported cloud (public + private) IT spending for Q2 2016 grew 14.5% YoY while traditional IT spending shrank 6% YoY. Cloud offers greater flexibility and efficiency, and in the case of public cloud the ability to replace capital expense investments with a pure OpEx model.

It’s not a panacea, as there are always concerns about security, privacy, and speed of access. Also, larger customers often find that on-premises infrastructure — often set up as private cloud — can cost less than public cloud in the long run. But there is no doubting the inexorable shift of projects, infrastructure, and spending to the cloud. This shift affects compute (servers), networking, software, and storage, and drives both cloud and enterprise customers to find more efficient solutions that offer lower cost and greater flexibility.


Figure 4: IDC Forecasts cloud will consume >40% of IT infrastructure spending by 2020. Full chart available at:


OK Captain Obvious, Now Make Some Real Predictions!

Now let’s look at the complex predictions which are derived from the easy ones:

  • Storage vendors consolidate and innovate
  • Fibre Channel continues its slow decline
  • Ceph grows in popularity for large customers
  • RDMA becomes more prevalent in storage


Traditional storage vendors consolidate and innovate

Data keeps growing at over 30% per year but spending on traditional storage is flat. This is forcing vendors to fight harder for market share by innovating more quickly to make their solutions more efficient, flexible, flash-focused, and cloud-friendly. Vendors that previously offered only standalone arrays are offering software-defined options, cloud-based storage, and more converged or hyper-converged infrastructure (HCI) options. For example, NetApp offers options to replicate or back up data from NetApp boxes to Amazon Web Services, Dell/EMC HDS, and IBM all sell converged infrastructure racks. In addition, startup Zadara Storage offers enterprise storage-as-a-service running either in the public cloud or as on-premises private cloud.

Meanwhile, major vendors all offer software versions of some of their products instead of only selling hardware appliances. For example, EMC ScaleIO, IBM Spectrum Storage, IBM Cloud Object Storage (formerly CleverSafe), and NetApp ONTAP Edge are all available as software that runs on commodity servers.

The environment for flash startups is getting tougher because all the traditional vendors now offer their own all-flash flavors. There are still startups making exciting progress in NVMe over Fabrics, object storage, hyper-converged infrastructure, data classification, and persistent memory, but only a few can grow into profitability on their own. 2017 will see a round of acquisitions as storage vendors who can’t grow enough organically look to expand their portfolios in these areas.


Fibre Channel Continues its Downward Spiral

One year ago I wrote a blog about why Fibre Channel (FC) is doomed and all signs (and analyst forecasts) point to its continued slow decline. All the storage trends around efficiency, flash, performance, big data, Ceph, machine learning, object storage, containers, HCI, etc. are moving against Fibre Channel. (Remember the “Cloud Eats the World” chart above? They definitely don’t want to use FC either.) The only thing keeping FC hopes alive is the rapid growth of all-flash arrays, which deploy mostly FC today because they are replacing legacy disk or hybrid FC arrays. But even AFAs are trending to using more Ethernet and InfiniBand (occasionally direct PCIe connections) to get more performance and flexibility at lower cost.

The FC vendors know the best they can hope for is to slow the rate of decline, so all of them were betting on growing their Ethernet product lines. More recently the FC vendors (Emulex, QLogic, Brocade) have been acquired by larger companies, but not as hot growth engines but rather so the larger companies can milk the cash flow from the expensive FC hardware before their customers convert to Ethernet and escape.


Ceph grows in Popularity for Large Customers

Ceph — both the community version and Red Hat Ceph Storage — continues to gain fans and use cases. Originally seen as suited only for storing big content on hard drives (low-cost, high-capacity storage), it’s now gained features and performance making it suitable for other applications. Vendors like Samsung, SanDisk (now WD), and Seagate are demonstrating Ceph on all-flash storage, while Red Hat and Supermicro teamed up with Percona to show Ceph works well as database storage (and is less expensive than Amazon storage for running MySQL).  I wrote a series of blogs on Ceph’s popularity, optimizing Ceph performance, and using Ceph for databases.

Ceph is still the only storage solution that is software-defined, open source, scale-out and offering enterprise storage features (though Lustre is approaching this as well). Major contributors to Ceph development include not just Red Hat but also Intel, the drive/SSD makers, Linux vendors (Canonical and SUSE), Ceph customers, and, of course, Mellanox.

In 2016, Ceph added features and stability to its file/NAS offering, CephFS, as well as major performance improvements for Ceph block storage. In 2017, Ceph will improve performance, management, and CephFS even more while also enhancing RDMA support. As a result, its adoption grows beyond its traditional base to add Telcos, cable companies, and large enterprises who want a scalable software-defined storage solution for OpenStack.



RDMA More Prevalent in Storage

RDMA, or Remote Direct Memory Access, has actually been prevalent in storage for a long time as a cluster interconnect and for HPC storage. Just about all the high-performance scale-out storage products use Mellanox-powered RDMA for their cluster communications — examples include Dell FluidCache for SAN, EMC XtremIO, EMC VMAX3, IBM XIV, InfiniDat, Kaminario, Oracle Engineered Systems, Zadara Storage, and many implementations of Lustre and IBM Spectrum Scale (GPFS).

The growing use of flash media and intense interest in NVMe-oF are accelerating the move to RDMA. Faster storage requires faster networks, not just more bandwidth but also lower latency, and in fact the NVMe-oF spec requires RDMA to deliver its super performance.


Figure 5: Intel presented a chart at Flash Memory Summit 2016 showing how the latency of storage devices is rapidly decreasing, leading to the need to decrease software and networking latency with higher-speed networks (like 25GbE) and RDMA.

In addition to the exploding interest in NVMe-oF, Microsoft has improved support for RDMA access to storage in Windows Server 2016, using SMB Direct and Windows Storage Spaces Direct, and Ceph RDMA is getting an upgrade. VMware has enhanced support for iSER (iSCSI Extensions for RDMA) in VSphere 2016 and more storage vendors like Oracle (in tape libraries) and Synology have added iSER support to enable accelerated client access. On top of this, multiple NIC vendors (not just Mellanox) have announced support for RoCE (RDMA over Converged Ethernet) on 25, 40, 50, and 100Gb Ethernet speeds. These changes all mean more storage vendors and storage deployments will leverage RDMA in 2017.


So Let’s Get This Party Started

2017 promises to be a super year for storage innovation. With technology changes, disruption, and consolidation, not every vendor will be a winner and not every storage startup will find hockey-stick growth and riches, but it’s clear the storage hardware and software vendors are working harder than ever, and customers will be big winners in many ways.



Ceph For Databases? Yes You Can, and Should

Ceph is traditionally known for both object and block storage, but not for database storage. While its scale-out design supports both high capacity and high throughput, the stereotype is that Ceph doesn’t support the low latency and high IOPS typically required by database workloads.

However, recent testing by Red Hat, Supermicro, and Percona—one of the top suppliers of MySQL database software—show that Red Hat Ceph Storage actually does a good job of supporting database storage, especially when running it on multiple VMs, and it does very well compared to running MySQL on Amazon Web Services(AWS).

In fact, Red Hat was a sponsor of Percona Live Europe last week in Amsterdam, and it wasn’t just to promote Red Hat Enterprise Linux. Sr. Storage Architect Karan Singh presented a session “MySQL and Ceph: A tale of two friends.”



Figure 1: This shadowy figure with the stylish hat has been spotted storing MySQL databases in a lab near you.


MySQL Needs Performance, But Not Just Performance

The front page of the Percona Europe web site says “Database Performance Matters,” and so it does. But there are multiple ways to measure database performance—it’s not just about running one huge instance of MySQL on one huge bare metal server with the fastest possible flash array. (Just in case that is what you want, check out conference sponsor Mangstor, who offer a very fast flash array connected using NVMe Over Fabrics.)  The majority of MySQL customers also consider other aspects of performance:

  • Performance across many instances: Comparing aggregate performance of many instances instead of just one large MySQL instance
  • Ease of deployment: The ability to spin up, manage, move and retire many MySQL instances using virtual machines.
  • Availability: Making sure the database keeps running even in case of hardware failure, and can be backed up and restored in case of corruption.
  • Storage management: Can the database storage be centralized, easily expanded, and possibly shared with other applications?
  • Price/Performance: Evaluating the cost of each database transaction or storage IOP.
  • Private vs. Public Cloud: Which instances should be run in a public cloud like AWS vs. in a private, on-premises cloud?

It’s common for customers to deploy many MySQL instances to support different applications, users, and projects. It’s also common to deploy them on virtual machines, which makes more efficient use of hardware and simplifies migration of instances. For example a particular MySQL instance can be given more resources when it’s hot then moved to an older server when it’s not.

Likewise it’s preferred to offer persistent, shared storage which can scale up in both capacity and performance when needed. While a straight flash array or local server flash might offer more peak performance to one MySQL instance, Ceph’s scale-out architecture makes it easy to scale up the storage performance to run many MySQL instances across many storage nodes. Persistent storage ensures the data continues to exist even if the database instances goes away. Ceph also features replication and erasure coding to protect against hardware failure and snapshots to support quick backup and restore of databases.

As for the debate between public vs. private cloud, it has too many angles to cover here, but clearly there are MySQL customers who prefer to run in their own datacenter rather than AWS, and others who would happily go either way depending which costs less.



Figure 2: Ceph can scale out to many nodes for both redundancy and increased performance for multiple database instances.

But the questions remain: can Ceph perform well enough for a typical MySQL user, and how does it compare to AWS in performance and price? This is what Red Hat, Supermicro, and Percona set off to find out.




Figure 3: MySQL on AWS vs. MySQL on Red Hat Ceph Storage. Which is faster? Which is less expensive?

First Red Hat ran baseline benchmarks on AWS EC2 (r3.2xlarge and m4.4xlarge) using Amazon’s Elastic Block Storage (EBS) with provisioned IOPS set to 30 IOPS/GB, testing with Sysbench for 100% read and 100% write. Not surprisingly, after converting from Sysbench numbers (requests per second per MySQL instance) to IOPS, AWS performance was as advertised—30 read IOPS/GB and 26 write IOPS/GB.

Then they tested the Ceph cluster illustrated above: 5 Supermicro cloud servers (SSG-6028R-E1CF12L) with four NVMe SSDs each, plus 12 Supermicro client machines on dual 10GbE networks. Software was Red Hat Ceph Storage 1.3.2 on RHEL 7.2 with Percona Server. After running the same Sysbench tests the Ceph cluster at 14% and 87% capacity utilization, they found read IOPS/GB were 8x or 5x better, while write IOPS/GB were 3x better than AWS at 14% utilization.  At 87% utilization of the Ceph cluster, write IOPS/DB were 14% lower than AWS due to the write amplification from the combination of InnoDB write buffering, Ceph replication, and OSD journaling.


Figure 4: Ceph private cloud generated far better write IOPS/GB at 14% capacity and slightly lower IOPS/GB at 72% and 87% capacity.


What about Price/Performance?

The Ceph cluster was always better than AWS for reads and much better than AWS for writes when nearly empty but slightly slower than AWS for writes when nearly full. On the other hand when looking at the cost per IOP for MySQL writes, Ceph was far less expensive than AWS in all scenarios. In the best case Ceph was less than 1/3rd the price/IOP and in the worst case half the price/IOP, vs. AWS EBS with provisioned IOPS.


Figure 5: MySQL on a Ceph private cloud showed much better (lower) price/performance than running on AWS EBS with Provisioned IOPS.


What Next for the Database Squid?

Having shown good performance chops running MySQL on Red Hat Ceph Storage, Red Hat also looked at tuning Ceph block storage performance, including RBD format, RBD order, RBD fancy striping, TCP settings, and various QEMU settings. These are covered in the Red Hat Summit presentation and Percona webinar.

For the next phase in this database testing, I’d like to see Red Hat, Supermicro, and Percona test larger server configurations that use more flash per server and faster networking. While this test only used dual 10GbE networks, previous testing has shown that using Mellanox 40 or 50Gb Ethernet can reduce latency and therefore increase IOPS performance for Ceph, even when dual 10GbE networks provide enough bandwidth. It would also be great to demonstrate the benefits of Ceph replication and cluster self-healing features for data protection as well as Ceph snapshots for nearly instant backup and restore of databases.

My key takeaways from this project are as follows:

  • Ceph is a good choice for many MySQL use cases
  • Ceph offers excellent performance and capacity scalability, even if it might not offer the fastest performance for one specific instance.
  • Ceph performance for MySQL compares favorably with AWS EBS Provisioned IOPS
  • You can build a private storage cloud with Red Hat Ceph Storage with a lower price/capacity and price/performance than running on AWS.

If you’re running a lot of MySQL instances, especially on AWS, it behooves you to evaluate Ceph as a storage option. You can learn more about this from the PerconaLive and Red Hat Summit presentations linked below.

Supporting Resources:


No Wrinkles as Mellanox Powers NVMe over Fabrics Demos at Flash Memory Summit and IDF

Mellanox just rounded out a two very busy weeks with back-to-back trade shows related to storage. We were at Flash Memory Summit August 9-11 in Santa Clara, followed by Intel Developer Forum (IDF) August 16-18 in San Francisco. A common theme was seeing Mellanox networking everywhere for demonstrating the performance of flash storage.

The fun began at Flash Memory Summit with several demos of NVMe over Fabrics (NVMe-oF). As my colleague Rob Davis wrote in his blog, the 1.0 standard and community drivers were just released in June 2016, and while FMS 2015 also featured NVMe-oF demos from Mangstor, Micron and PMC Sierra (now Microsemi), all were pre-standard and only Mangstor had a shipping product. Plus all the demos ran only on Linux.


Figure 1: NVMe over Fabrics is nearly always powered by RoCE (RDMA over Converged Ethernet)

So it was extremely exciting this year to see FIVE demos of NVMe over Fabrics at FMS using Mellanox networking, with three of them available as products. All the demos either used the standard NVMe-oF drivers or were compatible with the standard drivers, and they showed initiators running on Windows and VMware, not just Linux.

  • E8 Storage showed a distributed, scale-out NVMe-oF software-defined storage solution
  • Mangstor showed a high-performance, scale-up NVMe-oF array, with initiators running on bare-metal Linux and on a Linux VM running on top of VMware ESXi
  • Micron showed a Windows NVMe-oF initiator interoperating with a Linux target
  • Newisys (division of Sanmina) showed a live NVMe-oF demo
  • Pavilion Data showed a super dense NVMe-oF custom array supporting up to 460TB, 40x40GbE connections, and up to 20 million IOPS, all in one 4RU box.

blog 2

Figure 2: Pavilion Data’s custom-engineered all-flash array supports up to 460TB of raw capacity, 120GB/s of throughput, and 20M IOPS, all running NVMe-oF with up to forty 40GbE connections.

But NVMe over Fabrics wasn’t the only flash demo to leverage Mellanox networking! Samsung demonstrated an impressive Windows Storage Spaces Direct (S2D) cluster that reached 80GB/s (640 Gb/s) of data throughput. It used just 4 Dell servers, each with 4 Samsung NVMe SSDs and two Mellanox ConnectX-4 100GbE RDMA-enabled NICs, all connected by Mellanox’s Spectrum 2700 100GbE switch and LinkX® cables. Samsung also showed an all-flash reference design with 24 NVMe SSDs, capable of supporting several storage solutions including Ceph.

Nimbus Data unveiled a new family of flashy arrays which all support iSER (iSCSI Extensions for RDMA) on top of RoCE. Nexenta and Mellanox released a joint white paper showing how to deploy a hyper-converged software-defined storage (NexentaEdge) solution using Micron SSDs and Mellanox 50Gb Ethernet.

blog 3


Figure 3: Nimbus Data’s Exaflash C-series supports up to 3PB raw flash and can connect at 100Gb/s with either Ethernet or InfiniBand

At IDF a week later, there were more flashy demos. This time HGST (a Western Digital Brand), Seagate, and Samsung, showed NVMe over Fabrics using Mellanox adapters. Newisys and E8 Storage returned with their NVMe-oF demos, while Samsung also brought back their glorious Windows S2D cluster. To add to the storage excitement, Plexistor showed a solution for Shared Persistent Memory (uses technology similar to NVMe over Fabrics). Atto demonstrated ThunderLink which connects Thunderbolt 3 devices to 40Gb Ethernet networks, and Nokia showed their Airframe OCP rack.


blog 4blog5










Figure 4: Seagate showed a 2U NVMe-oF system with 24 Seagate Nytro XF1440 NVMe SSDs, while Atto’s ThunderLink™ connects Thunderbolt™ 3 devices to 40GbE networks.

Even Intel themselves showed NVMe over Fabrics with Mellanox ConnectX-4 100GbE NICs, paired with their Storage Performance Developer Kit (SPDK) and an Intel Silicon Photonics 100GbE cable. (Mellanox LinkX cables also support Silicon Photonics for 100GbE speeds at distances up to 2km.)

blog 6

Figure 5: Intel showed NVMe over Fabrics using their SPDK software and Mellanox ConnectX-4 adapters.

The common thread across these demos at FMS and IDF? They all used Mellanox ConnectX-3 or ConnectX-4 network adapters, and they all ran at speeds of 25Gb/s or faster (many at 100Gb/s).  In fact as far as I could see, every single demonstration of NVMe over Fabrics used Mellanox adapters, except for demos by other network adapter or chip vendors who showed their own networking.

This is not surprising given that Mellanox adapters and switches are the first to support 25, 50, and 100GbE speeds, and the first and best at supporting low-latency RDMA— via InfiniBand or RoCE—for super-efficient data movement. In addition, ConnectX-4 makes RoCE—and thus NVMe over Fabrics—deployments easier by allowing RoCE to run with Priority Flow Control (PFC) or Explicit Congestion Notification (ECN), or both (see my blog about that).

The key takeaways from these recent events are as follows:

  • NVMe over Fabrics is now a released standard with working products from several vendors
  • NVMe-oF support is expanding to Windows and VMware, no longer Linux-only
  • The speed of flash absolutely requires faster network speeds: 25, 40, 50, or even 100Gb/s
  • RoCE on Mellanox adapters is by far the most popular RDMA solution for supporting NVMe over Fabrics
  • Other flash storage solutions—such as Windows Storage Spaces, NexentaEdge, Ceph, and Plexistor—also choose Mellanox networking for the higher performance and efficiency

Many of the presentations—some given by me and my colleagues—from these two shows are now available online (links in the Resources section below). And if you’d like to see more solutions leveraging the power and efficiency of Mellanox networking, look for Mellanox at an upcoming event near you.

Supporting Resources:


Resilient RoCE Relaxes RDMA Requirements

RoCE — or RDMA over Converged Ethernet — has already proven to be the most popular choice for cloud deployments of Remote Direct Memory Access (RDMA). And it’s increasingly being used for fast flash storage access, such as with NVMe Over Fabrics. But some customers prefer not to configure their networks to be lossless using priority flow control (PFC). Now, with new software from Mellanox, RoCE can be deployed either with or without PFC, depending upon customer network requirements, infrastructure, and preference. This makes RoCE easier to deploy for more customers and will accelerate adoption of RDMA.

Background: Why RDMA?

The increasing speed of CPUs, networks, and storage (flash) have amplified the advantages of RDMA, making it more popular. As CPUs and storage get faster, they support faster network speeds such as 25, 40, 50, and 100GbE. But, as network speeds increase, more of the CPU cores are devoted to handling network traffic with its related data copies and interrupts. And as solid-state storage offers ever lower latencies, the network stack latency becomes a greater and greater part of the total time to access data.


Figure 1: As storage gets faster, software latency becomes a larger part of total data access latency. (Source: Intel presentation on SPDK, May 2016.)

RDMA solves both of these issues by reducing network latency and offloading the CPU. It uses zero-copy and hardware transport technology to transfer data directly from the memory of one server to another (or from server to storage) without making multiple copies, and hardware offloads relieve the CPU from managing any of the networking. This means that with RoCE, more CPU cores are available to run the important applications and the lower latency lets faster storage like flash shine.


Figure 2: RDMA increases network efficiency by transferring data directly to memory and bypassing the CPU. (Source: RoCE Initiative.)

The Purpose of Ethernet Flow Control

It’s clear that all RDMA performs best without packet loss, simply because detecting and retransmitting lost packets causes delays, no matter what protocol is used. The faster the network gets — such as 25, 40, 50, and 100GbE speeds — the greater the relative effect of packet loss and the more valuable to avoid packet loss.

RoCE has built-in error correction and retransmission mechanisms so it does not require a lossless network, however initial implementations recommended lossless networks. The most common source of packet loss within the datacenter is traffic overload on ports, such as an incast situation. So, it was recommended that customers deploy RoCE with Priority Flow Control (PFC).

PFC is part of the Ethernet Data Center Bridging (DCB) specification, originally implemented to support FCoE, which requires a lossless network. It acts like a traffic light or traffic cop at intersections, preventing collisions and avoiding packet loss from overloaded switch ports. The “Priority” in PFC allows traffic to be grouped into several classes so more important or latency-sensitive packets (for example storage or RDMA traffic) get priority over less latency-sensitive traffic.








Figure 3: PFC prevents packet loss on busy networks, just like a traffic cop prevents accidents at busy intersections.

Priority Flow Control

Priority Flow Control works very well, all major enterprise switches (including Mellanox switches) support it, and it’s been successfully deployed with RoCE in very large networks. In fact, because PFC eliminates packet loss from port overload, it effectively makes any datacenter network lossless. However, PFC requires the network administrators to set up VLANs and configure the flow control priorities, and some network administrators prefer not to do this.

ECN Eliminates Congestion for Smoother Network Flows

But there is an alternative mechanism to avoid packet loss, which leverages Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs.

The RoCE congestion management protocol takes advantage of ECN to avoid congestion and packet loss. ECN capable switches detect when a port is getting too busy and marks outbound packets from that port with the Congestion Experienced (CE) bit. The receiving NIC sees the CE indication and notifies the sending NIC with a Congestion Notification Packet (CNP). In turn, the sending NIC backs off its sending rate temporarily to prevent congestion from occurring. Once the risk of congestion declines sufficiently, the sender resumes full-speed data transmission.


Figure 4: RoCE congestion management leverages ECN to avoid both congestion and packet loss.

It’s like putting all the RoCE packets into self-driving cars which sense and avoid traffic jams using the data shared from all the other cars and local businesses. If a red light is ahead, the cars slow down so they won’t hit the red light, instead arriving at the intersection during the next green light.

Of course, ECN isn’t new. What is new is the latest software release that takes advantage of the advanced hardware mechanisms in the Mellanox ConnectX®-4 and ConnectX-4 Lx adapters which are optimized for deployment with ECN. Of course, you can still use PFC alone. You can even use both in a, “belt and suspenders” approach where ECN prevents congestion but just in case, PFC steps in as a, “traffic cop” to prevent packet loss and keep flows orderly.


Figure 5: RoCE can be deployed with ECN only, PFC only, or both, if you want to ensure your pants (or network flows) won’t fall down.

It’s the Same RoCE Specification as Before

To be clear, this is still the same RoCE specification and wire protocol, which hasn’t changed. It’s simply an enhanced implementation of RoCE, leveraging the improved features and capability of the Mellanox ConnectX-4 adapter family and the ECN support found in advanced switches, including the Mellanox Spectrum switch family. Different RoCE capable adapters still interoperate exactly as before.

Resilient RoCE delivers RDMA performance on lossy networks that performs on par with lossless networks and substantially better than protocols that rely on TCP/IP for error recovery. It gives customers more flexibility to deploy RDMA in the way that best suits their network architecture and performance needs. Some customers will deploy only PFC, some will deploy only ECN, and some will deploy both.

RoCE Continues to Improve and Evolve

Resilient RoCE continues the evolution of RoCE to serve the needs of both bigger networks and more types of enterprise and cloud customers.

  • 2013: First RoCE NICs shipped which are L3-routable
  • 2014: L3-routable RoCE standard approved
  • 2015 (June): Soft-RoCE lets any NIC run RoCE (though only rNICs offer the hardware acceleration and offload)
  • 2015 (October): RoCE plugfest proves multiple RoCE rNIC vendors can interoperate
  • 2016: Resilient RoCE lets RoCE run on lossless or lossy networks


Figure 6: RoCE continues to evolve and improve (source: Mellanox and InfiniBand Trade Association).

RoCE On!

It’s clear why RoCE is the most popular way to use RDMA over Ethernet—it provides the best performance and greatest efficiency. Now, with the addition of Soft-RoCE and the ability to operate with or without lossless networks, RoCE has the most flexibility and largest ecosystem of any Ethernet-based RDMA technology.




The Drive for 25: HPE Introduces New 25GbE NICs


At the Discover Conference earlier this month, HPE introduced exciting new 25G networking technology in their “Drive for 25” including dual-port 25GbE adapters, in both mezzanine and stand-up PCIe card form factors. These new adapters — based on the Mellanox ConnectX-4 Lx silicon — enable cloud and enterprise customers to improve network performance and efficiency while lowering total cost of ownership (TCO).

fig 1

Figure 1: HPE dual-port 25GbE adapter in both mezzanine (640SFP28) and PCIe card (640FLR_SFP28) formats, both based on the Mellanox ConnectX-4 Lx silicon.

Increasing Demand for 25GbE

With the increasing levels of performance coming out of HPE servers, applications frequently need more network bandwidth. 25GbE is ideal for many workloads and servers, providing 2.5x more bandwidth than 10GbE on each port. It accelerates many workloads including database, virtualization, video streaming, high-frequency trading (HFT), and network function virtualization (NFV). 25GbE — along with its close cousins 50GbE and 100GbE, also accelerates the new generation of infrastructure including hyper-converged infrastructure, in-memory computing, software-defined storage, and big data.

fig 2

Figure 2: HPE offers new speedy 25GbE adapters as part of the “Drive to 25” solution

 Two Ports, Flexible Connection Options

These new HPE 25GbE adapters each support two SFP28 ports to allow for high-availability or connection to multiple physical networks. Using a SFP28 form factor allows each port on the adapter to support many connectivity options, giving HPE customers the ability to choose the best cabling option for their needs:

  • 10GbE or 25GbE speeds
  • Copper or fiber optic cabling
  • Cables and transceivers supporting distances from 0.5M (50cm) to 10km
  • No breakout cables required
  • Ability to re-use existing structured 10GbE fiber for 25GbE connections

Advanced Support for Public and Private Cloud Workloads

These new HPE adapters, with Mellanox ConnectX-4 technology, also support advanced cloud offloads to improve packet processing speeds and maximize performance in virtualized environments. They include features to optimize video streaming, and support Remote Direct Memory Access (RDMA) using the RDMA over Converged Ethernet (RoCE) protocol.

As customers increasingly deploy HPE servers to handle cloud workloads and as network speeds increase, the smart offloads in these new HPE adapters offload the CPU and reduce network latencies. This delivers more CPU power to the applications. HPE customers will also leverage the increased bandwidth and efficiency to create more efficient software-defined storage and hyper-converged infrastructure solutions.

Highest Performance and Efficiency

The HPE 640SFP28 and 640FLR-SFP28 adapters come with impressive speed and green credentials as well. They feature some of the lowest latency and highest message rates of any 25GbE NIC, as well as very low power consumption for efficiency and a fan-less design for maximum reliability. The smart offloads allow more work to be accomplished more quickly by fewer CPU cores, and the two-port SFP28 design mentioned earlier allows a broad choice of the most efficient cabling for the distances required, including the ability to re-use existing structured fiber. (HPE also offers an EDR IB and 100Gb Ethernet adapter based on the Mellanox ConnectX-4 silicon.)

Mellanox Helps HPE Lead in Server Innovation

By offering 25GbE adapters with flexible ports and smart offloads, HPE and Mellanox are helping customers to build more efficient datacenters. This, “Drive to 25” is another example of the technology leadership that has made HPE a leader in server and networking technology for the last 25 years and Mellanox is proud to be an HPE server networking partner.




25 Is the New 10, 50 Is the new 40, 100 Is the New Amazing

(This blog was inspired by an insightful article in EE Times, written by my colleague, Chloe Jian Ma.)

The latest buzz about Ethernet is that 25GbE is coming. Scratch that, it’s already here and THE hot topic in the Ethernet world, with multiple vendors sampling 25GbE wares and Mellanox already shipping an end-to-end solution with adapters, switches and cables that support 25, 50, and 100GbE speeds. Analysts predict 25GbE sales will ramp faster than any previous Ethernet speed.

Why?????? What’s driving this shift?

John Kim 030416 Fig 1Figure 1: Analysts predict 25/40/50/100GbE adapters reach 57% of a $1.8 Billion USD high-speed Ethernet adapter market by 2020. (Based on Crehan Research data published January 2016.)

These new speeds are so hot that, like the ageless celebrities you just saw on the Oscar Night red carpet, we say “25 is the new 10 and 50 is the new 40.” But whoa! Sure everyone wants to look younger for the camera, but no 25-year old actor wants to look 10. More importantly, why would anyone want 25GbE or 50GbE when we already have 40GbE and 100GbE?

Continue reading

Ethernet Is the New Storage Network

I recently saw an infographic titled “2015 Data Storage Roadmap” and was pleasantly surprised to see Mellanox listed under the storage networking section. The side comment was “Ethernet Becoming The Standard Storage Network.”


John Kim 030416 Fig new storage networkFigure 1: Tech Expectations blog infographic shows the new storage networking vendors. (Graphic excerpted from the larger original graphic, which is available here.)



Why surprised? Because in the past, when people said “Storage Networking” they usually meant Fibre Channel. But the growth of cloud, software-defined, and scale-out storage, as well as hyper-converged and big data solutions, have all made Ethernet the new standard storage network (rather than Fibre Channel), just as the infographic above says. Since Mellanox is the leading vendor of networking equipment for speeds above 10Gb/s, it’s really not a surprise after all to have Mellanox on the leaderboard.



Continue reading

Making Ceph Faster: Lessons From Performance Testing

In my first blog on Ceph I explained what it is and why it’s hot; in my second blog on Ceph I showed how faster networking can enable faster Ceph performance (especially throughput).  But many customers are asking how to make Ceph even faster. And recent testing by Red Hat and Mellanox, along with key partners like Supermicro, QCT (Quanta Cloud technology), and Intel have provided more insight into increasing Ceph performance, especially around IOPS-sensitive workloads.

John Kim 021716 cheetah

Figure 1: Everyone wants Ceph to go FASTER


Different Data in Ceph Imposes Different Workloads

Ceph can be used for block or object storage and different workloads. Usually, block workloads consist of smaller, random I/O, where data is managed in blocks ranging from 1KB to 64KB in size. Object storage workloads usually offer large, sequential I/O with data chunks ranging from 16KB to 4MB in size (and individual objects can be many gigabytes in size). The stereotypical small, random block workload is a database such as MySQL or active virtual machine images. Common object data include archived log files, photos, or videos. However in special cases, block I/O can be large and sequential (like copying a large part of a database) and object I/O can be small and random (like analyzing many small text files).


The different workloads put different requirements on the Ceph system. Large sequential I/O (usually objects) tends to stress the storage and network bandwidth for both reads and writes. Small random I/O (usually blocks) tends to stress the CPU and memory of the OSD server as well as the storage and network latency. Reads usually require fewer CPU cycles and are more likely to stress storage and network bandwidth, while writes are more likely to stress the CPUs as they calculate data placement. Erasure Coding writes require more CPU power but less network and storage bandwidth.

Continue reading

Top 7 Reasons Why Fibre Channel Is Doomed

Analyst firm, Neuralytix, just published a terrific white paper about the revolution affecting data storage interconnects.  Titled Faster Interconnects for Next Generation Data Centers, it explains why customers are rethinking their data center storage and networks, in particular around how iSCSI and iSER (iSCSI with RDMA) are starting to replace Fibre Channel for block storage.

John Kim 121415 Pic1

You can find the paper here. It’s on-target about iSCSI vs. FC, but it doesn’t cover the full spectrum of factors dooming FC to a long and slow fadeout from the storage connectivity market. I’ll summarize the key points of the paper as well as the other reasons Fibre Channel has no future.


Three reasons Fibre Channel is a Dead End, As Explained by Neuralytix:

1.  Flash: Fast Storage Needs Fast Networking

Today’s flash far outperforms hard drives for throughput, latency, IOPS, power consumption, and reliability. It has better price/performance than hard disks and already represents between 10-15% of shipping enterprise storage capacity according to analysts. With fast storage, your physical network and your network protocol must have high bandwidth and low latency, otherwise you’re wasting much of the value of flash. Tomorrow’s NVMe devices will support up to 2-3GB/s (16-24Gb/s) each with latencies <50 us (that’s <0.05 milliseconds vs. 2-5 milliseconds for hard drives).  Modern Ethernet supports speeds of 100Gb/s per link, with latencies of several microseconds, and combined with the hardware-accelerated iSER block protocol, it’s perfect for supporting maximum performance on non-volatile memory (NVM), whether today’s flash or tomorrow’s next-gen solid state storage.


John Kim 121415 Pic2

Figure 1: Storage Media Gets Much Faster

Continue reading