All posts by John F. Kim

About John F. Kim

John Kim is Director of Storage Marketing at Mellanox Technologies, where he helps storage customers and vendors benefit from high performance interconnects and RDMA (Remote Direct Memory Access). After starting his high tech career in an IT helpdesk, John worked in enterprise software and networked storage, with many years of solution marketing, product management, and alliances at enterprise software companies, followed by 12 years working at NetApp and EMC. Follow him on Twitter: @Tier1Storage

No Wrinkles as Mellanox Powers NVMe over Fabrics Demos at Flash Memory Summit and IDF

Mellanox just rounded out a two very busy weeks with back-to-back trade shows related to storage. We were at Flash Memory Summit August 9-11 in Santa Clara, followed by Intel Developer Forum (IDF) August 16-18 in San Francisco. A common theme was seeing Mellanox networking everywhere for demonstrating the performance of flash storage.

The fun began at Flash Memory Summit with several demos of NVMe over Fabrics (NVMe-oF). As my colleague Rob Davis wrote in his blog, the 1.0 standard and community drivers were just released in June 2016, and while FMS 2015 also featured NVMe-oF demos from Mangstor, Micron and PMC Sierra (now Microsemi), all were pre-standard and only Mangstor had a shipping product. Plus all the demos ran only on Linux.


Figure 1: NVMe over Fabrics is nearly always powered by RoCE (RDMA over Converged Ethernet)

So it was extremely exciting this year to see FIVE demos of NVMe over Fabrics at FMS using Mellanox networking, with three of them available as products. All the demos either used the standard NVMe-oF drivers or were compatible with the standard drivers, and they showed initiators running on Windows and VMware, not just Linux.

  • E8 Storage showed a distributed, scale-out NVMe-oF software-defined storage solution
  • Mangstor showed a high-performance, scale-up NVMe-oF array, with initiators running on bare-metal Linux and on a Linux VM running on top of VMware ESXi
  • Micron showed a Windows NVMe-oF initiator interoperating with a Linux target
  • Newisys (division of Sanmina) showed a live NVMe-oF demo
  • Pavilion Data showed a super dense NVMe-oF custom array supporting up to 460TB, 40x40GbE connections, and up to 20 million IOPS, all in one 4RU box.

blog 2

Figure 2: Pavilion Data’s custom-engineered all-flash array supports up to 460TB of raw capacity, 120GB/s of throughput, and 20M IOPS, all running NVMe-oF with up to forty 40GbE connections.

But NVMe over Fabrics wasn’t the only flash demo to leverage Mellanox networking! Samsung demonstrated an impressive Windows Storage Spaces Direct (S2D) cluster that reached 80GB/s (640 Gb/s) of data throughput. It used just 4 Dell servers, each with 4 Samsung NVMe SSDs and two Mellanox ConnectX-4 100GbE RDMA-enabled NICs, all connected by Mellanox’s Spectrum 2700 100GbE switch and LinkX® cables. Samsung also showed an all-flash reference design with 24 NVMe SSDs, capable of supporting several storage solutions including Ceph.

Nimbus Data unveiled a new family of flashy arrays which all support iSER (iSCSI Extensions for RDMA) on top of RoCE. Nexenta and Mellanox released a joint white paper showing how to deploy a hyper-converged software-defined storage (NexentaEdge) solution using Micron SSDs and Mellanox 50Gb Ethernet.

blog 3


Figure 3: Nimbus Data’s Exaflash C-series supports up to 3PB raw flash and can connect at 100Gb/s with either Ethernet or InfiniBand

At IDF a week later, there were more flashy demos. This time HGST (a Western Digital Brand), Seagate, and Samsung, showed NVMe over Fabrics using Mellanox adapters. Newisys and E8 Storage returned with their NVMe-oF demos, while Samsung also brought back their glorious Windows S2D cluster. To add to the storage excitement, Plexistor showed a solution for Shared Persistent Memory (uses technology similar to NVMe over Fabrics). Atto demonstrated ThunderLink which connects Thunderbolt 3 devices to 40Gb Ethernet networks, and Nokia showed their Airframe OCP rack.


blog 4blog5










Figure 4: Seagate showed a 2U NVMe-oF system with 24 Seagate Nytro XF1440 NVMe SSDs, while Atto’s ThunderLink™ connects Thunderbolt™ 3 devices to 40GbE networks.

Even Intel themselves showed NVMe over Fabrics with Mellanox ConnectX-4 100GbE NICs, paired with their Storage Performance Developer Kit (SPDK) and an Intel Silicon Photonics 100GbE cable. (Mellanox LinkX cables also support Silicon Photonics for 100GbE speeds at distances up to 2km.)

blog 6

Figure 5: Intel showed NVMe over Fabrics using their SPDK software and Mellanox ConnectX-4 adapters.

The common thread across these demos at FMS and IDF? They all used Mellanox ConnectX-3 or ConnectX-4 network adapters, and they all ran at speeds of 25Gb/s or faster (many at 100Gb/s).  In fact as far as I could see, every single demonstration of NVMe over Fabrics used Mellanox adapters, except for demos by other network adapter or chip vendors who showed their own networking.

This is not surprising given that Mellanox adapters and switches are the first to support 25, 50, and 100GbE speeds, and the first and best at supporting low-latency RDMA— via InfiniBand or RoCE—for super-efficient data movement. In addition, ConnectX-4 makes RoCE—and thus NVMe over Fabrics—deployments easier by allowing RoCE to run with Priority Flow Control (PFC) or Explicit Congestion Notification (ECN), or both (see my blog about that).

The key takeaways from these recent events are as follows:

  • NVMe over Fabrics is now a released standard with working products from several vendors
  • NVMe-oF support is expanding to Windows and VMware, no longer Linux-only
  • The speed of flash absolutely requires faster network speeds: 25, 40, 50, or even 100Gb/s
  • RoCE on Mellanox adapters is by far the most popular RDMA solution for supporting NVMe over Fabrics
  • Other flash storage solutions—such as Windows Storage Spaces, NexentaEdge, Ceph, and Plexistor—also choose Mellanox networking for the higher performance and efficiency

Many of the presentations—some given by me and my colleagues—from these two shows are now available online (links in the Resources section below). And if you’d like to see more solutions leveraging the power and efficiency of Mellanox networking, look for Mellanox at an upcoming event near you.

Supporting Resources:


Resilient RoCE Relaxes RDMA Requirements

RoCE — or RDMA over Converged Ethernet — has already proven to be the most popular choice for cloud deployments of Remote Direct Memory Access (RDMA). And it’s increasingly being used for fast flash storage access, such as with NVMe Over Fabrics. But some customers prefer not to configure their networks to be lossless using priority flow control (PFC). Now, with new software from Mellanox, RoCE can be deployed either with or without PFC, depending upon customer network requirements, infrastructure, and preference. This makes RoCE easier to deploy for more customers and will accelerate adoption of RDMA.

Background: Why RDMA?

The increasing speed of CPUs, networks, and storage (flash) have amplified the advantages of RDMA, making it more popular. As CPUs and storage get faster, they support faster network speeds such as 25, 40, 50, and 100GbE. But, as network speeds increase, more of the CPU cores are devoted to handling network traffic with its related data copies and interrupts. And as solid-state storage offers ever lower latencies, the network stack latency becomes a greater and greater part of the total time to access data.


Figure 1: As storage gets faster, software latency becomes a larger part of total data access latency. (Source: Intel presentation on SPDK, May 2016.)

RDMA solves both of these issues by reducing network latency and offloading the CPU. It uses zero-copy and hardware transport technology to transfer data directly from the memory of one server to another (or from server to storage) without making multiple copies, and hardware offloads relieve the CPU from managing any of the networking. This means that with RoCE, more CPU cores are available to run the important applications and the lower latency lets faster storage like flash shine.


Figure 2: RDMA increases network efficiency by transferring data directly to memory and bypassing the CPU. (Source: RoCE Initiative.)

The Purpose of Ethernet Flow Control

It’s clear that all RDMA performs best without packet loss, simply because detecting and retransmitting lost packets causes delays, no matter what protocol is used. The faster the network gets — such as 25, 40, 50, and 100GbE speeds — the greater the relative effect of packet loss and the more valuable to avoid packet loss.

RoCE has built-in error correction and retransmission mechanisms so it does not require a lossless network, however initial implementations recommended lossless networks. The most common source of packet loss within the datacenter is traffic overload on ports, such as an incast situation. So, it was recommended that customers deploy RoCE with Priority Flow Control (PFC).

PFC is part of the Ethernet Data Center Bridging (DCB) specification, originally implemented to support FCoE, which requires a lossless network. It acts like a traffic light or traffic cop at intersections, preventing collisions and avoiding packet loss from overloaded switch ports. The “Priority” in PFC allows traffic to be grouped into several classes so more important or latency-sensitive packets (for example storage or RDMA traffic) get priority over less latency-sensitive traffic.








Figure 3: PFC prevents packet loss on busy networks, just like a traffic cop prevents accidents at busy intersections.

Priority Flow Control

Priority Flow Control works very well, all major enterprise switches (including Mellanox switches) support it, and it’s been successfully deployed with RoCE in very large networks. In fact, because PFC eliminates packet loss from port overload, it effectively makes any datacenter network lossless. However, PFC requires the network administrators to set up VLANs and configure the flow control priorities, and some network administrators prefer not to do this.

ECN Eliminates Congestion for Smoother Network Flows

But there is an alternative mechanism to avoid packet loss, which leverages Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs.

The RoCE congestion management protocol takes advantage of ECN to avoid congestion and packet loss. ECN capable switches detect when a port is getting too busy and marks outbound packets from that port with the Congestion Experienced (CE) bit. The receiving NIC sees the CE indication and notifies the sending NIC with a Congestion Notification Packet (CNP). In turn, the sending NIC backs off its sending rate temporarily to prevent congestion from occurring. Once the risk of congestion declines sufficiently, the sender resumes full-speed data transmission.


Figure 4: RoCE congestion management leverages ECN to avoid both congestion and packet loss.

It’s like putting all the RoCE packets into self-driving cars which sense and avoid traffic jams using the data shared from all the other cars and local businesses. If a red light is ahead, the cars slow down so they won’t hit the red light, instead arriving at the intersection during the next green light.

Of course, ECN isn’t new. What is new is the latest software release that takes advantage of the advanced hardware mechanisms in the Mellanox ConnectX®-4 and ConnectX-4 Lx adapters which are optimized for deployment with ECN. Of course, you can still use PFC alone. You can even use both in a, “belt and suspenders” approach where ECN prevents congestion but just in case, PFC steps in as a, “traffic cop” to prevent packet loss and keep flows orderly.


Figure 5: RoCE can be deployed with ECN only, PFC only, or both, if you want to ensure your pants (or network flows) won’t fall down.

It’s the Same RoCE Specification as Before

To be clear, this is still the same RoCE specification and wire protocol, which hasn’t changed. It’s simply an enhanced implementation of RoCE, leveraging the improved features and capability of the Mellanox ConnectX-4 adapter family and the ECN support found in advanced switches, including the Mellanox Spectrum switch family. Different RoCE capable adapters still interoperate exactly as before.

Resilient RoCE delivers RDMA performance on lossy networks that performs on par with lossless networks and substantially better than protocols that rely on TCP/IP for error recovery. It gives customers more flexibility to deploy RDMA in the way that best suits their network architecture and performance needs. Some customers will deploy only PFC, some will deploy only ECN, and some will deploy both.

RoCE Continues to Improve and Evolve

Resilient RoCE continues the evolution of RoCE to serve the needs of both bigger networks and more types of enterprise and cloud customers.

  • 2013: First RoCE NICs shipped which are L3-routable
  • 2014: L3-routable RoCE standard approved
  • 2015 (June): Soft-RoCE lets any NIC run RoCE (though only rNICs offer the hardware acceleration and offload)
  • 2015 (October): RoCE plugfest proves multiple RoCE rNIC vendors can interoperate
  • 2016: Resilient RoCE lets RoCE run on lossless or lossy networks


Figure 6: RoCE continues to evolve and improve (source: Mellanox and InfiniBand Trade Association).

RoCE On!

It’s clear why RoCE is the most popular way to use RDMA over Ethernet—it provides the best performance and greatest efficiency. Now, with the addition of Soft-RoCE and the ability to operate with or without lossless networks, RoCE has the most flexibility and largest ecosystem of any Ethernet-based RDMA technology.




The Drive for 25: HPE Introduces New 25GbE NICs


At the Discover Conference earlier this month, HPE introduced exciting new 25G networking technology in their “Drive for 25” including dual-port 25GbE adapters, in both mezzanine and stand-up PCIe card form factors. These new adapters — based on the Mellanox ConnectX-4 Lx silicon — enable cloud and enterprise customers to improve network performance and efficiency while lowering total cost of ownership (TCO).

fig 1

Figure 1: HPE dual-port 25GbE adapter in both mezzanine (640SFP28) and PCIe card (640FLR_SFP28) formats, both based on the Mellanox ConnectX-4 Lx silicon.

Increasing Demand for 25GbE

With the increasing levels of performance coming out of HPE servers, applications frequently need more network bandwidth. 25GbE is ideal for many workloads and servers, providing 2.5x more bandwidth than 10GbE on each port. It accelerates many workloads including database, virtualization, video streaming, high-frequency trading (HFT), and network function virtualization (NFV). 25GbE — along with its close cousins 50GbE and 100GbE, also accelerates the new generation of infrastructure including hyper-converged infrastructure, in-memory computing, software-defined storage, and big data.

fig 2

Figure 2: HPE offers new speedy 25GbE adapters as part of the “Drive to 25” solution

 Two Ports, Flexible Connection Options

These new HPE 25GbE adapters each support two SFP28 ports to allow for high-availability or connection to multiple physical networks. Using a SFP28 form factor allows each port on the adapter to support many connectivity options, giving HPE customers the ability to choose the best cabling option for their needs:

  • 10GbE or 25GbE speeds
  • Copper or fiber optic cabling
  • Cables and transceivers supporting distances from 0.5M (50cm) to 10km
  • No breakout cables required
  • Ability to re-use existing structured 10GbE fiber for 25GbE connections

Advanced Support for Public and Private Cloud Workloads

These new HPE adapters, with Mellanox ConnectX-4 technology, also support advanced cloud offloads to improve packet processing speeds and maximize performance in virtualized environments. They include features to optimize video streaming, and support Remote Direct Memory Access (RDMA) using the RDMA over Converged Ethernet (RoCE) protocol.

As customers increasingly deploy HPE servers to handle cloud workloads and as network speeds increase, the smart offloads in these new HPE adapters offload the CPU and reduce network latencies. This delivers more CPU power to the applications. HPE customers will also leverage the increased bandwidth and efficiency to create more efficient software-defined storage and hyper-converged infrastructure solutions.

Highest Performance and Efficiency

The HPE 640SFP28 and 640FLR-SFP28 adapters come with impressive speed and green credentials as well. They feature some of the lowest latency and highest message rates of any 25GbE NIC, as well as very low power consumption for efficiency and a fan-less design for maximum reliability. The smart offloads allow more work to be accomplished more quickly by fewer CPU cores, and the two-port SFP28 design mentioned earlier allows a broad choice of the most efficient cabling for the distances required, including the ability to re-use existing structured fiber. (HPE also offers an EDR IB and 100Gb Ethernet adapter based on the Mellanox ConnectX-4 silicon.)

Mellanox Helps HPE Lead in Server Innovation

By offering 25GbE adapters with flexible ports and smart offloads, HPE and Mellanox are helping customers to build more efficient datacenters. This, “Drive to 25” is another example of the technology leadership that has made HPE a leader in server and networking technology for the last 25 years and Mellanox is proud to be an HPE server networking partner.




25 Is the New 10, 50 Is the new 40, 100 Is the New Amazing

(This blog was inspired by an insightful article in EE Times, written by my colleague, Chloe Jian Ma.)

The latest buzz about Ethernet is that 25GbE is coming. Scratch that, it’s already here and THE hot topic in the Ethernet world, with multiple vendors sampling 25GbE wares and Mellanox already shipping an end-to-end solution with adapters, switches and cables that support 25, 50, and 100GbE speeds. Analysts predict 25GbE sales will ramp faster than any previous Ethernet speed.

Why?????? What’s driving this shift?

John Kim 030416 Fig 1Figure 1: Analysts predict 25/40/50/100GbE adapters reach 57% of a $1.8 Billion USD high-speed Ethernet adapter market by 2020. (Based on Crehan Research data published January 2016.)

These new speeds are so hot that, like the ageless celebrities you just saw on the Oscar Night red carpet, we say “25 is the new 10 and 50 is the new 40.” But whoa! Sure everyone wants to look younger for the camera, but no 25-year old actor wants to look 10. More importantly, why would anyone want 25GbE or 50GbE when we already have 40GbE and 100GbE?

Continue reading

Ethernet Is the New Storage Network

I recently saw an infographic titled “2015 Data Storage Roadmap” and was pleasantly surprised to see Mellanox listed under the storage networking section. The side comment was “Ethernet Becoming The Standard Storage Network.”


John Kim 030416 Fig new storage networkFigure 1: Tech Expectations blog infographic shows the new storage networking vendors. (Graphic excerpted from the larger original graphic, which is available here.)



Why surprised? Because in the past, when people said “Storage Networking” they usually meant Fibre Channel. But the growth of cloud, software-defined, and scale-out storage, as well as hyper-converged and big data solutions, have all made Ethernet the new standard storage network (rather than Fibre Channel), just as the infographic above says. Since Mellanox is the leading vendor of networking equipment for speeds above 10Gb/s, it’s really not a surprise after all to have Mellanox on the leaderboard.



Continue reading

Making Ceph Faster: Lessons From Performance Testing

In my first blog on Ceph I explained what it is and why it’s hot; in my second blog on Ceph I showed how faster networking can enable faster Ceph performance (especially throughput).  But many customers are asking how to make Ceph even faster. And recent testing by Red Hat and Mellanox, along with key partners like Supermicro, QCT (Quanta Cloud technology), and Intel have provided more insight into increasing Ceph performance, especially around IOPS-sensitive workloads.

John Kim 021716 cheetah

Figure 1: Everyone wants Ceph to go FASTER


Different Data in Ceph Imposes Different Workloads

Ceph can be used for block or object storage and different workloads. Usually, block workloads consist of smaller, random I/O, where data is managed in blocks ranging from 1KB to 64KB in size. Object storage workloads usually offer large, sequential I/O with data chunks ranging from 16KB to 4MB in size (and individual objects can be many gigabytes in size). The stereotypical small, random block workload is a database such as MySQL or active virtual machine images. Common object data include archived log files, photos, or videos. However in special cases, block I/O can be large and sequential (like copying a large part of a database) and object I/O can be small and random (like analyzing many small text files).


The different workloads put different requirements on the Ceph system. Large sequential I/O (usually objects) tends to stress the storage and network bandwidth for both reads and writes. Small random I/O (usually blocks) tends to stress the CPU and memory of the OSD server as well as the storage and network latency. Reads usually require fewer CPU cycles and are more likely to stress storage and network bandwidth, while writes are more likely to stress the CPUs as they calculate data placement. Erasure Coding writes require more CPU power but less network and storage bandwidth.

Continue reading

Top 7 Reasons Why Fibre Channel Is Doomed

Analyst firm, Neuralytix, just published a terrific white paper about the revolution affecting data storage interconnects.  Titled Faster Interconnects for Next Generation Data Centers, it explains why customers are rethinking their data center storage and networks, in particular around how iSCSI and iSER (iSCSI with RDMA) are starting to replace Fibre Channel for block storage.

John Kim 121415 Pic1

You can find the paper here. It’s on-target about iSCSI vs. FC, but it doesn’t cover the full spectrum of factors dooming FC to a long and slow fadeout from the storage connectivity market. I’ll summarize the key points of the paper as well as the other reasons Fibre Channel has no future.


Three reasons Fibre Channel is a Dead End, As Explained by Neuralytix:

1.  Flash: Fast Storage Needs Fast Networking

Today’s flash far outperforms hard drives for throughput, latency, IOPS, power consumption, and reliability. It has better price/performance than hard disks and already represents between 10-15% of shipping enterprise storage capacity according to analysts. With fast storage, your physical network and your network protocol must have high bandwidth and low latency, otherwise you’re wasting much of the value of flash. Tomorrow’s NVMe devices will support up to 2-3GB/s (16-24Gb/s) each with latencies <50 us (that’s <0.05 milliseconds vs. 2-5 milliseconds for hard drives).  Modern Ethernet supports speeds of 100Gb/s per link, with latencies of several microseconds, and combined with the hardware-accelerated iSER block protocol, it’s perfect for supporting maximum performance on non-volatile memory (NVM), whether today’s flash or tomorrow’s next-gen solid state storage.


John Kim 121415 Pic2

Figure 1: Storage Media Gets Much Faster

Continue reading

A Good Network Connects Ceph To Faster Performance

In my first blog on Ceph, I explained what it is and why it’s hot. But what does Mellanox, a networking company, have to do with Ceph, a software-defined storage solution?  The answer lies in the Ceph scale-out design. And some empirical results are found in the new “Red Hat Ceph Storage Clusters on Supermicro storage servers” reference architecture published August 10th.


Ceph has two logical networks, the client-facing (public) and the cluster (private) networks. Communication with clients or application servers is via the former while replication, heartbeat, and reconstruction traffic run on the latter. You can run both logical networks on one physical network or separate the networks if you have a large cluster or lots of activity.

John Kim 081915 Fig1

Figure 1: Logical diagram of the two Ceph networks

Continue reading

50 Shades of Flash—Solutions That Won’t Tie Up Your Storage

Where Are We On this NVMe Thing?

Back in April 2015, during the Ethernet Technology Summit conference, my colleague Rob Davis wrote a great blog about NVMe Over Fabrics. He outlined the basics of what NVMe is and why Mellanox is collaborating on a standard to access NVMe devices over networks (over fabrics). We had two demos from two vendors in our booth:

  • Mangstor’s NX-Series array with NVMe Over Fabrics, using Mellanox 56GbE RoCE (or FDR InfiniBand), demonstrated >10GB/s read throughput and >2.5 million 4KB random read IOPS.
  • Saratoga Speed’s Altamont XP-L with iSER (iSCSI RDMA), using Mellanox 56Gb RoCE to reach 11.6GB/s read throughput and 2.7 million 4KB sequential read IOPs

These numbers were pretty impressive, but in the technology world, nothing stands still. One must always strive to be faster, cheaper, more reliable, and/or more efficient.


The Story Gets Better

Today—four months after Ethernet Technology Summit—kicked off the Flash Memory Summit in Santa Clara, California. Mellanox issued a press release highlighting the fact that we now have NINE vendors showing TWELVE demos of flash (or other non-volatile memory) being accessed using high-speed Mellanox networks at 40, 56, or even 100Gb/s speeds.  Mangstor and Saratoga Speed are both back with faster, more impressive demos and we have other demos from Apeiron, HGST, Memblaze, Micron, NetApp, PMC-Sierra, and Samsung. Here’s a quick summary:

  Continue reading

Ceph Is A Hot Storage Solution – But Why?

In talks with customers, server vendors, the IT press, and even within Mellanox, one of the hottest storage topics is Ceph. You’ve probably heard of it and many big customers are implementing it or evaluating it. But I am also frequently asked the following:

  • What is Ceph?
  • Why is it a hot topic in storage?
  • Why does Mellanox, a networking company, care about Ceph, and why should Ceph customers care about networking?

I’ll answer #1 and #2 in this blog and #3 in another blog.



Figure 1: A bigfin reef squid (Sepioteuthis lessoniana) of the Class Cephalopoda


Continue reading