All posts by John F. Kim

About John F. Kim

John Kim is Director of Storage Marketing at Mellanox Technologies, where he helps storage customers and vendors benefit from high performance interconnects and RDMA (Remote Direct Memory Access). After starting his high tech career in an IT helpdesk, John worked in enterprise software and networked storage, with many years of solution marketing, product management, and alliances at enterprise software companies, followed by 12 years working at NetApp and EMC. Follow him on Twitter: @Tier1Storage

IBM Demonstrates NVMe Over Fabrics on InfiniBand with Power9 Servers and PCIe Gen 4

IBM Supports NVMe over Fabrics using Mellanox

Toay, at the AI Summit New York, IBM is demonstrating a technology preview of NVMe over Fabrics using their Power9 servers, Mellanox InfiniBand connectivity, and IBM Flash Storage.

As I mentioned in my blog 3 weeks ago, during the SC17 conference, the IBM FlashSystem 900 array would be an excellent candidate to support NVMe over Fabrics. It is a superfast flash array with very low latency and already supports the SCSI RDMA Protocol (SRP) over InfiniBand connections.

IBM is a strong player in several industries and solutions that require high bandwidth and/or low latency, such as High Performance Computing (HPC), Media and Entertainment, and Database. And of course IBM has been a long-time leader and innovator in Artificial Intelligence (AI).

Figure 1: The IBM FlashSystem 900 features very low latency and is now being demonstrated with NVMe over Fabrics over InfiniBand.



Using NVMe-oF on InfiniBand to Support AI

One of their demonstrations at the AI Summit is IBM Power AI Vision, which can automatically and quickly recognize and classify objects with image recognition via neural networks. In this case, an IBM AC922 server, running the Power 9 CPU, connects to five FlashSystem 900 storage arrays using networking technology from Mellanox. This includes a Mellanox Switch-IB 2 SB7800 switch, which supports InfiniBand networking at DDR, QDR, FDR, and EDR speeds (20, 40, 56, or 100Gb/s). This tech preview is achieving 41 Gigabytes/second of throughput (23GB/s of reads plus 18GB/s of writes) using a single Mellanox ConnectX-5 dual-port 2x100Gb) adapter in the server.

Showcasing the Power of PCIe Gen 4

This is also one of the first demonstrations of a server connecting to the Mellanox ConnectX adapter using PCIe Gen 4 technology, which can support 2x faster data transfers per lane than PCIe Gen 3.  Mellanox networking technology offers the fastest performance whether on Ethernet or InfiniBand, and includes the first NICs and HBAs to support PCIe Gen 4 slots in servers. Mellanox is shipping end-to-end 100Gb/s Ethernet and InfiniBand solutions today, including adapters, switches, cables and transceivers, with 200Gb/s technology coming soon for both Ethernet and InfiniBand. With PCIe Gen 4, a single adapter can easily support throughput up to 200Gb/s, so it’s no surprise that the fastest storage and server vendors in the world, like IBM, have chosen Mellanox to connect their solutions together and demonstrate NVMe over Fabrics.

Upcoming G2M Webinar about NVMe Over Fabrics

  • To learn more about NVMe over Fabrics technology directions and trends, you can attend an upcoming webinar next Tuesday (December 12, 9am PST or 12pm EST) hosted by G2M, Mellanox, and other leading vendors in the NVMe and NVMe-oF space. Featured speakers include: Howard Marks of DeepStorage, Mike Heumann of G2M, and Rob Davis of Mellanox. You can register for this webinar here:

Supporting Resources:

NVMe Over Fabrics on InfiniBand, the World’s Fastest Storage Network

More Vendors support NVMe over Fabrics using Mellanox

G2M Research, an analyst that specializes in solid state storage, just held a webinar on October 24th 2017, about NVMe and NVMe over Fabrics (NVMe-oF). In it, they predicted rapid growth in the NVMe market, including rising demand for specialized network adapters, and they named Mellanox as the “Highest Footprint” vendor with the largest share of these adapters. Back in August 2017 at Flash Memory Summit, IT Brand Pulse readers voted Mellanox as the leading provider of NVMe-oF network adapters. Neither of these is any surprise since Mellanox was first to market with 25, 40, 50, and 100GbE adapters and has been a longtime leader in the Remote Direct Memory Access (RDMA) technology that is currently required for NVMe-oF.

However, while most of the news about NVMe-oF have focused on Ethernet (using RoCE), some well-known storage vendors are supporting NVMe-oF over InfiniBand.


NetApp E-Series Supports NVMe-oF on InfiniBand

In September 2017, NetApp announced their new E-5700 hybrid storage array and EF-570 all-flash arrays, which both support NVMe over Fabrics (NVMe-oF) connections to hosts (servers) using EDR 100Gb/s InfiniBand. This made NetApp the first large enterprise vendor to support NVMe over Fabrics to the host, and first enterprise storage vendor to support EDR 100Gb/s InfiniBand. They are also — as far as I know — the first all-flash and hybrid arrays to support three block storage protocols with RDMA: NVMe-oF, iSER, and SRP. Rob Davis wrote about the NetApp announcement in his recent blog.

Figure 1: NetApp EF-570 supports NVMe-oF, iSER and SRP on EDR 100Gb InfiniBand.



Excelero demonstrates NVMe-oF on InfiniBand at SC17

Excelero has been a fast software-defined storage (SDS) innovator in supporting NVMe-oF as both a disaggregated flash arrays or in a hyperconverged configuration. They support both 100Gb Ethernet and EDR 100Gb InfiniBand, and they are demonstrating their solution using InfiniBand in the Mellanox booth #653 at Supercomputing 2017.


Other Systems Likely to Add NVMe-oF on InfiniBand Support

In addition to the publicly declared demonstrations, there are other vendors who already support InfiniBand front-end (host) connections are could add NVMe-oF support fairly easily. For example, the IBM FlashSystem 900 has a long history of supporting InfiniBand host connections and is known for high performance and low latency, even amongst other all-flash arrays. IBM also has a strong history of delivering HPC and technical computing solutions including storage. So it wouldn’t be much of a surprise if IBM decided to add NVMe-oF support over IB to the FlashSystem 900 in the future.


InfiniBand is the World’s Fastest Storage Network

NVMe-oF allows networked access to NVMe flash devices, which themselves are faster and more efficient than SAS- or SATA- connected SSDs. Because it eliminates the SCSI layer, NVMe-oF is more efficient than iSCSI, Fibre Channel Protocol (FCP), or Fibre Channel over Ethernet (FCoE). But with more efficient storage devices and protocols, the underlying network latency becomes more important.

InfiniBand is appealing because it is the world’s fastest storage networking technology. It supports the highest bandwidth (EDR 100Gb/s shipping since 2014) and the lowest latency of any major fabric technology, <90ns port-to-port, which is far lower than 32Gb Fibre Channel and slightly lower than 100Gb Ethernet.  InfiniBand is a lossless network with credit-based flow control and built-in congestion control and QoS mechanisms.

Since the NetApp E-series arrays are very fast — positioned for “Extreme Performance”— and NetApp is targeting high-performance workloads such as analytics, video processing, high performance computing (HPC), and machine learning, it’s no surprise that the product family has long supported InfiniBand and the newest models support EDR InfiniBand.

Likewise, Excelero positions their NVMesh® to meet the demand of the most demanding enterprise and cloud-scale applications, while the IBM FlashSystem 900 is designed to accelerate demanding applications such as online transaction processing (OLTP), analytics database, virtual desktop infrastructure (VDI), technical computing applications, and cloud environments. With their focus on these applications, it makes sense that they support InfiniBand as a host connection option.

Figure 2: The IBM FlashSystem 900 already supports InfiniBand host connections and IBM has promised to add support soon for NVMe technology.


InfiniBand Supports Multi-Lingual Storage Networking

InfiniBand is a versatile transport for storage. Besides supporting NVMe-oF, it supports iSCSI Extensions for RDMA (iSER) and the SCSI RDMA Protocol (SRP). It can also be used for SMB Direct, NFS over RDMA, Ceph, and most non-RDMA storage protocols that run over TCP/IP (using IP-over-IB). One of the innovative aspects of the new NetApp E-5700 and EF-570 is that they are “trilingual” and support any of the three block storage protocols over EDR (or FDR) InfiniBand. The IBM FlashSystem 900 also supports SRP and will presumably become “bilingual” on InfiniBand storage protocols after adding NVMe-oF.


So, whether you are already using SRP for HPC or want to adopt NVMe-oF as the newest and most efficient block storage protocol (or use iSER with NetApp), Mellanox InfiniBand has you covered.



Figure 3: The Mellanox Switch-IB 2 family supports up to 36 ports at EDR 100Gb/s in a 1U switch or up to 648 ports in a chassis switch, with port-to-port latency below 90 nanoseconds.


Cloud, HPC, Media, and Database Customers Drive Demand for EDR InfiniBand

Now, who exactly needs InfiniBand connections, or any type of 100Gb connection to the storage? If most storage customers have been running on 10Gb Ethernet and 8/16Gb Fibre Channel, what would drive someone to jump to 100Gb networking? It turns out many high performance computing (HPC), cloud, media, and database customers need this high level of storage networking performance to connect to flash arrays.

HPC customers are on the cutting edge of pure performance, wanting either the most bandwidth or the lowest latency, or both. Bandwidth allows them to move more data to where it can be analyzed or used. Low latency lets them compute and share results faster. EDR InfiniBand is the clear winner either way, with the highest bandwidth and lowest latency of any storage networking fabric. Several machine learning (ML) and artificial intelligence (AI) applications also support RDMA and perform better using the super low latency of InfiniBand. And the latest servers from vendors such as Dell EMC, HPE, Lenovo, and Supermicro can all be ordered with FDR 56Gb or EDR 100Gb InfiniBand adapters (based on Mellanox technology).

Cloud customers are on the cutting edge of scale, efficiency, and disaggregation, and whatever lets them support more users, more applications, and more VMs or containers in the most efficient way. As they use virtualization to pack more applications onto each physical host, they need faster and faster networking. And as they disaggregate flash storage (by moving it from individual servers to centralized flash pools) to improve efficiency, they need NVMe-oF to access that flash efficiently. While most cloud customers are running Ethernet, there are some who have built their networks on InfiniBand so want InfiniBand-connected storage arrays to power their cloud operations.

Media and Entertainment customers are scrambling to deal with ultra-high definition (UHD) video at 4K and 8K resolutions. 4K cinema video has almost 4.3 more pixels than standard HD TV (4K TV video is slightly lower, having only 4x more pixels than HD TV). While the initial capture and final broadcast use compression, many of the editing, rendering, and special effects steps used to create your favorite TV shows and movies require dealing with uncompressed video streams, often in real-time. These uncompressed 4K streams cannot fit in an 8Gb or 10Gb pipe, and sometimes even exceed what 16Gb FC can do. This has pushed many media production customers to use FDR (56Gb) or EDR (100Gb) InfiniBand, and they need fast storage to match that.

Figure 4: Adoption of 4K and 8K video is driving media and entertainment companies to adopt high-speed networking for storage, including EDR InfiniBand.

Database over InfiniBand may surprise some of you, but it makes perfect sense because database servers need low latency, both between each other and to the storage. The Oracle Engineered Systems (Exadata, Exalogic, Exalytics) are designed around an InfiniBand fabric and Oracle RAC server clustering software supports InfiniBand. Even in the cloud, a majority of financial or e-commerce transactions end up going through a SQL database in the end, and low latency for the database is critical to ensure a smooth online and e-commerce experience.


Leveraging The World’s Fastest Fabric for Storage

InfiniBand is the world’s fastest fabric with 100Gb/s today and 200Gb/s products announced and coming soon. While most of the world’s networked storage deployments are moving to Ethernet, it’s clear that when the fastest possible storage connections with the lowest latency are needed, InfiniBand is often the best choice. With new flash array support for NVMe over Fabrics, IBM and NetApp are supporting the world’s most efficient block storage protocol on top of the world’s fastest storage networking technology, and I expect the result will be many happy customers enjoying superfast storage performance.


Supporting Resources:



New HPE StoreFabric M-series Switches Power Ethernet Storage Fabric

HPE Launches New Ethernet Switches for Storage

Today, Hewlett Packard Enterprise (HPE) announced their new StoreFabric M-series Ethernet switches, which are built on Mellanox Spectrum switch technology. This is an exciting new product line, specifically designed for storage workloads and ideal for building an Ethernet Storage Fabric (ESF). The switches are perfect for building fast and scalable storage networks for block, object, and file storage, as well as hyper converged infrastructure (HCI). They make it easy to start by connecting a few nodes in a rack then scale up to a full rack, and later, to hundreds of nodes connected across many racks, running at speeds from 1GbE up to 100GbE.

Figure 1: HPE StoreFabric M-series switches are ideal for building an Ethernet Storage Fabric


Why HPE Needs an Ethernet Storage Switch

HPE has long sold Fibre Channel SAN switches but this is their first Ethernet switch specifically targeted at storage. It turns out, Ethernet-connected storage is growing much more rapidly than FC-SAN connected storage, and about 80 percent of storage capacity today is well suited for Ethernet (or can only run on Ethernet).  Only 20 percent of storage capacity is the kind of Tier-1 block storage that traditionally goes on FC-SAN, and even most of that block storage can also run on iSCSI or newer block protocols such as iSER (iSCSI RDMA) and NVMe over Fabrics (NVMe-oF, over Ethernet RDMA).

If you look at HPE’s extensive storage lineup, the products which are more focused on Ethernet are growing much faster than those focused on Fibre Channel.

  • The very high-end Enterprise XP are probably growing very slowly and are almost entirely FC or FCoE connected.
  • The high-end 3PAR arrays are growing modestly and are mostly FC-connected (I would guess 75 percent FC today) but their Ethernet connect rate is rising.
  • The HPE Nimble Storage arrays were growing at a robust 28 percent/year when HPE acquired Nimble, and are mostly Ethernet-connected (I’d guess at least 70 percent Ethernet).
  • The HPE Simplivity HCI solution is growing super quickly and is 100 percent Ethernet.

HPE also has key storage software partners who specialize in file storage (like Qumulo), object storage (like Scality), and hyper converged secondary storage (like Cohesity). And HPE servers also get deployed with other HCI or software-defined storage solutions such as VMware VSAN, Ceph, and Microsoft Windows Storage Spaces Direct — all products which require Ethernet networking. So, while Fibre Channel remains important to key HPE customers and storage products, most or all of the growth is in Ethernet-connected solutions. It makes perfect sense for HPE to offer a line of Ethernet switches optimized for Ethernet storage.


Figure 2: The HPE M-series switches support many kinds of storage arrays, tiers, and HPE storage partners.


There is No Fibre Channel in the Cloud

Currently, about the single most powerful trend in IT is the cloud. Workloads are moving to the public cloud and enterprises are transforming their on-premises IT infrastructure to emulate the cloud to achieve similar cost savings and efficiency gains. All the major cloud providers long ago realized that Fibre Channel is too expensive, too inflexible, and too limited as a storage network for their highly-scalable, super-efficient deployments. Hence, all the public clouds run both compute and storage on Ethernet (except for those that need high performance and efficiency and therefore run on InfiniBand), and large enterprises are following suit. They are deploying more virtualization, more containers, and more hyperconverged infrastructure to increase their flexibility and agility. As enterprises build private and hybrid clouds using HPE storage and servers, it makes sense that they would look for fast, reliable HPE Ethernet switches to power their own cloud deployments.


Mellanox Spectrum is Ideal for Storage Networking

Now what kind of Ethernet switch is ideal for storage?  First it must be FAST, meaning high-bandwidth, non-blocking, and with consistently low latency. As noted in my previous blogs, faster storage needs faster networks, especially for all-flash arrays. HPE is the world’s #1 enterprise storage systems vendor according to IDC (IDC Worldwide Quarterly Enterprise Storage Systems Tracker, 2Q 2017) so we can assume they sell more flash storage than just about anyone else. These faster systems need faster connections. While Fibre Channel recently reached 32Gb/s, there are already all-flash arrays on the market making full use of 100Gb Ethernet. And 100GbE delivers 3x the performance of 32Gb FC at 1/3rd the price — at a 9x advantage in price-performance.

The trend amongst top storage vendors is also to support NVMe SSDs and the NVMe over Fabrics protocol, which requires higher bandwidth, lower latency, and an RDMA-capable network. HPE Servers — one of the world’s most popular server brands — already support 25, 40 and 100GbE networking (most often with Mellanox adapters), and we can assume that HPE Storage flash arrays will support faster Ethernet speeds such as 25, 40, 50, or 100GbE in the future.

This means an Ethernet storage switches needs to be ready to support these faster speeds  with high-bandwidth but also with features like: RDMA over Converged Ethernet (RoCE), non-blocking, ZeroPacketLoss, consistently-low latency, etc. Mellanox Spectrum switches, and now HPE StoreFabric M-series switches, are best in class in all these categories.


What is an Ethernet Storage Fabric?

Beyond performance, the ideal Ethernet storage switch should offer Flexibility and Efficiency. That means efficient form-factors that support many ports in just one RU of space. It should support all Ethernet speeds and allow easy upgrades to port speeds, port counts, features, and the network architecture. And, of course, it should have low power consumption, be easy to manage, and be affordable, with flexible pricing and financing.

The HPE StoreFabric M-series switches combine the best of Mellanox and HPE innovation and technology. The unique form factors allow high-availability and up to 128 ports (at 10/25GbE speeds) in one RU of space. The switches deliver consistently low latency across all speeds and port combinations, letting different server and storage nodes use different speeds without any performance penalty. They support speeds up to 100GbE and have the best support for Ethernet RDMA, traffic isolation, security, telemetry, and Quality of Service (QoS).

Figure 3: HPE StoreFabric M-series Switches Support an Ethernet Storage Fabric


StoreFabric M-series Make Your Storage Network Future-Proof

Thanks to flexible licensing, customers can start with as few as 8 ports per switch and upgrade the port count as needed. The same switches can be used to grow storage networks from one rack to many racks with hundreds of servers and ports, without needing to discard or replace any of the original switches. The port speeds can be upgraded easily from 10 to 25 to 40/50 to 100GbE speeds and the switch is ready to supported advanced storage protocols.

Even better, the M-series switches are designed to allow software upgrades and future integrations with specific storage, server, or cloud management tools. This means your network infrastructure investment in HPE M-series switches today will support multiple generations of HPE servers and storage arrays, making your storage network future-proof.

To learn more about the amazing new HPE StoreFabric M-series switches, contact your HPE channel partner or HPE sales rep today!

Figure 4: Upgradable port speeds, network architecture, and switch software make the HPE M-series switches future-proof.


Supporting Resources:



The Best Flash Array Controller Is a System-on-Chip called BlueField

As the storage world turns to flash and flash turns to NVMe over Fabrics, the BlueField SoC could be the most highly integrated and most efficient flash controller ever. Let me explain why.

The backstory—NVMe Flash Changes Storage

Dramatic changes are happening in the storage market. This change comes from NVMe over Fabrics, which comes from NVMe, which comes from flash. Flash has been capturing more and more of the storage market. IDC reported that in Q2 2017, the all-flash array (AFA) revenue grew 75% YoY while the overall external enterprise storage array market was slightly down. In the past this flash consisted of all SAS and SATA solid state drives (SSDs), but flash and SSDs have long been fast enough that the SATA and SAS interfaces imposed bandwidth bottlenecks and extra latency.


Figure 1: SATA and SAS controllers can cause a bottleneck and result in higher latency.


The SSD vendors developed the Non-volatile memory Express or NVMe standard and commands (version 1.0 released March 2011), which run over a PCIe interface. NVMe allows higher throughput, up to 20Gb/s per SSD today (and more in the near future) and lower latency.  It eliminates the SAS/SATA controllers and requires PCIe connections, typically 4 PCIe Gen 3 lanes per SSD.  Many servers deployed with local flash now enjoy the higher performance of NVMe SSDs.


How to Share Fast SSD Goodness

But local flash deployed this way is “trapped in the server” because each server can only use its own flash. Different servers need different amounts of flash at different times, but with a local model you must overprovision enough flash in each server to support the maximum that might be needed, even if you need the extra flash for only a few hours at some point in the future. The answer over the last 20 years has been to centralize and network the storage using iSCSI, Fibre Channel Protocol, iSER (iSCSI over RDMA), or NAS protocols like SMB and NFS.

But these all use either SCSI commands or file semantics and were not optimized for flash performance, so they can deliver good performance but not the best possible performance. As a result the NVMe community, including Mellanox, created NVMe over Fabrics (NVMe-oF) to allow fast, efficient sharing of NVMe flash over a fabric. It allows the lean and efficient NVMe commands to operate across an RDMA network with protocols like RoCE and InfiniBand. And it maintains the efficiency and low latency of NVMe while allowing sharing, remote access, replication, failover, etc.  A good overview of NVMe over Fabrics is in this YouTube video:


Video 1: An overview of how NVMe over Fabrics has Evolved

NVMe over Fabrics Frees the Flash But Doesn’t Come Free

Once  NVMe-oF frees the Flash from the server, you now need an additional CPU to run NVMe commands in a Just-A-Bunch-of-Flash (JBOF) box, plus more CPU power if it’s a storage controller running storage software. You need DRAM to store the buffers and queues. You need a PCIe switch to connect to the SSDs. And you need rNICs that can handle RDMA at high enough speeds to support all the fast NVMe SSDs. In other words, you have to build a complete server design with enhanced internal and external connectivity to support this faster storage. For a storage controller this is not unusual, but for a JBOF it’s more complex and costly than what they’re accustomed to doing with SAS or SATA HBAs and expanders—that don’t require CPUs, DRAM, PCIe switches, or rNICs.

Also, since NVMe SSDs and the NVMe over Fabrics protocol are inherently low latency, the latency of everything else in the system—software, network, HBAs, cache or DRAM access, etc., becomes more prominent and reducing latency in those areas becomes more critical.

A New SoC Is the Most Efficient Way to Drive NVMe-oF

Fortunately there is a new way to build NVMe-oF systems: a single chip that provides everything needed, other than the SSDs and the DRAM DIMMs; it is the Mellanox BlueField.  It includes:

  • ConnectX-5 high-speed NIC (up to 2x100Gb/s ports, Ethernet or InfiniBand),
  • Up to 16 ARM A72 (64-bit) CPU cores,
  • A built-in PCIe switch (32 lanes at Gen3/Gen4),
  • DRAM controller & coherent cache
  • A fast mesh fabric to connect it all


Figure 2: BlueField (logical design illustration) includes networking, CPU cores, cache, DRAM controllers, and a PCIe switch all on one chip.


The embedded ConnectX-5 delivers not just 200Gb/s of network bandwidth but all the features of ConnectX-5, including RDMA and NVMe protocol offloads. This means the NVMe-oF data traffic can go directly from SSD to NIC (or NIC to SSD) without interrupting the CPU. It also means overlay network encapsulation (like VXLAN), virtual switch features (such as OVS), erasure coding, T10 data integrity factor signatures, and stateless TCP offloads can all be processed by the NIC without involving the CPU cores.  The CPU cores remain free to run storage software, security, encryption, or other functionality.


The fast mesh internal fabric enables near-instantaneous data movement between the PCIe, CPU, cache and networking elements as needed, and operates much more efficiently than a classic server design where traffic between the SSDs and NIC(s) must traverse the PCIe switch and DRAM multiple times for each I/O. With this design, NVMe-oF data traffic queues and buffers can be handled completely in the on-chip cache and doesn’t need to go to the external DRAM, which is only needed if additional storage functions running on the CPU cores are applied to the data. Otherwise the DRAM can be used for control plane traffic, reporting, and management. The PCIe switch supports up to 32 lanes of both Gen3 or Gen4, so it can transfer more than 200Gb/s of data to/from SSDs and is ready for the new PCIe Gen4-enabled SSDs expected to arrive in 2018. (PCIe Gen4 can transfer 2x more traffic per lane than PCIe Gen3.)

BlueField is the FIRST SoC to include all these features and performance, making it uniquely well-suited to control flash arrays, in particular NVMe-oF arrays and JBOFs.


BlueField Is the Most Integrated NVMe-oF Solution

We’ve seen that in the flash storage world, performance is very important. But simplicity of design and controlling costs are also important. By combining all the components of a NVMe-oF server into a single chip, BlueField makes the flash array design very simple and lowers the cost—including allowing a smaller footprint and lower power consumption.

Figure 3: BlueField (logical design illustration) includes networking, CPU cores, cache, DRAM controllers, and a PCIe switch all on one chip.


Vendors Start Building Storage Solutions Based on BlueField

Not surprisingly, key Original Design Manufacturers (ODMs) and storage Original Equipment Manufacturers (OEMs) are already designing storage solutions based on BlueField SoC. Mellanox is also working with key partners to create more BlueField solutions for network processing, cloud, security, machine learning, and other non-storage use cases. Mellanox has created a BlueField Storage Reference Platform that can handle many NVMe SSDs and serve them up using NVMe over Fabrics using BlueField. This is the perfect development and reference platform to help customers and partners test and develop their own BlueField-powered storage controllers and JBOFs.

Figure 4: The BlueField Reference System helps vendors and partners quickly develop BlueField-based storage systems.


BlueField is the Best Flash Array Controller

The optimized performance and tight integration of all the components needed, makes BlueField the perfect flash array controller, especially for NVMe-oF storage arrays and JBOFs. Designs using BlueField will deliver more flash performance at lower cost and using less power than standard server-based designs.

You can see the BlueField SoC and BlueField Storage Reference Platform this week (August 8-10) at Flash Memory Summit, in the Santa Clara Convention Center, in the Mellanox booth #138.


Supporting Resources:



The Ideal Network for Containers and NFV Microservices

Containers are the New Virtual Machine

Containers represent a hot trend in cloud computing today. They allow virtualization of servers and portability of applications minus the overhead of running a hypervisor on every host and without a copy of the full operating system in every virtual machine. This makes them more efficient than using full virtual machines. You can pack more applications on each server with containers than with a hypervisor.

Figure 1: Containers don’t replicate the entire OS for each application so have less overhead than virtual machines. Illustration courtesy of Docker, Inc. and RightScale, Inc.


Containers Make it Easy to Convert Legacy Appliances Into Microservices

Because they are more efficient, containers also make it easier to convert legacy networking appliances into Virtualized Network Functions (VNF) and into microservices. It’s important to understand that network function virtualization (NFV) is not the same as re-architecting functions as microservices, but that the two are still highly complementary.


Figure 2: Docker Swarm and Kubernetes are tools to automate deployment of containers. Using containers increases IT and cloud flexibility but puts new demands on the network.


The Difference Between Microservices and Plain Old NFV

Strictly speaking, NFV simply replaces a dedicated appliance with the same application running as a virtual machine or container. The monolithic app remains monolithic and must be deployed in the same manner as if it were still on proprietary hardware, except it’s now running on commercial off the shelf (COTS) servers. These servers are cheaper than the dedicated appliances but performance is often slower, because generic server CPUs generally are not great at high-speed packet processing or switching.

Microservices means disaggregating the parts of a monolithic application into many small parts that can interact with each other and scale separately. Suppose my legacy appliance inspects packets, routes them to the correct destination, and analyzes suspicious traffic. As I deploy more appliances, I get these three capabilities in exactly the same ratio, even though one particular customer (or week, or day) might require substantially more routing and very little analysis, or vice versa. However, if I break my application into specific components, or microservices that interoperate with each other, then I can scale only the services that are needed. Deploying microservices in containers means it’s easy to add, reduce, or change the mix and ratio of services running from customer to customer, or even hour to hour. It also makes applications faster to deploy and easier to develop and update, because individual microservices can be designed, tested, deployed or updated quickly without affecting all the other services.

So, NFV moves network functions from dedicated appliances to COTS servers and microservices disaggregates monolithic functions into scalable components. Doing both gives cloud service providers total flexibility in choosing which services are deployed and what hardware is used. But, one more critical element must be considered in the quest for total infrastructure efficiency—NFV optimized networking.

Figure 3: Plain NFV uses monolithic apps on commodity servers. Microservices decomposes apps into individual components that can be scaled separately.


Microservices and Containers Require the Right Network Infrastructure

When you decompose monolithic applications into microservices, you place greatly increased demand on the network. Monolithic apps connect their functions within one server so there is little or no east-west traffic — all traffic is north-south to and from the clients or routers. But, an app consisting of disaggregated microservices relies on the network for inter-service communication and can easily generate several times more east-west traffic than north-south traffic. Much of this traffic can even occur between containers on the same physical host, thereby taxing the virtual switch running in host software.

Figure 4: Changing to a microservices design allows flexibility to deploy exactly the services that are needed but greatly increases east-west network traffic, mandating the use of robust and reliable switches.

Moving to COTS servers also poses a performance challenge because the proprietary appliances use purpose-built chips to accelerate their packet processing, while general purpose X86 CPUs require many cycles to process packet streams, especially for small packets.

The answer to both challenges is deploying the right networking hardware. The increased east-west traffic demands a switch that is not only fast and reliable, but able to handle micro-bursts of traffic while also fairly allocating performance across ports. Many Ethernet switches use merchant silicon that only delivers the advertised bandwidth for larger packet sizes, or only when a certain combinations of ports are used. They might cause an unexpected packet drop under load or switch from cut-through networking to store-and-forward networking, which will greatly increase network latency. The main problem with these switches is that performance becomes unpredictable — sometimes it’s good and sometimes it’s bad, and this makes supporting cloud service level agreements impossible. On the other hand, choosing the right switch ensures good throughput and low latency across all packet sizes and port combinations, which also eliminates packet loss during traffic microbursts.

Figure 5: Mellanox Spectrum has up to 8x better microburst absorption capability than the Broadcom Tomahawk silicon used in many other switches. Spectrum also delivers full rated bandwidth at all packet sizes without any avoidable packet loss.

Separately from the switch, an optimized smart NIC such as the Mellanox ConnectX®-4 includes Single Route I/O Virtualization (SRIOV) and an internal Ethernet switch or, eSwitch, to accelerate network functions. These features let each container access the NIC directly and can offload inter-container traffic from the software virtual switch using an Open vSwitch (OVS) offload technology called ASAP2. These smart NICs also offload the protocol translation for overlay networks—like VXLAN, NVGRE, and Geneve, which are used to provide improved container isolation and mobility. These features and offloads greatly accelerate container networking performance while reducing the host’s CPU utilization. Faster networking plus more available CPU cycles enables more containers per host, improving cloud scalability and reducing costs.

Figure 6: ASAP2 offloads packet processing from the software vSwitch to a hardware-accelerated eSwitch in the NIC, greatly accelerating container network performance.


Medallia Deploys Microservices Using Containers

Medallia provides a great case study of a modern cloud services provider that has embraced containers and advanced networking, in order to deliver Customer Feedback Management as Software-as-a-Service (SaaS). Medallia enables companies to track and improve their customers’ experiences. Every day, Medallia must capture and analyze online and social media feedback from millions of interactions and deliver real-time analysis and reporting, including personalized dashboards to thousands of their customers’ employees.  Medallia wanted to run their service on commodity hardware using open standards and fully-automated provisioning. They also wanted full portability of any app, service, or networking function, making it easy to move, replace, or relaunch any function on any hardware.


To accomplish all this, they designed a software-defined, scalable cloud infrastructure using microservices and containers on the following components:

  • Docker for container management
  • Aurora, Mesos, and Bamboo for automation
  • Ceph for storage
  • Ubuntu Linux for compute servers and Cumulus Linux for networking
  • Mellanox ConnectX-4 Lx 50GbE adapters
  • Mellanox Spectrum switches running Cumulus Linux (50GbE to servers, 100GbE for aggregation)

Figure 7 and Video 1: Medallia uses containers, Cumulus Linux, and Ceph running on Mellanox adapters and switches to deliver a superior cloud SaaS to their customers.


Medallia found that using end-to-end Mellanox networking hardware to underlay their containers and microservices resulted in faster performance and a more reliable network. Their Ceph networked storage performance matched that of their local storage, and they were able to automate network management tasks and reduce the number of network cables per rack. All of this enables Medallia to deliver a better SaaS to their cloud customers, who, in turn, learn how to be better listeners and vendors to their own retail customers.

Mellanox is the Container Networking Company

The quest for NFV and containerization of microservices is a noble one that increases flexibility and lowers hardware costs. However, to do this correctly, cloud service providers need networking solutions like Mellanox ConnectX-4 adapters and Spectrum switches. Using the right network hardware ensures fast, reliable and secure performance from containers and VNFs, making Mellanox the ideal NFV and Container Networking Company.

Supporting Resources:




Excelero Unites NVMe Over Fabrics With Hyper-Converged Infrastructure

Two Hot IT Topics Standing Alone, Until Now…

Two of the hottest topics and IT trends right now are hyper-converged infrastructure (HCI) and NVMe Over Fabrics (NVMe-oF).  The hotness of HCI is evident in the IPO of Nutanix in September and HPE’s acquisition of Simplivity in January 2017. The interest in NVMe-oF has been astounding with all the major storage vendors working on it and all the major SSD vendors promoting it as well.

But the two trends have been completely separate—you could do one, the other, or both, but not together in the same architecture. HCI solutions could use NVMe SSDs but not NVMe-oF, while NVMe-oF solutions were being deployed either as separate, standalone flash arrays or NVMe flash shelves behind a storage controller. There was no easy way to create a hyper-converged solution using NVMe-oF.


Excelero NVMesh Combines NVMe-oF with HCI

Now a new solution launched by Excelero combines the low latency and high throughput of NVMe-oF with the scale-out and software-defined power of HCI. Excelero does this with a technology called NVMesh that takes commodity server, flash, and networking technology and connects it in a hyper-converged configuration using an enhanced version of the NVMe-oF protocol. With this solution, each node can act both as an application server and as a storage target, making its local flash storage accessible to all the other nodes in the cluster. It also supports a disaggregated flash model so customers have a choice between scale-out converged infrastructure and a traditional centralized storage array.

Figure 1: Excelero NVMesh combines NVMe-oF with HCI, much like combining peanut butter and chocolate into one tasty treat).



Remote Flash Access Without the Usual CPU Penalties

NVMesh creates a virtualized pool of block storage using the NVMe SSDs on each server and leverages a technology called Remote Direct Drive Access (RDDA) to let each node access flash storage remotely.   RDDA itself builds on top of industry-standard Remote Direct Memory Access (RDMA) networking to maintain the low latency of NVMe SSDs even when accessed over the network fabric.  The virtualized pools allow several NVMe SSDs to be accessed as one logical volume by either local or remote applications.

In a traditional hyper-converged model, the storage sharing consumes some part of the local CPU cycles, meaning they are not available for the application. The faster the storage and the network, the more CPU is required to share the storage. RDDA avoids this by allowing the NVMesh clients to directly access the remote storage without interrupting the target node’s CPU. This means high performance—whether throughput or IOPS—is supported across the cluster without eating up all the CPU cycles.


Recent testing showed a 4-server NVMesh cluster with 8 SSDs per server could support several million 4KB IOPS or over 6.5GB/s (>50Gb/s)—very impressive results for a cluster that size.

Figure 2: NVMesh leverages RDDA and RDMA to allow fast storage sharing with minimal latency and without consuming CPU cycles on the target. The control path passes through the management module and CPUs but the data path does not, eliminating potential performance bottlenecks.


Integrates with Docker and OpenStack

Another feature NVMesh has over the standard NVMe-oF 1.0 protocol is that it supports integration with Docker and OpenStack. NVMesh includes plugins for both Docker Persistent Volumes and Cinder, which makes it easy to support and manage container and OpenStack block storage. In a world where large clouds increasingly use either OpenStack or Docker, this is a critical feature.

Figure 3: Excelero’s NVMesh includes plug-ins for both Docker and OpenStack Cinder, making it easy to use it for both container and cloud block storage.



Another Step Forward in the NVMe-oF Revolution

The launch of Excelero’s NVMesh is an important step forward in the ongoing revolution of NVMe over Fabrics. The open source solution supports high performance but only with a centralized storage solution and without many important storage features. The NVMe-oF array solutions offer a proven appliance solution but some customers want a software-defined storage option built on their favorite server hardware.  Excelero offers them all of these features together: hyper-converged infrastructure, NVMe over Fabrics technology, and software-defined storage.


Supporting Resources:

Storage Predictions for 2017

Looking at what’s to come for storage in 2017, I find three simple and easy predictions which lead to three more complex predictions.  Let’s start with the easy ones:

  • Flash keeps taking over
  • NVMe over Fabrics remains the hottest storage technology
  • Cloud continues to eat the world of IT


Flash keeps taking over

Every year, for the past four years, has been “The Year Flash Takes Over” and every year flash owns a growing minority of storage capacity and spend, but it’s still in the minority. 2017 is not the year flash surpasses disk in spending or capacity — there’s simply not enough NAND fab capacity yet, but it is the year all-flash arrays go mainstream. SSDs are now growing in capacity faster than HDDs (15TB SSD recently announced) and every storage vendor offers an all-flash flavor. New forms of 3D NAND are lowering price/TB on one side to compete with high capacity disks while persistent memory technologies like 3D-XPoint (while not actually buillt on NAND flash) are increasing SSD performance even further above that of disk. HDDs will still dominate low price, high-capacity storage for some years, but are rapidly becoming a niche technology.


Figure 1: TrendFocus 2015 chart shows worldwide hard drive shipments have fallen since 2010. Flash is one major reason, cloud is another.


According to IDC (Worldwide Quarterly Enterprise Storage Systems Tracker, September 2016) in Q2 2016 the all-flash array (AFA) market grew 94.5% YoY while the overall enterprise storage market grew 0%, giving AFAs 19.4% of the external (outside the server) enterprise storage systems market. This share will continue to rise.


Figure 2: Wikibon 2015 forecast predicts 4-year TCO of flash storage dropped below that of hard disk storage in 2016. 


NVMe over Fabrics (NVMe-oF) remains the hottest storage technology

It’s been a hot topic since 2014 and it’s getting hotter, even though production deployments are not yet widespread. The first new block storage protocol in 20 years has all the storage and SSD vendors excited because it makes their products and the applications running on them work better.  At least 4 startups have NVMe-oF products out with POCs in progress, while large vendors such as Intel, Samsung, Seagate, and Western Digital are demonstrating it regularly. Mainstream storage vendors are exploring how to use it while Web 2.0 customers want it to disaggregate storage, moving flash out of each individual server into more flexible, centralized repositories.

It’s so hot because it helps vendors and customers get the most out of flash (and other non-volatile memory) storage. Analyst G2M, Inc. predicts the NVMe market will exceed $57 Billion by 2020, with a compound annual growth rate (CAGR) of 95%. They also claim say 40% of AFAs will use NVMe SSDs by 2020, and hundreds of thousands of those arrays will connect with NVMe over Fabrics.


Figure 3: G2M predicts incredibly fast growth for NVMe SSDs, servers, appliances, and NVMe over Fabrics.


Cloud continues to eat the world of IT 

Nobody is surprised to hear cloud is growing faster than enterprise IT. IDC reported cloud (public + private) IT spending for Q2 2016 grew 14.5% YoY while traditional IT spending shrank 6% YoY. Cloud offers greater flexibility and efficiency, and in the case of public cloud the ability to replace capital expense investments with a pure OpEx model.

It’s not a panacea, as there are always concerns about security, privacy, and speed of access. Also, larger customers often find that on-premises infrastructure — often set up as private cloud — can cost less than public cloud in the long run. But there is no doubting the inexorable shift of projects, infrastructure, and spending to the cloud. This shift affects compute (servers), networking, software, and storage, and drives both cloud and enterprise customers to find more efficient solutions that offer lower cost and greater flexibility.


Figure 4: IDC Forecasts cloud will consume >40% of IT infrastructure spending by 2020. Full chart available at:


OK Captain Obvious, Now Make Some Real Predictions!

Now let’s look at the complex predictions which are derived from the easy ones:

  • Storage vendors consolidate and innovate
  • Fibre Channel continues its slow decline
  • Ceph grows in popularity for large customers
  • RDMA becomes more prevalent in storage


Traditional storage vendors consolidate and innovate

Data keeps growing at over 30% per year but spending on traditional storage is flat. This is forcing vendors to fight harder for market share by innovating more quickly to make their solutions more efficient, flexible, flash-focused, and cloud-friendly. Vendors that previously offered only standalone arrays are offering software-defined options, cloud-based storage, and more converged or hyper-converged infrastructure (HCI) options. For example, NetApp offers options to replicate or back up data from NetApp boxes to Amazon Web Services, Dell/EMC HDS, and IBM all sell converged infrastructure racks. In addition, startup Zadara Storage offers enterprise storage-as-a-service running either in the public cloud or as on-premises private cloud.

Meanwhile, major vendors all offer software versions of some of their products instead of only selling hardware appliances. For example, EMC ScaleIO, IBM Spectrum Storage, IBM Cloud Object Storage (formerly CleverSafe), and NetApp ONTAP Edge are all available as software that runs on commodity servers.

The environment for flash startups is getting tougher because all the traditional vendors now offer their own all-flash flavors. There are still startups making exciting progress in NVMe over Fabrics, object storage, hyper-converged infrastructure, data classification, and persistent memory, but only a few can grow into profitability on their own. 2017 will see a round of acquisitions as storage vendors who can’t grow enough organically look to expand their portfolios in these areas.


Fibre Channel Continues its Downward Spiral

One year ago I wrote a blog about why Fibre Channel (FC) is doomed and all signs (and analyst forecasts) point to its continued slow decline. All the storage trends around efficiency, flash, performance, big data, Ceph, machine learning, object storage, containers, HCI, etc. are moving against Fibre Channel. (Remember the “Cloud Eats the World” chart above? They definitely don’t want to use FC either.) The only thing keeping FC hopes alive is the rapid growth of all-flash arrays, which deploy mostly FC today because they are replacing legacy disk or hybrid FC arrays. But even AFAs are trending to using more Ethernet and InfiniBand (occasionally direct PCIe connections) to get more performance and flexibility at lower cost.

The FC vendors know the best they can hope for is to slow the rate of decline, so all of them were betting on growing their Ethernet product lines. More recently the FC vendors (Emulex, QLogic, Brocade) have been acquired by larger companies, but not as hot growth engines but rather so the larger companies can milk the cash flow from the expensive FC hardware before their customers convert to Ethernet and escape.


Ceph grows in Popularity for Large Customers

Ceph — both the community version and Red Hat Ceph Storage — continues to gain fans and use cases. Originally seen as suited only for storing big content on hard drives (low-cost, high-capacity storage), it’s now gained features and performance making it suitable for other applications. Vendors like Samsung, SanDisk (now WD), and Seagate are demonstrating Ceph on all-flash storage, while Red Hat and Supermicro teamed up with Percona to show Ceph works well as database storage (and is less expensive than Amazon storage for running MySQL).  I wrote a series of blogs on Ceph’s popularity, optimizing Ceph performance, and using Ceph for databases.

Ceph is still the only storage solution that is software-defined, open source, scale-out and offering enterprise storage features (though Lustre is approaching this as well). Major contributors to Ceph development include not just Red Hat but also Intel, the drive/SSD makers, Linux vendors (Canonical and SUSE), Ceph customers, and, of course, Mellanox.

In 2016, Ceph added features and stability to its file/NAS offering, CephFS, as well as major performance improvements for Ceph block storage. In 2017, Ceph will improve performance, management, and CephFS even more while also enhancing RDMA support. As a result, its adoption grows beyond its traditional base to add Telcos, cable companies, and large enterprises who want a scalable software-defined storage solution for OpenStack.



RDMA More Prevalent in Storage

RDMA, or Remote Direct Memory Access, has actually been prevalent in storage for a long time as a cluster interconnect and for HPC storage. Just about all the high-performance scale-out storage products use Mellanox-powered RDMA for their cluster communications — examples include Dell FluidCache for SAN, EMC XtremIO, EMC VMAX3, IBM XIV, InfiniDat, Kaminario, Oracle Engineered Systems, Zadara Storage, and many implementations of Lustre and IBM Spectrum Scale (GPFS).

The growing use of flash media and intense interest in NVMe-oF are accelerating the move to RDMA. Faster storage requires faster networks, not just more bandwidth but also lower latency, and in fact the NVMe-oF spec requires RDMA to deliver its super performance.


Figure 5: Intel presented a chart at Flash Memory Summit 2016 showing how the latency of storage devices is rapidly decreasing, leading to the need to decrease software and networking latency with higher-speed networks (like 25GbE) and RDMA.

In addition to the exploding interest in NVMe-oF, Microsoft has improved support for RDMA access to storage in Windows Server 2016, using SMB Direct and Windows Storage Spaces Direct, and Ceph RDMA is getting an upgrade. VMware has enhanced support for iSER (iSCSI Extensions for RDMA) in VSphere 2016 and more storage vendors like Oracle (in tape libraries) and Synology have added iSER support to enable accelerated client access. On top of this, multiple NIC vendors (not just Mellanox) have announced support for RoCE (RDMA over Converged Ethernet) on 25, 40, 50, and 100Gb Ethernet speeds. These changes all mean more storage vendors and storage deployments will leverage RDMA in 2017.


So Let’s Get This Party Started

2017 promises to be a super year for storage innovation. With technology changes, disruption, and consolidation, not every vendor will be a winner and not every storage startup will find hockey-stick growth and riches, but it’s clear the storage hardware and software vendors are working harder than ever, and customers will be big winners in many ways.



Ceph For Databases? Yes You Can, and Should

Ceph is traditionally known for both object and block storage, but not for database storage. While its scale-out design supports both high capacity and high throughput, the stereotype is that Ceph doesn’t support the low latency and high IOPS typically required by database workloads.

However, recent testing by Red Hat, Supermicro, and Percona—one of the top suppliers of MySQL database software—show that Red Hat Ceph Storage actually does a good job of supporting database storage, especially when running it on multiple VMs, and it does very well compared to running MySQL on Amazon Web Services(AWS).

In fact, Red Hat was a sponsor of Percona Live Europe last week in Amsterdam, and it wasn’t just to promote Red Hat Enterprise Linux. Sr. Storage Architect Karan Singh presented a session “MySQL and Ceph: A tale of two friends.”



Figure 1: This shadowy figure with the stylish hat has been spotted storing MySQL databases in a lab near you.


MySQL Needs Performance, But Not Just Performance

The front page of the Percona Europe web site says “Database Performance Matters,” and so it does. But there are multiple ways to measure database performance—it’s not just about running one huge instance of MySQL on one huge bare metal server with the fastest possible flash array. (Just in case that is what you want, check out conference sponsor Mangstor, who offer a very fast flash array connected using NVMe Over Fabrics.)  The majority of MySQL customers also consider other aspects of performance:

  • Performance across many instances: Comparing aggregate performance of many instances instead of just one large MySQL instance
  • Ease of deployment: The ability to spin up, manage, move and retire many MySQL instances using virtual machines.
  • Availability: Making sure the database keeps running even in case of hardware failure, and can be backed up and restored in case of corruption.
  • Storage management: Can the database storage be centralized, easily expanded, and possibly shared with other applications?
  • Price/Performance: Evaluating the cost of each database transaction or storage IOP.
  • Private vs. Public Cloud: Which instances should be run in a public cloud like AWS vs. in a private, on-premises cloud?

It’s common for customers to deploy many MySQL instances to support different applications, users, and projects. It’s also common to deploy them on virtual machines, which makes more efficient use of hardware and simplifies migration of instances. For example a particular MySQL instance can be given more resources when it’s hot then moved to an older server when it’s not.

Likewise it’s preferred to offer persistent, shared storage which can scale up in both capacity and performance when needed. While a straight flash array or local server flash might offer more peak performance to one MySQL instance, Ceph’s scale-out architecture makes it easy to scale up the storage performance to run many MySQL instances across many storage nodes. Persistent storage ensures the data continues to exist even if the database instances goes away. Ceph also features replication and erasure coding to protect against hardware failure and snapshots to support quick backup and restore of databases.

As for the debate between public vs. private cloud, it has too many angles to cover here, but clearly there are MySQL customers who prefer to run in their own datacenter rather than AWS, and others who would happily go either way depending which costs less.



Figure 2: Ceph can scale out to many nodes for both redundancy and increased performance for multiple database instances.

But the questions remain: can Ceph perform well enough for a typical MySQL user, and how does it compare to AWS in performance and price? This is what Red Hat, Supermicro, and Percona set off to find out.




Figure 3: MySQL on AWS vs. MySQL on Red Hat Ceph Storage. Which is faster? Which is less expensive?

First Red Hat ran baseline benchmarks on AWS EC2 (r3.2xlarge and m4.4xlarge) using Amazon’s Elastic Block Storage (EBS) with provisioned IOPS set to 30 IOPS/GB, testing with Sysbench for 100% read and 100% write. Not surprisingly, after converting from Sysbench numbers (requests per second per MySQL instance) to IOPS, AWS performance was as advertised—30 read IOPS/GB and 26 write IOPS/GB.

Then they tested the Ceph cluster illustrated above: 5 Supermicro cloud servers (SSG-6028R-E1CF12L) with four NVMe SSDs each, plus 12 Supermicro client machines on dual 10GbE networks. Software was Red Hat Ceph Storage 1.3.2 on RHEL 7.2 with Percona Server. After running the same Sysbench tests the Ceph cluster at 14% and 87% capacity utilization, they found read IOPS/GB were 8x or 5x better, while write IOPS/GB were 3x better than AWS at 14% utilization.  At 87% utilization of the Ceph cluster, write IOPS/DB were 14% lower than AWS due to the write amplification from the combination of InnoDB write buffering, Ceph replication, and OSD journaling.


Figure 4: Ceph private cloud generated far better write IOPS/GB at 14% capacity and slightly lower IOPS/GB at 72% and 87% capacity.


What about Price/Performance?

The Ceph cluster was always better than AWS for reads and much better than AWS for writes when nearly empty but slightly slower than AWS for writes when nearly full. On the other hand when looking at the cost per IOP for MySQL writes, Ceph was far less expensive than AWS in all scenarios. In the best case Ceph was less than 1/3rd the price/IOP and in the worst case half the price/IOP, vs. AWS EBS with provisioned IOPS.


Figure 5: MySQL on a Ceph private cloud showed much better (lower) price/performance than running on AWS EBS with Provisioned IOPS.


What Next for the Database Squid?

Having shown good performance chops running MySQL on Red Hat Ceph Storage, Red Hat also looked at tuning Ceph block storage performance, including RBD format, RBD order, RBD fancy striping, TCP settings, and various QEMU settings. These are covered in the Red Hat Summit presentation and Percona webinar.

For the next phase in this database testing, I’d like to see Red Hat, Supermicro, and Percona test larger server configurations that use more flash per server and faster networking. While this test only used dual 10GbE networks, previous testing has shown that using Mellanox 40 or 50Gb Ethernet can reduce latency and therefore increase IOPS performance for Ceph, even when dual 10GbE networks provide enough bandwidth. It would also be great to demonstrate the benefits of Ceph replication and cluster self-healing features for data protection as well as Ceph snapshots for nearly instant backup and restore of databases.

My key takeaways from this project are as follows:

  • Ceph is a good choice for many MySQL use cases
  • Ceph offers excellent performance and capacity scalability, even if it might not offer the fastest performance for one specific instance.
  • Ceph performance for MySQL compares favorably with AWS EBS Provisioned IOPS
  • You can build a private storage cloud with Red Hat Ceph Storage with a lower price/capacity and price/performance than running on AWS.

If you’re running a lot of MySQL instances, especially on AWS, it behooves you to evaluate Ceph as a storage option. You can learn more about this from the PerconaLive and Red Hat Summit presentations linked below.

Supporting Resources:


No Wrinkles as Mellanox Powers NVMe over Fabrics Demos at Flash Memory Summit and IDF

Mellanox just rounded out a two very busy weeks with back-to-back trade shows related to storage. We were at Flash Memory Summit August 9-11 in Santa Clara, followed by Intel Developer Forum (IDF) August 16-18 in San Francisco. A common theme was seeing Mellanox networking everywhere for demonstrating the performance of flash storage.

The fun began at Flash Memory Summit with several demos of NVMe over Fabrics (NVMe-oF). As my colleague Rob Davis wrote in his blog, the 1.0 standard and community drivers were just released in June 2016, and while FMS 2015 also featured NVMe-oF demos from Mangstor, Micron and PMC Sierra (now Microsemi), all were pre-standard and only Mangstor had a shipping product. Plus all the demos ran only on Linux.


Figure 1: NVMe over Fabrics is nearly always powered by RoCE (RDMA over Converged Ethernet)

So it was extremely exciting this year to see FIVE demos of NVMe over Fabrics at FMS using Mellanox networking, with three of them available as products. All the demos either used the standard NVMe-oF drivers or were compatible with the standard drivers, and they showed initiators running on Windows and VMware, not just Linux.

  • E8 Storage showed a distributed, scale-out NVMe-oF software-defined storage solution
  • Mangstor showed a high-performance, scale-up NVMe-oF array, with initiators running on bare-metal Linux and on a Linux VM running on top of VMware ESXi
  • Micron showed a Windows NVMe-oF initiator interoperating with a Linux target
  • Newisys (division of Sanmina) showed a live NVMe-oF demo
  • Pavilion Data showed a super dense NVMe-oF custom array supporting up to 460TB, 40x40GbE connections, and up to 20 million IOPS, all in one 4RU box.

blog 2

Figure 2: Pavilion Data’s custom-engineered all-flash array supports up to 460TB of raw capacity, 120GB/s of throughput, and 20M IOPS, all running NVMe-oF with up to forty 40GbE connections.

But NVMe over Fabrics wasn’t the only flash demo to leverage Mellanox networking! Samsung demonstrated an impressive Windows Storage Spaces Direct (S2D) cluster that reached 80GB/s (640 Gb/s) of data throughput. It used just 4 Dell servers, each with 4 Samsung NVMe SSDs and two Mellanox ConnectX-4 100GbE RDMA-enabled NICs, all connected by Mellanox’s Spectrum 2700 100GbE switch and LinkX® cables. Samsung also showed an all-flash reference design with 24 NVMe SSDs, capable of supporting several storage solutions including Ceph.

Nimbus Data unveiled a new family of flashy arrays which all support iSER (iSCSI Extensions for RDMA) on top of RoCE. Nexenta and Mellanox released a joint white paper showing how to deploy a hyper-converged software-defined storage (NexentaEdge) solution using Micron SSDs and Mellanox 50Gb Ethernet.

blog 3


Figure 3: Nimbus Data’s Exaflash C-series supports up to 3PB raw flash and can connect at 100Gb/s with either Ethernet or InfiniBand

At IDF a week later, there were more flashy demos. This time HGST (a Western Digital Brand), Seagate, and Samsung, showed NVMe over Fabrics using Mellanox adapters. Newisys and E8 Storage returned with their NVMe-oF demos, while Samsung also brought back their glorious Windows S2D cluster. To add to the storage excitement, Plexistor showed a solution for Shared Persistent Memory (uses technology similar to NVMe over Fabrics). Atto demonstrated ThunderLink which connects Thunderbolt 3 devices to 40Gb Ethernet networks, and Nokia showed their Airframe OCP rack.


blog 4blog5










Figure 4: Seagate showed a 2U NVMe-oF system with 24 Seagate Nytro XF1440 NVMe SSDs, while Atto’s ThunderLink™ connects Thunderbolt™ 3 devices to 40GbE networks.

Even Intel themselves showed NVMe over Fabrics with Mellanox ConnectX-4 100GbE NICs, paired with their Storage Performance Developer Kit (SPDK) and an Intel Silicon Photonics 100GbE cable. (Mellanox LinkX cables also support Silicon Photonics for 100GbE speeds at distances up to 2km.)

blog 6

Figure 5: Intel showed NVMe over Fabrics using their SPDK software and Mellanox ConnectX-4 adapters.

The common thread across these demos at FMS and IDF? They all used Mellanox ConnectX-3 or ConnectX-4 network adapters, and they all ran at speeds of 25Gb/s or faster (many at 100Gb/s).  In fact as far as I could see, every single demonstration of NVMe over Fabrics used Mellanox adapters, except for demos by other network adapter or chip vendors who showed their own networking.

This is not surprising given that Mellanox adapters and switches are the first to support 25, 50, and 100GbE speeds, and the first and best at supporting low-latency RDMA— via InfiniBand or RoCE—for super-efficient data movement. In addition, ConnectX-4 makes RoCE—and thus NVMe over Fabrics—deployments easier by allowing RoCE to run with Priority Flow Control (PFC) or Explicit Congestion Notification (ECN), or both (see my blog about that).

The key takeaways from these recent events are as follows:

  • NVMe over Fabrics is now a released standard with working products from several vendors
  • NVMe-oF support is expanding to Windows and VMware, no longer Linux-only
  • The speed of flash absolutely requires faster network speeds: 25, 40, 50, or even 100Gb/s
  • RoCE on Mellanox adapters is by far the most popular RDMA solution for supporting NVMe over Fabrics
  • Other flash storage solutions—such as Windows Storage Spaces, NexentaEdge, Ceph, and Plexistor—also choose Mellanox networking for the higher performance and efficiency

Many of the presentations—some given by me and my colleagues—from these two shows are now available online (links in the Resources section below). And if you’d like to see more solutions leveraging the power and efficiency of Mellanox networking, look for Mellanox at an upcoming event near you.

Supporting Resources:


Resilient RoCE Relaxes RDMA Requirements

RoCE — or RDMA over Converged Ethernet — has already proven to be the most popular choice for cloud deployments of Remote Direct Memory Access (RDMA). And it’s increasingly being used for fast flash storage access, such as with NVMe Over Fabrics. But some customers prefer not to configure their networks to be lossless using priority flow control (PFC). Now, with new software from Mellanox, RoCE can be deployed either with or without PFC, depending upon customer network requirements, infrastructure, and preference. This makes RoCE easier to deploy for more customers and will accelerate adoption of RDMA.

Background: Why RDMA?

The increasing speed of CPUs, networks, and storage (flash) have amplified the advantages of RDMA, making it more popular. As CPUs and storage get faster, they support faster network speeds such as 25, 40, 50, and 100GbE. But, as network speeds increase, more of the CPU cores are devoted to handling network traffic with its related data copies and interrupts. And as solid-state storage offers ever lower latencies, the network stack latency becomes a greater and greater part of the total time to access data.


Figure 1: As storage gets faster, software latency becomes a larger part of total data access latency. (Source: Intel presentation on SPDK, May 2016.)

RDMA solves both of these issues by reducing network latency and offloading the CPU. It uses zero-copy and hardware transport technology to transfer data directly from the memory of one server to another (or from server to storage) without making multiple copies, and hardware offloads relieve the CPU from managing any of the networking. This means that with RoCE, more CPU cores are available to run the important applications and the lower latency lets faster storage like flash shine.


Figure 2: RDMA increases network efficiency by transferring data directly to memory and bypassing the CPU. (Source: RoCE Initiative.)

The Purpose of Ethernet Flow Control

It’s clear that all RDMA performs best without packet loss, simply because detecting and retransmitting lost packets causes delays, no matter what protocol is used. The faster the network gets — such as 25, 40, 50, and 100GbE speeds — the greater the relative effect of packet loss and the more valuable to avoid packet loss.

RoCE has built-in error correction and retransmission mechanisms so it does not require a lossless network, however initial implementations recommended lossless networks. The most common source of packet loss within the datacenter is traffic overload on ports, such as an incast situation. So, it was recommended that customers deploy RoCE with Priority Flow Control (PFC).

PFC is part of the Ethernet Data Center Bridging (DCB) specification, originally implemented to support FCoE, which requires a lossless network. It acts like a traffic light or traffic cop at intersections, preventing collisions and avoiding packet loss from overloaded switch ports. The “Priority” in PFC allows traffic to be grouped into several classes so more important or latency-sensitive packets (for example storage or RDMA traffic) get priority over less latency-sensitive traffic.








Figure 3: PFC prevents packet loss on busy networks, just like a traffic cop prevents accidents at busy intersections.

Priority Flow Control

Priority Flow Control works very well, all major enterprise switches (including Mellanox switches) support it, and it’s been successfully deployed with RoCE in very large networks. In fact, because PFC eliminates packet loss from port overload, it effectively makes any datacenter network lossless. However, PFC requires the network administrators to set up VLANs and configure the flow control priorities, and some network administrators prefer not to do this.

ECN Eliminates Congestion for Smoother Network Flows

But there is an alternative mechanism to avoid packet loss, which leverages Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs.

The RoCE congestion management protocol takes advantage of ECN to avoid congestion and packet loss. ECN capable switches detect when a port is getting too busy and marks outbound packets from that port with the Congestion Experienced (CE) bit. The receiving NIC sees the CE indication and notifies the sending NIC with a Congestion Notification Packet (CNP). In turn, the sending NIC backs off its sending rate temporarily to prevent congestion from occurring. Once the risk of congestion declines sufficiently, the sender resumes full-speed data transmission.


Figure 4: RoCE congestion management leverages ECN to avoid both congestion and packet loss.

It’s like putting all the RoCE packets into self-driving cars which sense and avoid traffic jams using the data shared from all the other cars and local businesses. If a red light is ahead, the cars slow down so they won’t hit the red light, instead arriving at the intersection during the next green light.

Of course, ECN isn’t new. What is new is the latest software release that takes advantage of the advanced hardware mechanisms in the Mellanox ConnectX®-4 and ConnectX-4 Lx adapters which are optimized for deployment with ECN. Of course, you can still use PFC alone. You can even use both in a, “belt and suspenders” approach where ECN prevents congestion but just in case, PFC steps in as a, “traffic cop” to prevent packet loss and keep flows orderly.


Figure 5: RoCE can be deployed with ECN only, PFC only, or both, if you want to ensure your pants (or network flows) won’t fall down.

It’s the Same RoCE Specification as Before

To be clear, this is still the same RoCE specification and wire protocol, which hasn’t changed. It’s simply an enhanced implementation of RoCE, leveraging the improved features and capability of the Mellanox ConnectX-4 adapter family and the ECN support found in advanced switches, including the Mellanox Spectrum switch family. Different RoCE capable adapters still interoperate exactly as before.

Resilient RoCE delivers RDMA performance on lossy networks that performs on par with lossless networks and substantially better than protocols that rely on TCP/IP for error recovery. It gives customers more flexibility to deploy RDMA in the way that best suits their network architecture and performance needs. Some customers will deploy only PFC, some will deploy only ECN, and some will deploy both.

RoCE Continues to Improve and Evolve

Resilient RoCE continues the evolution of RoCE to serve the needs of both bigger networks and more types of enterprise and cloud customers.

  • 2013: First RoCE NICs shipped which are L3-routable
  • 2014: L3-routable RoCE standard approved
  • 2015 (June): Soft-RoCE lets any NIC run RoCE (though only rNICs offer the hardware acceleration and offload)
  • 2015 (October): RoCE plugfest proves multiple RoCE rNIC vendors can interoperate
  • 2016: Resilient RoCE lets RoCE run on lossless or lossy networks


Figure 6: RoCE continues to evolve and improve (source: Mellanox and InfiniBand Trade Association).

RoCE On!

It’s clear why RoCE is the most popular way to use RDMA over Ethernet—it provides the best performance and greatest efficiency. Now, with the addition of Soft-RoCE and the ability to operate with or without lossless networks, RoCE has the most flexibility and largest ecosystem of any Ethernet-based RDMA technology.