As the storage world turns to flash and flash turns to NVMe over Fabrics, the BlueField SoC could be the most highly integrated and most efficient flash controller ever. Let me explain why.
The backstory—NVMe Flash Changes Storage
Dramatic changes are happening in the storage market. This change comes from NVMe over Fabrics, which comes from NVMe, which comes from flash. Flash has been capturing more and more of the storage market. IDC reported that in Q2 2017, the all-flash array (AFA) revenue grew 75% YoY while the overall external enterprise storage array market was slightly down. In the past this flash consisted of all SAS and SATA solid state drives (SSDs), but flash and SSDs have long been fast enough that the SATA and SAS interfaces imposed bandwidth bottlenecks and extra latency.
The SSD vendors developed the Non-volatile memory Express or NVMe standard and commands (version 1.0 released March 2011), which run over a PCIe interface. NVMe allows higher throughput, up to 20Gb/s per SSD today (and more in the near future) and lower latency. It eliminates the SAS/SATA controllers and requires PCIe connections, typically 4 PCIe Gen 3 lanes per SSD. Many servers deployed with local flash now enjoy the higher performance of NVMe SSDs.
How to Share Fast SSD Goodness
But local flash deployed this way is “trapped in the server” because each server can only use its own flash. Different servers need different amounts of flash at different times, but with a local model you must overprovision enough flash in each server to support the maximum that might be needed, even if you need the extra flash for only a few hours at some point in the future. The answer over the last 20 years has been to centralize and network the storage using iSCSI, Fibre Channel Protocol, iSER (iSCSI over RDMA), or NAS protocols like SMB and NFS.
But these all use either SCSI commands or file semantics and were not optimized for flash performance, so they can deliver good performance but not the best possible performance. As a result the NVMe community, including Mellanox, created NVMe over Fabrics (NVMe-oF) to allow fast, efficient sharing of NVMe flash over a fabric. It allows the lean and efficient NVMe commands to operate across an RDMA network with protocols like RoCE and InfiniBand. And it maintains the efficiency and low latency of NVMe while allowing sharing, remote access, replication, failover, etc. A good overview of NVMe over Fabrics is in this YouTube video:
Video 1: An overview of how NVMe over Fabrics has Evolved
NVMe over Fabrics Frees the Flash But Doesn’t Come Free
Once NVMe-oF frees the Flash from the server, you now need an additional CPU to run NVMe commands in a Just-A-Bunch-of-Flash (JBOF) box, plus more CPU power if it’s a storage controller running storage software. You need DRAM to store the buffers and queues. You need a PCIe switch to connect to the SSDs. And you need rNICs that can handle RDMA at high enough speeds to support all the fast NVMe SSDs. In other words, you have to build a complete server design with enhanced internal and external connectivity to support this faster storage. For a storage controller this is not unusual, but for a JBOF it’s more complex and costly than what they’re accustomed to doing with SAS or SATA HBAs and expanders—that don’t require CPUs, DRAM, PCIe switches, or rNICs.
Also, since NVMe SSDs and the NVMe over Fabrics protocol are inherently low latency, the latency of everything else in the system—software, network, HBAs, cache or DRAM access, etc., becomes more prominent and reducing latency in those areas becomes more critical.
A New SoC Is the Most Efficient Way to Drive NVMe-oF
Fortunately there is a new way to build NVMe-oF systems: a single chip that provides everything needed, other than the SSDs and the DRAM DIMMs; it is the Mellanox BlueField. It includes:
- ConnectX-5 high-speed NIC (up to 2x100Gb/s ports, Ethernet or InfiniBand),
- Up to 16 ARM A72 (64-bit) CPU cores,
- A built-in PCIe switch (32 lanes at Gen3/Gen4),
- DRAM controller & coherent cache
- A fast mesh fabric to connect it all
The embedded ConnectX-5 delivers not just 200Gb/s of network bandwidth but all the features of ConnectX-5, including RDMA and NVMe protocol offloads. This means the NVMe-oF data traffic can go directly from SSD to NIC (or NIC to SSD) without interrupting the CPU. It also means overlay network encapsulation (like VXLAN), virtual switch features (such as OVS), erasure coding, T10 data integrity factor signatures, and stateless TCP offloads can all be processed by the NIC without involving the CPU cores. The CPU cores remain free to run storage software, security, encryption, or other functionality.
The fast mesh internal fabric enables near-instantaneous data movement between the PCIe, CPU, cache and networking elements as needed, and operates much more efficiently than a classic server design where traffic between the SSDs and NIC(s) must traverse the PCIe switch and DRAM multiple times for each I/O. With this design, NVMe-oF data traffic queues and buffers can be handled completely in the on-chip cache and doesn’t need to go to the external DRAM, which is only needed if additional storage functions running on the CPU cores are applied to the data. Otherwise the DRAM can be used for control plane traffic, reporting, and management. The PCIe switch supports up to 32 lanes of both Gen3 or Gen4, so it can transfer more than 200Gb/s of data to/from SSDs and is ready for the new PCIe Gen4-enabled SSDs expected to arrive in 2018. (PCIe Gen4 can transfer 2x more traffic per lane than PCIe Gen3.)
BlueField is the FIRST SoC to include all these features and performance, making it uniquely well-suited to control flash arrays, in particular NVMe-oF arrays and JBOFs.
BlueField Is the Most Integrated NVMe-oF Solution
We’ve seen that in the flash storage world, performance is very important. But simplicity of design and controlling costs are also important. By combining all the components of a NVMe-oF server into a single chip, BlueField makes the flash array design very simple and lowers the cost—including allowing a smaller footprint and lower power consumption.
Vendors Start Building Storage Solutions Based on BlueField
Not surprisingly, key Original Design Manufacturers (ODMs) and storage Original Equipment Manufacturers (OEMs) are already designing storage solutions based on BlueField SoC. Mellanox is also working with key partners to create more BlueField solutions for network processing, cloud, security, machine learning, and other non-storage use cases. Mellanox has created a BlueField Storage Reference Platform that can handle many NVMe SSDs and serve them up using NVMe over Fabrics using BlueField. This is the perfect development and reference platform to help customers and partners test and develop their own BlueField-powered storage controllers and JBOFs.
BlueField is the Best Flash Array Controller
The optimized performance and tight integration of all the components needed, makes BlueField the perfect flash array controller, especially for NVMe-oF storage arrays and JBOFs. Designs using BlueField will deliver more flash performance at lower cost and using less power than standard server-based designs.
You can see the BlueField SoC and BlueField Storage Reference Platform this week (August 8-10) at Flash Memory Summit, in the Santa Clara Convention Center, in the Mellanox booth #138.
- BlueField Video
- BlueField Product web page and product brief
- BlueField Reference Platform
- BlueField storage press release
- The evolution of NVMe over Fabrics overview video
- Mellanox Blog: NVMe over Fabrics Standard is Released
- Mellanox Ethernet Switches