Resilient RoCE Relaxes RDMA Requirements

 
Adapters, Ethernet, Storage, Switches, , , , , , , , , , , , ,

RoCE — or RDMA over Converged Ethernet — has already proven to be the most popular choice for cloud deployments of Remote Direct Memory Access (RDMA). And it’s increasingly being used for fast flash storage access, such as with NVMe Over Fabrics. But some customers prefer not to configure their networks to be lossless using priority flow control (PFC). Now, with new software from Mellanox, RoCE can be deployed either with or without PFC, depending upon customer network requirements, infrastructure, and preference. This makes RoCE easier to deploy for more customers and will accelerate adoption of RDMA.

Background: Why RDMA?

The increasing speed of CPUs, networks, and storage (flash) have amplified the advantages of RDMA, making it more popular. As CPUs and storage get faster, they support faster network speeds such as 25, 40, 50, and 100GbE. But, as network speeds increase, more of the CPU cores are devoted to handling network traffic with its related data copies and interrupts. And as solid-state storage offers ever lower latencies, the network stack latency becomes a greater and greater part of the total time to access data.

Roce1

Figure 1: As storage gets faster, software latency becomes a larger part of total data access latency. (Source: Intel presentation on SPDK, May 2016.)

RDMA solves both of these issues by reducing network latency and offloading the CPU. It uses zero-copy and hardware transport technology to transfer data directly from the memory of one server to another (or from server to storage) without making multiple copies, and hardware offloads relieve the CPU from managing any of the networking. This means that with RoCE, more CPU cores are available to run the important applications and the lower latency lets faster storage like flash shine.

Roce2

Figure 2: RDMA increases network efficiency by transferring data directly to memory and bypassing the CPU. (Source: RoCE Initiative.)

The Purpose of Ethernet Flow Control

It’s clear that all RDMA performs best without packet loss, simply because detecting and retransmitting lost packets causes delays, no matter what protocol is used. The faster the network gets — such as 25, 40, 50, and 100GbE speeds — the greater the relative effect of packet loss and the more valuable to avoid packet loss.

RoCE has built-in error correction and retransmission mechanisms so it does not require a lossless network, however initial implementations recommended lossless networks. The most common source of packet loss within the datacenter is traffic overload on ports, such as an incast situation. So, it was recommended that customers deploy RoCE with Priority Flow Control (PFC).

PFC is part of the Ethernet Data Center Bridging (DCB) specification, originally implemented to support FCoE, which requires a lossless network. It acts like a traffic light or traffic cop at intersections, preventing collisions and avoiding packet loss from overloaded switch ports. The “Priority” in PFC allows traffic to be grouped into several classes so more important or latency-sensitive packets (for example storage or RDMA traffic) get priority over less latency-sensitive traffic.

 

 

 

 

 

traffic-cop-police-in-new-york-city-street-4k_1

 

Figure 3: PFC prevents packet loss on busy networks, just like a traffic cop prevents accidents at busy intersections.

Priority Flow Control

Priority Flow Control works very well, all major enterprise switches (including Mellanox switches) support it, and it’s been successfully deployed with RoCE in very large networks. In fact, because PFC eliminates packet loss from port overload, it effectively makes any datacenter network lossless. However, PFC requires the network administrators to set up VLANs and configure the flow control priorities, and some network administrators prefer not to do this.

ECN Eliminates Congestion for Smoother Network Flows

But there is an alternative mechanism to avoid packet loss, which leverages Explicit Congestion Notification (ECN). ECN allows switches to notify hosts when congestion is likely to happen, and the end nodes adjust their data transmission speeds to prevent congestion before it occurs.

The RoCE congestion management protocol takes advantage of ECN to avoid congestion and packet loss. ECN capable switches detect when a port is getting too busy and marks outbound packets from that port with the Congestion Experienced (CE) bit. The receiving NIC sees the CE indication and notifies the sending NIC with a Congestion Notification Packet (CNP). In turn, the sending NIC backs off its sending rate temporarily to prevent congestion from occurring. Once the risk of congestion declines sufficiently, the sender resumes full-speed data transmission.

Roce4

Figure 4: RoCE congestion management leverages ECN to avoid both congestion and packet loss.

It’s like putting all the RoCE packets into self-driving cars which sense and avoid traffic jams using the data shared from all the other cars and local businesses. If a red light is ahead, the cars slow down so they won’t hit the red light, instead arriving at the intersection during the next green light.

Of course, ECN isn’t new. What is new is the latest software release that takes advantage of the advanced hardware mechanisms in the Mellanox ConnectX®-4 and ConnectX-4 Lx adapters which are optimized for deployment with ECN. Of course, you can still use PFC alone. You can even use both in a, “belt and suspenders” approach where ECN prevents congestion but just in case, PFC steps in as a, “traffic cop” to prevent packet loss and keep flows orderly.

Roce5

Figure 5: RoCE can be deployed with ECN only, PFC only, or both, if you want to ensure your pants (or network flows) won’t fall down.

It’s the Same RoCE Specification as Before

To be clear, this is still the same RoCE specification and wire protocol, which hasn’t changed. It’s simply an enhanced implementation of RoCE, leveraging the improved features and capability of the Mellanox ConnectX-4 adapter family and the ECN support found in advanced switches, including the Mellanox Spectrum switch family. Different RoCE capable adapters still interoperate exactly as before.

Resilient RoCE delivers RDMA performance on lossy networks that performs on par with lossless networks and substantially better than protocols that rely on TCP/IP for error recovery. It gives customers more flexibility to deploy RDMA in the way that best suits their network architecture and performance needs. Some customers will deploy only PFC, some will deploy only ECN, and some will deploy both.

RoCE Continues to Improve and Evolve

Resilient RoCE continues the evolution of RoCE to serve the needs of both bigger networks and more types of enterprise and cloud customers.

  • 2013: First RoCE NICs shipped which are L3-routable
  • 2014: L3-routable RoCE standard approved
  • 2015 (June): Soft-RoCE lets any NIC run RoCE (though only rNICs offer the hardware acceleration and offload)
  • 2015 (October): RoCE plugfest proves multiple RoCE rNIC vendors can interoperate
  • 2016: Resilient RoCE lets RoCE run on lossless or lossy networks

Roce6

Figure 6: RoCE continues to evolve and improve (source: Mellanox and InfiniBand Trade Association).

RoCE On!

It’s clear why RoCE is the most popular way to use RDMA over Ethernet—it provides the best performance and greatest efficiency. Now, with the addition of Soft-RoCE and the ability to operate with or without lossless networks, RoCE has the most flexibility and largest ecosystem of any Ethernet-based RDMA technology.

RESOURCES:

 

 

About John F. Kim

John Kim is Director of Storage Marketing at Mellanox Technologies, where he helps storage customers and vendors benefit from high performance interconnects and RDMA (Remote Direct Memory Access). After starting his high tech career in an IT helpdesk, John worked in enterprise software and networked storage, with many years of solution marketing, product management, and alliances at enterprise software companies, followed by 12 years working at NetApp and EMC. Follow him on Twitter: @Tier1Storage

Comments are closed.