Double Your Network File System (NFS) Performance with RDMA-Enabled Networking

 
RDMA

Network File System (NFS) is a ubiquitous component of most modern clusters. It was originally designed as a work-group filesystem, making a central file store available to and shared among a number of client servers. Later, as NFS became more popular, mission-critical applications were running over NFS and high-speed access to storage became paramount, higher performance networking started to be used for the client-to-NFS communication. In addition to higher networking speeds (today 100GbE and soon 200GbE), the industry has been looking for technologies that offload stateless networking functions that run on the CPU to the IO subsystems. This leaves more CPU cycles free to run business applications and will maximize the data center efficiency.

One of the more popular networking offload technologies is RDMA (Remote Direct Memory Access). RDMA makes data transfers more efficient and enables fast data move­ment between servers and storage without involving the server’s CPU. Throughput is increased, latency reduced, and CPU power is freed up for the applications. RDMA technology is already widely used for efficient data transfer in render farms and in large cloud deployments such as Microsoft Azure, HPC solutions (including Machine/Deep learning), iSER and NVMe-oF based storage, mission critical SQL databases solutions such as Oracle’s RAC (Exadata), IBM DB2 pureScale, Microsoft SQL solutions and Teradata, as well as many others.

Figure 1: Data Communication over TCP vs RDMA

 

The figure above illustrates why IT managers have been deploying RoCE (RDMA over Converged Ethernet).  RoCE utilizes advances in Ethernet to enable more efficient implementations of RDMA over Ethernet and enables widespread deployment of RDMA technologies in mainstream data center applications.

The growing deployment of RDMA-enabled networking solutions in public and private clouds, like RoCE that enables ruining RDMA over Ethernet, plus the recent NFS protocol extensions, enable NFS communication over RoCE. (For more details, please watch the Open Source NFS/RDMA Roadmap presentation given at the OpenFabrics Workshop on March 2017 by Chuck Lever, upstream Linux contributor and a Linux Kernel Architect at Oracle.) For a detailed description on how to run NFS over RoCE, please read, How to Configure NFS over RDMA (RoCE) at the Mellanox community site.

In order to evaluate the boost that RoCE enables (vs. TCP), we ran a set of iozone tests and measured the read/write IOPS and throughput of multi-thread read or write tests at Mellanox. The tests were performed on a single client against a Linux NFS server using a tmpfs, so storage latencies are removed from the picture and transport behavior is clearly exposed.

The Client server included Intel(R) Core(TM) i5-3450S CPU @ 2.80GHz one socket, four cores, HT disabled 16GB RAM, 1333MHz DDR3, non-ECC HCA together with Mellanox’s ConnectX-5 100GbE NIC (SW version 16.20.1010) plugged into in a PCIe 3.0 x16 slot.

The NFS Server included Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz one socket, four cores, HT disabled 64GB RAM, 2400MHz DDR4 HCA, together with Mellanox’s ConnectX-5 100GbE NIC (16.20.1010) plugged into in a PCIe 3.0 x16 slot.

The Client and the NFS Server were connected over a single 100GbE Mellanox LinkX® copper cable to Mellanox’s Spectrum™ switch using the SN2700 model with its 32 x 100GbE ports, which is the lowest latency Ethernet switch available in the market today, making it ideal for running latency sensitive applications over Ethernet .

Below are the bandwidth and the IOPS that were measured the performance over RoCE vs. TCP, running iozone tes

Figure 2: Running NFS over RoCE enables 2X to 3X higher bandwidth (using 128KB block size, read and write with 16 threads, aggregate throughput)

Figure 3: NFS over RoCE enables up to 140% higher IOPs (using 8KB block size, read and write with 16 threads, aggregate IOPs)

Figure 4: NFS over RoCE enables up to 150% higher IOPs (using 2KB block size, read and write with 16 threads, aggregate IOPs). The difference between the 2KB and 8KB tests is that each 2KB I/O fits entirely in a single RDMA Send, whereas each 8KB I/O conveys the data payload with an RDMA Read or Write.

Conclusion

Running NFS over RDMA-enabled networks, such as RoCE, which offload the CPU from performing the data communication job, generate a significant performance boost. As a result, Mellanox expects that NFS over RoCE will eventually replace NFS over TCP and become the leading transport technology in data centers.

 

Acknowledgement

Thanks to Chuck Lever for sharing his performance results and for his guidance.

 

About Motti Beck

Motti Beck is Sr. Director Enterprise Market Development at Mellanox Technologies Inc. Before joining Mellanox Motti was a founder of BindKey Technologies an EDC startup that provided deep submicron semiconductors verification solutions and was acquired by DuPont Photomask and Butterfly Communications a pioneering startup provider of Bluetooth solutions that was acquired by Texas Instrument. Prior to that, he was a Business Unit Director at National Semiconductors. Motti hold B.Sc in computer engineering from the Technion – Israel Institute of Technology. Follow Motti on Twitter: @MottiBeck

Comments are closed.