Apache Spark™ is an open-source, fast and general engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
Apache Spark™ replaces MapReduce
MapReduce, as implemented in Hadoop, is a popular and widely-used engine. In spite of its popularity, MapReduce suffers from high-latency and its batch-mode response is painful for lots of applications that process and analyze data. Apache Spark is a general purpose engine like MapReduce, but is designed to run much faster and with many more workloads. One of the most interesting features of Spark is its efficient use of memory, while MapReduce has always worked primarily with data stored on disk.
Accelerating Spark Shuffle
Shuffling is the process of redistributing data across partitions (that is, repartitioning) between stages of computation. It is a costly process that should be avoided when possible. In Hadoop, shuffle writes intermediate files to the disk. These files are pulled by the next step/stage. With Spark shuffle, datasets are kept in memory and make data within reach. However, when working in a cluster, network resources are required for fetching data blocks, adding on overall execution time. The SparkRDMA plugin accelerates the network fetch of data blocks using RDMA/RoCE technology, which reduces CPU usage and overall execution time.
SparkRDMA plugin is a high-performance, scalable and efficient ShuffleManager open-source plugin for Apache Spark.
It utilizes RDMA/RoCE technology to reduce CPU cycles needed for Shuffle data transfers, reducing memory usage by reusing memory for transfers rather than copying data multiple times as the traditional TCP-stack does.
SparkRDMA plugin is built to provide the best performance out of the box. Additionally, it provides multiple configuration options to further tune SparkRDMA on a per-job basis.
SparkRDMA plugin Benefits
- Provides Improved Performance
• Lower block transfer times
• Lower memory consumption
• Lower CPU utilization
- Easy to deploy
• Single JAR file
• Enabled with simple configuration handle
• Finer tuning available
• Can be deployed incrementally
• Can be limited to Shuffle-intensive jobs
- Supported on all RDMA-capable ConnectX family products