Data volume, velocity, variety, veracity and value are the indexes to measure Big Data deployments. Hadoop, as the dominant Big Data application, helps organizations to improve, initiate new business and improve research. Hadoop's cluster capabilities in data storage size, ingress flow and analytics speed are the objectives of today's data scientist work.
Legacy Ethernet networks are no longer capable of delivering the performance required for Hadoop clusters. Multi-socket, multicore servers have outgrown legacy Gigabit Ethernet capacity even with multiple aggregated links.
The selection of network, and needed capabilities, is one of the major challenges when building a Hadoop cluster as workloads vary a lot. The target is to buy enough network capacity so that all nodes in the cluster can communicate with each other at optimal rates. Mellanox's end-to-end Hadoop networking solutions deliver the needed performance to eliminate any current and future bottlenecks. Whether Ethernet or InfiniBand, Mellanox switches, cables and adapter cards provide enough bandwidth to sustain today's advanced disk controller throughput. The low- latency features drive NoSQL data bases performance to new levels of capabilities. Sub micro-second latency reduces HBase inquiry response time and enables a more predictable answer time histogram.
New, interactive frameworks, on top of Apache Hadoop, provide near real-time data analysis performance. To handle the large amount of data these frameworks require a low- latency, high throughput connectivity. Linear scalability is a requirement to answer today's business growth, as data magnitude is exponentially increasing with linear sales and revenue growth. With advances in microprocessor technology, multi-core processors and servers create a demand for higher throughput network to feed the CPU's processing. InfiniBand and Ethernet high- bandwidth technologies deliver up to 56Gb/s of bandwidth, and each have RDMA (Remote Direct Memory Access) capabilities to offload data movement. Mellanox's Unstructured Data Accelerator (UDA), is an open-source project that delivers RDMA offloads into the Hadoop Map Reduce framework.
UDA is a software plugin that accelerates Hadoop networks and improves the scaling of Hadoop clusters executing intensive applications such as data analytics. A novel data shuffle and merge protocol which uses RDMA to implement efficient merge-sort algorithm over Mellanox InfiniBand and 10Gb/40Gb/56Gb Ethernet RoCE (RDMA over Converged Ethernet) adapter cards. UDA is integrated in to Apache Hadoop 2.0.x, 3.0.x and patches are available to all binary compatible distributions. UDA installation is a simple RPM addition to the execution library.
UDA doubles the data processing throughput and reduces total job execution time by half based on the analytics workload. With increased CPU efficiency due to network protocol offload by RDMA, clusters will increase their corresponding computation power efficiency. Higher bandwidth with scale out architecture based on RDMA over InfiniBand and Ethernet provides consolidated single networking pipe to transfer larger datasets across the wire.
UDA Key Advantages
- Leverages the world's fastest interconnect: FDR InfiniBand or 10Gb/40Gb/56Gb Ethernet
- Increases Hadoop Map Reduce efficiency by processing data with RDMA technology and efficient merge-sort algorithm
- Lowers total job execution time
- Lossless scalable fabric solution
The UDA acceleration kit is available at the following Google code repository. Mellanox will update the repository from time to time to provide the community with a powerful and efficient data analytics tool.
UDA code and binaries are available in the following url. We welcome your contribution.
UDA 3.1 is jointly developed by the Parallel Architecture and System Laboratory headed by Dr. Weikuan Yu from Auburn University and Mellanox.