Hadoop

Data volume, velocity, variety, veracity and value are the indexes to measure Big Data deployments. Hadoop, as the dominant Big Data application, helps organizations to improve, initiate new business, and advance research. Hadoop's cluster capabilities in data storage size, ingress flow, and analytics speed are the objectives of today's data scientist's work. Legacy Ethernet networks are no longer capable of delivering the performance required for Hadoop clusters. Multi-socket, multicore servers have outgrown legacy Gigabit Ethernet capacity, even with multiple aggregated links.

The selection of a network and its required capabilities is one of the major challenges when building a Hadoop cluster, as workloads vary significantly. The goal is to buy enough network capacity that all nodes in the cluster can communicate with each other at optimal rates. Mellanox's end-to-end Hadoop networking solutions deliver the necessary performance to eliminate any current and future bottlenecks. Whether Ethernet or InfiniBand, Mellanox switches, cables, and adapter cards provide enough bandwidth to sustain today's advanced disk controller throughput. The low-latency features drive NoSQL database performance to new highs. Sub-microsecond latency reduces HBase inquiry response time and enables a more predictable answer time histogram. 

New interactive frameworks, on top of Apache Hadoop, provide near-real-time data analysis performance. To handle the large amount of data, these frameworks require a low-latency, high throughput connectivity. Linear scalability is a requirement to answer today's business growth, as data magnitude is exponentially increasing with linear sales and revenue growth. With advances in microprocessor technology, multi-core processors and servers create a demand for higher throughput networks to feed the CPU's processing. InfiniBand and Ethernet high-bandwidth technologies deliver up to 56Gb/s of bandwidth, and each have RDMA (Remote Direct Memory Access) capabilities to offload data movement. Mellanox's Unstructured Data Accelerator (UDA) is an open-source project that delivers RDMA offloads into the Hadoop Map Reduce framework.

UDA

UDA is a software plugin that accelerates Hadoop networks and improves the scaling of Hadoop clusters executing intensive applications such as data analytics. A novel data shuffle and merge protocol, UDA uses RDMA to implement efficient merge-sort algorithm over Mellanox InfiniBand and 10Gb/40Gb/56Gb Ethernet RoCE (RDMA over Converged Ethernet) adapter cards. UDA is integrated into Apache Hadoop 2.0.x, 3.0.x, and patches are available to all binary compatible distributions. UDA installation is a simple RPM addition to the execution library.

UDA Performance

UDA doubles the data processing throughput and reduces total job execution time by half based on the analytics workload. With increased CPU efficiency due to network protocol offload by RDMA, clusters will increase their corresponding computation power efficiency. Higher bandwidth with scale-out architecture based on RDMA over InfiniBand and Ethernet provides consolidated single networking pipe to transfer larger datasets across the wire.

UDA Key Advantages

  • Leverages the world's fastest interconnect: FDR InfiniBand or 10Gb/40Gb/56Gb Ethernet
  • Increases Hadoop Map Reduce efficiency by processing data with RDMA technology and efficient merge-sort algorithm
  • Reduces total job execution time
  • Lossless scalable fabric solutionn

UDA Availability

The UDA acceleration kit is available at the following Google code repository. Mellanox will update the repository from time-to-time to provide the community with a powerful and efficient data analytics tool.

UDA code and binaries are available at the following URL. We welcome your contribution.

https://code.google.com/p/uda-plugin

UDA 3.1 is jointly developed by the Parallel Architecture and System Laboratory headed by Dr. Weikuan Yu from Auburn University and Mellanox.