Tag Archives: Hadoop

Deploying Hadoop on Top of Ceph, Using FDR InfiniBand Network

We recently posted a whitepaper on “Deploying Ceph with High Performance Networks” using Ceph as a block storage device.  In this post, we review the advantages of using CephFS as an alternative for HDFS.

Hadoop has become a leading programming framework in the big data space. Organizations are replacing several traditional architectures with Hadoop and use it as a storage, data base, business intelligence and data warehouse solution. Enabling a single file system for Hadoop and other programming frameworks benefits users who need dynamic scalability of compute and or storage capabilities.

Continue reading

Advantages of RDMA for Big Data Applications

Hadoop MapReduce is the leading Big Data analytics framework. This framework enables data scientists to process data volumes and variety never processed before. The result from this data processing is new business creation and operation efficiency.  

As MapReduce and Hadoop advance, more organizations try to use the frameworks in near real-time capabilities. Leveraging RDMA (Remote Direct Memory Access) capabilities for faster Hadoop MapReduce capabilities has proven to be a successful method.

In our presentation at Oracle Open World 2013, we show the advantages RDMA brings to enterprises deploying Hadoop and other Big Data applications:

-        Double analytics performance, accelerating MapReduce framework

-        Double Hadoop file system ingress capabilities

-        Reducing NoSQL Databases’ latencies by 30%

On the analytics side, UDA (Unstructured Data Accelerator), doubles the computation power by offloading networking and buffer copying from the server’s CPU to the network controller. In addition, a novel shuffle and merge approach helped to achieve the needed performance acceleration. The UDA package is and open source package available here (https://code.google.com/p/uda-plugin/).  The HDFS (Hadoop Distributed File System) layer is also getting its share of performance boost.

While the community continues to improve the feature, work conducted at Ohio State University brings the RDMA capabilities to the data ingress process of HDFS. Initial testing shows over 80% improvement in the data write path to the HDFS repository. The RDMA HDFS acceleration research and downloadable package is available from the  Ohio State University website at: http://hadoop-rdma.cse.ohio-state.edu/

We are expecting more RDMA acceleration enablement to different Big Data frameworks in the future.  If you have a good use case, we will be glad to discuss the need and help with the implementation.

Contact us through the comments section below or at bigdata@mellanox.com

 

eyal gutkind
Author: Eyal Gutkind is a Senior Manager, Enterprise Market Development at Mellanox Technologies focusing on Web 2.0 and Big Data applications. Eyal held several engineering and management roles at Mellanox Technologies over the last 11 years. Eyal Gutkind holds a BSc. degree in Electrical Engineering from Ben Gurion University in Israel and MBA from Fuqua School of Business at Duke University, North Carolina.

Product Flash: DDN hScaler Hadoop Appliance

 

Of the many strange-sounding application and product names out there in the industry today, Hadoop remains one of the most recognized.  Why?  Well, we’ve talked about the impact that data creation, storage and management is having on the overall business atmosphere, it’s the quintessential Big Data problem. Since all that data has no value unless it’s made useful and actionable through analysis, a variety of Big Data analytics software and hardware solutions have been created.  The most popular solution on the software side is, of course, Hadoop.  Recently, however, DDN announced an exciting new integrated solution to solve the Big Data equation: hScaler.

 

Based on DDN’s award-winning SFA 12K architecture, hScaler is the world’s first enterprise Hadoop appliance.  Unlike many Hadoop installations, hScaler is factory-configured and simple to deploy, eliminating the need for trial-and-error approaches that require substantial expertise and time to configure and tune.  The hScaler can be deployed in a matter of hours, compared to homegrown approaches requiring weeks or even months, allowing enterprises to focus on their actual business, and not the mechanics of the Hadoop infrastructure.

hScaler_trans.png

DDN hScaler

 

Performance-wise, the hScaler is no slouch.  Acceleration of the Hadoop shuffle phase through the use of Mellanox InfiniBand and 40GbE RDMA interconnects, ultra-dense storage and an efficient processing infrastructure deliver results up to 7x faster than typical Hadoop installations. That means quicker time-to-insight and a more competitive business.

 

For enterprise installations, hScaler includes an integrated ETL engine, over 200 connectors for data ingestion and remote manipulation, high availability and management through DDN’s DirectMon framework.  Independently scalable storage and compute resources provide additional flexibility and cost savings, as organizations can choose to provision to meet only their current needs, and add resources later as their needs change.  Because hScaler’s integrated architecture is four times as dense as commodity installations, additional TCO dollars can be saved in floorspace, power and cooling.

 

Overall, hScaler looks to be a great all-in-one, plug-n-play package for enterprise organizations that need Big Data results fast, but don’t have the time, resources or desire to build an installation from the ground up.

 

Find out more about the hScaler Hadoop Appliance at DDN’s website: http://www.ddn.com/en/products/hscaler-appliance and http://www.ddn.com/en/press-releases/2013/new-era-of-hadoop-simplicity

 

Don’t for get to join the Mellanox Storage Community: http://community.mellanox.com/groups/storage

 

New Territories Explored: Distributed File System Hadoop

It took me a while but I’m back – hope you’re all been waiting to hear from me .

With that, I’ve decided to go into un-charted territories…HADOOP

 

Hadoop is an Apache project. It is a framework, written in Java, for running applications on large clusters built with commodity hardware (distributed computing). Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.

 

Hadoop provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. The 2 key functions in Hadoop are map and reduce. I would like to briefly touch on what they mean.

 

The map function processes a key/value pair to generate a set of intermediate key/value pairs, while the reduce function merges all intermediate values associated with the same intermediate key. A map-reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner, the framework sorts the outputs of the maps, which then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system Master/Slave architecture.

 

The  test we’ve conducted was the DFSIO benchmark, a map-reduce job where each map task opens a file and writes/reads to/from it, closes it, and measures the I/I time. There is only single reduce task which aggregates individual times and sizes. We’ve limited the test for 10 to 50 files measurements with 360 MB that we found reasonable compared with the ratio of number of nodes used and number of files. We then compared that to a public publish from Yahoo  which used 14k files over 4k nodes. This boils down to 3.5 files per node where we are using 50 files over 12 nodes, which equates to over 4 files per node.

 

Given the above configuration and the test described above, here is a snap shot of the results we’ve seen:

 

 



It can clearly be seen from the above, as well as through other results we’ve been given, that InfiniBand and 10GigE (via our ConnectX adapters) is half the time in execution time and over triple in bandwidth…these are very conclusive results by any matrix. 

 

A very interesting point to review is that the tests which were executed using DFS located on a hard disk showed significant better performance, but when testing with RamDisk, the gap increased even more. e.g. latency became from half to one-third… it seems like a clear way to unleash the potential.

 

In my next blog post I’ll plan to either review a new application or anther aspect of this application.

 

 

Nimrod Gindi

Director of Corporate Strategy

nimrodg@mellanox.com