Tag Archives: InfiniBand

The Automotive Makers Require Better Compute Simulations Capabilities

This week I presented in the LS-DYNA user conference. LS-DYNA is one of the most used applications for automotive related computer simulations – simulations that are being used throughout the vehicle design process and decreases the need to build expensive physical prototypes. Computer simulation usage has decreased the vehicle design cycle from years to month, and is responsible for cost reduction throughout the process. Almost every part in the vehicle is designed with computer aided simulations. From crash/safety simulation to engine and gasoline flow, from air condition to water pumps, almost every part of the vehicle is simulated.

Today challenges in vehicle simulations are around the motivation to build more economical and ecological designs, how to do design lighter vehicles (less material to be used) while meeting the increased safety regulation demands. For example, national and international standardizations have been put in place, which provide structural crashworthiness requirements for railway vehicle bodies.

In order to be able to meet all of those requirements and demands, higher compute simulation capability is required. It is not a surprise that LS-DYNA is being mostly used in high-performance clustering environments as they provide the needed flexibility, scalability and efficiency for such simulations. Increasing high-performance clustering productivity and the capability to handle more complex simulations is the most important factor for the automotive makers today. It requires using balanced clustering design (hardware – CPU, memory, interconnect, GPU; and software), enhanced messaging techniques and the knowledge on how to increase the productivity from a given design.

For LS-DYNA, InfiniBand interconnect-based solutions have been proven to provide the highest productivity compared to Ethernet (GigE, 10GigE, iWARP). With InfiniBand, LS-DYNA demonstrated high parallelism and scalability, which enabled it to take full advantage of multi-core high-performance computing clusters. In the case of Ethernet, the better choice between GigE, 10GigE and iWARP is 10GigE. While iWARP aim to provide better performance, typical high-performance applications are using send-receive semantics which iWARP does not provide any added value with, and even worse, it just increase the complexity and the CPU overhead/power consumption.

If you want to get a copy of a paper that present the capabilities to increase simulations productivity while decrease power consumption, don’t hesitate to send me a note (hpc@mellanox.com).

Gilad Shainer
shainer@mellanox.com

Web Retailer Uses InfiniBand to Improve Response Time to Its Customers

Recently while talking with an IT operations manager for a major Web retailer, I was enlightened on the importance of reducing latency in web-based applications. He explained that they were challenged to find a way to reduce the response time to their web customers. They investigated this for quite some time before discovering that the major issue seemed to be the time it takes to initiate a TCP transaction between their app servers and database servers. Subsequently their search focused on finding the best interconnect fabric to minimize this time.

Well, they found it in InfiniBand. With its 1 microsecond latency between servers, this web retailer saw tremendous opportunity to improve response time to its customers. In their subsequent proof of concept testing, they found that indeed they could reduce latency between their app servers and database servers. Resulting improvement to their customers is over 30%. This is a huge advantage in their highly competitive market. I would tell you who they are but they would probably shoot me.

More and more enterprise data centers are finding that low latency, high-performance interconnects, like InfiniBand, can improve their customer-facing systems and their resulting web business.

If you want to hear more, or try it for yourself, send me an email.

Thanks,

Wayne Augsburger
Vice President of Business Development
wayne@mellanox.com

I/O Virtualization

I/O virtualization is a complimentary solution for server and storage virtualization, which aims to reduce the management complexity of physical connections in and out of virtual hosts. Virtualized data center clusters will have multiple networking connections to LAN and SAN, and virtualizing the network avoids the extra complexity associated with it. While I/O virtualization reduces the management complexity, in order to maintain high productivity and scalability one should pay attention to other characteristics of the network being virtualized.

Offloading the network virtualization from the VMM (virtual machine manager, e.g. Hypervisor) to a smart networking adapter, not only reduces the CPU overhead associated with the virtualization management, but also increases the performance capability of the virtual machines (or guest OSs) and can provide the native performance capabilities to them.

The PCISIG has standards in place to help simplify I/O virtualization. The most interesting solution is Single Root I/O virtualization (SR-IOV). SR-IOV allows a smart adapter to create multiple virtual adapters (virtual functions) for a given physical server. The virtual adapters can be assigned directly to a virtual machine (VM) instead of relying on the VMM to manage everything.

SR-IOV provides a standard mechanism for devices to advertise their ability to be simultaneously shared among multiple virtual machines. SR-IOV allows the partitioning of PCI functions into many virtual interfaces for the purpose of sharing the resources of a PCI device in a virtual environment.

Mellanox interconnect solutions provide full SR-IOV support while adding the required scalability and high throughput capabilities to effectively support multiple virtual machines on a single physical server. With Mellanox 10GigE or 40Gb/s InfiniBand solutions, each of the virtual machines can get the needed bandwidth allocation to ensure highest productivity and performance, just as if it was a physical server. 

Gilad Shainer
Director of Technical Marketing
gilad@mellanox.com

SSD over InfiniBand

Last week I was at Storage Networking World in Orlando, Florida.  The sessions were a lot better organized with focus on all the popular topics like Cloud Computing, Storage Virtualization and Solid State Storage (SSD).  In our booth, we demonstrated our Layer 2 agnostic storage supporting iSCSI, FCoE (Fibre Channel over Ethernet) and SRP (SCSI RDMA Protocol) all coexisting in a single network. We partnered with Rorke Data who demonstrated a 40Gb/s InfiniBand-based storage array and Texas Memory System’s ‘World’s Fastest Storage’ in our booth demonstrating sustained rates of 3Gb/s and over 400K I/Os using Solid State Drives. 

I attended few of the sessions on the SSD and Cloud Computing stream. SSD was my favorite topic primarily because InfiniBand and SSD together will provide the highest storage performance and has the potential to carve out a niche in the data center OLTP applications market. Clod Barrera, IBM’s Chief Technical Storage Strategist’s presentation on SSD was very good. He had a chart which talked about how HDD I/O rates per GByte had dropped so low and currently staying constant at around 150 to 200 I/Os per drive. On the contrary SSD’s have capability to produce 50K I/Os on Read and 17K I/Os on Write.  Significant synergy can be achieved by combining SSD with InfiniBand technology. InfiniBand delivers the lowest latency of sub 1us and the highest bandwidth of 40Gb/s.  The combination of these technologies will provide significant value in the datacenter and has the potential to change the database and OLTP storage infrastructure.

SSD over InfiniBand delivers:

-  Ultra-fast, lowest latency infrastructure for transaction processing applications

-  Delivering a more compelling Green per GB 

-   Faster recovery time for business continuity applications

-   Disruptive scaling

I see lot of opportunity for InfiniBand technology in the storage infrastructure as SSD provides the much needed discontinuity to the rotary media. 

TA Ramanujam (TAR)
tar@mellanox.com

Unleashing Performance, Scalability and Productivity with Intel Xeon 5500 Processors “Nehalem”

The industry has been talking about it for a long time, but on March 30th, it was officially announced. The new Xeon 5500 “Nehalem” platform from Intel has introduced a totally new concept of server architecture for Intel-based platforms. The memory has moved from being connected to the chipset to be connected directly to the CPU, and the memory speed has increased. More importantly, PCI-Express (PCIe) Gen2 can now be fully utilized to unleash new performance and efficiency levels from Intel-based platforms. PCIe Gen2 is the interface between the CPU and memory to the networking that connects servers together to form compute clusters. With PCIe Gen2 now being integrated in compute platforms from the majority of OEMs, more data can be sent and received in a single server or blade. This means that applications can exchange data faster and complete simulations much faster, bringing a competitive advantage to end-users. In order to feed the PCIe Gen2, one needs to have a big pipe for his networking solutions, and this is what InfiniBand 40Gb/s brings to the table. No surprise that multiple server OEMs have announced the availability of 40Gb/s InfiniBand in conjunction with Intel announcement (for example HP and Dell).

 

I have been testing several applications to compare the performance benefits of Intel Xeon 5500 processors and Mellanox end-to-end 40Gb/s networking solutions. One of those applications was the Weather Research and Forecasting (WRF) application, widely used around the world. With Intel Xeon-5500-based servers and Mellanox 40Gb/s ConnectX InfiniBand adapters and MTS3600 36-port 40Gb/s InfiniBand switch system, we witnessed a 100% increase in performance and productivity over previous Intel platforms.

With a digital media rendering application – Direct Transport Compositor, we have seen a 100% increases in frames per second delivery, while increasing the screen anti-aliasing at the same time. Other applications have shown similar level of performance and productivity boost as well.

 

The reasons for the new performance levels are the decrease in the latency (1usec) and the huge increase in throughput (more than 3.2GB/s throughput uni-directional on more than 6.5GB/s bi-directional on a single InfiniBand port). With the increase in the number of CPU cores, and new server architecture, bigger pipes in and out from the servers are required in order to keep the system balanced and to avoid creating artificial bottlenecks. Another advantage for InfiniBand is its ability to use RDMA and transfer data directly to and from the CPU memory, without the involvement of the CPU in the data transfer activity. This mean one thing only – more CPU cycles can be dedicated to the applications!

 

Gilad Shainer

Director, HPC Marketing

New Territories Explored: Distributed File System Hadoop

It took me a while but I’m back – hope you’re all been waiting to hear from me .

With that, I’ve decided to go into un-charted territories…HADOOP

 

Hadoop is an Apache project. It is a framework, written in Java, for running applications on large clusters built with commodity hardware (distributed computing). Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.

 

Hadoop provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. The 2 key functions in Hadoop are map and reduce. I would like to briefly touch on what they mean.

 

The map function processes a key/value pair to generate a set of intermediate key/value pairs, while the reduce function merges all intermediate values associated with the same intermediate key. A map-reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner, the framework sorts the outputs of the maps, which then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system Master/Slave architecture.

 

The  test we’ve conducted was the DFSIO benchmark, a map-reduce job where each map task opens a file and writes/reads to/from it, closes it, and measures the I/I time. There is only single reduce task which aggregates individual times and sizes. We’ve limited the test for 10 to 50 files measurements with 360 MB that we found reasonable compared with the ratio of number of nodes used and number of files. We then compared that to a public publish from Yahoo  which used 14k files over 4k nodes. This boils down to 3.5 files per node where we are using 50 files over 12 nodes, which equates to over 4 files per node.

 

Given the above configuration and the test described above, here is a snap shot of the results we’ve seen:

 

 



It can clearly be seen from the above, as well as through other results we’ve been given, that InfiniBand and 10GigE (via our ConnectX adapters) is half the time in execution time and over triple in bandwidth…these are very conclusive results by any matrix. 

 

A very interesting point to review is that the tests which were executed using DFS located on a hard disk showed significant better performance, but when testing with RamDisk, the gap increased even more. e.g. latency became from half to one-third… it seems like a clear way to unleash the potential.

 

In my next blog post I’ll plan to either review a new application or anther aspect of this application.

 

 

Nimrod Gindi

Director of Corporate Strategy

nimrodg@mellanox.com