All posts by brian

Mellanox InfiniBand and Ethernet RDMA Interconnect Solutions Accelerate IBM’s Virtualized Database Solutions

Recently, IBM expanded its PureSystems family with the new PureData System, which features analytics and the ability to handle big data in the box. For today’s organizations to be competitive, they need to quickly and easily analyze and explore big data—even when dealing with petabytes. The new system simplifies and optimizes the performance of data warehouse services and analytics applications. The new PureData for Analytics system is designed to accelerate analytics and boasts the largest library of in-database analytic functions on the market today. Clients can use it to predict and help avoid customer churn in seconds, and create targeted advertising and promotions using predictive and spatial analysis, and prevent fraud.

We are pleased to announce that our InfiniBand and Ethernet RoCE interconnect solutions have been selected to accelerate these systems, helping reduce CPU overhead, enable higher system efficiency and availability, and deliver higher return-on-investment.

Modern database applications are placing increased demands on the server and storage interconnects as they require higher performance, scalability and availability. Virtualizing IBM DB2 pureScale® on System x® servers using Mellanox’s RDMA based interconnect solutions deliver outstanding application performance and business benefits to IBM customers.

Mellanox’s interconnect solutions on IBM DB2 pureScale virtualized database on System x servers provide the ability to run multiple highly scalable databases clusters on the same shared infrastructure, while staying highly available and helping to minimize downtimes.

Mellanox interconnect products enable the IBM DB2 pureScale to deliver the performance and functionality needed to support the most demanding database and transaction processing applications. Using Mellanox high bandwidth, low latency interconnects is one of the key ingredients in building scalable cluster solutions with DB2 pureScale.

Mellanox InfiniBand and Ethernet interconnects enable IBM DB2 pureScale to provide a direct connectivity from the database virtual machines to the interconnect infrastructure while preserving RDMA semantics. This direct connectivity allows virtual machines to achieve lower latency and faster data access versus other solutions.

Live Demonstration and presentations at Information On demand 2012 (October 21st – October 26th at Las Vegas, NV)

Visit the Intel booth on the Expo floor to see a live demonstration of virtualized DB2 pureScale cluster running over Mellanox’s 10GbE with RoCE interconnect solution.

Mellanox InfiniBand and Ethernet Switches Receive IPv6 Certification

I am proud to announce that Mellanox’s SwitchX® line of InfiniBand and Ethernet switches have received a gold certification for Internet Protocol v6 (IPv6) by the Internet Protocol Forum.  Adding IPv6 support to our SwitchX series is another milestone for Mellanox’s InfiniBand and Ethernet interconnect solutions, and demonstrates our commitment to producing quality, interoperable InfiniBand and Ethernet products optimized for the latest Internet Protocols.

SX1036 - 36-port 40GbE Switch

Mellanox’s drive to satisfy strong requirements has led to receiving the gold certification as part of the IPv6 Ready Logo Program which is a conformance and interoperability testing program designed to increase user confidence by demonstrating that IPv6 is the future of network architecture.

We at Mellanox feel that as global technology adoption rates increase, there is a greater need for larger networks and subsequently more IP addresses. Just as background, Internet Protocol version 4 (IPv4), still in dominant use, is now reaching the limit of its capacity. The next generation of IP – IPv6 – is designed to provide a vastly expanded address space and quadruples the number of network address bits from 32 bits in IPv4 to 128 bits, providing more than enough globally unique IP addresses for every networked device on the planet.

Regards,

Amit Katz

Director, Product Management

Mellanox FDR 56Gb/s InfiniBand Adapters Provide Leading Application Performance for Dell PowerEdge C8000 Series Servers

Dell just announced today the PowerEdge C8000 series, which is the industry’s only 4U shared infrastructure solution to provide customers compute, GPU/coprocessor and storage options in a single chassis. End users deploying the PowerEdge C8000 with Mellanox fast interconnect solutions gain access to the industry-leading performance of 56Gb/s InfiniBand combined with the power of Dell’s newest high end server, resulting in a high performance solution with low total cost of ownership in power efficiency, system scaling efficiency and compute density.

Mellanox FDR 56Gb/s InfiniBand solutions are already being deployed with Dell PowerEdge C8000 systems as part of the Stampede supercomputer at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. With a peak performance of more than 10 petaflops, Stampede will be the most powerful system available to researchers via the NSF’s eXtreme Science & Engineering Discovery Environment (XSEDE) program when installed in January 2013.

Mellanox fast interconnect solutions are providing the Dell PowerEdge C8000 with low latency, high bandwidth benefits for the most resource-intensive hyperscale workloads, including HPC, big data processing and hosting providers. Mellanox delivers the most effective interconnect solution for the Dell PowerEdge C8000, enabling the highest compute and storage performance at the lowest cost and power consumption.

National Supercomputing Centre in Shenzhen (NSCS) – #2 on June 2010 Top500 list

I had the pleasure to be little bit involved in the creation of the fastest supercomputer in Asia, and the second fastest supercomputer in the world – the Dawning “Nebulae” Petaflop Supercomputer at SIAT. If we look on the peak flops capacity of the system – nearly 3 Petaflops, it is the largest supercomputer in the world. I visited the supercomputer site in April and saw how fast it was assembled. It took around 3 weeks to get it up and running – amazing, well, this is one of the benefits of using cluster architecture instead of the expensive proprietary systems. The first picture by the way was taken during the system setup in Shenzhen.

 

 

 

 

 

 

 

 

 

The system includes 5200 Dawning TC3600 Blades, each with NVIDIA Fermi GPU to provide 120K cores, all connected with Mellanox ConnectX InfiniBand QDR adapters, IS5000 switches and the fabric management. It is the 3rd system in the world to provide more than sustained Petaflop performance (after Roadrunner and Jaguar). Unlike Jaguar (from Cray) that requires 20K nodes to reach the required performance, Nebulae does it with only 5.2K nodes – reducing the needed real-estate etc, making is much more cost effective. It is yet another prove that commodity-based supercomputers can deliver better performance, cost/performance and other x/performance metrics compared to the proprietary systems. As GPUs gain higher popularity, we also witness the effort that is being done to create and port the needed applications to GPU-based environments, which will bring a new era of GPU computing. It is clear that GPUs will drive the next phase of supercomputers, and of course the new speeds and feeds of the interconnect solutions (such as the IBTA’s new specifications for the FDR/EDR InfiniBand speeds).

The second picture was taken at the ISC’10 conference, after the Top500 award ceremony. You can see the Top500 certificates…

 

 

 

 

 

 

 

 

 

Regards,

Gilad Shainer
Shainer@mellanox.com

Paving The Road to Exascale – Part 2 of many

In the introduction for the “Paving the road to Exascale” series of posts (part 1), one of the items I mentioned was the “many many cores, CPU or GPUs”.  The basic performance of a given system is being measured by flops. Each CPU/GPU is capable for X amount of flops (which can be calculated as number of parallel operations * frequency * cores for example), and the sum of all of them in a given system gives you the maximum compute capability of the system. How much you can really utilize for your application depends on the system design, memory bandwidth, interconnect etc. On the Top500 list, you can see per each of the systems listed, what the maximum amount of flops is, and what is the effective or measured performance using the Linpack benchmark.

In order to achieve the increasing performance targets (we are talking on paving the road to Exascale….) we need to have as many cores as possible. As we all witness, GPUs have become the most cost-effective compute element, and the natural choice for bringing the desired compute capability in the next generation of supercomputers. A simple comparison shows that with a proprietary design, such as a Cray machine, one needs around 20K nodes to achieve Petascale computing, and using GPUs (assuming one per server). 5K nodes will be enough to achieve a similar performance capability – best cost effective solution.

So, now that we starting to plug in more and more GPUs into the new supercomputers, there are two things that we need to take care of – one, is to start working on the application side, and port applications to use parallel GPU computation (this is a subject for a whole new blog) and second, to make sure the communications between the GPU is as effective as possible. For the later, we have saw the recent announcements from NVIDIA and Mellanox on creating a new interface, called GPUDirect, that enables a better and more efficient communication interface between the GPUs and the InfiniBand interconnect. The new interface eliminates the CPU involvement from the GPU communications data path, using the host memory as the medium between the GPU and the InfiniBand adapter. One needs to be aware, that the GPUDirect solution requires network offloading capability to completely eliminate the CPU from being involved in the data path, as if the network requires CPU cycles to send and receive traffic, the CPU will still be involved in the data path! Once you eliminate the CPU from the GPU data path, you can reduce the GPU communications by 30%.

We will be seeing more and more optimizations for GPU communications on high speed networks. The end goal is of course to provide local system latencies for remote GPUs, and with that ensure the maximum utilization of the GPU’s flops capability.

Till next time,

Gilad Shainer
shainer@mellanox.com

The biggest winner of the new June 2010 Top500 Supercomputers list? InfiniBand!

Published twice a year, the Top500 supercomputers list ranks the world fastest supercomputers and provides a great indication for HPC market trends, usage models and a tool for future predictions. The 35th release of the Top500 list was just published and according to the new results InfiniBand has become the de-facto interconnect technology for high performance computing.

What wasn’t said on InfiniBand from the competitor world? Too many time I have heard that InfiniBand is dead and that Ethernet is the killer. I am just sitting in my chair and laughing. InfiniBand is the only interconnect that is growing on the Top500 list, more than 30% growth year over year (YoY) and it is growing by continuing to uproot Ethernet and the proprietary solutions. Ethernet is 14% down YoY and it has become very difficult to spot a proprietary clustered interconnect…  Even more, in the hard core of HPC, the Top100, 64% of the systems are InfiniBand and are using solutions from Mellanox. InfiniBand is definitely proven to provide the needed scalability, efficiency and performance, and to really deliver the highest CPU or GPU availability to the user or to the applications. Connecting 208 systems from the list is only steps away from connecting the majority of the systems.

What makes InfiniBand so strong? The fact that it solves issues and does not migrate them to other parts of the systems. In a balanced HPC system, each components needs to do its work, and not rely on other components to do overhead tasks. Mellanox is doing a great job in providing solutions that offload all the communications and can provide the needed accelerations for the CPU or GPU, and maximize the CPU/GPU cycles for the applications. The collaborations with NVIDIA on the NVIDA GPUDirect, Mellanox CORE-Direct and so forth are just few examples.

The GPUDIrect is a great example on how Mellanox can offload the CPU from being involved in the GPU-to-GPU communications. No other InfiniBand vendor can do it without using Mellanox technology. GPUDirect requires network offloading or it does not work. Simple. When you want to offload the CPU from being involved in the GPU to GPU communications, and your interconnect needs the CPU to do the transports (since it is an onloading solution), the CPU is involved in every GPU transaction. Only offloading interconnects, such as Mellanox InfiniBand can really deliver the benefits of the GPUDirect.

If you want more information on the GPUDirect and other solutions, feel free to drop a note to hpc@mellanox.com.

Gilad

Visit Mellanox at ISC’10

It’s almost time for ISC’10 in Hamburg, Germany (May 31-June 3), please stop by and visit Mellanox Technologies booth (#331) to learn more about how our products deliver market-leading bandwidth, high-performance, scalability, power conservation and cost-effectiveness while converging multiple legacy network technologies into one future-proof solution.  

Mellanox’s end-to-end 40Gb/s InfiniBand connectivity products deliver the industry’s leading CPU efficiency rating on the TOP500. Come see our application acceleration and offload technologies that decrease run time and increase cluster productivity.

Hear from our HPC Industry Exports

Exhibitor Forum Session – Tuesday, June 1, 9:40AM – 10:10AM

Speaking: Gilad Shainer, Sr. Director of HPC Marketing / Michael Kagan, CTO

HOT SEAT SESSION – Tuesday, June 1, 3:15PM – 3:30PM

Speaking: Michael Kagan, CTO

JuRoPa breakfast Session – Wednesday, June 2, 7:30AM – 8:45AM

Speaking: Gilad Shainer, Sr. Director of HPC Marketing / Michael Kagan, CTO

“Low Latency, High Throughput, RDMA & the Cloud In-Between” – Wednesday, June 2, 10:00AM – 10:30AM

Speaking: Gilad Shainer, Sr. Director of HPC Marketing

“Collectives Offloads for Large Scale Systems” – Thursday, June 3, 11:40AM – 12:20PM

Speaking: Gilad Shainer, Mellanox Technologies; Prof. Dr. Richard Graham, Oak Ridge National Laboratory

“RoCE – New Concept of RDMA over Ethernet” – Thursday, June 3, 12:20PM – 1:00PM

Speaking: Gilad Shainer, Sr. Director of HPC Marketing and Bill Lee, Sr. Product Marketing Manager

Mellanox Scalable HPC Solutions with NVIDIA GPUDirect Technology Enhance GPU-Based HPC Performance and Efficiency

Mellanox announced the immediate availability of NVIDIA GPUDirect™ technology with Mellanox ConnectX®-2 40Gb/s InfiniBand adapters that boosts GPU-based cluster efficiency and increases performance by an order of magnitude over today’s fastest high-performance computing clusters.  Read the entire press release here:

Supporting Resources:

Paving The Road to Exascale – Part 1 of many

1996 was the year when the world saw the first Teraflops system. 12 years after, the first Petaflop system was built. It took the HPC world 12 years to increase the performance by a factor of 1000. Exascale computing, another performance jump by a factor of 1000 will not take another 12 years. Expectations indicate that we will see the first Exascale system in the year 2018, only 10 years after the introduction of the Petaflop system. How do we get to the Exascale system is a good question, but we definitely put some guidelines on how to do it right. Since there is much to write on this subject, this will probably take multiple blog posts, and we have time till 2018…  :)

Here are the items that I have in mind as overall guidelines:

-  Dense computing – we can’t populate Earth with servers as we need some space for living… so dense solutions will need to be built – packing as many cores as possible in a single rack. This is a task for the Dell folks…  :)

-  Power efficiency – energy is limited, and today data centers already consume too much power. Apart from alternative energy solutions, the Exascale systems will need to be energy efficient, and this covers all of the systems components – CPUs, memory, networking. Every Watt is important.

-  Many-many cores – CPU / GPU, as much as possible and be sure, software will use them all

-  Offloading networks – every Watt is important, every flop needs to be efficient. CPU/GPU availability will be critical in order to achieve the performance goals. No one can afford wasting cores on non-compute activities.

-  Efficiency – balanced systems, no jitters, no noise, same order of magnitude of latency everywhere – between CPUs, between GPUs, between end-points

-  Ecosystem/partnership is a must – no one can do it by himself.

In future posts I will expand on the different guidelines, and definitely welcome your feedback.

————————————————————————-
Gilad Shainer
Senior Director, HPC and Technical Computing
gilad@mellanox.com

GPU-Direct Technology – Accelerating GPU based Systems

The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, has made graphics accelerators a compelling platform for computationally demanding tasks in a wide variety of application domains. Due to the great computational power of the GPU, the GPGPU method has proven valuable in various areas of science and technology.

GPU based clusters are being used to perform compute intensive tasks, like finite element computations, Computational Fluids Dynamics, Monte-Carlo simulations etc. Several of the world leading supercomputers are using GPUs in order to achieve the desired performance. Since the GPUs provide high core count and floating point operations capability, a high-speed networking such as InfiniBand is required to connect between the GPU platforms, in order to provide the needed throughput and the lowest latency for the GPU to GPU communications.

While GPUs have been shown to provide worthwhile performance acceleration yielding benefits to both price/performance and power/performance, several areas of GPU based clusters could be improved in order to provide higher performance and efficiency. One of the main performance issues with deploying clusters consisting of multi-GPU nodes involves the interaction between the GPUs, or the GPU to GPU communication model. Prior to the GPU-Direct technology, any communication between GPUs had to involve the host CPU and required buffer copy. The GPU communication model required the CPU to initiate and manage memory transfers between the GPUs and the InfiniBand network. Each GPU to GPU communication had to follow the following steps:

  1. The GPU writes data to a host memory dedicated to the GPU
  2. The host CPU copies the data from the GPU dedicated host memory to host memory available for the InfiniBand devices to use for RDMA communications
  3. The InfiniBand device reads data from that open area and send it to the remote node

Gilad Shainer
Senior Director of HPC and Technical Marketing