Category Archives: High Performance Computing (HPC)

Mellanox FDR 56Gb/s InfiniBand Helps Lead SC’13 Student Cluster Competition Teams to Victory

Mellanox’s end-to-end FDR 56Gb/s InfiniBand solutions helped lead The University of Texas at Austin to victory at the SC Student Cluster Competition’s Standard Track during SC’13. Utilizing Mellanox’s FDR InfiniBand solutions, The University of Texas at Austin achieved superior application run-time and sustained performance within a 26-amp of 120-volt power limit, allowing them to complete workloads faster while achieving top benchmark performance. Special recognition was also provided to China’s National University of Defense Technology (NUDT), which through the use of Mellanox’s FDR 56Gb/s InfiniBand, won the award for highest LINPACK performance.

 

Held as part of HPC Interconnections, the SC Student Cluster Competition is designed to introduce the next generation of students to the high-performance computing community. In this real-time, non-stop, 48-hour challenge, teams of undergraduate students assembled a small cluster on the SC13 exhibit floor and raced to demonstrate the greatest sustained performance across a series of applications. The winning team was determined based on a combined score for workload completed, benchmark performance, conference attendance, and interviews.

Continue reading

Mellanox at SuperComputing Conference 2013 – Denver, CO

 

Attending the SC13 conference in Denver next week?

gfx_02086.jpg

Yes? Be sure to stop by the Mellanox booth at booth #2722 and check out the latest products, technology demonstrations, and FDR InfiniBand performance with Connect-IB!   We have a long list of theater presentations with our partners at the Mellanox booth. We will have giveaways at every presentation and a lucky attendee will go home with a new Apple iPad3 Mini at the end of each day!

Don’t forget to sign up for Mellanox Special Evening Event During SC13 on Wednesday night.  Register here:  http://www.mellanox.com/sc13/event.php

Location
Sheraton Denver Downtown Hotel
Plaza Ballroom
1550 Court Place
Denver, Colorado 80202
Phone: (303) 893-3333
  Map It  

Time:
Wednesday, November 20th
7:00PM – 10:00PM

Also download the Print ‘n Fly guide to SC13 in Denver from insideHPC!

print'nfly cover

Finally, come to hear from our experts in the SC13 sessions:

 

Speaker: Gilad Shainer, VP Marketing; Richard Graham, Sr. Solutions Architect

Title: “OpenSHMEM BoF”

Date: Wednesday, November 20, 2013

Time: 5:30PM – 7:00PM

Room: 201/203

 

Speaker: Richard Graham, Sr. Solutions Architect

Title: “Technical Paper Session Chair: Inter-Node Communication

Date: Thursday, November 21, 2013

Time: 10:30AM – 12:00PM

Room: 405/406/407

 

Speaker: Richard Graham, Sr. Solutions Architect

Title: “MPI Forum BoF”

Date: Thursday, November 21, 2013

Time: 12:15PM-1:15PM

Room: 705/707/709/711

P.S.  Stop by the Mellanox booth [2272]  to see our Jelly bean jar.  Comment on this post with your guess, and you could win a $50 Amazon Gift Card!  Winner will be announced at the end of the conference.  Follow all of our activities on our social channels including Twitter, Facebook and our Community!

Guess How Many?

 See you in Denver!

 

 

pak Author: Pak Lui is the Applications Performance Manager at Mellanox Technologies, responsible for managing the application performance, application characterization, profiling and testing. His main focus is to optimize HPC applications on products, explore new technologies and solutions and their effect on real workloads. Pak has been working in the HPC industry for over 12 years. Prior to joining Mellanox Technologies, Pak worked as a Cluster Engineer, responsible for building and testing HPC cluster configurations from different OEMs and ISVs. Pak holds a B.Sc. in Computer Systems Engineering and a M.Sc. in Computer Science from Boston University in the United States.

Deploying HPC Clusters with Mellanox InfiniBand Interconnect Solutions

High-performance simulations require the most efficient compute platforms. The execution time of a given simulation depends upon many factors, such as the number of CPU/GPU cores and their utilization factor and the interconnect performance, efficiency, and scalability. Efficient high-performance computing systems require high-bandwidth, low-latency connections between thousands of multi-processor nodes, as well as high-speed storage systems.

Mellanox has released “Deploying HPC Clusters with Mellanox InfiniBand Interconnect Solutions”.  This guide describes how to design, build, and test a high performance compute (HPC) cluster using Mellanox® InfiniBand interconnect covering the installation and setup of the infrastructure including:

  • HPC cluster design
  • Installation and configuration of the Mellanox Interconnect components
  • Cluster configuration and performance testing

 

 Scot Schlultz Author: Scot Schultz is a HPC technology specialist with broad knowledge in operating systems, high speed interconnects and processor technologies. Joining the Mellanox team in March 2013 as Director of HPC and Technical Computing, Schultz is 25-year veteran of the computing industry. Prior to joining Mellanox, he spent the past 17 years at AMD in various engineering and leadership roles, most recently in strategic HPC technology ecosystem enablement. Scot was also instrumental with the growth and development of the Open Fabrics Alliance as co-chair of the board of directors. Scot currently maintains his role as Director of Educational Outreach, founding member of the HPC Advisory Council and of various other industry organizations.

Advancing Applications Performance With InfiniBand

High-performance scientific applications typically require the lowest possible latency in order to have the parallel processes be in sync as much as possible.  In the past, this requirement drove the adoption of SMP machines, where the floating point elements (CPU, GPUs) were placed as much as possible on the same board. With the increased demands for higher compute capability, and lowering the cost of adoption for making large scale HPC more available, we have witnessed the increase of clustering as the preferred architecture for high-performance computing.

 

 

We introduce and explore some of the latest advancements in the areas of high speed networking and suggest new usage models that leverage the latest technologies that meet the desired requirements of today’s demanding applications.   The recently launched Mellanox Connect-IB™ InfiniBand adapter introduced a novel high-performance and scalable architecture for high-performance clusters.  The architecture was designed from the ground up to provide high performance and scalability for the largest supercomputers in the world, today and in the future.

The device includes a new network transport mechanism called Dynamically Connected Transport™ Service (DCT), which was invented to provide a Reliable Connection Transport mechanism — the service that provides many of InfiniBand’s advanced capabilities such as RDMA, large message sends, and low latency kernel bypass — at an unlimited cluster size.  We will also discuss optimizations for MPI collectives communications, that are frequently used for processes synchronization and show how their performance is critical for scalable, high-performance applications.

 

Presented by:  Pak Lui, Application Performance Manager, Mellanox – August 12, 2013 – International Computing for the Atmospheric Sciences Symposium, Annecy, France

 

 

UF launches HiPerGator, the state’s most powerful supercomputer

GAINESVILLE, Fla. — The University of Florida today unveiled the state’s most powerful supercomputer, a machine that will help researchers find life-saving drugs, make decades-long weather forecasts and improve armor for troops.

The HiPerGator supercomputer and recent tenfold increase in the size of the university’s data pipeline make UF one of the nation’s leading public universities in research computing.

“If we expect our researchers to be at the forefront of their fields, we need to make sure they have the most powerful tools available to science, and HiPerGator is one of those tools,” UF President Bernie Machen said. “The computer removes the physical limitations on what scientists and engineers can discover. It frees them to follow their imaginations wherever they lead.”

For UF immunologist David Ostrov, HiPerGator will slash a months-long test to identify safe drugs to a single eight-hour work day.

“HiPerGator can help get drugs get from the computer to the clinic more quickly. We want to discover and deliver safe, effective therapies that protect or restore people’s health as soon as we can,” Ostrov said. “UF’s supercomputer will allow me to spend my time on research instead of computing.”

The Dell machine has a peak speed of 150 trillion calculations per second. Put another way, if each calculation were a word in a book, HiPerGator could read the millions of volumes in UF libraries several hundred times per second.

UF worked with Dell, Terascala, Mellanox and AMD to build a machine that makes supercomputing power available to all UF faculty and their collaborators and spreads HiPerGator’s computing power over multiple simultaneous jobs instead of focused on a single task at warp speed.

HiPerGator features the latest in high-performance computing technology from Dell and AMD with 16,384 processing cores; a Dell|Terascala HPC Storage Solution (DT-HSS 4.5) with the industry’s fastest open-source parallel file system; and Mellanox’s FDR 56Gb/s InfiniBand interconnects that provide the highest  bandwidth and lowest latency.  Together these features provide UF researchers unprecedented computation and faster access to data to quickly further their research.

UF unveiled HiPerGator on Tuesday as part of a ribbon-cutting ceremony for the 25,000-square-foot UF Data Center built to house it. HiPerGator was purchased and assembled for $3.4 million, and the Data Center was built for $15 million.

Also today, the university announced that it is the first in the nation to fully implement the Internet2 Innovation Platform, a combination of new technologies and services that will further speed research computing.

National Supercomputing Centre in Shenzhen (NSCS) – #2 on June 2010 Top500 list

I had the pleasure to be little bit involved in the creation of the fastest supercomputer in Asia, and the second fastest supercomputer in the world – the Dawning “Nebulae” Petaflop Supercomputer at SIAT. If we look on the peak flops capacity of the system – nearly 3 Petaflops, it is the largest supercomputer in the world. I visited the supercomputer site in April and saw how fast it was assembled. It took around 3 weeks to get it up and running – amazing, well, this is one of the benefits of using cluster architecture instead of the expensive proprietary systems. The first picture by the way was taken during the system setup in Shenzhen.

 

 

 

 

 

 

 

 

 

The system includes 5200 Dawning TC3600 Blades, each with NVIDIA Fermi GPU to provide 120K cores, all connected with Mellanox ConnectX InfiniBand QDR adapters, IS5000 switches and the fabric management. It is the 3rd system in the world to provide more than sustained Petaflop performance (after Roadrunner and Jaguar). Unlike Jaguar (from Cray) that requires 20K nodes to reach the required performance, Nebulae does it with only 5.2K nodes – reducing the needed real-estate etc, making is much more cost effective. It is yet another prove that commodity-based supercomputers can deliver better performance, cost/performance and other x/performance metrics compared to the proprietary systems. As GPUs gain higher popularity, we also witness the effort that is being done to create and port the needed applications to GPU-based environments, which will bring a new era of GPU computing. It is clear that GPUs will drive the next phase of supercomputers, and of course the new speeds and feeds of the interconnect solutions (such as the IBTA’s new specifications for the FDR/EDR InfiniBand speeds).

The second picture was taken at the ISC’10 conference, after the Top500 award ceremony. You can see the Top500 certificates…

 

 

 

 

 

 

 

 

 

Regards,

Gilad Shainer
Shainer@mellanox.com

Paving The Road to Exascale – Part 2 of many

In the introduction for the “Paving the road to Exascale” series of posts (part 1), one of the items I mentioned was the “many many cores, CPU or GPUs”.  The basic performance of a given system is being measured by flops. Each CPU/GPU is capable for X amount of flops (which can be calculated as number of parallel operations * frequency * cores for example), and the sum of all of them in a given system gives you the maximum compute capability of the system. How much you can really utilize for your application depends on the system design, memory bandwidth, interconnect etc. On the Top500 list, you can see per each of the systems listed, what the maximum amount of flops is, and what is the effective or measured performance using the Linpack benchmark.

In order to achieve the increasing performance targets (we are talking on paving the road to Exascale….) we need to have as many cores as possible. As we all witness, GPUs have become the most cost-effective compute element, and the natural choice for bringing the desired compute capability in the next generation of supercomputers. A simple comparison shows that with a proprietary design, such as a Cray machine, one needs around 20K nodes to achieve Petascale computing, and using GPUs (assuming one per server). 5K nodes will be enough to achieve a similar performance capability – best cost effective solution.

So, now that we starting to plug in more and more GPUs into the new supercomputers, there are two things that we need to take care of – one, is to start working on the application side, and port applications to use parallel GPU computation (this is a subject for a whole new blog) and second, to make sure the communications between the GPU is as effective as possible. For the later, we have saw the recent announcements from NVIDIA and Mellanox on creating a new interface, called GPUDirect, that enables a better and more efficient communication interface between the GPUs and the InfiniBand interconnect. The new interface eliminates the CPU involvement from the GPU communications data path, using the host memory as the medium between the GPU and the InfiniBand adapter. One needs to be aware, that the GPUDirect solution requires network offloading capability to completely eliminate the CPU from being involved in the data path, as if the network requires CPU cycles to send and receive traffic, the CPU will still be involved in the data path! Once you eliminate the CPU from the GPU data path, you can reduce the GPU communications by 30%.

We will be seeing more and more optimizations for GPU communications on high speed networks. The end goal is of course to provide local system latencies for remote GPUs, and with that ensure the maximum utilization of the GPU’s flops capability.

Till next time,

Gilad Shainer
shainer@mellanox.com

The biggest winner of the new June 2010 Top500 Supercomputers list? InfiniBand!

Published twice a year, the Top500 supercomputers list ranks the world fastest supercomputers and provides a great indication for HPC market trends, usage models and a tool for future predictions. The 35th release of the Top500 list was just published and according to the new results InfiniBand has become the de-facto interconnect technology for high performance computing.

What wasn’t said on InfiniBand from the competitor world? Too many time I have heard that InfiniBand is dead and that Ethernet is the killer. I am just sitting in my chair and laughing. InfiniBand is the only interconnect that is growing on the Top500 list, more than 30% growth year over year (YoY) and it is growing by continuing to uproot Ethernet and the proprietary solutions. Ethernet is 14% down YoY and it has become very difficult to spot a proprietary clustered interconnect…  Even more, in the hard core of HPC, the Top100, 64% of the systems are InfiniBand and are using solutions from Mellanox. InfiniBand is definitely proven to provide the needed scalability, efficiency and performance, and to really deliver the highest CPU or GPU availability to the user or to the applications. Connecting 208 systems from the list is only steps away from connecting the majority of the systems.

What makes InfiniBand so strong? The fact that it solves issues and does not migrate them to other parts of the systems. In a balanced HPC system, each components needs to do its work, and not rely on other components to do overhead tasks. Mellanox is doing a great job in providing solutions that offload all the communications and can provide the needed accelerations for the CPU or GPU, and maximize the CPU/GPU cycles for the applications. The collaborations with NVIDIA on the NVIDA GPUDirect, Mellanox CORE-Direct and so forth are just few examples.

The GPUDIrect is a great example on how Mellanox can offload the CPU from being involved in the GPU-to-GPU communications. No other InfiniBand vendor can do it without using Mellanox technology. GPUDirect requires network offloading or it does not work. Simple. When you want to offload the CPU from being involved in the GPU to GPU communications, and your interconnect needs the CPU to do the transports (since it is an onloading solution), the CPU is involved in every GPU transaction. Only offloading interconnects, such as Mellanox InfiniBand can really deliver the benefits of the GPUDirect.

If you want more information on the GPUDirect and other solutions, feel free to drop a note to hpc@mellanox.com.

Gilad

Paving The Road to Exascale – Part 1 of many

1996 was the year when the world saw the first Teraflops system. 12 years after, the first Petaflop system was built. It took the HPC world 12 years to increase the performance by a factor of 1000. Exascale computing, another performance jump by a factor of 1000 will not take another 12 years. Expectations indicate that we will see the first Exascale system in the year 2018, only 10 years after the introduction of the Petaflop system. How do we get to the Exascale system is a good question, but we definitely put some guidelines on how to do it right. Since there is much to write on this subject, this will probably take multiple blog posts, and we have time till 2018…  :)

Here are the items that I have in mind as overall guidelines:

-  Dense computing – we can’t populate Earth with servers as we need some space for living… so dense solutions will need to be built – packing as many cores as possible in a single rack. This is a task for the Dell folks…  :)

-  Power efficiency – energy is limited, and today data centers already consume too much power. Apart from alternative energy solutions, the Exascale systems will need to be energy efficient, and this covers all of the systems components – CPUs, memory, networking. Every Watt is important.

-  Many-many cores – CPU / GPU, as much as possible and be sure, software will use them all

-  Offloading networks – every Watt is important, every flop needs to be efficient. CPU/GPU availability will be critical in order to achieve the performance goals. No one can afford wasting cores on non-compute activities.

-  Efficiency – balanced systems, no jitters, no noise, same order of magnitude of latency everywhere – between CPUs, between GPUs, between end-points

-  Ecosystem/partnership is a must – no one can do it by himself.

In future posts I will expand on the different guidelines, and definitely welcome your feedback.

————————————————————————-
Gilad Shainer
Senior Director, HPC and Technical Computing
gilad@mellanox.com

Interconnect analysis: InfiniBand and 10GigE in High-Performance Computing

InfiniBand and Ethernet are the leading interconnect solutions for connecting servers and storage systems in high-performance computing and in enterprise (virtualized or not) data centers. Recently, the HPC Advisory Council has put together the most comprehensive database for high-performance computing applications to help users understand the performance, productivity, efficiency and scalability differences between InfiniBand and 10 Gigabit Ethernet.

In summary, there are a large number of HPC applications that need the lowest possible latency for best performance or the highest bandwidth (for example Oil&Gas applications as well as weather related applications). There are some HPC applications that are not latency sensitive. For example, gene sequencing and some bioinformatics applications are not sensitive to latency and scale well with TCP-based networks including GigE and 10GigE. For HPC converged networks, putting HPC message passing traffic and storage traffic on a single TCP network may not provide enough data throughput for either. Finally, there is a number of examples that show 10GigE has limited scalability for HPC applications and InfiniBand proves to be a better performance, price/performance, and power solution than 10GigE.

The complete report can be found under the HPC Advisory Council case studies or by clicking here.