Archive for the ‘Gilad Shainer’ Category

National Supercomputing Centre in Shenzhen (NSCS) – #2 on June 2010 Top500 list

Monday, June 14th, 2010

I had the pleasure to be little bit involved in the creation of the fastest supercomputer in Asia, and the second fastest supercomputer in the world – the Dawning “Nebulae” Petaflop Supercomputer at SIAT. If we look on the peak flops capacity of the system – nearly 3 Petaflops, it is the largest supercomputer in the world. I visited the supercomputer site in April and saw how fast it was assembled. It took around 3 weeks to get it up and running – amazing, well, this is one of the benefits of using cluster architecture instead of the expensive proprietary systems. The first picture by the way was taken during the system setup in Shenzhen.

 

 

 

 

 

 

 

 

 

The system includes 5200 Dawning TC3600 Blades, each with NVIDIA Fermi GPU to provide 120K cores, all connected with Mellanox ConnectX InfiniBand QDR adapters, IS5000 switches and the fabric management. It is the 3rd system in the world to provide more than sustained Petaflop performance (after Roadrunner and Jaguar). Unlike Jaguar (from Cray) that requires 20K nodes to reach the required performance, Nebulae does it with only 5.2K nodes – reducing the needed real-estate etc, making is much more cost effective. It is yet another prove that commodity-based supercomputers can deliver better performance, cost/performance and other x/performance metrics compared to the proprietary systems. As GPUs gain higher popularity, we also witness the effort that is being done to create and port the needed applications to GPU-based environments, which will bring a new era of GPU computing. It is clear that GPUs will drive the next phase of supercomputers, and of course the new speeds and feeds of the interconnect solutions (such as the IBTA’s new specifications for the FDR/EDR InfiniBand speeds).

The second picture was taken at the ISC’10 conference, after the Top500 award ceremony. You can see the Top500 certificates…

 

 

 

 

 

 

 

 

 

Regards,

Gilad Shainer
Shainer@mellanox.com

Paving The Road to Exascale – Part 2 of many

Monday, June 14th, 2010

In the introduction for the “Paving the road to Exascale” series of posts (part 1), one of the items I mentioned was the “many many cores, CPU or GPUs”.  The basic performance of a given system is being measured by flops. Each CPU/GPU is capable for X amount of flops (which can be calculated as number of parallel operations * frequency * cores for example), and the sum of all of them in a given system gives you the maximum compute capability of the system. How much you can really utilize for your application depends on the system design, memory bandwidth, interconnect etc. On the Top500 list, you can see per each of the systems listed, what the maximum amount of flops is, and what is the effective or measured performance using the Linpack benchmark.

In order to achieve the increasing performance targets (we are talking on paving the road to Exascale….) we need to have as many cores as possible. As we all witness, GPUs have become the most cost-effective compute element, and the natural choice for bringing the desired compute capability in the next generation of supercomputers. A simple comparison shows that with a proprietary design, such as a Cray machine, one needs around 20K nodes to achieve Petascale computing, and using GPUs (assuming one per server). 5K nodes will be enough to achieve a similar performance capability – best cost effective solution.

So, now that we starting to plug in more and more GPUs into the new supercomputers, there are two things that we need to take care of – one, is to start working on the application side, and port applications to use parallel GPU computation (this is a subject for a whole new blog) and second, to make sure the communications between the GPU is as effective as possible. For the later, we have saw the recent announcements from NVIDIA and Mellanox on creating a new interface, called GPUDirect, that enables a better and more efficient communication interface between the GPUs and the InfiniBand interconnect. The new interface eliminates the CPU involvement from the GPU communications data path, using the host memory as the medium between the GPU and the InfiniBand adapter. One needs to be aware, that the GPUDirect solution requires network offloading capability to completely eliminate the CPU from being involved in the data path, as if the network requires CPU cycles to send and receive traffic, the CPU will still be involved in the data path! Once you eliminate the CPU from the GPU data path, you can reduce the GPU communications by 30%.

We will be seeing more and more optimizations for GPU communications on high speed networks. The end goal is of course to provide local system latencies for remote GPUs, and with that ensure the maximum utilization of the GPU’s flops capability.

Till next time,

Gilad Shainer
shainer@mellanox.com

The biggest winner of the new June 2010 Top500 Supercomputers list? InfiniBand!

Monday, May 31st, 2010

Published twice a year, the Top500 supercomputers list ranks the world fastest supercomputers and provides a great indication for HPC market trends, usage models and a tool for future predictions. The 35th release of the Top500 list was just published and according to the new results InfiniBand has become the de-facto interconnect technology for high performance computing.

What wasn’t said on InfiniBand from the competitor world? Too many time I have heard that InfiniBand is dead and that Ethernet is the killer. I am just sitting in my chair and laughing. InfiniBand is the only interconnect that is growing on the Top500 list, more than 30% growth year over year (YoY) and it is growing by continuing to uproot Ethernet and the proprietary solutions. Ethernet is 14% down YoY and it has become very difficult to spot a proprietary clustered interconnect…  Even more, in the hard core of HPC, the Top100, 64% of the systems are InfiniBand and are using solutions from Mellanox. InfiniBand is definitely proven to provide the needed scalability, efficiency and performance, and to really deliver the highest CPU or GPU availability to the user or to the applications. Connecting 208 systems from the list is only steps away from connecting the majority of the systems.

What makes InfiniBand so strong? The fact that it solves issues and does not migrate them to other parts of the systems. In a balanced HPC system, each components needs to do its work, and not rely on other components to do overhead tasks. Mellanox is doing a great job in providing solutions that offload all the communications and can provide the needed accelerations for the CPU or GPU, and maximize the CPU/GPU cycles for the applications. The collaborations with NVIDIA on the NVIDA GPUDirect, Mellanox CORE-Direct and so forth are just few examples.

The GPUDIrect is a great example on how Mellanox can offload the CPU from being involved in the GPU-to-GPU communications. No other InfiniBand vendor can do it without using Mellanox technology. GPUDirect requires network offloading or it does not work. Simple. When you want to offload the CPU from being involved in the GPU to GPU communications, and your interconnect needs the CPU to do the transports (since it is an onloading solution), the CPU is involved in every GPU transaction. Only offloading interconnects, such as Mellanox InfiniBand can really deliver the benefits of the GPUDirect.

If you want more information on the GPUDirect and other solutions, feel free to drop a note to hpc@mellanox.com.

Gilad

Visit Mellanox at ISC’10

Thursday, May 27th, 2010

It’s almost time for ISC’10 in Hamburg, Germany (May 31-June 3), please stop by and visit Mellanox Technologies booth (#331) to learn more about how our products deliver market-leading bandwidth, high-performance, scalability, power conservation and cost-effectiveness while converging multiple legacy network technologies into one future-proof solution.  

Mellanox’s end-to-end 40Gb/s InfiniBand connectivity products deliver the industry’s leading CPU efficiency rating on the TOP500. Come see our application acceleration and offload technologies that decrease run time and increase cluster productivity.

Hear from our HPC Industry Exports

Exhibitor Forum Session – Tuesday, June 1, 9:40AM – 10:10AM

Speaking: Gilad Shainer, Sr. Director of HPC Marketing / Michael Kagan, CTO

HOT SEAT SESSION – Tuesday, June 1, 3:15PM – 3:30PM

Speaking: Michael Kagan, CTO

JuRoPa breakfast Session – Wednesday, June 2, 7:30AM – 8:45AM

Speaking: Gilad Shainer, Sr. Director of HPC Marketing / Michael Kagan, CTO

“Low Latency, High Throughput, RDMA & the Cloud In-Between” – Wednesday, June 2, 10:00AM – 10:30AM

Speaking: Gilad Shainer, Sr. Director of HPC Marketing

“Collectives Offloads for Large Scale Systems” – Thursday, June 3, 11:40AM – 12:20PM

Speaking: Gilad Shainer, Mellanox Technologies; Prof. Dr. Richard Graham, Oak Ridge National Laboratory

“RoCE – New Concept of RDMA over Ethernet” – Thursday, June 3, 12:20PM – 1:00PM

Speaking: Gilad Shainer, Sr. Director of HPC Marketing and Bill Lee, Sr. Product Marketing Manager

Paving The Road to Exascale – Part 1 of many

Wednesday, May 26th, 2010

1996 was the year when the world saw the first Teraflops system. 12 years after, the first Petaflop system was built. It took the HPC world 12 years to increase the performance by a factor of 1000. Exascale computing, another performance jump by a factor of 1000 will not take another 12 years. Expectations indicate that we will see the first Exascale system in the year 2018, only 10 years after the introduction of the Petaflop system. How do we get to the Exascale system is a good question, but we definitely put some guidelines on how to do it right. Since there is much to write on this subject, this will probably take multiple blog posts, and we have time till 2018…  :)

Here are the items that I have in mind as overall guidelines:

-  Dense computing – we can’t populate Earth with servers as we need some space for living… so dense solutions will need to be built – packing as many cores as possible in a single rack. This is a task for the Dell folks…  :)

-  Power efficiency – energy is limited, and today data centers already consume too much power. Apart from alternative energy solutions, the Exascale systems will need to be energy efficient, and this covers all of the systems components – CPUs, memory, networking. Every Watt is important.

-  Many-many cores – CPU / GPU, as much as possible and be sure, software will use them all

-  Offloading networks – every Watt is important, every flop needs to be efficient. CPU/GPU availability will be critical in order to achieve the performance goals. No one can afford wasting cores on non-compute activities.

-  Efficiency – balanced systems, no jitters, no noise, same order of magnitude of latency everywhere – between CPUs, between GPUs, between end-points

-  Ecosystem/partnership is a must – no one can do it by himself.

In future posts I will expand on the different guidelines, and definitely welcome your feedback.

————————————————————————-
Gilad Shainer
Senior Director, HPC and Technical Computing
gilad@mellanox.com

GPU-Direct Technology – Accelerating GPU based Systems

Thursday, April 29th, 2010

The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, has made graphics accelerators a compelling platform for computationally demanding tasks in a wide variety of application domains. Due to the great computational power of the GPU, the GPGPU method has proven valuable in various areas of science and technology.

GPU based clusters are being used to perform compute intensive tasks, like finite element computations, Computational Fluids Dynamics, Monte-Carlo simulations etc. Several of the world leading supercomputers are using GPUs in order to achieve the desired performance. Since the GPUs provide high core count and floating point operations capability, a high-speed networking such as InfiniBand is required to connect between the GPU platforms, in order to provide the needed throughput and the lowest latency for the GPU to GPU communications.

While GPUs have been shown to provide worthwhile performance acceleration yielding benefits to both price/performance and power/performance, several areas of GPU based clusters could be improved in order to provide higher performance and efficiency. One of the main performance issues with deploying clusters consisting of multi-GPU nodes involves the interaction between the GPUs, or the GPU to GPU communication model. Prior to the GPU-Direct technology, any communication between GPUs had to involve the host CPU and required buffer copy. The GPU communication model required the CPU to initiate and manage memory transfers between the GPUs and the InfiniBand network. Each GPU to GPU communication had to follow the following steps:

  1. The GPU writes data to a host memory dedicated to the GPU
  2. The host CPU copies the data from the GPU dedicated host memory to host memory available for the InfiniBand devices to use for RDMA communications
  3. The InfiniBand device reads data from that open area and send it to the remote node

Gilad Shainer
Senior Director of HPC and Technical Marketing

ROI through efficiency and utilization

Monday, September 21st, 2009

High-performance computing provides an invaluable role in research, product development and education. It helps accelerate time to market, and provides significant cost reductions in product development and tremendous flexibility. One strength in high-performance computing is the ability to achieve best sustained performance by driving the CPU performance towards its limits. Over the past decade, high-performance computing has migrated from supercomputers to commodity clusters. More than eighty percent of the world’s Top500 compute system installations in June 2009 were clusters. The driver for this move appears to be a combination of Moore’s Law (enabling higher performance computers at lower costs) and the ultimate drive for the best cost/performance and power/performance. Cluster productivity and flexibility are the most important factors for a cluster’s hardware and software configuration.

A deeper examination of the world’s Top500 systems based on commodity clusters shows two main interconnect solutions that are being used to connect the servers for creating those compute powerful systems – InfiniBand and Ethernet. If we divide the systems according to the interconnect family, we will see that the same CPUs, memory speed and other settings are common between the two groups. The only difference between the two groups, besides the interconnect, is the system efficiency, or how many of CPU cycles can be dedicated to the application work, and how many of them will be wasted. The below graph list the systems according to their interconnect setting, and their measured efficiency. 

Top500 Interconnect Efficiency

Top500 Interconnect Efficiency

As seen, systems connected with Ethernet achieves an average 50% efficiency, which means that 50% of the CPU cycles are wasted on non-application work or are idle, waiting for data to arrive.  Systems connected with InfiniBand achieve an above 80% efficiency average, which means that less than 20% of the CPU cycles are wasted. Moreover, the latest InfiniBand based systems have demonstrated up to 94% efficiency (the best Ethernet connected systems demonstrated 63% efficiency).

People might argue that the Linpack benchmark is not the best benchmark for measuring parallel application efficiency, and does not fully utilize the network. The graph results are a clear indication that even for the Linpack application, the network does make a difference, and for better parallel application, the gap will be much higher.

When choosing the system setting, with the notion of maximizing return on investment, one needs to make sure no artificial bottlenecks will be created. Multi-core platforms, parallel applications, large databases etc require fast data exchange and lots of it. Ethernet can become the system bottleneck due to latency/bandwidth and CPU overhead due to the TCP/UDP processing (TOE solutions introduce other issues, sometime more complicated, but this is a topic for another blog) and reduce the system efficiency to 50%. This means that half of the compute system is wasted, and just consumes power and cooling. Same performance capability could have been achieved with half of the servers if they were connected with InfiniBand. More data on different application performance, productivity and ROI, can be found at the HPC Advisory Council web site, under content/best practices.

While InfiniBand will demonstrate higher efficiency and productivity, there are several ways to increase Ethernet efficiency. One of them is optimizing the transport layer to provide zero copy and lower CPU overhead (not by using TOE solutions, as those introduce single points of failure in the system). This capability is known as LLE (low latency Ethernet). More on LLE will be discussed in future blogs.

Gilad Shainer, Director of Technical Marketing
gilad@mellanox.com

Launching IS5000 40Gb/s InfiniBand Switches at ISC’09

Monday, June 29th, 2009

ISC’09 in Hamburg, Germany, went exceptionally well. Below is a quick video of us launching the new IS5000 family of 40Gb/s InfiniBand switches to the attending press and analysts. Afterwards, Gilad Shainer, director of HPC marketing, gives you a tour on the live booth demonstrations for both 40Gb/s IB and low-latency 10 Gigabit Ethernet.

The Automotive Makers Require Better Compute Simulations Capabilities

Friday, May 15th, 2009

This week I presented in the LS-DYNA user conference. LS-DYNA is one of the most used applications for automotive related computer simulations – simulations that are being used throughout the vehicle design process and decreases the need to build expensive physical prototypes. Computer simulation usage has decreased the vehicle design cycle from years to month, and is responsible for cost reduction throughout the process. Almost every part in the vehicle is designed with computer aided simulations. From crash/safety simulation to engine and gasoline flow, from air condition to water pumps, almost every part of the vehicle is simulated.

Today challenges in vehicle simulations are around the motivation to build more economical and ecological designs, how to do design lighter vehicles (less material to be used) while meeting the increased safety regulation demands. For example, national and international standardizations have been put in place, which provide structural crashworthiness requirements for railway vehicle bodies.

In order to be able to meet all of those requirements and demands, higher compute simulation capability is required. It is not a surprise that LS-DYNA is being mostly used in high-performance clustering environments as they provide the needed flexibility, scalability and efficiency for such simulations. Increasing high-performance clustering productivity and the capability to handle more complex simulations is the most important factor for the automotive makers today. It requires using balanced clustering design (hardware – CPU, memory, interconnect, GPU; and software), enhanced messaging techniques and the knowledge on how to increase the productivity from a given design.

For LS-DYNA, InfiniBand interconnect-based solutions have been proven to provide the highest productivity compared to Ethernet (GigE, 10GigE, iWARP). With InfiniBand, LS-DYNA demonstrated high parallelism and scalability, which enabled it to take full advantage of multi-core high-performance computing clusters. In the case of Ethernet, the better choice between GigE, 10GigE and iWARP is 10GigE. While iWARP aim to provide better performance, typical high-performance applications are using send-receive semantics which iWARP does not provide any added value with, and even worse, it just increase the complexity and the CPU overhead/power consumption.

If you want to get a copy of a paper that present the capabilities to increase simulations productivity while decrease power consumption, don’t hesitate to send me a note (hpc@mellanox.com).

Gilad Shainer
shainer@mellanox.com

I/O Virtualization

Wednesday, April 29th, 2009

I/O virtualization is a complimentary solution for server and storage virtualization, which aims to reduce the management complexity of physical connections in and out of virtual hosts. Virtualized data center clusters will have multiple networking connections to LAN and SAN, and virtualizing the network avoids the extra complexity associated with it. While I/O virtualization reduces the management complexity, in order to maintain high productivity and scalability one should pay attention to other characteristics of the network being virtualized.

Offloading the network virtualization from the VMM (virtual machine manager, e.g. Hypervisor) to a smart networking adapter, not only reduces the CPU overhead associated with the virtualization management, but also increases the performance capability of the virtual machines (or guest OSs) and can provide the native performance capabilities to them.

The PCISIG has standards in place to help simplify I/O virtualization. The most interesting solution is Single Root I/O virtualization (SR-IOV). SR-IOV allows a smart adapter to create multiple virtual adapters (virtual functions) for a given physical server. The virtual adapters can be assigned directly to a virtual machine (VM) instead of relying on the VMM to manage everything.

SR-IOV provides a standard mechanism for devices to advertise their ability to be simultaneously shared among multiple virtual machines. SR-IOV allows the partitioning of PCI functions into many virtual interfaces for the purpose of sharing the resources of a PCI device in a virtual environment.

Mellanox interconnect solutions provide full SR-IOV support while adding the required scalability and high throughput capabilities to effectively support multiple virtual machines on a single physical server. With Mellanox 10GigE or 40Gb/s InfiniBand solutions, each of the virtual machines can get the needed bandwidth allocation to ensure highest productivity and performance, just as if it was a physical server. 

Gilad Shainer
Director of Technical Marketing
gilad@mellanox.com