Category Archives: Exascale

Paving The Road to Exascale – Part 2 of many

In the introduction for the “Paving the road to Exascale” series of posts (part 1), one of the items I mentioned was the “many many cores, CPU or GPUs”.  The basic performance of a given system is being measured by flops. Each CPU/GPU is capable for X amount of flops (which can be calculated as number of parallel operations * frequency * cores for example), and the sum of all of them in a given system gives you the maximum compute capability of the system. How much you can really utilize for your application depends on the system design, memory bandwidth, interconnect etc. On the Top500 list, you can see per each of the systems listed, what the maximum amount of flops is, and what is the effective or measured performance using the Linpack benchmark.

In order to achieve the increasing performance targets (we are talking on paving the road to Exascale….) we need to have as many cores as possible. As we all witness, GPUs have become the most cost-effective compute element, and the natural choice for bringing the desired compute capability in the next generation of supercomputers. A simple comparison shows that with a proprietary design, such as a Cray machine, one needs around 20K nodes to achieve Petascale computing, and using GPUs (assuming one per server). 5K nodes will be enough to achieve a similar performance capability – best cost effective solution.

So, now that we starting to plug in more and more GPUs into the new supercomputers, there are two things that we need to take care of – one, is to start working on the application side, and port applications to use parallel GPU computation (this is a subject for a whole new blog) and second, to make sure the communications between the GPU is as effective as possible. For the later, we have saw the recent announcements from NVIDIA and Mellanox on creating a new interface, called GPUDirect, that enables a better and more efficient communication interface between the GPUs and the InfiniBand interconnect. The new interface eliminates the CPU involvement from the GPU communications data path, using the host memory as the medium between the GPU and the InfiniBand adapter. One needs to be aware, that the GPUDirect solution requires network offloading capability to completely eliminate the CPU from being involved in the data path, as if the network requires CPU cycles to send and receive traffic, the CPU will still be involved in the data path! Once you eliminate the CPU from the GPU data path, you can reduce the GPU communications by 30%.

We will be seeing more and more optimizations for GPU communications on high speed networks. The end goal is of course to provide local system latencies for remote GPUs, and with that ensure the maximum utilization of the GPU’s flops capability.

Till next time,

Gilad Shainer

Paving The Road to Exascale – Part 1 of many

1996 was the year when the world saw the first Teraflops system. 12 years after, the first Petaflop system was built. It took the HPC world 12 years to increase the performance by a factor of 1000. Exascale computing, another performance jump by a factor of 1000 will not take another 12 years. Expectations indicate that we will see the first Exascale system in the year 2018, only 10 years after the introduction of the Petaflop system. How do we get to the Exascale system is a good question, but we definitely put some guidelines on how to do it right. Since there is much to write on this subject, this will probably take multiple blog posts, and we have time till 2018…  :)

Here are the items that I have in mind as overall guidelines:

-  Dense computing – we can’t populate Earth with servers as we need some space for living… so dense solutions will need to be built – packing as many cores as possible in a single rack. This is a task for the Dell folks…  :)

-  Power efficiency – energy is limited, and today data centers already consume too much power. Apart from alternative energy solutions, the Exascale systems will need to be energy efficient, and this covers all of the systems components – CPUs, memory, networking. Every Watt is important.

-  Many-many cores – CPU / GPU, as much as possible and be sure, software will use them all

-  Offloading networks – every Watt is important, every flop needs to be efficient. CPU/GPU availability will be critical in order to achieve the performance goals. No one can afford wasting cores on non-compute activities.

-  Efficiency – balanced systems, no jitters, no noise, same order of magnitude of latency everywhere – between CPUs, between GPUs, between end-points

-  Ecosystem/partnership is a must – no one can do it by himself.

In future posts I will expand on the different guidelines, and definitely welcome your feedback.

Gilad Shainer
Senior Director, HPC and Technical Computing