All posts by Gilad Shainer

About Gilad Shainer

Gilad Shainer has served as Mellanox's Vice President of Marketing since March 2013. Previously, he was Mellanox's Vice President of Marketing Development from March 2012 to March 2013. Gilad joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles between July 2005 and February 2012. He holds several patents in the field of high-speed networking and contributed to the PCI-SIG PCI-X and PCIe specifications. Gilad holds a MSc degree (2001, Cum Laude) and a BSc degree (1998, Cum Laude) in Electrical Engineering from the Technion Institute of Technology in Israel.

Deep Learning in the Cloud Accelerated by Mellanox

At the GTC’17 conference week, NVIDIA and Microsoft announced new Deep Learning solutions for the cloud. Artificial Intelligence and deep learning have the power to pull meaningful insights from the data we collect, in real time, enabling business to gain a competitive advantage, and to develop new products faster and better. As Jim McHugh, vice president and general manager at NVIDIA said, having AI and Deep leaning solutions in the cloud simplifies the access to the required technology and can unleash AI developers to build a smarter world

Microsoft announced that Azure customers using Microsoft’s GPU-assisted virtual machines for their AI applications will have newer, faster-performing options. Corey Sanders, director of Compute at Microsoft Azure mentioned that the cloud offering will provide over 2x the performance over the previous generation, for AI workloads utilizing CNTK [Microsoft Cognitive Toolkit], TensorFlow, Caffe, and other frameworks.

Mellanox solutions enable to speed up data insights with scalable deep learning in the cloud. Microsoft Azure NC is a massively scalable and highly accessible GPU computing platform. Customers can use GPU Direct RDMA (Remote Direct Memory Access) over InfiniBand for scaling jobs across multiple instances. Scaling out to 10s, 100s, or even 1,000s of GPUs across hundreds of nodes allows customers to submit tightly coupled jobs like Microsoft Cognitive Toolkit. A tool perfect for natural language processing, image recognition, and object detection.

See more at: http://www.nvidia.com/object/gpu-accelerated-microsoft-azure.html#sthash.y71Fw3Jf.dpuf

 

Special Effects Winners Need Winning Interconnect Solutions!

Mellanox is proud to enable the Moving Picture Company (MPC, http://www.moving-picture.com/) with our world-leading Ethernet and InfiniBand solutions, that are being used during the creative process for Oscar winning special effects.

The post-production and editing phases require very high data throughput due to the need for higher resolution (the number of pixels that make up the screen), better color-depth (how many bits are used to represent each color) and the increase in frame rate (how many frames are played in a second). Data must be edited in real-time, and must be edited uncompressed to avoid quality degradation.

You can do the simple math: a single stream of uncompressed 4K video requires 4096 x 2160 (typical 4K/UHD pixel count) x 24 (color depth bits) x 60 (frames per second) which is 12.7Gb/s. Therefore one needs interconnect speeds of greater than 10G today. As we move to 8K videos, we will need data speeds greater than 100Gb/s! Mellanox is the solutions provider for such speeds and the enabler behind the great movies of today.

We would like to congratulate MPC for winning the British Academy of Film and Television (BAFTA) Special Visual Effects 2017 award! And for winning the 2017 visual effect Oscar! Congratulation from the entire Mellanox team.

Find out more about the creation of the Jungle Book effect:
Telegraph 

Mellanox Joins OpenCAPI and GenZ Consortiums, and Continues to Drive CCIX Consortium Specification Development

This week, Mellanox was part of three press releases that announced the formation of two new standardization consortiums – OpenCAPI and GenZ, as well as progress update by the CCIX (Cache Coherent Interconnect for Accelerators) consortium.

These new open standards demonstrates an industry wide collaborative effort and the needs to enable open, flexible and standard solutions for the future data center. The three consortiums are dedicated to delivering technology enhancements that increase the data center applications performance, efficiency and scalability, in particular for data intensive applications, machine learning, high performance computing, cloud, web2.0 and more.

Mellanox is delighted to be part of all three consortium, to be able to leverage the new standards in future products, and to enhance Mellanox Open Ethernet and InfiniBand solutions, enabling better communications between interconnect: the CPU/memory/accelerators.

There are many common goals between the different consortiums: to increase the CPU/Memory/Accelerator – interconnect bandwidth and reduce latency; to enable data coherency between the platform devices and more. While each consortium differs in the specific area of focus, they all drive the need for open standards and the ability to leverage existing technologies.

The CCIX consortium has tripled its members and is getting close to releasing its specification. The CCIX specification enables enhanced performance and capabilities over PCIe Gen4, leveraging the PCIe eco system to enhance future compute and storage platforms.

As a board member of CCIX and OpenCAPI, and a member of GenZ, Mellanox is determined to help drive the creation of open standards. We believe that open standards and the open collaborations between companies and users form the foundation for developing the necessary technology for the next generation cloud, Web 2.0, high performance, machine learning, big data, storage and other infrastructures.

Co-Design Architecture to Deliver Next Generation of Performance Boost

The latest revolution in HPC is the move to Co-Design architecture, a collaborative effort to reach Exascale performance by taking a holistic system-level approach to fundamental performance improvements. This collaboration enables all active system devices to become accelerators by orchestrating a more effective mapping of communication between devices and software in the system to produce a well-balanced architecture across the various compute elements, networking, and data storage infrastructures.

Co-Design architecture exploits system efficiency and optimizes performance by ensuring that all components serve as co-processors in the data center, creating synergies between the hardware and the software, and between the different hardware elements within the data center. This is in diametric opposition to the traditional CPU-centric approach, which seeks to improve performance by on-loading ever more operations to the CPU.

Rather, Co-Design recognizes that the CPU has reached the limits of its scalability, and offers an intelligent network as the ideal co-processor to share the responsibility for handling and accelerating workloads. Since the CPU has reached its maximum performance, the rest of the network must be better utilized to enable additional performance gains.

Besides, the CPU was designed to compute, not to oversee data transfer. By reducing overhead on the CPU, the CPU is freed from non-compute functions and is allowed to focus on its original intent. By placing the algorithms that handle those other functions on an intelligent network, performance improves both on the network and in the CPU itself.

This technology transition from CPU-centric architecture to Co-Design brings with it smart elements throughout the data center, with every active component becoming more intelligent. Data is processed wherever it is located, essentially providing in-network computing, instead of waiting for the processing bottleneck in the CPU.

The only solution is to enable the network to become a co-processor. Smart devices can move the data directly from the CPU or GPU memory into the network and back, and can analyze the data in the process. This means that the new model is for completely distributed in-network computing, wherever the data is located, whether at the node level, at the switch level, or at the storage level.

The first set of algorithms being migrated to the network are data aggregation protocols, which enable sharing and collecting information from parallel processes and distributions. By offloading these algorithms from the CPU to the intelligent network, a data center can see at least 10X performance improvement, resulting in a dramatic acceleration of various HPC applications and data analytics.

In the future we anticipate seeing most data algorithms and communication frameworks (such as MPI) managed and executed by the data center interconnect, enabling us to perform analytics on the data as the data moves.

Ultimately, the goal of any data center is to experience the highest possible performance with the utmost efficiency, thereby providing the best return on investment. For many years, the best way to do this has been to maximize the frequency of the CPUs and to increase the number of cores. However, the CPU-centric approach can no longer scale to meet the massive needs of today’s data centers, and performance gains must be achieved from other sources. The Co-Design approach addresses this issue by offloading non-compute functions from the CPU onto an intelligent interconnect that is capable of not only transporting data from one endpoint to another efficiently, but is now also able to handle in-network computing, in which it can analyze and process data while it is en route.

Sound interesting? Learn more at our upcoming webinar, Smart Interconnect: The Next Key Driver of HPC Performance Gains.

 

 

 

ISC 2016 Recap

Last week we participated at the International Supercomputing conference – ISC’16. In reality, the event started the weekend before, kicked off by the HP-CAST conference where Mellanox presented to the entire HP-CAST audience our newest smart interconnect solutions; solutions that enable the highest application performance and best cost-performance compute and storage infrastructures.

At ISC’16 we have made several important announcements:

The new TOP500 supercomputers list was also announced last Monday, introducing a new number one supercomputer in the world, built at the supercomputing center in Wuxi, China. The new world’s fastest supercomputer delivers 93 Petaflops (three times higher compared to the #2 system on the list), connecting nearly 41 thousands nodes and more than ten million CPU cores. The offloading architecture of the Mellanox interconnect solution is the key to providing world leading performance, scalability and efficiency, connecting the highest number of nodes and CPUs cores within a single supercomputer.

We have witnessed eco-system demand for smart interconnect solutions, that enable smart offloading, both network activity and data algorithms, as it is critical for delivering higher applications performance, efficiency and scalability.

For those who could not attend ISC, we have a video of the highlights you can view here: https://www.youtube.com/watch?v=IYRfkGqXdLo&feature=youtu.be

And for those of you on the go, you can use your mobile phone to view the Mellanox 360 gallery of ConnectX-5 advantages: https://www.youtube.com/watch?v=2yljG2KsBwg

See you at SC’16.

27578232370_e97500703e_o 27574779660_9a4a9fd1e0_o 20160622_155940

A Look At The Latest Omni-Path Claims

4sec-thermometerOnce again, the temperature kicked up another few degrees in the interconnect debate with HPC Wire’s coverage based on information released by Intel on the growth of Omni-Path Architecture (OPA). According to Intel, the company behind OPA, have been seeing steady market traction. We have always expected Intel to win some systems, same as QLogic in the past or even Myricom years back; however, while I read over the article in detail, I couldn’t help but argue some of their points.

On Market Traction

Intel has seen continued delays in Omni-Path’s production release. We are not aware of any company that can buy any OPA offering in the channel, and OEMs have not released anything.

In the article, there are a number of public wins referenced including National Nuclear Security Administration’s Tri Labs (Capacity Technology Systems (CTS-1) program) and the Pittsburgh Supercomputing Center. The latter was built with non-production parts as they could not delay any further, and we have heard from sources that performance is lacking.

The specific Department of Energy deal with NNSA is part of the commodity track of the DoE labs, which is a set of small systems used for commodity work. It is not the DoE leadership systems, and we know that Lawrence Livermore National Laboratory decided to use InfiniBand for their next leader system – under the Coral project. The DoE did grant the previous commodity deal to QLogic TrueScale a few years ago, and QLogic has made the same noise we are hearing today – that they are allegedly gaining momentum over Mellanox.

Additionally, the CTS program (formally TLCC), enables a second tier of companies and helps labs to maintain multiple choices for technologies. The program results in building a small scale of systems that the labs are using for basic work, not for their major and high-scale applications. The previous TLCC was awarded to Appro and QLogic, and the current one to Penguin Computing and Intel OPA.

On A Hybrid Approach

Omni-Path is the same technology as the old technology, “InfiniPath” by Pathscale which was later bought and marketed by QLogic under the name “TrueScale.” Similar to QLogic with TrueScale, we believe any description of Omni-Path as a “hybrid” between off-loading and on-loading is likely not supported by the facts. Read more about it in my latest post for HPC Wire. You can see the system performance difference in various HPC application cases, such as WIEN2K, Quantum Espresso, and LS-DYNA.

On Performance

Intel chose to highlight message rate performance, stating “Compute data coming out of MPI tends to be very high message rate, relatively small size for each message, and highly latency sensitive. There we do use an on-load method because we found it to be the best way to move data. We keep in memory all of the addressing information for every node, core, and process running that requires this communications.” While previously Intel claimed 160M messages per second with OPA, they recently admitted it is closer to 79-100M. Mellanox delivers a superior solution with 150M messages per second.

Finally, as of today, Intel has not yet provided application performance benchmarks for OPA, that support details of the article, or offer substance to claims regarding its performance versus Mellanox’s InfiniBand. We have a number of case studies to prove the performance of InfiniBand.

We look forward to seeing what Intel comes out with next.

One Step Closer to Exascale Computing: Switch-IB 2 & SHARP Technology

A typical metric used to evaluate network performance is its latency for point-to-point communications.  But more important, and sometimes overlooked, is the latency for collective communications, such as barrier synchronization used to synchronize a set of processes, and all-reduce, used to perform distributed reductions.  For many High-Performance-Computing applications, the performance of such collective operations play a critical role in determining overall application scalability and performance.  As such, a system-oriented approach to network design is essential for achieving the network performance needed to reach extreme system scales.

 

The CORE-Direct technology introduced by Mellanox was a first step at taking a holistic system view, by implementing the execution of collective communications in the network.  The SHARP technology being introduced is an extension of this technology, which moves support for collective communication from the network edges, e.g. the hosts, to the core of the network – the switch fabric.  Processing of collective communication moves to dedicated silicon within the InfiniBand Switch-IB 2 switch, thus providing the means for accelerating the performance of these collective operations by an order of magnitude

Continue reading

When Ethernet Meets the Unbelievable

Towering high above the Manhattan skyline on the Horizon level (102nd floor), One World Observatory at the new One World Trade Center (the tallest building in the Western Hemisphere) is the ideal location for an extraordinary event. The amazing views afforded by the floor-to-ceiling glass served as the perfect backdrop for some truly unbelievable Ethernet innovation from Mellanox Technologies as well!

 

Data centers are at a critical breaking point as they need to deal with growing amounts of data. The data center has become the competitive advantage and companies seek the right interconnect solutions to enable innovation. Commonly deployed closed Ethernet solutions (those requiring the use of proprietary hardware-software combinations) leave many businesses unable to optimize their data centers for their unique needs. Today, network performance and flexibility are critical to analyzing more data in real time.

 

Last night, under the fabulous backdrop of the New York City skyline, Mellanox unveiled new Ethernet products to enable the future of the data center – our new Open Ethernet Spectrum switch and ConnectX-4 Lx network adapter. Pushing the boundary for open solutions, Spectrum is the industry’s only 100GbE Open Ethernet-based network switch, which will enable businesses to deploy end-to-end 25 and 100Gb/s Ethernet while choosing the software best suited for their data center.

 

With this announcement, customers in the cloud, Web 2.0, Big Data and other high-performance markets will be able to achieve real-time insights from their data, giving them the information needed to drive their business forward. With its announcement, ConnectX-4 Lx becomes the industry’s most cost-effective 25GbE and 50GbE network adapter which enables organizations to migrate from 10GbE to 25GbE and from 40GbE to 50GbE with similar power and cost.

 

Today marks the beginning of a new era for Ethernet-based data center environments and we believe it will also mark the start of significant disruption in the market. What better way to usher in this new era than at an unbelievable venue in one of the world’s greatest cities.

 

Mellanox Enables Machine Learning at Baidu

A soon to be released film, CHAPPiE, tackles the subject matter of artificial intelligence with an experimental robot built and designed to learn and feel.  In the near future, crime is patrolled by an oppressive mechanized police force. When one police droid, CHAPPiE, is stolen and given new programming, he becomes the first robot with the ability to think and feel for himself. He must fight back against forces planning to take him down.

 

While you may see it as a fictional work of scientific vision, the path toward artificial intelligence isn’t quite so far away.  The building block of artificial intelligence is machine learning.  Deep learning is a new area of machine learning. This new area has the objective of moving machine learning closer to artificial intelligence.  Multiple organizations are investing significant resources into deep learning include Google, Microsoft, Yahoo, Facebook, Twitter and DropBox.

 

machine-learning

 

One such company tackling the challenge is Baidu, Inc., a web services company headquartered in Beijing, China.  The company offers many services including a Chinese language-search engine for websites, audio files and images.  The company offers multimedia content including MP3 music, and movies, and is the first company in China to offer wireless access protocol and PDA-based mobile search to users.  Baidu has seen an ever increasing percentage of voice and image searches on its platform.

 

Continue reading