Top-3 Ethernet Interconnect Considerations for your Machine Learning Infrastructure


As Machine Learning (ML) workloads are gaining popularity, more customers are choosing Ethernet interconnects for their ML infrastructure. Mellanox Spectrum 100GbE Ethernet Switches are ideal for Machine Learning workloads. In this article, we will explore Machine Learning training workload attributes and how they relate to Spectrum’s capabilities.

How are Machine Learning training workloads different

  1. Calculations are approximate
    The training process is statistical in nature. Tradeoffs between infrastructure speed and algorithm accuracy is often made.
  2. Computation is iterative
    The model parameter optimization is typically done to minimize model prediction error for the given training data set using an iterative method like gradient descent.
  3. Datasets are typically too big to fit in a single server
    It is typical for models to have 10s of millions of parameters and tens of billions of samples. So, it would take years to train this model by a single CPU server, or months by a single off-the-shelf GPU. The solution is to build a scale out infrastructure and distribute the load across several worker nodes.

Building high-performance ML infrastructure is a delicate balancing act

We can scale out the workload and distribute the subsets of the data to ‘m’ different worker nodes (See figure above). Each worker node can work on a subset of the data and develop a local model. However, local models work only on a subset of data and can go out of sync (and increase the error rate). In order to prevent this from happening, all worker instances need to work in lock steps with each other and periodically merge the models. The parameter server is responsible for merging the models.

Scaling out will parallelize the computation and will shorten the time for a single iteration. However, scaling out will also increase the number of independent worker nodes and hence will increase the error rate. As the error rate increases, more iterations will be needed to converge. At some point, the increased number of iterations needed will wipe out the benefits obtained from scaling out.

The optimal solution is to scale out to the point that it is still beneficial and then focus on other ways extracting performance from the infrastructure.

Mellanox Spectrum 100GbE Ethernet Switches are ideal for Machine Learning workloads

The exchange of the millions model parameters requires enormous network bandwidth. Mellanox Spectrum switches provide just that with support for line-rate 32x100GbE performance with zero packet loss.

The worker nodes can work independently for a few iterations but need to repeatedly sync-up and work in lock steps in order to converge. Consistent low latency is important in distributed systems where the individual processes work in lock steps. Jitter will make the system inefficient as the entire distributed system slows down waiting for the node that experiences the worst latency. Mellanox Spectrum switches support line rate traffic with consistent low latency and low jitter.

TCP/IP stack does not meet the performance needs for ML/AI workloads. RDMA over Converged Ethernet (RoCE) is proven to be the right choice for high performance distributed workloads such as ML. Mellanox Spectrum supports robust support for RoCE. Additionally, Mellanox Spectrum has visibility and “easy button” automation knobs to help users enable RoCE.

Bottom line

As workloads evolve, network infrastructure needs to evolve. Mellanox Spectrum Ethernet switches are the right choice to build your high-performance Machine Learning infrastructure because it supports:

  1. Reliable line rate 100GbE
  2. Consistently low latency
  3. Robust RoCE

In addition, Mellanox Spectrum has the right hooks to support for visibility, automation and orchestration tools. No wonder cloud service providers around the world are picking Mellanox Ethernet solutions to build their AI infrastructure.

Supporting Resources:


About Karthik Mandakolathur

Karthik is a Senior Director of Product Marketing at Mellanox. Karthik has been in the networking industry for over 15 years. Before joining Mellanox, he held product management and engineering positions at Cisco, Broadcom and Brocade. He holds multiple U.S. patents in the area of high performance switching architectures. He earned an MBA from The Wharton School, MSEE from Stanford and BSEE from Indian Institute of Technology.

Leave a Reply