As Machine Learning (ML) workloads are gaining popularity, more customers are choosing Ethernet interconnects for their ML infrastructure. Mellanox Spectrum 100GbE Ethernet Switches are ideal for Machine Learning workloads. In this article, we will explore Machine Learning training workload attributes and how they relate to Spectrum’s capabilities.
How are Machine Learning training workloads different
Building high-performance ML infrastructure is a delicate balancing act
We can scale out the workload and distribute the subsets of the data to ‘m’ different worker nodes (See figure above). Each worker node can work on a subset of the data and develop a local model. However, local models work only on a subset of data and can go out of sync (and increase the error rate). In order to prevent this from happening, all worker instances need to work in lock steps with each other and periodically merge the models. The parameter server is responsible for merging the models.
Scaling out will parallelize the computation and will shorten the time for a single iteration. However, scaling out will also increase the number of independent worker nodes and hence will increase the error rate. As the error rate increases, more iterations will be needed to converge. At some point, the increased number of iterations needed will wipe out the benefits obtained from scaling out.
The optimal solution is to scale out to the point that it is still beneficial and then focus on other ways extracting performance from the infrastructure.
Mellanox Spectrum 100GbE Ethernet Switches are ideal for Machine Learning workloads
The exchange of the millions model parameters requires enormous network bandwidth. Mellanox Spectrum switches provide just that with support for line-rate 32x100GbE performance with zero packet loss.
The worker nodes can work independently for a few iterations but need to repeatedly sync-up and work in lock steps in order to converge. Consistent low latency is important in distributed systems where the individual processes work in lock steps. Jitter will make the system inefficient as the entire distributed system slows down waiting for the node that experiences the worst latency. Mellanox Spectrum switches support line rate traffic with consistent low latency and low jitter.
TCP/IP stack does not meet the performance needs for ML/AI workloads. RDMA over Converged Ethernet (RoCE) is proven to be the right choice for high performance distributed workloads such as ML. Mellanox Spectrum supports robust support for RoCE. Additionally, Mellanox Spectrum has visibility and “easy button” automation knobs to help users enable RoCE.
As workloads evolve, network infrastructure needs to evolve. Mellanox Spectrum Ethernet switches are the right choice to build your high-performance Machine Learning infrastructure because it supports:
In addition, Mellanox Spectrum has the right hooks to support for visibility, automation and orchestration tools. No wonder cloud service providers around the world are picking Mellanox Ethernet solutions to build their AI infrastructure.