Accelerating Artificial Intelligence over High Performance Network

 
AI, InfiniBand, RDMA, Storage, , , , , , , ,

Last year, on November 10, 2016, IBM announced that: Tencent Cloud Joins IBM and Mellanox to Break Data Sorting World Records. In that press release, IBM mentioned that when running over high speed network, at 100Gb/s, the Tencent Cloud established four world records by winning the GraySort and MinuteSort categories at the renowned Sort Benchmark Contest – and all at rates two to five times faster than last year’s winners. Highlights included:

  • Took just 98.8 seconds to sort 100TB of data
  • Used IBM OpenPOWER servers and Mellanox 100Gb/s Ethernet

This year, at the end of last month, Tencent published a blog: “The Future is here: Tencent AI Computing Network”, in which they discuss the advantages of using standard RDMA-enabled networks with NVMe, when building an AI system. The blog was written in Chinese and below is the translation to English.

Both references show the significant role of the networking in building a data center that needs to meet the growing Compute and Storage demands for artificial intelligence applications, such as image recognition and speech recognition.

The Future is Here: Tencent AI Computing Network

March 24, 2017, Xiang Li, Tencent Network

Tencent Network – A platform for young Tencent network enthusiasts and peers to exchange their ideas. This group of young people have planned, designed, built and operated a huge and complex Tencent network, and experienced all the ups and downs.

“Tencent Network” is operated by the technical engineering business network platform department of the Shenzhen Tencent Computer Systems Co., Ltd. We hope to work with like-minded friends in the industry to stay on top of the latest network and server trends, while sharing Tencent’s experiences and achievements in network and server planning, operations, research and development, as well as service. We look forward to working and growing with everyone.

There is no doubt that artificial intelligence has been the hottest topic in the IT industry in recent years. Especially since landmark events like the Alpha GO in 2016, technology giants in China and worldwide have been continuing to increase investment in artificial intelligence. At present, the main aspects of artificial intelligence, such as image recognition and speech recognition, are achieved by the machine through learning, with a powerful computing platform to perform massive data analysis and calculation. However, with data growth, standalone machines have become unable to meet the calculation need, so the need for high-performance computing (HPC) clusters has arisen for further enhancing computing power.

An HPC cluster is a distributed system that organizes multiple computing nodes together. It generally uses RDMA (Remote Direct Memory Access) technology such as iWARP / RoCE / InfiniBand (IB) to complete the fast exchange of data between computing nodes. As shown in Figure 1, the RDMA network card can fetch data from the sending node’s address space and send it directly to the address space of the receiving node. The entire interactive process does not require kernel memory to participate, thus greatly reducing the processing delay on the server side. At the same time, with the network as part of the HPC cluster, any transmission block will cause a waste of computing resources. In order to maximize the cluster computing power, the network is usually required to complete the RDMA traffic within 10us. Therefore, for HPC-enabled networks, latency is a primary indicator of cluster computing performance.

Figure 1: RDMA interconnection architecture

 

In the actual deployment, the main factors that affect the network delay are:

  1. Hardware delay. Network equipment forwarding, forwarding hops, and fiber distance can affect the network delay; an optimization program is to use two levels of “Fat-Tree” to reduce the number of network forwarding levels, upgrade the network speed to forward data at a higher baud rate, and deploy low delay switch (minimum 0.3us);
  2. Network packet loss. When the network experiences buffer overflow packet loss due to congestion, the server side needs to retransmit the entire data segment, resulting in serious deterioration of the delay. Common solutions are: Increasing the switch cache, increasing the network bandwidth to improve the anti-congestion capacity, engaging in the application layer algorithm to optimize the incast scene and reduce the network congestion points, and deploying flow control technology to reduce source deceleration in order to eliminate congestion, etc.

When the data center network hardware environment is relatively fixed, the effect is very limited when relying on upgrading hardware to reduce the delay; what is more common practice is to decrease the delay by reducing network congestion. So for the HPC network, the industry is more focused on studying the “lossless network”; however, the currently more mature solutions include lossy network and flow control protocol, which is in a different direction to the industrial lossless network.

Lossy network and flow control protocol

 

Ethernet uses “best effort” for forwarding, where each network element tries its best to transmit the data to the downstream network element without caring about the forwarding capacity of the downstream network element, which may cause the downstream network element congestion packet loss. This means that Ethernet is a lossy network that does not guarantee reliable transmission. Data centers use reliable TCP protocol to transmit data, but Ethernet RDMA packets are mostly UDP packets, which require the deployment of cache management and flow control technology to reduce the packet loss on the network side.

PFC (Priority Flow Control) is a queue-based back pressure protocol. The congestion network element prevents the buffer overflow packet loss by sending the Pause frame to inform the upstream network element to reduce speed. In a stand-alone scenario, the PFC can adjust the server speed quickly and effectively in order to ensure that the network does not lose packets. However, in a multi-level network, there may occur thread blocking (Figure 2), unfair deceleration, PFC storm and other issues, and when an abnormal server transmits the PFC message to the network, it may also paralyze the entire network. Therefore, opening the PFC in the data center requires strict monitoring and management of Pause frame in order to ensure network reliability.

 

Figure 2: PFC thread blocking issue

 

ECN (Explicit Congestion Notification) is an IP-based end-to-end flow control mechanism.

Figure 3: ECN deceleration process

 

As shown in Figure 3, when the switch detects that the port cache is occupied, it will set the ECN field of the packet at the time of forwarding. The destination network card generates the alert according to the message feature, and the source network card is decelerated accurately. ECN avoids the problem of thread blocking, and can achieve accurate deceleration at the stream level. However, because of its need to generate back pressure packets on the card side, the response period is longer and it is usually used as an auxiliary means of PFC to reduce the amount of PFC in the network. As shown in Figure 4, the ECN should have a smaller trigger threshold and perform a deceleration of the flow before the PFC takes effect.

Figure 4: PFC and ECN trigger time

 

In addition to the mainstream large cache, PFC and ECN, the industry has also proposed RDMA field-based HASH, elephant flow shaping, queue length based HASH algorithm DRILL, bandwidth cache algorithm HULL, and other solutions. However, most of these programs need the support of network cards and switch chips, which is hard to engage in scale deployment in the short-term.

  • Industrial lossless network

Figure 5: InfiniBand flow control mechanism

InfiniBand is an interconnection architecture designed for high-performance computing and storage, and is the complete definition of a seven-story protocol stack, with characteristics such as low latency and lossless forwarding. As shown in Figure 5, the IB network adopts the “credit” based flow control mechanism. The sender negotiates the initial Credit for each queue when the link is initialized, indicating the number of packets that can be sent to the other end, and the receiver, according to its own forwarding capability, refreshes the Credit of each queue simultaneously and in real-time with the sender; when the sender’s Credit is exhausted, then packet sending is stopped. As the network element and the network card must be authorized before sending the packets, IB network will not experience prolonged congestion, which can ensure a lossless network’s reliable transmission. IB provides 15 service queues to differentiate traffic, and the traffic from different queues does not experience blockage. At the same time, IB switches use a “Cut-through” forwarding mode with a single-hop forwarding delay of about 0.3us, much lower than the Ethernet switch.

Therefore, for a small HPC and storage network, IB is an excellent choice, but there are other issues such as IB not being compatible with Ethernet, monotonous product form, etc., which makes it difficult to be integrated into the Tencent production network.

The Tencent AI Computing Network

The Tencent AI computing network is part of the production network; as well as needing to communicate with other network modules, it also needs to dock background systems such as network management and security. This means that only the Ethernet option compatible with existing networks can be chosen. The architecture of the computing network has experienced multiple iterations with the growth of business requirements, from the earliest HPC v1.0 that supported 80 40G nodes to today’s HPC v3.0 that supports 2000 100G nodes.

Computing nodes in the computing network are used by the entire company as a resource pool, which puts the network in multi-service traffic concurrent congestion. For the network carrying a single service, we can avoid network congestion through the application layer algorithm scheduling. However, for a multi-service sharing network, it is inevitable that there is concurrent congestion of multi-service traffic; even if with queue protection and flow control mechanisms to reduce network packet loss, it still experiences loss of computing capacity of the cluster due to server slowdown. At the same time, PFC defects are not suitable for a multi-level network, and its effectiveness scope needs to be limited. Therefore, our design idea is as follows:

  1. Physically isolate business, use high-density equipment as access equipment to concentrate a department’s nodes in an access device as far as possible, and limit the number of cross-equipment clusters;
  2. PFC is only opened in the access device to ensure rapid back pressure, and ECN protection across the equipment cluster is opened in the entire network;
  3. For small cross-device clusters, provide enough network bandwidth to reduce congestion and use large cache switches to solve the problem of long ECN back pressure cycles.
  4. To meet the requirements of high-density access, large cache, and end-to-end back pressure, etc., HPCv3.0 architecture opted to use the BCM DUNE series of chip-based box switches as access devices.

Figure 6: HPC3.0 architecture

As shown in Figure 6, HPC v3.0 is a two-stage CLOS architecture, while the convergence device LC and access equipment LA are BCM DUNE chip box switches and each LA can access up to 72 40G / 100G servers. Taking into account that the scale of clusters used by most of the applications at present are at 10 to 20 nodes, and that the performance of future computing nodes and algorithms will be improved and optimized, further limiting cluster sizes, 72 are sufficient to meet the computing requirements of a single service. The DUNE line card supports 4GB of cache, can buffer the congestion of ms-level, and support the end-to-end flow control scheme based on VoQ (Figure 7). It can realize the accurate deceleration of the server under the same frame as PFC. Although the forwarding delay (4us) of the box switch is greater than that of the cassette switch (1.3us), it does not affect the performance of the cluster in consideration of the reduced latency of packet forwarding, packet loss, and congestion.

Figure 7: DUNE chip end-to-end flow control

Figure 7: DUNE chip end-to-end flow control

From the financial perspective, the cost of the single-port box switch is higher than the cassette switch. The single LA node, however, can meet most of the computing needs, and the cross-LA cluster demand is limited, which reduces the number of interconnection modules, and it is lower in cost than the traditional cassette access and the one-to-one convergence program.

Summary

For a long time, the network was not the bottleneck in data center performance, and “large bandwidth”-based network design was able to meet business application needs. In recent years, however, the rapid development of server technology has in turn led to rapid improvement of data center computing and storage capacity, and RDMA technologies such as RoCE and NVME over Fabric have transferred the data center performance bottlenecks to the network side. Especially for RDMA-based new applications such as HPC, distributed storage, GPU cloud, and ultra-converged architecture, network delays have become a major constraint in the performance. Therefore, it is foreseeable that the future design of the data center will gradually shift from being bandwidth-driven to delay-driven. Our long-term goals include building a low latency, lossless, large Ethernet data center, and establishing a complete cache and delay monitoring mechanism.

You are more than welcome to follow our public account, “Tencent Network”. We provide you with the latest industry news, the hands-on experiences of Tencent network and servers, as well as a few interactive exchange events with prizes that are being prepared. We welcome and look forward to your active participation.

  • Note 1: The copyright of any texts and images that are marked as from “Tencent Network” belongs to the Shenzhen Tencent Computer System Co., Ltd. Without its official authorization, use is not permitted. Any violation of such, once verified, will be prosecuted.
  • Note 2: Some of the images of this article come from the internet. Please contact kevinmi@tencent.com for any copyright issues

You can subscribe by simply clicking on the “public account” above!

About Motti Beck

Motti Beck is Sr. Director Enterprise Market Development at Mellanox Technologies Inc. Before joining Mellanox Motti was a founder of BindKey Technologies an EDC startup that provided deep submicron semiconductors verification solutions and was acquired by DuPont Photomask and Butterfly Communications a pioneering startup provider of Bluetooth solutions that was acquired by Texas Instrument. Prior to that, he was a Business Unit Director at National Semiconductors. Motti hold B.Sc in computer engineering from the Technion – Israel Institute of Technology. Follow Motti on Twitter: @MottiBeck

Comments are closed.