During my undergraduate days at UC Berkeley in the 1980’s, I remember climbing through the attic of Cory Hall running 10Mbit/sec coaxial cables to professors’ offices. Man, that 10base2 coax was fast!! Here we are in 2014 right on the verge of 100Gbit/sec networks. Four orders of magnitude increase in bandwidth is no small engineering feat, and achieving 100Gb/s network communications requires innovation at every level of the seven layer OSI model.
To tell you the truth, I never really understood the top three layers of this OSI model: I prefer the TCP/IP model which collapses all of them into a single “Application” layer which makes more sense. Unfortunately, it also collapses the Link layer and the Physical layer and I actually don’t think this makes sense to combine these two. I like to build my own ‘hybrid’ model that collapses the top three layers into an Application layer but allows you to consider the Link and Physical layers separately.
It turns out that a tremendous amount of innovation is required in these bottom four layers to achieve effective 100Gb/s communications networks. The application layer needs to change as well to fully take advantage of 100Gb/s networks. For now we’ll focus on the bottom four layers.
Transport Layer Innovation: RDMA
Let’s start at the transport layer and work down. The transport layer provides for reliable connections to move data from A to B. These reliable connections are based on the TCP/IP stack which was developed in the 1980’s during an era of 10 Mbit/sec link technologies that were assumed to be unreliable. TCP/IP uses something called Implicit Congestion Notification to detect congestion in the network. This relies on packet acknowledgement, sequence number checking, and software timeouts at the receiver.
Basically, TCP/IP intentionally allows the network to become congested and drop packets. The sender uses sequence numbers to keep track of packets, and waits for acknowledgements to return from the receiver to determine that the packets has safely arrived.
If a packet gets lost then a software timeout at the sender eventually occurs that implicitly signals that congestion has occurred in the network. Once this congestion is detected the sender starts resending packets but also throttles back its sending rate to try to avoid future congestion. With modern high speed networks this is really bad idea!
It is like driving down the freeway and rear-ending someone to determine that there is congestion and only then putting on the brakes!
While this works, it’s obviously not the best idea.
The original Ethernet coax used shared media where collisions and lost packets were the norm at the physical and link layers. All modern high speed networks use dedicated point to point links and switches and thus packet loss or corruption at the link and physical layers are the exception rather than the norm.
This is very expensive both in terms of CPU utilization and more importantly in terms of latency. The same is true with using dropped packets to detect network congestion. You simply cannot use millisecond timeouts, dropped packets, and resends to deal with
congestion in a 100Gbit/sec network.
With 100Gb/sec it critical to use a transport protocol that is high performance, low latency, and doesn’t rely on software. The transport protocol needs to allow applications on one server to share data with another application running on a remote server, as if that data were in the same physical machine. The ‘server’ can be a virtual machine running on a physical server performing a compute or storage function.
RDMA (Remote Direct Memory Addressing) is the key 100Gb/s transport protocol that achieves low latency, high throughput, reliable connections. It does all of the heavy lifting protocol processing in hardware and delivers data directly to and from applications without involving the CPU or software stack. RDMA is available over both InfiniBand and Ethernet connections and provides the offloaded, low latency connections needed to take advantage of 100Gb/s connections.
RDMA transport is absolutely critical to taking advantage of high performance networks, however a detailed discussion is beyond the scope of this overview. I’ll come back to this topic and cover this in more detail in another post.