So in two previous posts, I discussed the innovations required at the transport, network, and link layer of the communications protocol stack to take advantage of 100Gb/s networks . Let’s now talk about the physical layer. A 100Gb/sec signaling rate implies a 10ps symbol period.
Frankly, this is just not possible on a commercial basis with current technology. Neither is it possible on copper nor on optical interfaces. At this rate the electrical and optical pules just can’t travel any useful distance without smearing into each other and getting corrupted.
So there are two possible solutions. The first is to use 4 parallel connections each running @25Gb/sec. The second is to use a single channel with a 25Gb/sec symbol rate but to send four bits per symbol. This can be done either through techniques like Pulse Amplitude Modulation (PAM4) or optically by sending four different colors of light on a single fiber using Wavelength Division Multiplexing (WDM) techniques. Continue reading →
Network and Link Layer Innovation: Lossless Networks
In a previous post, I discussed that innovations are required to take advantage of 100Gb/s at every layer of the communications protocol stack networks – starting off with the need for RDMA at the transport layer. So now let’s look at the requirements at the next two layers of the protocol stack. It turns out that RDMA transport requires innovation at the Network and Link layers in order to provide a lossless infrastructure.
‘Lossless’ in this context does not mean that the network can never lose a packet, as some level of noise and data corruption is unavoidable. Rather by ‘lossless’ we mean a network that is designed such that it avoids intentional, systematic packet loss as a means of signaling congestion. That is packet loss is the exception rather than the rule.
Lossless networks can be achieved by using priority flow control at the link layer which allows packets to be forwarded only if there is buffer space available in the receiving device. In this way buffer overflow and packet loss is avoided and the network becomes lossless.
In the Ethernet world, this is standardized as 802.1 QBB Priority Flow Control (PFC) and is equivalent to putting stop lights at each intersection. A packet on a given priority class can only be forwarded when the light is green.
During my undergraduate days at UC Berkeley in the 1980’s, I remember climbing through the attic of Cory Hall running 10Mbit/sec coaxial cables to professors’ offices. Man, that 10base2 coax was fast!! Here we are in 2014 right on the verge of 100Gbit/sec networks. Four orders of magnitude increase in bandwidth is no small engineering feat, and achieving 100Gb/s network communications requires innovation at every level of the seven layer OSI model.
To tell you the truth, I never really understood the top three layers of this OSI model: I prefer the TCP/IP model which collapses all of them into a single “Application” layer which makes more sense. Unfortunately, it also collapses the Link layer and the Physical layer and I actually don’t think this makes sense to combine these two. I like to build my own ‘hybrid’ model that collapses the top three layers into an Application layer but allows you to consider the Link and Physical layers separately.
It turns out that a tremendous amount of innovation is required in these bottom four layers to achieve effective 100Gb/s communications networks. The application layer needs to change as well to fully take advantage of 100Gb/s networks. For now we’ll focus on the bottom four layers. Continue reading →
In 1967, Gene Amdahl developed a formula that calculates the overall efficiency of a computer system by analyzing how much of the processing can be parallelized and the amount of parallelization that can be applied in the specific system.
At that time, deeper performance analysis had to take into consideration the efficiency of three main hardware resources that are needed for the computation job: the compute, memory and storage.
On the compute side, efficiency has to be measured by how many threads can run in parallel (which depends on the number of cores). The memory size affects the percentage of IO operation that needs to access the storage, which slows significantly the execution time and the overall system efficiency.
Those three hardware resources worked very well until the beginning of 2000. At that time, the computer industry started to use a grid-computing or as it known today, scale-out systems. The benefits of the scale-out architecture are clear. It enables building systems with higher performance, easy to scale with built-in high availability at a lower cost. However, the efficiency of those systems heavily depend on the performance and the resiliency of the interconnect solution.
The importance of the Interconnect became even bigger in the virtualized data center, where the amount of east west traffic continues to grow (as more parallel work is being done). So, if we want to use Amdahl’s law to analyze the efficiency of the scale-out system, in addition to the three traditional items (compute, memory & storage) the fourth item, which is the Interconnect, has to be considered as well.
Last year, Open Compute Project(OCP) launched a new network project focused on developing operating system agnostic switches to address the need for a highly efficient and cost effective open switch platform. Mellanox Technologies collaborated with Cumulus Networks and the OCP community to define unified and open drivers for the OCP switch hardware platforms. As a result, any software provider can now deliver a networking operating system to the open switch specifications on top of the Open Network Install Environment (ONIE) boot loader.
At the upcoming OCP Summit, Mellanox will present recent technical advances such as loading Net-OS on an x86 system with ONIE, OCP platform control using Linux sysfs calls, full L2 and L3 Open Ethernet Switch API, and also demonstrate Open SwitchX SDK. To support this, Mellanox developed SX1024-OCP, a SwitchX®-2-based TOR switch which supports 48 10GbE SFP+ ports and up to 12 40GbE QSFP ports.
The SX1024-OCP enables non-blocking connectivity within the OCP’s Open Rack and 1.92Tb/s throughput. Alternatively, it can enable 60 10GbE server ports when using QSFP+ to SFP+ breakout cables to increase rack efficiency for less bandwidth demanding applications.
Mellanox also introduced SX1036-OCP, a SwitchX-2-based spine switch, which supports 36 40GbE QSFP ports. The SX1036-OCP enables non-blocking connectivity between the racks. These open source switches are the first switches on the market to support ONIE over x86 dual core processors.