Network Disaggregation – Does your switch have the right packet buffer architecture? (Part 1)

Network disaggregation

Without considering buffer architecture nuances, it is difficult for customers to ascertain if what they see in the datasheet is what they actually get. This blog presents three important buffer related questions customers can ask themselves to gain better insight into the system level performance.

I want to cover this topic in two parts. In this first part, I will go over possible architectural choices for on-chip switch packet buffers and implications. The second part will be dedicated to off-chip “ultra-deep” packet buffer architectures.

Basic purpose of packet buffer is to provide three functions:

  1. Flow Isolation: Traffic should be treated fair. Traffic flows that are flowing through unrelated ports should not interfere. QoS policies have to be enforced. In the cloud context, we want zero interference between the traffic between different tenants.
  2. Burst Absorption: Buffers must hold packets when there is a temporary speed mismatch. Since it is unlikely that all queues will experience rate mismatch simultaneously, it is important that the buffer be dynamically shared. In other words, if there is only a single queue congested at a point it time, it can grow to occupy a substantial portion of the buffer.
  3. Congestion Management: How are ECN, PFC and WRED features affected by the architectural choices?


Every other functionality can be mapped to one of these three areas. Ask these three questions to ensure what you see is what you will get.


Question 1: Is the switch buffer a single unit or is it made of multiple fragmented slices?


Figure 1: Packet Buffer Architectural choices – single unit versus split


As one can imagine, a single unit packet buffer is much more efficient when compared to split buffer architectures. With a single buffer, a congested queue can occupy substantial portion of the packet buffer. With a 4-way split buffer, the congested queue can only occupy 25 percent of the buffer at best (See Figure 1). This leads to very poor burst absorption capabilities. Also, queues belonging to the same port can physically reside in different buffer slices. The scheduling across slices often leads to port level and flow level unfairness.


Mellanox Spectrum has a single unit packet buffer which is dynamically shared across all ports. Datasheets for other products just state the total buffer capacity as an aggregate sum of each of the slices. Dig deeper to find out if it is single unit or fragmented slices.


Question 2: Can the packet buffer system sustain line rate bandwidth from all ports simultaneously?

Figure 2: Line rate versus oversubscribed packet buffer

Line rate packet buffer not only gives better performance, it also makes rest of the system simpler and elegant. With an over-subscribed buffer, packets can be dropped even before forwarding table lookup or packet classification (See Figure 2). This means the packet drops are indiscriminate and port isolation is not guaranteed. Scheduling packets out of an oversubscribed packet buffer also can be tricky and inefficient. This results in very poor flow isolation and inefficient utilization of bandwidth. Note that TCP window halves for every dropped packet. So, dropping a few packets during congestion can have a drastic impact on the TCP goodput.

Mellanox Spectrum Packet Buffer supports full line rate. Datasheets for other products sometimes just state the total port capacity as the platform throughput. Dig deeper to find out if the packet buffer can sustain the port line rate.


Question 3: How is buffer occupancy accounting done for ECN, WRED, PFC?

Congestion management algorithms are based on, “buffer occupancy”. The definition for buffer occupancy is straight forward when the buffer is a single unit. It becomes complex when the buffer is split n-ways. For example, if only one slice of a 2-way split buffer is experiencing congestion, the system will have to react as if the entire buffer is congested. This is not only sub-optimal … but it simply does not work for high performance applications such as Ethernet Storage and Deep Learning. Since existing dominant on-chip buffer solutions is not working, customers are looking to adopt expensive “ultra-deep” buffer solutions.

Mellanox’s Spectrum Packet Buffer is single unit and so buffer occupancy calculations are straight forward. Spectrum supports ECN, WRED and PFC with no inefficiencies. Datasheets for other products often just state that they support the congestion management protocols. Again, dig deeper to find out how they do the buffer occupancy accounting and look for inefficiencies if they have a fragmented buffer.

The Bottom Line

Mellanox Spectrum Open Ethernet Switches have the best on-chip buffer architecture. It can support line rate 10GbE/25GbE/50GbE/100GbE speeds. It provides flow isolation, burst absorption and intelligent congestion management that is critical for Cloud, Storage, Deep Learning and other high performance applications. Additionally, Spectrum is open and disaggregated – today you can run your choice of Cumulus Linux or Mellanox OS operating system on the platform. Mellanox will support even more options going forward. Explore more at: and

Up next, in part two, we will discuss ultra-deep packet buffer architectures so stay tuned.


Supporting Resources:



About Karthik Mandakolathur

Karthik is a Senior Director of Product Marketing at Mellanox. Karthik has been in the networking industry for over 15 years. Before joining Mellanox, he held product management and engineering positions at Cisco, Broadcom and Brocade. He holds multiple U.S. patents in the area of high performance switching architectures. He earned an MBA from The Wharton School, MSEE from Stanford and BSEE from Indian Institute of Technology.

Comments are closed.