All posts by Amit Katz

The Top 7 Data Center Networking Misconceptions

  1. Adding more 40GbE links is less expensive than using 100GbE links
  2. 40GbE price is 2X 10GbE price, similarly 2 x 40GbE are more expensive than a single 100GbE port
  3. Vendor C: “You get the best price in the entire country, more than 80% discount”
  4. Really? I have met many customers proud of getting 80% discount, worth doing an apples to apples comparison including 3 year Opex, licenses, support, transceivers…
  5. “I wish I could move to white box, but I don’t have tons of developers like Google and Facebook have…”
  6. There is no need for developers, try using these solutions and you’ll see, automation is built in the product and is sooooo easy to deploy
  7. L2 is simple, L3 is complicated and expensive
  8. STP? PVRST? Root Guard? BPDU Guard? mLAG? Broadcast storms? So in fact there is a huge amount of complexity to building a reliable and scalable L2 LAN.

Much of this complexity is hidden in an L3 environment because BGP/ECMP is very simple to use and to debug, especially with the right automation. Price is the same when buying a L2/3 switch from the right switch vendors.

  1. “Nobody ever got fired for buying X”
  2. No one ever got promoted for doing that either, once upon a time, some of these brands were really better, not anymore… It may seem like the safe bet, but paying more for less makes your company less competitive, and could jeopardize its very future …
  3. “You can automate servers, storage, billing system, order system, and pretty much everything in the infrastructure except the network”
  4. Today’s networks can be easily and fully automated using standard tools, integrated with compute and storage and being monitored using commercial and open source tools. Check out this short videoto see how simple automation can be
  5. Telemetry is such a special feature that you must buy a special license ($) to enable it.
  6. Why should you pay extra just for the switch to give real-time visibility, but regular counters are free? The same question should be asked about ZTP, VXLAN, EVPN, Tap Aggregation, or BGP. You should ask your vendor, what makes feature X so much more complicated that you get to charge more for that? Why isn’t that a feature that comes standard with a switch that costs over $10K???

Why are Baidu, Tencent, Medallia, Sys11 and others using Mellanox Spectrum?

In 2016, we saw a significant shift in our business. We started seeing our Ethernet switches being deployed in some of the largest data centers of the world. I’d like to share with you some of those that I can talk about.

Here at Mellanox, we have seen a lot of traction lately in the areas of Analytics and, “Something” As a Service, which people typically refer to as cloud.

Data Analytics

Baidu, Tencent, and many others have started investing in 25/100GbE infrastructure for their Analytics and Machine learning deployments. After doing their homework, running many in-depth performance benchmarks, they identified our Mellanox NIC as the key to run these performance hungry workloads simply due to the fact that RoCE provides significant application performance benefits. Earlier this year, Tencent participated in the TeraSort Benchmark’s Annual Global Computing Competition. They utilized the Mellanox Ethernet NIC + Switch solution as key component which enabled them to achieve amazing improvements over last year’s benchmarks:

Baidu experienced similar benefits when they adopted our Ethernet technology. When asking why they chose Mellanox they told us they view the Mellanox solution as the most efficient platform on the market for their applications.

When iFLYTEK from China decided they needed 25GbE server connectivity, they chose the Mellanox Spectrum SN2410 for their ToR running 25GbE to the server; with 100GbE uplinks to the Spectrum SN2700. They told us our solution enabled them to leverage the scalability of our Ethernet solutions and thereby grow their compute and storage needs in the most efficient manner.

These organizations have seen the value of a tested validated high performing End-to-End 100GbE Mellanox solution, and even more importantly, they have done the math and every single one of them came up with the same clear and inevitable conclusion.

Cloud – Anything as a Service

So, let’s talk about Cloud, and review why people choose Mellanox Spectrum to run their businesses.

I’d start that all these solutions labeled, “As a Service” which are typically driven by efficiency and ROI. This means that when people build these solutions they typically, “Do the Math” as the entire business model is based on cost efficiency, taking into account OpEx and CapEx.

Sys11 is one of the largest clouds in Germany and they needed a fully automated and very efficient cloud infrastructure.

Harald Wagener, Sys11 CTO, told us that they chose Mellanox switches because it allowed them to fully automate their state-of-the-art Cloud data center. He said they also enjoyed the cost effectiveness of our solution which allowed Sys11 to leverage the industry’s best bandwidth with the flexibility of the OpenStack open architecture.

Sys11 was one of our first deployments with Cumulus, but instead of using the more standard Quagga routing stack, they decided to use Bird. Initially, we were a little concerned because that is not what everyone else does, but then we tested it, ran some 35K routes and it worked like a charm. We used SN2410 at 25GbE as the ToR and the SN2700 running 100GbE as the spine switch.

One of the most interesting deployments recently has been with Medallia, a great example of a SAAS company which is growing fast and needed a fully automated solution that could effectively drive its various data center needs, such as high speed CEPH (50GbE), short-live containers and scale. They wanted IP mobility without the hassle of tunnels, which they got by adopting a fully routed network, all the way to the server.

Medallia deployed a Spectrum-Cumulus solution, running 50GbEè 100GbE to replace their all 40GbE network.  With their all-40GbE network, they needed 2 ToR switches and 1 spine switch per server rack. When they moved to 50/100GbE, they reduced the number of switches needed per rack by a whopping 50 percent.

What’s really cool about Medallia is that they are so open minded and while looking for the right solution, they made their vendor decision right after they chose Cumulus Linux. It was only then that they chose the vendor who provided the best ROI, another great example of people who, “did the math” and didn’t follow incumbent vendors who typically focus on confusing the customers with what they can do, making sure customer will not, “do the math”. So, what’s next?

Here’s my view of what’s coming in 2017:

The first change will impact latency sensitive High Frequency Trading environments. Around the world, 10G server connections are being replaced with 25&50GbE because 10G can bottleneck the performance of the fast new servers (Intel Skylake) with their high bandwidth flash-based (NVMe) storage. And with HFT, the bandwidth for market data increases every year – the OPRA options feed from NYSE recently crossed the 12Gbps barrier.

There currently exists a gap in the HFT switch market because the low latency switches from Cisco and Arista are capped at 10GbE and there is no silicon available for them to make new low latency switches. Their over-10GbE switches have average latencies that are ten times higher than their 10G switches which make them irrelevant for trading.

Mellanox has the solution for the HFT market; we built a super low latency switch for these 25/50/100G connections speeds and it has 10-20 times lower latency than Cisco and Arista’s new (25-100GbE) switches. We have also added a suite of new features important to HFTs: PIM, hardware based telemetry on packet rates, buffer usage and congestion alerts, we also have fine-tuned buffer controls, and slow receiver multicast protection.

So, we expect another busy and successful year, making sure organizations are not bottlenecked by their networks and most importantly – we all do the math!!!

100GbE Switches – Have You Done The Math?

100GbE switches – sounds futuristic? Not really, 100GbE is here and being deployed by those who do the math…
100GbE is not just about performance, it is about saving money. For many years, the storage market has been “doing the math”, $/IOPs is a very common metric to measure storage efficiency and make buying decisions. Ethernet switches are not different, when designing your network, $/GbE is the way to measure efficiency.
While enhanced performance is always better, 100GbE is also about using less components to achieve better data center efficiency, Capex & Opex. Whether a server should run 10, 25, 50 or 100GbE, this is about performance, but with switches, 100GbE simply means better Return on Investment!
Building a 100GbE switch doesn’t cost 2.5X than building a 40GbE switch. In today’s competitive market, vendors can no longer charge exorbitant prices for their switches. These days are over.
With 25GbE being adopted on more servers simply to get more out of the server you’ve already paid for, 100GbE is the way to connect switches.


Today, when people do the math, they minimize the number of links between switches by using 100GbE. When a very large POD (Performance Optimized Datacenter) is needed, sometimes we see 50GbE being used as uplink to increase spine switch fan-out and thus the number of servers connected to the same POD. In other cases, people use the fastest available, it used to be 40GbE, and today it is 100GbE.
Who are these customers who migrate to 100GbE? They are those who consider datacenter’s efficiency being highly important for the success of their business. A few examples:
Medallia recently deployed 32 x Mellanox SN2700 running Cumulus Linux – Thorvald Natvig, Medallia lead architect told us that the math is simply about more cost effectiveness, especially when the switches are deployed with zero touch and run simple L3 protocols, eliminating old fashion complications of STP and other unnecessary protocols. QoS? Needed when the pipes are insufficient, not when running 100GbE with enough bandwidth coming from each rack. Buffers? Scale? Mellanox Spectrum ASIC provides everything a Data Center needs today and tomorrow.
University of Cambridge has also done the math and has selected the Mellanox End-to-End Ethernet interconnect solution including Spectrum SN2700 Ethernet switches for its OpenStack-based scientific research cloud. Why? 100GbE is there to unleash the capabilities of the NexentaEdge Software Defined Storage solution which can easily stress a 10/40GbE network.
Enter has been running Mellanox Ethernet Switches for a few years now. 100GbE is coming soon, Enter will deploy Mellanox Spectrum SN2700 switches with Cumulus Linux because they did the math! Enter, as a cloud service providers cannot get lazy and wait for 100GbE to be everywhere before they adopt it. Waiting means losing money. In today’s competitive world, standing is like walking backwards, 100GbE is here, it works and it is priced right!
Cloudalize was about to deploy a 10/40GbE solution. After they did the math, they went directly to 100GbE with Mellanox Spectrum SN2700 running Cumulus Linux.

To summarize: if your Data Center efficiency is important for your business, it is time to do the math:
1. Check the cost of any 10/40/100GbE solution vs. Mellanox Spectrum 100GbE
Cost must include all components: cables, support, licenses (no additional licenses with Mellanox)
2. Please note that even when 10GbE on the server is enough, 100GbE uplinks still make sense
3. A break-out cables always costs less than 4 x single speed cables
4. Pay attention to hidden costs (feature licenses, extra support…)
5. What’s the price of being free with 100% standard protocols and no “vendor specific”, which is a nicer way to say “proprietary” protocols
6. In the event that 100GbE is more cost effective, it is time to view the differences between various 100GbE switch solutions in the market, the following performance analysis provides a pretty good view of the market’s available options
7. How much money do you spend on QoS vs. the alternative of throwing bandwidth on the problem?
8. $/GbE is the best way to measure network efficiency
Feel free to contact me at , I would be happy to help “doing the math” and compare any 10/40/100GbE solution to Mellanox Spectrum.

What Happened to the Good Old RFC2544?

Compromising on the basics has resulted in broken data centers…

After the Spectrum vs. Tomahawk Tolly report was published, people asked me:

“Why was this great report commissioned to Tolly? Isn’t there an industry benchmark where multiple switch vendors participate?”

So, the simple answer is: No, unfortunately there isn’t…

Up until about 3 years ago, Nick Lippis and Network World ran an “Industry Performance Benchmark”.

These reports were conducted by a neutral third party, and different switch vendors used to participate and publish reports showing how great their switches were, how they passed RFC 2544, 2889 and 3918, etc.


Time to check which switch you plan to use in your data center!!!

Since the release of Trident2, which failed to pass the very basic RFC 2544 (it lost 19.8% of packets when tested with small packets), these industry reports seemed to have vanished. It is as if no one wants to run a benchmark showing RFC 2544 anymore. No wonder the tests are all failing.

The questions you really need to ask are the following:

  • Why is it that RFC 2544, which was established to test switches and verify they don’t lose packets, is all of a sudden being “forgotten”?
  • Is the Ethernet community lowering its standards because it has become too hard to keep up with the technology?
  • Has it become difficult to build 40GbE and 100GbE switches running at wire speed for all packet sizes and based on modern, true cut-through technology?

The answers to all these questions is simple: RFC 2544 is as important as ever and still the best way to test a data center switch. Sure, it is hard to build a state-of-the-art switch which is why RFC 2544 is now more important than ever. This is because there are more small packets in the network (requests packets, control packets, cellular messages, SYN attacks…), and ZeroPacketLoss was and is still essential for your Ethernet switches.

Here is how Arista defined RFC 2544 before abandoning it:

“RFC 2544 is the industry leading network device benchmarking test specification since 1999, established by the Internet Engineering Task Force (IETF). The standard outlines methodologies to evaluate the performance of network devices using throughput, latency and frame loss. Results of the test provide performance metrics for the Device Under Test (DUT). The test defines bi-directional traffic flows with varying frame size to simulate real world traffic conditions.”

And indeed the older Fulcrum-based 10GbE switch passed these tests:

A simple web search will provide you with numerous articles defining the importance of running RFC 2544 before choosing a switch.

While working on this blog I ran into a performance report sponsored by one of the big server vendors for a switch using the Broadcom Tomahawk ASIC. They worked hard to make a failed RFC 2544 look okay. Using a very specific port connectivity and packet sizes, RFC 2544 failed only with 64 Byte packets, using 8 ports (out of 32 ports), and even the mesh test passed. What is a data center customer to conclude from this report? That one should buy a 32-port switch and use only 8 ports? Sponsoring such a report clearly means that RFC 2544, 2889 and 3918 are still important when making a switch decision. I definitely agree: these tests have been established to help customers buy the best switches for their data centers.

So, how has the decline in RFC 2544 testing resulted in unfair clouds?

Not surprisingly, once the market accepted the packet-loss first introduced by Trident2, things have not improved. In fact, they’ve gotten worse.

Building a 100GbE switch is harder than building a 40GbE switch, and the compromises are growing worse. So, the 19.8% switch packet loss has soared up to 30% switch packet loss, and the sizes of packets being lost have increased.

Moreover, a single switch ASIC is now comprised of multiple cores, which means a new type of compromise. When an ASIC is built out of multiple cores, not all ports are equal. What does this actually mean? It means that nothing is predictable any longer. The behavior depends upon which ports are allocated to which buffers (yes, it is not a single shared buffer anymore). The situation also depend on the hardware layout which defines which ports are assigned to which switch cores. To make it simple: 2 users connected to 2 different ports do not get the same bandwidth… for more details, read the Spectrum Tolly report.

The latest Broadcom Based Switch Tolly report was released three weeks after the original Tolly report was issued. It attempted to “answer” the RFC 2544 failure, but nowhere did it refute the fairness issue. It is hard to explain why 2 ports connected to the same switch provide different service level agreements. In one test, the results showed 3% vs. 50% of the available bandwidth. So, this means you have one customer who is very happy and another customer who is very unhappy. But this would be true only if the customer were to know the facts, right? Has anyone told the unhappy customer that the SLA is broken? Probably not.

Bottom line:

Has compromising on the basics truly benefitted end customers and proven worthwhile? Are they really happy with the additional, worsening compromises they have had to make in order to build their switches with faster switch ASICs, which as all can see, are undergoing multiple ASIC revisions and at the end of the day yielding delayed, compromised, packet-losing 100GbE data centers? One should think not!


Mellanox Spectrum™ runs at line rate at all packet sizes. The solution supports true cut through switching; has a single shared buffer; consists of a single, symmetrically balanced switch core; provides the world’s lowest power; and runs MLNX-OS® and Cumulus Linux, with more network operating systems coming…

So, stop compromising and get Mellanox Spectrum for your data center today!!!


A Final Word About Latency

Note that this report also uses a methodology to measure latency that is unusual at best, and bordering on deceptive. It is standard industry practice to measure latency from the first bit into the switch to the first bit out (FIFO). By contrast here they took the unusual approach of using a last in first out (LIFO) latency measurement methodology. Using LIFO measurements has the effect of dramatically reducing reported latencies. But unlike the normal FIFO measurements the results are not particularly enlightening or useful. For example you cannot just add latencies and get the results through a multi-hop environment. Additionally for a true cut-through switch such as the Mellanox Spectrum, using LIFO measurements would actually result in negative latency measurements – which clearly doesn’t make sense. The only reason to use these non-standard LIFO measurements is to obscure the penalty caused by switches not able to perform cut-through switching and to reduce the otherwise very large reported latencies that result from store and forward switching.