All posts by David Iles

Who’s Tapping Your Lines and Snooping On Your Apps?

Spoiler alert – it should be you!

No one would argue whether good vision was important if you were a surgeon, a welder, or an Uber driver. In technology, whether you’re a Cloud Architect or in Network Operations, you really need good visibility into what is going on inside your data center. To sleep soundly at night, you have got to actively monitor the performance of your Network, your application performance, and be on the lookout for security breaches. There are analyzers that specialize in each of these three distinct monitoring disciplines: Network Performance, Application Performance, and Security.

How do you get the right traffic to the right analyzers?

You need to “tap your own lines” by placing TAPs at key points in your network. These TAPs will copy all the data traversing the links they are attached to. Then, you need to aggregate those TAPs, consolidating all the flows into a few high bandwidth links on the analyzers. The modern, scaled-out, approach for consolidating TAPs is to use a Software Defined TAP Aggregation Fabric, which amounts to a bunch of Ethernet switches that are only specialized in that they don’t run normal Layer2/3 protocols. Instead,are steering specific flows to specific analyzers.

TAP Aggregation Fabric

You might want the TAP Aggregation fabric to do more than just steer the right flows to the right analyzers. You may want your TAP Aggregators to some of the following:

  • Filter out unwanted flows which will save bandwidth to the analyzers and increase the utilization of the analyzers
  • Truncate packets – to remove unneeded payload data – especially if your analyzers only look at the packet headers
  • Source tagging – to identify where packets came from by changing the MAC address or popping on a VLAN tag
  • Time-stamping – to identify exactly when packets hit the wire
  • Matching inside tunnels – to forward the right tunneled traffic to the right analyzer, while preserving the MPLS or VXLAN tunnel headers
  • Centralized management – to configure all the TAP Aggregation switches from a single control point. The per-flow filtering and forwarding rules can be configured a number of ways, but most people like to use an OpenFlow controller which is almost purpose built for this type of application. An added bonus is that it makes automation super easy since the individual switches configs are dead simple.

Where do you TAP your network?

There is no universal consensus on where to place your TAPs, but there are some very common models:

Financial Services organizations frequently TAP every Tier of their network, so they can measure the latency as packets traverse the network while they also implement security monitoring:

Many Cloud Providers TAP every Rack in their data centers for their own monitoring purposes, as well as offering Application Performance reports to their customers:

 

How do you know what traffic to send for analysis?

If you have ever enabled too many debug features on a Cisco/Arista switch, you are rightfully a bit cautious.  (Friendly advice: don’t do it unless you also want a switch reboot)

TAP Aggregation switches are the ideal place to implement heavy duty Telemetry features because they cannot impact your production network.

One technique for determining which flows need to be analyzed is to start monitoring your traffic with sFlow. sFlow can give you a picture of the busiest flows, top talkers, top protocols, most flows, and various traffic anomalies. It can help you detect and diagnose network problems. It can also provide a glimpse into which applications are using the network most.

You can also see when something changes and can point out what flows should be sent on for further analysis.

Some of the best monitoring, analytics, and graphing tools are Open Source. Recently, folks have been well served by sending their sFlow data to sFow-RT for analysis and then monitor the state of their datacenter with Grafana:

What to look out for when considering TAP aggregation solutions

  • Go with an open multi-vendor solution – don’t get locked into a proprietary one-of-a-kind closed solution. In the data center business, we call these, “Unicorns” because they are single vendor focused, single vendor sourced, and cannot be easily replaced. Beware – Unicorns are expensive!

  • Be sure to make “apples to apples” cost comparisons. Don’t just look at the switch hardware costs, but also look at the per-switch licensing and controller costs
  • Consider best-of-breed Open Source Tools which were developed for hyperscale data centers and scale better than expensive vendor-specific solutions
  • TAPs are preferred to SPAN as some switches are not able to mirror every packet.
  • Make sure your TAP Aggregation switches have sufficient packet rates (PPS) to be able to forward every packet sent by the TAPs

 

Supporting Resources

How the Space Race for Data Centers Helps Everyone

A lot of the household products that we take for granted and use every day were created as by-products of the space race. We take for granted well-known products that NASA claims as spin-offs include memory foam (originally named temper foam), freeze-dried food, firefighting equipment, emergency “space blankets“, Dustbusters, and cochlear implants to name just a few. Each was created out of necessity in order to further the space race. These happy side-products were only possible because of the massive government investments involved and now, they benefit our everyday lives. Interestingly, NASA didn’t invent Velcro, Tang or Teflon but as of 2012, NASA claimed that there are nearly 1,800 spin-off products in the fields of computer technology, environment and agriculture, health and medicine, public safety, transportation, recreation, and industrial productivity. And everyone knows that Star Trek was indirectly responsible for inspiring cell phone technology. Sorry, but who can resist? Beam me up Scotty.

Similarly, there is a sort of Internet “space race” of data centers that has been quietly underway for years now. As the hyperscalers have built increasingly massive datacenters to better serve the needs and scale of the Internet users, out of necessity, they have also created a number of innovations that are applicable to server deployments of every size. Well known Webscale IT innovations include: MapReduce/Hadoop, Mesos(Borg), and Containerization.  These happy side-projects were only possible because of massive hyperscale investments and now they benefit our everyday datacenter lives.

Just as consumers don’t need to be NASA and Star Trek to appreciate Dustbusters or even their beloved cell phones, IT professionals also don’t need social media to appreciate network automation. All datacenters can benefit from automation. They also benefit from higher speed networks.

Cloud computing and 25/100GbE

Cloud computing is constantly evolving. In 2012, Cloud based servers demanded 10 Gigabit Ethernet connectivity because the 1 Gigabit NICs so common at that time were hindering performance. Fast forward to 2016 and 10 Gigabit Ethernet can now be a bottleneck for modern server platforms which have leapt forward in performance; in the number of CPU cores and VMs they can support. Cloud Service providers are leveraging these new server platforms to increase the VM density per server, which has a corresponding increase in their profits.

These modern servers with their higher core count, higher VM density, and flash based storage are now bottlenecked by 10GbE connections and need high speed 25 Gigabit Ethernet connectivity. So too, the data center switch interconnects are moving from 40 Gigabit Ethernet to a technology with similar cost structures but with 2.5 times the bandwidth; 100 Gigabit Ethernet.

Cloud Computing is not just about speeds and feeds or about where workloads are located. Cloud computing requires a scalable provisioning framework that is automatic in nature. The best practices for network automation developed for the largest data centers in the world apply to every cloud based network.

Get up to speed on Cloud innovations, now!

Join my upcoming webinar, 25/100GbE and Network Automation for the Cloud, with Dinesh Dutt from Cumulus where we will discuss the tips and tricks to automating a data center as well as the Webscale data-plane innovations that drive server bandwidth to 25GbE, including OVS offload, RDMA, and VXLAN acceleration. I promise to keep my space references to a minimum.

See you on September 14, 2016 at 10:00 a.m. PT!

learn-more-about-the-webinar

 

astronaught

 

Why We Decided To Partner with Cumulus Networks

It is no secret that cloud computing has changed the IT infrastructure model forever. It has transformed the landscape of the data center including the way data centers are used, how they are designed, and how they are managed. The cloud has altered the buying behavior of the Enterprise, with every new project now going through a “buy vs. lease” evaluation to decide whether to build the infrastructure in-whether to build the infrastructure in-house or to deploy in the cloud. A side-effect of this new model is

Fig-11

that a subtle shift of the expectations of data center admins has quietly taken hold. Whether we are talking about public clouds, private clouds, or a hybrid of the two, those managing the infrastructure have grown to expect a plug and play experience from their data center. Time was, when you added a peripheral to your PC you needed to manually configure a slew of settings. That was then and this is now. Today we expect things to just work ‘automagically’. This auto-provisioning mentality is moving to the datacenter where there is now an expectation that when you add an application, virtual machine, container, or deploy a Hadoop cluster, you get the same plug-and-play experience with the entire data center as experienced with a laptop. As new workloads are deployed, the servers are automatically provisioned, so too, the storage, firewall rules, load balancers, and the physical network infrastructure needs to be automatically provisioned.

The automated provisioning and monitoring practices first developed for Web Scale IT have permeated even the smallest data center footprints. Whether your data center footprint covers two football fields or two rack units with a hyperconverged solution, people desire the force-multiplying benefits of automation. Think about it ̶when you have hundreds or thousands of servers (virtual or physical) and you need to change some security setting, or change where the SYSLOG alerts are sent, you use a tool like Puppet or Ansible to update all the server endpoints with a single command. Folks are now accustomed to making mass configuration changes in an automated, scriptable manner, for all of the data center infrastructure, not just servers. This shift becomes a significant OPEX differentiator for Cloud providers and larger enterprises that measure manually entered CLI key-strokes in terms of headcount.

At Mellanox, we are continually adding new automation features to our home-grown Network Operating System, MLNX-OS, with support for OpenStack, Puppet, NEO, REST, Neutron, and more every year. We have not stopped investing in MLNX, which offers an industry-standard interface, familiar to most networking professionals. However, there is a growing class of customers who are not satisfied with this approach, a flourishing rank of digerati who have embraced the DevOps approach and now treat their infrastructure as code. These technology boat-rockers started by adding network functions to their Linux servers and now want their Ethernet switches to offer the same programmable Linux interface as their servers. They have figured out, because it is Linux, they can load their own applications on their switches just like they do on servers. If they ever need some network visibility feature that didn’t come with the switch, they create a simple script to monitor the particular counter they are interested in and then have the switch automatically send an alert when appropriate.

 

Fig-22

The Mellanox leadership team thoughtfully considered whom to partner with in order to create the best solution for this new market and we found there was one clear choice: Cumulus Networks. Cumulus Networks is *the* leader for automating the network. Besides offering a native Linux interface that enables a switch to behave exactly like a Linux server, they have already integrated into every major Cloud Orchestration solution, including VMware EVO-SDDC, Nutanix, OpenStack, as well as Network Overlays like NSX, Nuage, PLUMgrid, and Midokura. They offer native support for server automation tools like Ansible, SaltStack, Chef, Puppet, and CF Engines. Beyond that, Cumulus has enhanced the Networking in Linux with the purpose of streamlining the provisioning of switches by reducing the number of unique network configuration parameters needed per switch. In many cases, all the Cumulus Linux switches in a data center POD will have nearly identical configurations, with the only difference being the loopback address:

Fig-32

A key benefit of offering a third party Operating System is that it allows Mellanox to compete with Broadcom-based switches in, “apples to apples” comparison tests in a way that highlights the hardware performance differences. Testing two switches with the same OS like the old Pepsi challenge in that it removes testing bias and shows how much better one hardware platform is than the other. We relish the opportunity to compete, especially when performance is the yard-stick. When it comes to 100GbE capable switches, our Spectrum switch is the clear performance leader, as documented by The Tolly Group here, and recorded in webinar here: http://www.mellanox.com/webinars/2016/tolly-performance-report/

Fig-42

At Mellanox, we have been investing in Open Ethernet for many years. We contributed multiple Ethernet Switch designs to the Open Compute Project (OCP). We open-sourced our Multi-chassis link aggregation (MLAG) solution and contributed the code to the community. We spearheaded the Switch Abstraction Interface (SAI) which aims to make it easy to port Networking Operating Systems to many different switch ASICs from any vendor. We are founding member of the OpenSwitch Linux Foundation project and we are leading the Open Optics initiative, which is aimed at unlocking 100G, 400G, and higher speed technologies. This partnership with Cumulus is the logical culmination of this effort.

Albert Einstein, famously, would conduct, “thought experiments” to consider new theories and ideas. I would challenge you to a different kind of thought experiment: think about what you could do if your switches were as easy to automate as your servers. But you can do more than just thought experiments. If you are a hands-on kind of person with a penchant for Linux, do yourself a favor and download Cumulus VX, which is a fully featured SW-only version of Cumulus that is free and runs as a Virtual Machine. Build a virtual network of five or six routers inside your laptop and see how well it works with your favorite server configuration management tool. Then you will experience, first-hand, why Mellanox decided to partner with Cumulus Networks.