All posts by Weijia Song

About Weijia Song

Dr. Weijia Song is the lead FFFS developer working as a post-doctoral student at Cornell University in the Department of Computer Science, in a group headed by Professor Ken Birman (http://www.cs.cornell.edu/ken). He also has roles in Hakim Weatherspoon’s advanced cloud computing group, where he works on his Super Cloud system. PhD student Theo Gkountouvas collaborated closely with Dr. Song on the FFFS data structure design and on extending FFFS to use an optimal consistency-preserving data representation.

Experience with the Freeze Frame File System (FFFS) on Mellanox RDMA

Everyone is talking about the Internet of Things (IoT), but fast, scalable IoT solutions can be difficult to create.  IoT applications often require real-time analysis on data streams, but most software is designed to just read data from existing files.  As a result, the developer may be forced to create entirely new applications that read data continuously.  Developers who prefer to just build scripts that build new solutions from existing analytic tools would find this frustrating and time-consuming.

In fact, this problem has been troublesome to computing professionals for decades.  There is an enormous variety of prebuilt software that can do almost any imaginable computation provided that the input is in files (for example tables represented in comma-separated value format, or files that list a series of values from some kind of sensor, or files written one per input, such as photos captured from an image sensor).  In contrast, there is little prebuilt software for incremental data feeds (settings in which programs need to run continuously and accept one new input at a time).

This is why so many developers prefer to store data into files, then carry out any needed analysis on the files.  But here we run into another issue: in today’s systems, the delay (latency) of the store-then-compute style of analysis is often very high.

Cornell University recently introduced its new Freeze-Frame File System to fill this gap. FFFS is able to accept streams of updates while fulfilling “temporal reads” on demand.   By running analytics on the files that FFFS captures, we can get the best of both worlds: high productivity through reuse of existing code, and ultra-low latency.  The key is that FFFS bridges between a model of continuously writing data into files and one of performing analysis on snapshots representing instants in time (much as if it was a file system backup system creating a backup after every single file update operation).

  • The developer sees a standard file system.
  • Snapshots are available and will materialize past file state at any time desired.
  • The snapshots don’t need to be planned ahead of time, and look to the application like file system directories (folders), with the snapshot time embedded in the name.
  • Applications don’t need any modifications at all, and can run on these snapshots with extremely low delay between when data is captured and when the analysis runs.
  • The file system uses RDMA data transfers for extremely high speed.

The result is that even though the FFFS looks like standard file system, and supports a standard style of program that just stores data into files and then launches analytics that use those files as input, performance is often better than that of a custom data streaming solution.  This gives developers the best of two worlds: they can leverage existing file-oriented data analytics, but gain the performance of a custom-built data streaming solution!

How does FFFS do this?

Getting more technical, here are some of the key ideas used within FFFS:

  • The system maintains an updated history in the form of a memory-mapped log.
  • The system maintains an in-memory cache of recently retrieved data for repeat reads.
  • FFFS uses a hybrid of a real-time and a logical clock to respond to read requests in a manner that is both temporally precise and causally consistent.
  • When RDMA hardware is available, the write and read throughput of a single client reaches 2.6GB/s for writes and 5GB/s for reads, close to the limit (about 6GB/s) on the RDMA hardware used in the experiments summarized below.
  • The FFFS software is free and open-sourced, and can be used in any setting where a standard file system is used, including from the Spark/DataBricks platform, where it would be used just like the built-in HDFS.
  • FFFS can support any normal file-based analytic programs, so once the data is captured, existing analytic code will work as usual.
  • Dramatically better performance through utilization of Mellanox RDMA solutions. When RDMA is available, FFFS takes advantage of the hardware.  Any data read or written will be moved using zero-copy hardware rather than through standard kernel/user-space TCP.  For data-intensive uses, this offers a huge speedup compared to standard non-RDMA networking.
  • Ultra-accurate tracking of real-time for Internet of Things (IoT) applications.

Additional Freeze Frame Features:

  • Can be configured to understand the time fields in your data streams.  Given GPS clocks, FFFS will use them as its time source for the IoT data.
  • Files can be overwritten, in whole or in part. FFFS supports the full POSIX file system API (in HDFS, parts of the POSIX API are not supported: files can only be appended-to).
  • Unlike HDFS, FFFS does not require the user to pre-plan a snapshot. FFFS allows a user to access a snapshot when needed, long after the data was captured, and with no significant overhead costs. In addition, FFFS provided a new read API to access the snapshot, enabling applications to read the state of a file at any given time without creating or referring to a ‘snapshot’ explicitly. This is superior to HDFS’s snapshot directory (a special directory named by a time, within which FFFS will show read-only versions of all the files as they looked at that time) because:
  • The users may not know which snapshot will be required by a future analysis.
  • FFFS snapshots guarantee temporal precision and causal consistency.
  • Never exposes applications to “mashups” of data from multiple times.
  • Applications can specify the time of each read and write. However, FFFS blocks applications from tampering with the distant past or writing far in the future.

FFFS is easy to install, following the download, compilation and install instructions on the distribution website.

One of the outstanding features of FFFS is that it uses a new form of temporal storage system wherein each file system update is represented separately and its time of update is tracked. When a temporal read occurs, FFFS then recovers the exact data that applied at that time for that file. These data structures are highly efficient and FFFS outperforms HDFS even in standard non-temporal uses.

An Example Use Case

FFFS is a general-purpose tool, but it was originally created for the use of smart electric power grid, where developers are creating machine-intelligence solutions to improve the way electric power is managed and reduce wasteful and power generation that creates pollution. In this application, data collectors capture data from Internet of Things (IoT) devices, such as power-line monitoring units (so-called PMU and micro-PMU sensors), in-home smart meters, solar panels, and so on. The data is relayed into a cloud setting for analysis, often by an optimization program that will then turn around and adjust operating modes to balance power generation and demand. With FFFS, grid network models and other configuration information gets updated by writing files to reflect the evolving state. This essentially creates an archive of past states.

The greatest benefit happens when performing temporal analytics which is the analysis of the evolving state of the system over time, rather than just on the state of the system at a single instant. By using FFFS for data captures, smart grid developers can run standard non-real-time analytic applications on a series of snapshots, and they can then study the evolution of power-grid data. Developers only need ask FFFS to retrieve a series of high-accuracy snapshots, at any precision desired. The programs run multiple times, one time on each of a series of snapshots. Because the input files evolve over time, users can obtain a series of outputs that also evolve over time. Hence for many applications, the analytic programs often don’t need to be modified to understand “time” explicitly. The series of outputs would be an evolving temporal analysis for the system.

In addition, FFFS supports temporal access to a file at more than one point in time, from a single program.  As such, one can easily write programs that explicitly search data in the time dimension!  It is as if the same file could be opened again and again, with each file descriptor connecting to a different version from a different instant in time.

Smart grid applications need speed: when adjusting the power grid to deal with a surge in load or production, time is of the essence.  This is where RDMA plays a key role:  FFFS is designed to  leverage Mellanox RDMA solutions.  FFFS with Mellanox RDMA is far faster than standard non-RDMA file systems (like the popular Ceph or HDFS file systems) in all styles of use, and at the same time, FFFS is able to offer better consistency and temporal accuracy.

Freeze Frame in Action (full animations can be viewed at https://goo.gl/QXaODs)

freeze-frame-triple-animation-r2To illustrate the kind of computing that arises in the smart grid, we simulated a wave propagating through a very simple mesh network, and generated 100 10×10 image streams, as if each cell in the mesh were monitored by a distinct camera, including a timestamp for each of these tiny images. We streamed the images in a time-synchronized manner to a set of data collectors over TCP, using a separate connection for each stream (mean RTT was about 10ms). Each data collector simply writes incoming data to files. Finally, to create a movie we extracted data for the time appropriate to each frame trusting the file system to read the proper data, fused the data, then repeated. In Figures 1-3 we see representative output.
In the Figure 2 (middle), we ran the same application but now saved the files into FFFS, configured to assume that each update occurred at the time the data reached the data storage node.  Then we used FFFS snapshots to make the movie.  You can immediately see the improvement: this is because FFFS understands both time and data consistency.  Moreover, HDFS snapshots don’t need to be preplanned: you just ask for data at any desired time.  Finally, in Figure 3 (right), we again used FFFS, but this time configured it to extract time directly from the original image by providing a datatype-specific plug-in.  (Some programming may be needed to create a new plug-in the format of your sensor data is one we haven’t worked with previously).  Now the movie is perfect.The image on the left used the widely popular HDFS system to store the files.  HDFS has a built-in snapshot feature; we triggered snapshots once every 100 milliseconds (HDFS snapshots must be planned in advance and requested at the exact time the snapshot should cover).  Then we rendered the data included in that snapshot to create each frame of the movie. Even from the single frame shown, you should be able to instantly see that this version is of low quality: HDFS is oblivious to timestamps and hence often mixes frames from different times.

With this image in mind, now imagine that rather than making a movie, the plan was to run an advanced smart-grid data analytic on the captured data.  If a file system can’t support making a movie, as in Figure 1, it clearly wouldn’t work well for other time-sensitive computations, either. In effect, the HDFS style of file system fights against the IoT application.  FFFS, by understanding time, offers a far superior environment for doing these kinds of smart analytics.

Performance

Below we see one of several performance experiments; our full paper (published in the ACM Symposium on Cloud Computing, November 2016), reports on many more of them. To create these graphs, we ran FFFS with Mellanox ConnectX-3 FDR 56 Gb/s adapters, and the results are shown in the figures below.

We configured our RDMA layer to transfer data in pages of 4KB, and used scatter/gather to batch as many as 16 pages per RDMA read/write.  Figure (a) (Single Client) shows that the FFFS read and write throughput grows when the data packet size used in write increases. When our clients write 4096KB per update, FFFS write throughput reaches 2.6 GB/s, which is roughly half the RDMA hardware limit of 6 GB/s (measured with ib_send_bw tool). FFFS read throughput reaches 5.0GB/s.

We scaled up to a 16-DataNodes setup consisting of 1 NameNode and 16 DataNode servers. In figure (b) (Multiple Clients), we can see that file system throughput grows in proportion to the number of DataNodes, confirming that FFFS scales well.

 

Figure 4. Transfer Size; Figure 5. Multiple DataNodes.

Figure 4. Transfer Size; Figure 5. Multiple DataNodes.

A few words from the author

I’m Weijia Song, and am the lead FFFS developer.  I work as a post doctoral student at Cornell University in the Department of Computer Science, in a group headed by Professor Ken Birman (http://www.cs.cornell.edu/ken).  I also have roles in Hakim Weatherspoon’s advanced cloud computing group, where I work on his Super Cloud system (but that’s a story for another blog posting).  PhD student Theo Gkountouvas collaborated closely on the FFFS data structure design and on extending FFFS to use an optimal consistency-preserving data representation.

FFFS isn’t finished yet.  In future work, I’ll be using Cornell’s RDMA-based Derecho programming platform to make FFFS highly available via data replication and fault-tolerance. This will also make it even more scalable, since we will be able to spread reads and writes over a large number of compute/storage nodes. Theo is working to use Pig Latin, a scripting language, to extract and format the data from files into matrix or tensor representations, which can then support direct computing from MPI, Matlab, Linpack, or other packages. A deeper integration into Derecho is also in the planning stages. Contact Ken Birman (ken@cs.cornell.edu) with questions about the Derecho and FFFS plans.

I want to express my thanks to the organizations that supported this work, including Mellanox, which generously provided access to its cutting-edge RDMA hardware for our experiments.  Other support was obtained from the DOE ARPAe research organization, from NSF, and most recently, from NYSERDA.   Through ongoing work, I hope to see FFFS in production use helping to optimize the smart power grid, and perhaps in a wide range of other data analytic settings!

References

References

Overview

The Freeze-Frame File System

GitHub – The Freeze Frame File System

GitHub – The Derecho System

Codeplex Freeze-Frame File System (mirror site) : Deprecated; please use the GitHub site.

http://www.cs.cornell.edu/ken/slides/MesosCon.pdf