This is a repository for my research, paper reading summaries/reviews, and relevant blog-like posts in markdown.

The Google File System

What is the problem the authors are trying to solve?

The authors are interested in a fault-tolerant distributed file system that is very fast on cheap, commodity hardware. Additionally, the authors would like the file system to be fast for Google’s workloads, which are distinct from typical workload assumptions made by other file systems.

What other approaches or solutions existed at the time that this work was done?

The authors mention a large range of existing work that they build on or have similarities to. About 20 years prior, CMU had been building the Andrew File System (AFS) to support 5000-10000 workstations. But GFS is claimed to be more similar to xFS, which is based on RAID and Zebra, and to Swift, which tries to hide latency by using high-speed interconnections to many slow storage devices used in parallel.

Harp was a replicated Unix file system first described in 1991 that used primary-copy replication for reliability. The authors mention they, “could adapt a primary-copy scheme like the one in Harp… [for] high availability and stronger consistency guaranees.

Lustre is a parallel file system which the authors mention had similar issues as GFS when serving so many clients.

NASD is a storage architecture using network-attached disks and emphasizing security–Network Attached Secure Disk. The authors of the NASD paper say that there are 4 motivating principles for NASD: (1) clients and appropriate disks should directly transfer between each other, (2) “secure interfaces” with the support of cryptography, (3) system communications not on the critical-path should be asynchronous, and (4) data objects are variable-sized instead of fixed-size. The authors of the GFS paper specifically mention that they differ from NASD in (4), and the rest of the GFS paper suggests that (1) and (3) were used in GFS, although the model for distributed storage is different from network-attached disks.

Finally, the authors mention River (distributed memory-based queues) which is work from Hellerstein’s lab (and maybe others? I can’t tell which authors are professors) that focuses on data flow using a generalization of chained declustering and key insights into the consistency properties of distributed queues. The authors of GFS only mention the support of many consumers (m-to-1 queues) compared to River’s many consumers and many producers (m-to-n queues), but this work also seems relevant to the pipelined network approach to data transfer to secondary replicas.

What was wrong with the other approaches or solutions?

As is typical with Google, the other approaches probably do not perform well enough at large scale. But, more seriously, Google as a company has different workload characteristics compared to general companies and consumers, and the scale at which Google exercises these workloads is larger than most, thus to get optimal performance on the use cases that Google values, they needed to design their own distributed file system.

What is the authors’ approach or solution, and why is it better than the other approaches or solutions? How does it perform?

The authors propose a distributed file system that tries to minimize the amount of time that operations important for correctness take, while maximizing the amount of concurrent operations that can take place. Specifically, they propose a model where there is a single master server, that is stateful but can recover state easily, many chunkservers, which handle data persistence and are final authority on the data chunks that they store. Finally, operations from clients are managed at the master for learning routes and modifications to the namespace (new files, deleted files), but the client then communicates directly with chunk servers once that information is made available.

Another approach for improving performance is to decouple data flow to secondary replicas from the control flow and data flow to/from a primary. This is efficient because when a secondary replica receives data, it immediately passes that data to the next closest secondary replica, therefore minimizing time between when the initial secondary replica receives the last bytes from the client and when the final secondary replica in a pipelined chain receives the last bytes. This also maximizes data flow by ensuring that data writes to secondary replicas are not affected by logic at all–throughput is maximized. Then, control flow can be very fast because all of the data exists at the secondary replicas already.

Why is this work important?

There seem to be a few key ideas that, while based or inspired by other work, should be taken away from this paper: (1) separation of control flow from data flow, (2) the idea that correctness of data persistence is best left to chunk servers rather than the master, (3) pipelined tcp data transfers, and maybe some others.

3+ comments/questions

  1. “Chunkservers neednot cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.” This is a great quote that shows that knowing underlying mechanisms can simplify higher levels of abstraction without actually including harmful assumptions. This idea is something our lab wants to explore more.

  2. In some ways I think that GFS reminds me of corfu, specifically in the similarity that consistency is maintained with maximal performance by consolidating as few operations in a single point of failure as possible, and once that operation is complete, then all subsequent operations from distributed agents can be completed at the agent’s leisure. Where GFS consolidates namespace operations at the master, corfu consolidates log position generation at a single, master-like node. This is a very similar idea to minimizing critical regions in concurrent systems and seems to be a fundamental design choice to distributed systems that must guarantee correctness.

  3. What happens to data pushed to a replica that gets aged out before the write is pushed to the primary? Is this a cause of many retries of steps 3 - 7 in the control flow?

  4. The separation of control flow and data flow sounds nice for a given sequence of writes, but I wonder how well the idea holds up when many clients are doing data flow and control flow at the same time and the links are saturated. Do control flow packets get prioritized by the chunkservers?