Paper Review: MapReduce: Simplified Data Processing on Large Clusters

This is a repository for my research, paper reading summaries/reviews, and relevant blog-like posts in markdown.

Paper Review: MapReduce: Simplified Data Processing on Large Clusters

Review

What’s new

MapReduce was a new model for distributed parallel computing over large datasets.

Summary

Inspired by map and reduce functions in functional languages, the idea was that many computations could be expressed in two phases: map and reduce. These phases could then be distributed in an automated, highly performant manner. This approach relies on good approaches for automatically partitioning input data, and also treating input data and intermediate data as key value pairs, where the reduce step typically returns 0 or 1 outputs.

Thoughts/Comments

  1. I believe that even at the time mapreduce first came out, data shuffles (moving of data between nodes) had a huge performance hit. Along with better tools for observing data shuffles, I wonder what would be necessary to make a distributed debugger function well, and why it wasn’t pursued.

  2. It also occurs to me that many mapreduce implementations I know of are based on Java, and I wonder just how much nicer the C++ implementation would be to work with because the difficulties that many JVMs brings would be mitigated.

  3. I suppose this actually makes me wonder if a “map-reduce” based domain specific language would have been useful to make the ecosystem easier to adopt.