Reading Summary 2019.05.17

· by Aldrin Montana

MapReduce: Simplified Data Processing on Large Clusters

Overall Evaluation

Inspired by map and reduce functions in functional languages, the idea was that many computations could be expressed in two phases: map and reduce. These phases could then be distributed in an automated, highly performant manner. This approach relies on good approaches for automatically partitioning input data, and also treating input data and intermediate data as key value pairs, where the reduce step typically returns 0 or 1 outputs.

Strong Points

  1. MapReduce was a new model for distributed parallel computing over large datasets. 2. 3.

Weak Points

1. 2. 3.

Questions Raised

  1. I believe that even at the time mapreduce first came out, data shuffles (moving of data between nodes) had a huge performance hit. Along with better tools for observing data shuffles, I wonder what would be necessary to make a distributed debugger function well, and why it wasn’t pursued.

  2. It also occurs to me that many mapreduce implementations I know of are based on Java, and I wonder just how much nicer the C++ implementation would be to work with because the difficulties that many JVMs brings would be mitigated.

Research Connections