Reading Summary 2019.05.17
· by Aldrin Montana
MapReduce: Simplified Data Processing on Large Clusters
Inspired by map and reduce functions in functional languages, the idea was that many computations could be expressed in two phases: map and reduce. These phases could then be distributed in an automated, highly performant manner. This approach relies on good approaches for automatically partitioning input data, and also treating input data and intermediate data as key value pairs, where the reduce step typically returns 0 or 1 outputs.
- MapReduce was a new model for distributed parallel computing over large datasets. 2. 3.
1. 2. 3.
I believe that even at the time mapreduce first came out, data shuffles (moving of data between nodes) had a huge performance hit. Along with better tools for observing data shuffles, I wonder what would be necessary to make a distributed debugger function well, and why it wasn’t pursued.
It also occurs to me that many mapreduce implementations I know of are based on Java, and I wonder just how much nicer the C++ implementation would be to work with because the difficulties that many JVMs brings would be mitigated.