MapReduce
List the four general steps of MapReduce ?
- Splitting
- Mapping
- Shuffling
- Reducing
Map task takes care of ==loading, transforming, parsing and filtering==
Reduce task takes care of ==grouping and aggregation==
What are the phases in Mapper ?
- RecordReader
- Map
- Combiner
- Partitioner
What are the phases in Reduce ?
- Shuffling
- Sorting
- Reducing
- OutputFormat
What is the function of RecordReader ?
- Convert byte-oriented view to record-oriented view
- Generates key-value pairs from the input data
- Usually it read in byte values and generates
[position: data]
pairs
What is the function of Map ?
- Convert
[position: data]
to[data: some_val]
- Produces intermediate key-value pairs
What is the function of Combiner ?
- It is a local reducer
- Applies some user given function on the data present in that mapper alone
What is the function of the Partitioner ?
- Takes the intermediate key-value pairs and splits then into shards
- Then send the shards to the appropriate reducer as per user code
- Same keys go to the same reducer
- This is typically decided by a hash function
- The number of partition is equal to the number of reduce jobs
What is the function of shuffle and sort ?
- Download the data given by the partitioner
- Sorts them by key so that it can iterated easily
What is the function of Reduce ?
- Reduce works on one group at a time
- It iterates through all the key-value pairs of that group and applies the reduce function on them
What does the outputFormat do ? Separates the key-value pairs using a tab and write out to an output file
What is the usually the combiner class
?
The Reducer
class can also be used as the combiner class
What is the difference between Reducer
and Combiner
?
The main difference is that the output of Combiner
is intermediate and is used by the reducer. The output of Reducer
is written out to disk
What is the default partitioner ? Hash Partitioner