2023-10-18

MapReduce

List the four general steps of MapReduce ?

Splitting
Mapping
Shuffling
Reducing

Map task takes care of ==loading, transforming, parsing and filtering==

Reduce task takes care of ==grouping and aggregation==

What are the phases in Mapper ?

RecordReader
Map
Combiner
Partitioner

What are the phases in Reduce ?

Shuffling
Sorting
Reducing
OutputFormat

What is the function of RecordReader ?

Convert byte-oriented view to record-oriented view
Generates key-value pairs from the input data
Usually it read in byte values and generates [position: data] pairs

What is the function of Map ?

Convert [position: data] to [data: some_val]
Produces intermediate key-value pairs

What is the function of Combiner ?

It is a local reducer
Applies some user given function on the data present in that mapper alone

What is the function of the Partitioner ?

Takes the intermediate key-value pairs and splits then into shards
Then send the shards to the appropriate reducer as per user code
Same keys go to the same reducer
This is typically decided by a hash function
The number of partition is equal to the number of reduce jobs

What is the function of shuffle and sort ?

Download the data given by the partitioner
Sorts them by key so that it can iterated easily

What is the function of Reduce ?

Reduce works on one group at a time
It iterates through all the key-value pairs of that group and applies the reduce function on them

What does the outputFormat do ? Separates the key-value pairs using a tab and write out to an output file

What is the usually the combiner class ? The Reducer class can also be used as the combiner class

What is the difference between Reducer and Combiner ? The main difference is that the output of Combiner is intermediate and is used by the reducer. The output of Reducer is written out to disk

What is the default partitioner ? Hash Partitioner