..

MapReduce

List the four general steps of MapReduce ?

  • Splitting
  • Mapping
  • Shuffling
  • Reducing

Map task takes care of ==loading, transforming, parsing and filtering==

Reduce task takes care of ==grouping and aggregation==

What are the phases in Mapper ?

  • RecordReader
  • Map
  • Combiner
  • Partitioner

What are the phases in Reduce ?

  • Shuffling
  • Sorting
  • Reducing
  • OutputFormat

What is the function of RecordReader ?

  • Convert byte-oriented view to record-oriented view
  • Generates key-value pairs from the input data
  • Usually it read in byte values and generates [position: data] pairs

What is the function of Map ?

  • Convert [position: data] to [data: some_val]
  • Produces intermediate key-value pairs

What is the function of Combiner ?

  • It is a local reducer
  • Applies some user given function on the data present in that mapper alone

What is the function of the Partitioner ?

  • Takes the intermediate key-value pairs and splits then into shards
  • Then send the shards to the appropriate reducer as per user code
  • Same keys go to the same reducer
  • This is typically decided by a hash function
  • The number of partition is equal to the number of reduce jobs

What is the function of shuffle and sort ?

  • Download the data given by the partitioner
  • Sorts them by key so that it can iterated easily

What is the function of Reduce ?

  • Reduce works on one group at a time
  • It iterates through all the key-value pairs of that group and applies the reduce function on them

What does the outputFormat do ? Separates the key-value pairs using a tab and write out to an output file

What is the usually the combiner class ? The Reducer class can also be used as the combiner class

What is the difference between Reducer and Combiner ? The main difference is that the output of Combiner is intermediate and is used by the reducer. The output of Reducer is written out to disk

What is the default partitioner ? Hash Partitioner