..

Hadoop

Questions

What are the two components of Hadoop ?

  • HDFS
  • MapReduce

What are the two types of nodes ?

  • Master Node
  • Slave Node

What is the role of master HDFS ?

  • Partition Data
  • Keep track of positions

What is the role of master MapReduce ? Schedule work

What are the HDFS daemons ?

  • Name node
  • Data node
  • Secondary Name node

How many name nodes per cluster ? 1

How many data nodes per clusters ? multiple

HDFS breaks large data into smaller pieces called ? Blocks

What is the default block size ? 64MB

What is the identity used by NN called ? RACKID

What is a rack ? Set of data nodes in a cluster

What is the primary job of a name node ? Managing the File System Namespace

What is a File System Namespace ? Collection of files in a cluster

What is contained in an FsImage ?

  • Mapping of block to file
  • File metadata

What is the replication factor ? The number of times a file have to stored in HDFS

Where is the replication factor stored ? Name node

What are the two files used by a NN ?

  • EditLog
  • FsImage

What happens when NN starts ?

  1. It reads FsImage and EditLog from local disk and applies to all transactions from the EditLog to in-memory representation of the FsImage.
  2. Then it flushes out new version of FsImage on disk and truncates older EditLog because the changes are updated in the FsImage.

Explain with a diagram when data replication happens in HDFS ? ss 2023-10-05 at 8.12.54 PM.png

Is the secondary name node a backup name node ? No, It is a separate name node that keeps the copies of both the EditLog and the FsImage. It merges them periodically to keep the size reasonable. Usually it is better to have this on a node different from the name node

How does a client read work in HDFS ? ss 2023-10-05 at 8.15.54 PM.png

Explain the HDFS Replica Strategy ?

  • Same node
  • A node from a different cluster
  • Another node from the aforementioned cluster

How does a client write work in HDFS ? ss 2023-10-05 at 8.18.29 PM.png

How would you create a new folder in HDFS ? hdfs dfs -mkdir /sample

How would you copy a file from local FS to HDFS ? hdfs dfs -put ./sample.txt /sample/sample.txt

How would you copy a file from HDFS to local FS ? hdfs dfs -get /sample/sample.txt sample.txt

What are the two special features of Hadoop ?

  • Data Replication The client is automatically redirected to the nearest replica to ensure maximum performance. The client doesn’t need to keep track of the blocks
  • Data Pipeline The client just writes to the first Data Node in the pipeline. The changes are automatically forwarded to the next node. This node forwards it to the next node and so on. The process continues until all the replicas are updated

What are the two phases in MapReduce ?

  • Map
  • Reduce

What are the daemons used in MapReduce ?

  • JobTracker
  • TaskTracker

Where are the JobTracker and TaskTracker executed ? JobTracker is executed in the Master Node and TaskTracker is executed in the Slave Node

Draw a diagram showing the interaction between JobTracker and TaskTracker ? ss 2023-10-05 at 8.26.46 PM.png

Explain the MapReduce Workflow ? ss 2023-10-05 at 8.30.09 PM.png

What is the hidden phase in between map and reduce ? Shuffle and Sort

What are the 5 limitations of Hadoop Architecture ?

  • One NameNode is responsible for the entire cluster
  • MapReduce takes care of the cluster resource and data management
  • Only suitable for batch-oriented MapReduce tasks
  • Not suitable for interactive analysis
  • Not suitable for ML, graphs, memory intensive task

What is the full form of YARN ? Yet Another Resource Negotiator

What is the primary reason for introducing YARN ? Separate resource management from data processing

What are the two main components of YARN ?

  • ResourceManager
  • NodeManager

What are the daemons running on ResourceManager ?

  • Scheduler
  • Application Manager

What are the components of a NodeManager ?

  • Container
  • ApplicationMaster

What are the functions of the Application Manager ? Applications Manager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

What are the functions of the ApplicationMaster ? The per-application Application Master has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.

Draw the YARN workflow ? Pasted image 20231006054805

Explain pig ? Pig is a data flow system for hadoop Pig is a scripting language that can be used as an alternative to Map Reduce

Explain Hive ? Hive is a Data Warehousing Layer on top of Hadoop. Analysis and queries can be done using an SQL-like language

Explain Scoop ? Sqoop is a tool which helps to transfer data between Hadoop and Relational Databases. With the help of Sqoop, you can import data from RDBMS to HDFS and vice-versa

Explain HBase ? HBase is a NoSQL database for Hadoop. HBase is column-oriented NoSQL database. HBase is used to store billions of rows and millions of columns. HBase provides random read/write operation.