..

2023-11-18

Week Two

Introduction to Hadoop

What is Hadoop

Framework to process huge amounts of data
Set of open-source programs

History

Born from the Nutch search engine

Components of Hadoop

HDFS
MapReduce
YARN

Challenges

Transactions
Non-parallel tasks
Dependencies
- Suppose one record has to be processed before another
Low latency data
Lots of small files
Intensive calculations with little data

Intro to MapReduce

MapReduce is a programming model

Hadoop Ecosystem

Flume
- Data flow
Sqoop
- Non-relational to relational mappers
- Can generate MapReduce code

HDFS

Block
- Smallest unit of data in HDFS
- Larger data is broken down into blocks
- usually 64 or 128MB
- But if the data is smaller then it can be stored as such
- No need to pad smaller data to make it to the block size
Node
- A single computer that stores data
- Hadoop follows Primary-Secondary architecture
- The NameNode stores the metadata and instructs the DataNodes what to do
- NameNode always picks DataNodes that are closer by or in the same rack.
- This is called as Rack Awareness in HDFS
- The data replication is also done with rack awareness
- This is done by keeping track of rackID
Read/Write
- Write once, read many

HIVE

RDBMS for big data

HBASE

Columnar Non-Relational Database
Write-heavy tasks
We have to predefine column families. These columns are stored together
The columns in the column family is flexible. We can add columns to a family at anytime
In the above picture patient_details, hear_rate, timestamp are the column families. patiend_details has two columns namely - name, age