..

Week Two

Introduction to Hadoop

What is Hadoop

  • Framework to process huge amounts of data
  • Set of open-source programs

History

  • Born from the Nutch search engine

Components of Hadoop

  • HDFS
  • MapReduce
  • YARN

Challenges

  • Transactions
  • Non-parallel tasks
  • Dependencies
    • Suppose one record has to be processed before another
  • Low latency data
  • Lots of small files
  • Intensive calculations with little data

Intro to MapReduce

  • MapReduce is a programming model
  • Pasted image 20231118104032

Hadoop Ecosystem

  • Pasted image 20231118104252
  • Flume
    • Data flow
  • Sqoop
    • Non-relational to relational mappers
    • Can generate MapReduce code

HDFS

  • Block
    • Smallest unit of data in HDFS
    • Larger data is broken down into blocks
    • usually 64 or 128MB
    • But if the data is smaller then it can be stored as such
    • Pasted image 20231118104836
    • No need to pad smaller data to make it to the block size
  • Node
    • A single computer that stores data
    • Hadoop follows Primary-Secondary architecture
    • The NameNode stores the metadata and instructs the DataNodes what to do
    • NameNode always picks DataNodes that are closer by or in the same rack.
    • This is called as Rack Awareness in HDFS
    • The data replication is also done with rack awareness
    • This is done by keeping track of rackID
    • Pasted image 20231118105218
  • Read/Write
    • Write once, read many

HIVE

  • RDBMS for big data
  • Pasted image 20231118105548
  • Pasted image 20231118105723

HBASE

  • Columnar Non-Relational Database
  • Write-heavy tasks
  • Pasted image 20231118110039
  • We have to predefine column families. These columns are stored together
  • The columns in the column family is flexible. We can add columns to a family at anytime
  • In the above picture patient_details, hear_rate, timestamp are the column families. patiend_details has two columns namely - name, age
  • Pasted image 20231118110323
  • Pasted image 20231118110335