..

Introduction to Big Data

What is big data ?

  • Data that is too large to be stored in a single computer

Classify digital data ?

  • Structured
  • Semi-Structured
  • Unstructured

Characteristics of big data ?

  • Volume
  • Velocity
  • Variety

What are the different velocities in which data is collected ?

  • Batch
  • Periodic
  • Near Real-time
  • Real-time

What are the different varieties of data ?

  • Structured
  • Semi-Structured
  • Unstructured

What are the other 3 V’s of big data ?

  • Veracity
  • Volatility
  • Variability

Instead of the traditional 3 V’s some include an addition V. What is that ?

  • Veracity

What are the two different ways to classify data analytics ?

  • basic, operational, advanced and monetized
  • analytics 1.0, 2.0, 3.0

Explain the type of analytics done during the three chronological classes of analytics ?

  • 1.0 : Descriptive
  • 2.0 : Diagnostic
  • 3.0 : Predictive and Prescriptive

Describe a typical data ware house ? Pasted image 20231004175232

Differentiate BI and BD ?

BI BD
Descriptive, Diagnostic Predictive
Simple, clean, small datasets Large, raw, complex, varied dataset
What happened and Why New insights

What the different types of DBs used in big data and give an example for each one ?

DB Type Name
Key value Redis, Riak
Document HBase, Cassandra
Wide column mongoDB, couchDB
Graph neo4J, InfiniteGraph

What is a wide-column database ? The format of column can vary from row to row

Draw the apache Hadoop ecosystem ? ss 2023-10-04 at 6.02.04 PM.png

What is in-memory analytics ? Do all the processing in RAM

What is In-Database Processing ? Integration of data analytics into data warehousing

What is a symmetric multiprocessing system ? A symmetric multiprocessor system (SMP) is a multiprocessor system with centralized shared memory called main memory (MM) operating under a single operating system with two or more homogeneous processors

What is tightly coupled multiprocessing ? Symmetric Multiprocessing System

What are the three types of multiprocessing architectures ?

  • Shared Memory
  • Shared Disk
  • Shared Nothing

Explain Consistency ? All nodes should see the same data at the same time

Explain Availability ?

  • Node failures do not prevent survivors from continuing to operate
  • This condition states that every request gets a response on success/failure of nodes.
  • Every client gets a response, regardless of the state of any individual node in the system.

Explain Partition Tolerance ?

  • The system continues to operate despite network partitions failures.
  • Partition-tolerant systems can sustain any amount of network failure that doesn’t result in a failure of the entire network.
  • Data records are sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages.

What is the CAP theorem ? Can have only two of Consistency, Availability and Partition-Tolerance

When to choose consistency and when to choose availability? give examples ?

  • Choose availability over consistency when your business requirements allow some flexibility around when data in the system synchronizes.
  • Choose consistency over availability when your business requirements demand atomic reads and writes.

What are the different types of consistencies ?

  • Strong
  • Weak
  • Eventual

What are the different variants of eventual consistency ?

  • Monotonic Read
  • Monotonic Write
  • Read Your Writes
  • Casual consistency

What is BASE ? Basically Available Soft state Eventual consistency

Differentiate ACID and BASE ?

ACID BASE
Availability Less Important Weak Consistency
Complex mechanisms Simple and Fast

Notes

CAP Theorem

In the event of a network partition a Distributed System can either choose to be consistent or available but not both

Simple Example

Reference

  • Suppose we have two ATMs
  • The supported operations are Deposit, Withdrawal, Check Balance
  • There are no central DB and these ATMs are connected by a network
  • Assume that a network partition occurs. Now the ATMs have to choose between being available and being consistent
  • Case 1: Availability
    • If the ATMs choose to be available then they will operate even though they can’t communicate with each other
    • Suppose your balance is 100 and both the ATMs have the same value now
    • A network partition occurs
    • You withdraw 80 from ATM A
    • You go to ATM B and withdraw 80. It will allow this transaction because as far as it knows your balance is still 100. The ATMs made a choice to service this request even though it knew the other ATM is unreachable
  • Case 2: Consistency
    • In this case the ATMs will be unreachable until they can talk to one another