2023-11-20

Hive

What is Hive ?
- Data Warehousing Tool. Use to query structured data from HDFS
What does hive use for storage ?
- HDFS
What does hive use for execution ?
- MapReduce
What does hive use for storing metadata ?
- RDBMS
What are the suitable use cases for hive ?
- DWH applications
- process batch jobs on large data that is immutable
- Example: Application logs, Web logs
What are some of the features of Hive ?
- SQL-like
- Supports maps, structs and lists
- Supports SQL filter, group-by, order-by
- Custom data types and functions
What are the hive data units ?
- Databases
- Tables
- Partitions
- Buckets(clusters)
What does a partition do ?
- Logical separation of data based on some specified attribute
Assume that partition took place. A new record is inserted. Where is this stored ?
- Based on the value of the specified key, this data will go to that folder
Map the hive data units to their FS manifestations ?

Hive	FS
Tables	Directory
Partition	Subdirectory
Buckets	Files

Draw the hive architecture ?
What are some of the things stored in the meta store ?
- IDs of DB, Tables, Indices
- Creation time
- I/O format
What are the types of Meta Stores ?
- Embedded
- Local
- Remote
Draw a table illustrating where the hive process and the DB process runs in embedded, local and remote meta store ?

Meta store type	Location
Embedded	Same Process
Local	Separate Process
Remote	Separate JVM

What is the advantage of a Remote Meta Store in terms of security ?
- As they are ran on separate JVMs they are completely isolated. This means that the hive users are fire walled from DB user and vice versa
What are some of the numeric data types in hive ?
- TINYINT
- SMALLINT
- INT
- BIGINT
- FLOAT
- DOUBLE
What are the string data types in hive ?
- STRING
- VARCHAR
- CHAR
What are the collection data types in hive ?
- STRUCT
- MAP
- ARRAY
What are the different types of file formats in hive ?
- Text file
- Sequential File
- RCFile
The default file type of hive is ==Text file==
Sequential Files are ==flat files== which store ==binary key-value pairs==
RCFile stands for ==Record Columnar File==
What is the specialty of RCFile ?
- Stores data in columnar manner
- Aggregation is cheap
To create a database named “STUDENTS” with comments and database properties ?
- CREATE DATABASE students IF NOT EXISTS COMMENT "Student DB" WITH PROPERTIES ("creator" = "dbk")
How would you show dbproperties ?
- DESCRIBE DATABASE EXTENDED db_name
List all databases ?
- SHOW DATABASES
How would you change a dbproperty ?
- ALTER DATABASE students SET DBPROPERTIES ("creator" = "deebakkarthi")
What are the two kinds of tables in hive ?
- Managed Tables
- Self-managed/external tables
Draw a table illustrating the difference between managed and external tables ?

Property	Managed	External
Location	Warehouse	Some external location
Drop	Deleted	Retained

How would you create an external table ?
- CREATE EXTERNAL TABLE table_name(column_name COLUMN_TYPE) LOCATION location
How would you create a table in which the fields are separated by a tab ?
- CREATE TABLE student(roll INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
Load a local tab separated file into a table ?
- LOAD DATA LOCAL INPATH "/root/data.tsv" OVERWRITE INTO TABLE student
Load a HDFS tab separated file into a table ?
- LOAD DATA INPATH "/data.tsv" OVERWRITE INTO TABLE student
Create a table for this input file
- 1001, John,c++:Java,Mark1!45:Mark2!67:Mark3!75
- 1002,Jack,c++:cloud,Mark1!55:Mark2!69:Mark3!85 ?
- CREATE TABLE student (roll INT, name STRING, subject ARRAY<STRING>, marks MAP<STRING, INT>) ROW FORMAT DELIMITED FIELD TERMINATED BY "," COLLECTION ITEMS TERMINATED BY ":" MAP KEYS TERMINATED BY "!"
What are the types of partitions offered by hive ?
- Static
- Dynamic
What is the difference between static and dynamic partition ?
- The values are know at compile time when it comes to static partitions
Create a table with a static partition ?
- CREATE TABLE table_name (...) PARITION BY (column_name COLUMN_TYPE)
Insert into a partition ?
- INSERT INTO TABLE table_name PARTITION (partition_name=partition_val) VALUES (...)
Add another partition to an pre existing table ?
- ALTER TABLE table_name ADD PARTITION (partition_name=partition_val)
How would you create a dynamically partitioned table ?
- SET hive.exec.dynamic.partition = true
- SET hive.exec.dynamic.partition.mode = nonstrict
Why should we set the partition mode to nonstrict ?
- Strict mode which is the default requires at least one static partition
How would you rename a partition ?
- ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec
How would you drop a partition ?
- ALTER TABLE table_name DROP PARTITION partition_spec
What is the main advantage of bucketing over partitioning ?
- Partition need to create a new partition for every unique value. This can lead to many partitions. This can be reduced by using buckets
How would you create a bucket ?
- CREATE TABLE table_name (...) CLUSTERED BY (column_name) INTO <n> BUCKETS Where n is the number of buckets needed
What is difference in terms of where to specify the column that divides the table in partition and buckets ?
- In partition the column to partition is specified after PARTITIONED BY
- But in bucketing the column is specified as a regular column after the table_name and then referenced by CLUSTERED BY column_name
How would you query from a single bucket ?
- SELECT (...) FROM table_name TABLESAMPLE(BUCKET <n> OUT OF <m> ON column_name) Where n is the bucket to access m is the total number of buckets column_name is the column on which the data is bucketed
How would you create a view ?
- CREATE VIEW view_name AS ...
How would you limit the number of rows returned from a query ?
- SELECT * FROM table_name LIMIT <n>
Write a HQL script to word count from a text file ?

CREATE TABLE doc (line STRING);
LOAD DATA LOCAL INPATH "./doc" OVERWRITE INTO TABLE doc;
CREATE TABLE wc AS SELECT word, count(1) as count FROM (SELECT EXPLODE(SPLIT(line, " ")) FROM doc) w GROUP BY word ORDER BY word;