Hive
-
What is Hive ?
- Data Warehousing Tool. Use to query structured data from HDFS
-
What does hive use for storage ?
- HDFS
-
What does hive use for execution ?
- MapReduce
-
What does hive use for storing metadata ?
- RDBMS
-
What are the suitable use cases for hive ?
- DWH applications
- process batch jobs on large data that is immutable
- Example: Application logs, Web logs
-
What are some of the features of Hive ?
- SQL-like
- Supports maps, structs and lists
- Supports SQL filter, group-by, order-by
- Custom data types and functions
-
What are the hive data units ?
- Databases
- Tables
- Partitions
- Buckets(clusters)
-
What does a partition do ?
- Logical separation of data based on some specified attribute
-
Assume that partition took place. A new record is inserted. Where is this stored ?
- Based on the value of the specified key, this data will go to that folder
-
Map the hive data units to their FS manifestations ?
Hive | FS |
---|---|
Tables | Directory |
Partition | Subdirectory |
Buckets | Files |
-
Draw the hive architecture ?
-
What are some of the things stored in the meta store ?
- IDs of DB, Tables, Indices
- Creation time
- I/O format
-
What are the types of Meta Stores ?
- Embedded
- Local
- Remote
-
Draw a table illustrating where the hive process and the DB process runs in embedded, local and remote meta store ?
Meta store type | Location |
---|---|
Embedded | Same Process |
Local | Separate Process |
Remote | Separate JVM |
-
What is the advantage of a Remote Meta Store in terms of security ?
- As they are ran on separate JVMs they are completely isolated. This means that the hive users are fire walled from DB user and vice versa
-
What are some of the numeric data types in hive ?
TINYINT
SMALLINT
INT
BIGINT
FLOAT
DOUBLE
-
What are the string data types in hive ?
STRING
VARCHAR
CHAR
-
What are the collection data types in hive ?
STRUCT
MAP
ARRAY
-
What are the different types of file formats in hive ?
- Text file
- Sequential File
- RCFile
-
The default file type of hive is ==Text file==
-
Sequential Files are ==flat files== which store ==binary key-value pairs==
-
RCFile stands for ==Record Columnar File==
-
What is the specialty of RCFile ?
- Stores data in columnar manner
- Aggregation is cheap
-
To create a database named “STUDENTS” with comments and database properties ?
CREATE DATABASE students IF NOT EXISTS COMMENT "Student DB" WITH PROPERTIES ("creator" = "dbk")
-
How would you show dbproperties ?
DESCRIBE DATABASE EXTENDED db_name
-
List all databases ?
SHOW DATABASES
-
How would you change a dbproperty ?
ALTER DATABASE students SET DBPROPERTIES ("creator" = "deebakkarthi")
-
What are the two kinds of tables in hive ?
- Managed Tables
- Self-managed/external tables
-
Draw a table illustrating the difference between managed and external tables ?
Property | Managed | External |
---|---|---|
Location | Warehouse | Some external location |
Drop | Deleted | Retained |
-
How would you create an external table ?
CREATE EXTERNAL TABLE table_name(column_name COLUMN_TYPE) LOCATION location
-
How would you create a table in which the fields are separated by a tab ?
CREATE TABLE student(roll INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
-
Load a local tab separated file into a table ?
LOAD DATA LOCAL INPATH "/root/data.tsv" OVERWRITE INTO TABLE student
-
Load a HDFS tab separated file into a table ?
LOAD DATA INPATH "/data.tsv" OVERWRITE INTO TABLE student
-
Create a table for this input file
-
1001, John,c++:Java,Mark1!45:Mark2!67:Mark3!75
-
1002,Jack,c++:cloud,Mark1!55:Mark2!69:Mark3!85
? -
CREATE TABLE student (roll INT, name STRING, subject ARRAY<STRING>, marks MAP<STRING, INT>) ROW FORMAT DELIMITED FIELD TERMINATED BY "," COLLECTION ITEMS TERMINATED BY ":" MAP KEYS TERMINATED BY "!"
-
-
What are the types of partitions offered by hive ?
- Static
- Dynamic
-
What is the difference between static and dynamic partition ?
- The values are know at compile time when it comes to static partitions
-
Create a table with a static partition ?
CREATE TABLE table_name (...) PARITION BY (column_name COLUMN_TYPE)
-
Insert into a partition ?
INSERT INTO TABLE table_name PARTITION (partition_name=partition_val) VALUES (...)
-
Add another partition to an pre existing table ?
ALTER TABLE table_name ADD PARTITION (partition_name=partition_val)
-
How would you create a dynamically partitioned table ?
SET hive.exec.dynamic.partition = true
SET hive.exec.dynamic.partition.mode = nonstrict
-
Why should we set the partition mode to nonstrict ?
- Strict mode which is the default requires at least one static partition
-
How would you rename a partition ?
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec
-
How would you drop a partition ?
ALTER TABLE table_name DROP PARTITION partition_spec
-
What is the main advantage of bucketing over partitioning ?
- Partition need to create a new partition for every unique value. This can lead to many partitions. This can be reduced by using buckets
-
How would you create a bucket ?
CREATE TABLE table_name (...) CLUSTERED BY (column_name) INTO <n> BUCKETS
Where n is the number of buckets needed
-
What is difference in terms of where to specify the column that divides the table in partition and buckets ?
- In partition the column to partition is specified after
PARTITIONED BY
- But in bucketing the column is specified as a regular column after the
table_name
and then referenced byCLUSTERED BY column_name
- In partition the column to partition is specified after
-
How would you query from a single bucket ?
SELECT (...) FROM table_name TABLESAMPLE(BUCKET <n> OUT OF <m> ON column_name)
Where n is the bucket to access m is the total number of buckets column_name is the column on which the data is bucketed
-
How would you create a view ?
CREATE VIEW view_name AS ...
-
How would you limit the number of rows returned from a query ?
SELECT * FROM table_name LIMIT <n>
-
Write a HQL script to word count from a text file ?
CREATE TABLE doc (line STRING);
LOAD DATA LOCAL INPATH "./doc" OVERWRITE INTO TABLE doc;
CREATE TABLE wc AS SELECT word, count(1) as count FROM (SELECT EXPLODE(SPLIT(line, " ")) FROM doc) w GROUP BY word ORDER BY word;