..

Hive

  • What is Hive ?

    • Data Warehousing Tool. Use to query structured data from HDFS
  • What does hive use for storage ?

    • HDFS
  • What does hive use for execution ?

    • MapReduce
  • What does hive use for storing metadata ?

    • RDBMS
  • What are the suitable use cases for hive ?

    • DWH applications
    • process batch jobs on large data that is immutable
    • Example: Application logs, Web logs
  • What are some of the features of Hive ?

    • SQL-like
    • Supports maps, structs and lists
    • Supports SQL filter, group-by, order-by
    • Custom data types and functions
  • What are the hive data units ?

    • Databases
    • Tables
    • Partitions
    • Buckets(clusters)
  • What does a partition do ?

    • Logical separation of data based on some specified attribute
  • Assume that partition took place. A new record is inserted. Where is this stored ?

    • Based on the value of the specified key, this data will go to that folder
  • Map the hive data units to their FS manifestations ?

Hive FS
Tables Directory
Partition Subdirectory
Buckets Files
  • Draw the hive architecture ? ss 2023-10-06 at 9.34.26 AM.png

  • What are some of the things stored in the meta store ?

    • IDs of DB, Tables, Indices
    • Creation time
    • I/O format
  • What are the types of Meta Stores ?

    • Embedded
    • Local
    • Remote
  • Draw a table illustrating where the hive process and the DB process runs in embedded, local and remote meta store ?

Meta store type Location
Embedded Same Process
Local Separate Process
Remote Separate JVM
  • What is the advantage of a Remote Meta Store in terms of security ?

    • As they are ran on separate JVMs they are completely isolated. This means that the hive users are fire walled from DB user and vice versa
  • What are some of the numeric data types in hive ?

    • TINYINT
    • SMALLINT
    • INT
    • BIGINT
    • FLOAT
    • DOUBLE
  • What are the string data types in hive ?

    • STRING
    • VARCHAR
    • CHAR
  • What are the collection data types in hive ?

    • STRUCT
    • MAP
    • ARRAY
  • What are the different types of file formats in hive ?

    • Text file
    • Sequential File
    • RCFile
  • The default file type of hive is ==Text file==

  • Sequential Files are ==flat files== which store ==binary key-value pairs==

  • RCFile stands for ==Record Columnar File==

  • What is the specialty of RCFile ?

    • Stores data in columnar manner
    • Aggregation is cheap
  • To create a database named “STUDENTS” with comments and database properties ?

    • CREATE DATABASE students IF NOT EXISTS COMMENT "Student DB" WITH PROPERTIES ("creator" = "dbk")
  • How would you show dbproperties ?

    • DESCRIBE DATABASE EXTENDED db_name
  • List all databases ?

    • SHOW DATABASES
  • How would you change a dbproperty ?

    • ALTER DATABASE students SET DBPROPERTIES ("creator" = "deebakkarthi")
  • What are the two kinds of tables in hive ?

    • Managed Tables
    • Self-managed/external tables
  • Draw a table illustrating the difference between managed and external tables ?

Property Managed External
Location Warehouse Some external location
Drop Deleted Retained
  • How would you create an external table ?

    • CREATE EXTERNAL TABLE table_name(column_name COLUMN_TYPE) LOCATION location
  • How would you create a table in which the fields are separated by a tab ?

    • CREATE TABLE student(roll INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
  • Load a local tab separated file into a table ?

    • LOAD DATA LOCAL INPATH "/root/data.tsv" OVERWRITE INTO TABLE student
  • Load a HDFS tab separated file into a table ?

    • LOAD DATA INPATH "/data.tsv" OVERWRITE INTO TABLE student
  • Create a table for this input file

    • 1001, John,c++:Java,Mark1!45:Mark2!67:Mark3!75

    • 1002,Jack,c++:cloud,Mark1!55:Mark2!69:Mark3!85 ?

    • CREATE TABLE student (roll INT, name STRING, subject ARRAY<STRING>, marks MAP<STRING, INT>) ROW FORMAT DELIMITED FIELD TERMINATED BY "," COLLECTION ITEMS TERMINATED BY ":" MAP KEYS TERMINATED BY "!"

  • What are the types of partitions offered by hive ?

    • Static
    • Dynamic
  • What is the difference between static and dynamic partition ?

    • The values are know at compile time when it comes to static partitions
  • Create a table with a static partition ?

    • CREATE TABLE table_name (...) PARITION BY (column_name COLUMN_TYPE)
  • Insert into a partition ?

    • INSERT INTO TABLE table_name PARTITION (partition_name=partition_val) VALUES (...)
  • Add another partition to an pre existing table ?

    • ALTER TABLE table_name ADD PARTITION (partition_name=partition_val)
  • How would you create a dynamically partitioned table ?

    • SET hive.exec.dynamic.partition = true
    • SET hive.exec.dynamic.partition.mode = nonstrict
  • Why should we set the partition mode to nonstrict ?

    • Strict mode which is the default requires at least one static partition
  • How would you rename a partition ?

    • ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec
  • How would you drop a partition ?

    • ALTER TABLE table_name DROP PARTITION partition_spec
  • What is the main advantage of bucketing over partitioning ?

    • Partition need to create a new partition for every unique value. This can lead to many partitions. This can be reduced by using buckets
  • How would you create a bucket ?

    • CREATE TABLE table_name (...) CLUSTERED BY (column_name) INTO <n> BUCKETS Where n is the number of buckets needed
  • What is difference in terms of where to specify the column that divides the table in partition and buckets ?

    • In partition the column to partition is specified after PARTITIONED BY
    • But in bucketing the column is specified as a regular column after the table_name and then referenced by CLUSTERED BY column_name
  • How would you query from a single bucket ?

    • SELECT (...) FROM table_name TABLESAMPLE(BUCKET <n> OUT OF <m> ON column_name) Where n is the bucket to access m is the total number of buckets column_name is the column on which the data is bucketed
  • How would you create a view ?

    • CREATE VIEW view_name AS ...
  • How would you limit the number of rows returned from a query ?

    • SELECT * FROM table_name LIMIT <n>
  • Write a HQL script to word count from a text file ?

CREATE TABLE doc (line STRING);
LOAD DATA LOCAL INPATH "./doc" OVERWRITE INTO TABLE doc;
CREATE TABLE wc AS SELECT word, count(1) as count FROM (SELECT EXPLODE(SPLIT(line, " ")) FROM doc) w GROUP BY word ORDER BY word;