Getting Started
This page covers a minimal setup for Spark 3.5 and Hive 4.0.1 using Hive Warehouse Connector.
Prerequisites
- HiveServer2 (LLAP) and Hive Metastore are running and reachable.
- Spark has access to Hive client configs (
hive-site.xml,core-site.xml,hdfs-site.xml). - If Kerberos is enabled, a valid principal and keytab are available and HS2 is configured for Kerberos auth.
Build the assembly jar
From the repo root:
sbt -Dspark.version=3.5.6 -Dhive.version=4.0.1 -Dscala.version=2.12.18 assembly
The jar will be under target/scala-2.12/.
Spark shell (secure access mode)
spark-shell \
--jars target/scala-2.12/hive-warehouse-connector-assembly-1.3.1.jar \
--conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://hs2-host:10001/;transportMode=http;httpPath=cliservice;ssl=true" \
--conf spark.sql.hive.hiveserver2.jdbc.url.principal=hive/hs2-host@EXAMPLE.COM \
--conf spark.hadoop.hive.metastore.uris=thrift://hms-host:9083 \
--conf spark.datasource.hive.warehouse.read.mode=secure_access \
--conf spark.datasource.hive.warehouse.read.jdbc.mode=cluster \
--conf spark.datasource.hive.warehouse.load.staging.dir=hdfs://nameservice/apps/hwc_staging \
--conf spark.sql.extensions=com.hortonworks.spark.sql.rule.Extensions
Minimal Scala usage
import com.hortonworks.hwc.HiveWarehouseSession
val hwc = HiveWarehouseSession.session(spark).build()
hwc.executeUpdate("create database if not exists hwc_it")
val df = spark.range(0, 10).selectExpr("id", "concat('v', id) as v")
df.write
.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector")
.option("database", "hwc_it")
.option("table", "t_acid")
.mode("overwrite")
.save()
hwc.sql("select count(*) as c from hwc_it.t_acid").show()
PySpark usage
pyspark \
--jars target/scala-2.12/hive-warehouse-connector-assembly-1.3.1.jar \
--py-files python/pyspark_hwc-1.3.1.zip
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.sql("show databases").show()