Skip to main content
Version: 1.3.1.0

Writes and Streaming

HWC supports batch writes and Structured Streaming writes into Hive ACID tables.

Batch writes (DataFrame writer)

HWC writes stage files to HDFS and then issues LOAD DATA into the target table. The read mode does not affect write behavior.

spark.range(0, 100)
.selectExpr("id", "concat('v', id) as v")
.write
.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector")
.option("database", "hwc_it")
.option("table", "t_acid")
.mode("overwrite")
.save()

Notes:

  • Create the target table first (Spark 3 does not auto-create Hive tables on write).
  • Use a fully qualified staging directory for secure access and writes.

Streaming writes (Structured Streaming)

Use the streaming sink to write to ACID tables:

import org.apache.spark.sql.streaming.Trigger

val q = spark.readStream
.format("rate")
.option("rowsPerSecond", 5)
.load()
.selectExpr("cast(timestamp as string) as ts", "value")
.writeStream
.format("com.hortonworks.spark.sql.hive.llap.streaming.HiveStreamingDataSource")
.outputMode("append")
.option("database", "hwc_it")
.option("table", "t_stream")
.option("metastoreUri", "thrift://hms-host:9083")
.option("checkpointLocation", "hdfs://nameservice/tmp/hwc_ckpt")
.trigger(Trigger.Once())
.start()

Streaming notes:

  • The target table must be transactional (ACID).
  • metastoreUri is required for streaming.
  • Use cleanUpStreamingMeta to remove metadata for a stopped query.