Spark structured streaming

SekFook
3 min readMay 11, 2022

Introduction

Spark structured streaming is a streaming processing engine built on top of spark SQL engine that enables you to do a batch computation on streaming data, e.g you may do streaming aggregation, event-time windows or stream-to-batch join with Dataset/dataframe API.

There are 2 types of processing modes in Spark structured streaming:

  1. micro-batch processing engine: process data stream as in small-batch, default mode.
  2. continuous Processing: can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees.

How it works

Structured Streaming treats live data stream as a table that is being continuously appended. Thus, you may run batch-like queries on streaming data like a static table. Each data stream is like a new row being appended to the input table. A query on the input table would produce the result table.

continuous data stream —-append to→ input table —-query——→ result table

At every trigger interval (any predefined time), new rows get appended to the input table, and the result table would get updated too by your queries. Eventually, we want to write the changed result row to an external sink (e.g Kafka broker).

There are different modes on how you can write the output into the external sink.

  1. Complete mode — entire result table gets written into the external storage
  2. Append mode — only the new rows appended into the result table since the last trigger get written into external storage.
  3. Update mode — only the updated row in the result table gets written into external storage.

Fault Tolerance

end-to-end exactly-once semantics, exactly-once means that the message would be delivered to the consumer side exact once.

Every streaming source is assumed to have offsets to track the read position in the stream thus enabling the execution engine to restart or reprocess when there is any kind of failure.

The engines use checkpointing and write-ahead logs to record the offset range of the data in each trigger.

The streaming sinks are designed to be idempotent

checkpointing: allow to save truncated (without dependencies) RDDs

write-ahead logs: saving all data received by the receivers to logs file located in the checkpoint directory.

Further reading: https://www.waitingforcode.com/apache-spark-streaming/spark-streaming-checkpointing-and-write-ahead-logs/read

Input Source

File source — Reads files written in a directory as a stream of data

Kafka source — Reads data from Kafka.

Socket source (for testing) — Reads UTF8 text data from a socket connection.

Rate source (for testing) — Generates data at the specified number of rows per second

Window Operations

Aggregation of values over a sliding event-time window. You can define your time windows, e.g a 10-minute window and how often you want to update the window.

E.g.

10 minutes window, 5 minutes updates, starting at 1200received one order at 1200
so order count for window 1200-1210 would be 1
received one order at 1207
so the order count for the window 1200-1210 would be 2
and the order count for the window 1205-1215 would be 1

**late data (**data that is generated earlier but arrives late into the application) would be taken care of and put into its own time window.

watermarking is the parameter that tracks the current event time in the data and attempts to clean up the old state accordingly.

t = watermaking thresholdreject the late data if :
(max event time seen by the engine - late threshold) > T

--

--

SekFook

Software engineer at Xendit. Love data, machine learning, distributed systems and writing clean code. I got too many articles in my reading list.