Distributed Systems

Apache Spark

Parallel processing engine

RDD

dataframe

parquet

Delta lake table format

Extends parquet with a transaction log and metadata

Enables relational DB benefits on batch & stream

Structured streaming