Distributed Systems

Apache Spark

Parallel processing engine

RDD

dataframe

parquet

delta lake table format, extends parquet with a transaction log and metadata

enables relational db benefits on batch & stream

structured streaming