Spark Architecture

Spark is a In-memory processing engine
Created by Amp Lab (Databricks) in 2009. It was found to be 100x faster than MapReduce
In Spark v1.6.0 the Catalyst Optimizer (Tungsten Engine) was added which made Spark even faster
Spark is written in Scala but supports writing code in Python, R, Java and SQL (Polyglot)

Provides Libraries for various workloads/ types of data:

Streaming: Spark Streaming
Machine Learning: Spark Mllib
SQL Queries: Spark SQL
Graph: GraphX

Spark can be run on YARN, Mesos, Kubernetes, EMR (Elastic MapReduce) or Standalone Mode
Most of the time YARN is used

Spark has no limitation on the type of data source and destination it can be anywhere
Spark does not have an inbuilt storage solution so we need to provide an distributed storage environment
In Standalone mode data cannot be saved outside Spark

Spark is meant as replacement for MapReduce. It is not an replacement for Hadoop
UDFs are generally not used in Spark but can be created

Spark allows to write SQL queries using DSL (Domain Specific Language)
DSL and SQL commands can be used interchangeably

Spark SQL

spark-sql: Allows to perform SQL style queries on data using Spark Engine

Spark SQL (Default storage local FS) - Spark Warehouse
If there is already Hive DB then if we add data using Spark it is stored in HDFS
The metastore used by Hive and Spark is the same

Digital Archive

Explorer

Spark Architecture

Spark SQL

Backlinks

Graph View