Spark execution environments

Amit Singh Rathore
Dev Genius
Published in
6 min readFeb 26, 2023

--

Popular choices for running spark applications

Spark can be deployed in multiple ways. It can be deployed as a standalone application locally. It can be deployed on a cluster. While choosing the cluster manager we also have options like YARN, K8s, Mesos & any other custom solutions. In this blog, we will go through some of the most used environments.

deploy-mode client | cluster

Local

In local mode, every spark process runs in a single JVM. It is good for quick testing on some of the features of spark.

On Mac

To install spark on mac, execute the following.

# Install Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install openjdk@11 scala python apache-spark
pyspark

cluster managers local[*] | spark:// (standalone) | yarn | k8s | mesos

Spark on YARN

YARN is a resource management and job scheduling framework in distributed processing engines. YARN stands for Yet Another Resource Negotiator. YARN is responsible for allocating system resources to the various applications running in a cluster and scheduling tasks to be executed on different cluster nodes.

On-Premise

  • Client submits application
  • Upload jars, and files to HDFS
  • Launch Application Master
  • Launch Driver
  • Driver requests for executors with cluster Manager
  • Launch containers
  • Launch executors in the container
  • Executors register themselves with driver
  • Executors fetch job-related artifacts from HDFS
  • Launch the task
  • Report results to the driver
  • Driver sends the response to client

On EMR

EMR is a managed offering from AWS. It takes care of cluster provisioning and maintenance. At a high level, it has three types of nodes, Master, Core & Task.

Internally it uses YARN to manage the cluster. The flow of app submission to completion is depicted below.

Spark on K8

Since Spark 3.1, K8s is a production-ready deployment option. In this mode whole spark job runs inside containers running on K8s platform.

On-Premise

EMR on EKS

EMR on EKS uses the concept of virtual clusters. It uses the K8s cluster (EKS) for all its resource needs.

While running the Spark application on K8s we have two options for job submission

export MASTER=k8s://https://kubernetes.default:443
export IMAGE=spark:3.2

/opt/spark/bin/spark-submit --master $MASTER \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.namespace=spark-workloads \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.podTemplateFile=./bucket/driver.yml \
--conf spark.kubernetes.executor.podTemplateFile=./bucket/executor.yml \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.kubernetes.container.image=$IMAGE \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar

# Pyspark in client mode
/opt/spark/bin/pyspark --master $MASTER \
--deploy-mode client \
--name pyspark-client-test
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.namespace=spark-workloads \
--conf spark.kubernetes.driver.podTemplateFile=./bucket/driver.yml \
--conf spark.kubernetes.executor.podTemplateFile=./bucket/executor.yml \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=$IMAGE \
--conf spark.kubernetes.container.image.pullPolicy=Always

spark-submit
spark-operator (Community preference)

Installing spark-operator

$ helm repo add incubator https://charts.helm.sh/incubator --force-update
$ helm install incubator/sparkoperator --namespace spark-operator --set sparkJobNamespace=spark-workloads
from kubernetes import config
from kubernetes.utils import create_from_yaml

k8s_config = config.load_kube_config("path/to/kubeconfig_file")
k8s_api_client = ApiClient(k8s_config)


k8s_spark_api_client = CustomObjectsApi(k8s_api_client)
k8s_spark_properties = dict(
group="sparkoperator.k8s.io",
version="v1beta2",
plural="sparkapplications",
namespace=namespace,
)

# load job YAML to application_config variable
import yaml
from yaml.loader import SafeLoader

YAML_FILE = "spark-pi-config.yml"
application_config = {}

# Open the file and load the file
with open(os.path.join(os.path.dirname(__file__), YAML_FILE)) as f:
application_config = yaml.load(f, Loader=SafeLoader)

k8s_spark_api_client.create_namespaced_custom_object(
body=application_config, **k8s_spark_properties
)

Spark on EMR Serverless

EMR Serverless is a new deployment option for AWS EMR. With EMR Serverless, we don’t need to configure, optimize, or manage clusters to run applications on these platforms. EMR Serverless automatically identifies the resources needed by jobs, provisions those resources to run the jobs, and releases them when the jobs are completed. EMR Serverless applications are asynchronously executed and tracked through completion.

EMR Serverless does not use the native YARN cluster manager. It uses an in-house EMR Engine to do its job. The EMR engine contains three main components:

Job Engine — manages resource allocation
Metadata Service — Lambda + DynamoDB
Execution Environment — Customized Spark runtime

Spark on Databricks

Databrick uses its in-house UAP for running spark workloads. It has three components to its cluster manager.

Instance Manager — Get Instances/Compute from cloud provider
Resource Manager — Get JVMs on the nodes
Spark Cluster Manager — Create drivers & executors and assign them to JVMs

Spark via interactive UI

Terminals spark-shell (scala)| spark-sql (SQL)| pyspark (python)

Notebook

Jupyter is an open-source interactive computational environment managed by Jupyter Project. A Notebook is a shareable document that combines both inputs and outputs into a single file. JupyterLab is the next-gen notebook interface that further enhances the functionality of Jupyter so that it can support distributed workflows like Spark. The combination of Jupyter Notebooks with Spark provides developers with a powerful and familiar development environment while harnessing the power of Apache Spark.

On Mac

# Install Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install openjdk@11 scala python apache-spark jupyter
jupyter notebook

On Linux

sudo apt install default-jdk scala git -y
wget https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
tar xf spark-*
sudo mv spark-3.2.0-bin-hadoop3.2 /opt/spark

pip install notebook

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=8889'

Docker image

docker pull jupyter/pyspark-notebook:spark-3.3.2

In the above notebook setup, both spark and notebooks are running on the same machine/cluster. But this is generally not the production setup for companies. We generally separate notebook and spark infrastructure.

Livy comes in handy in this case. As it allows remote execution of spark workflows over REST calls.

# Installing Jupyter with helm chart
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm repo update
helm install jupyterhub/<helm chart name> --version <helm chart version>
# Livy + spark bundled chart
helm repo add jahstreet https://jahstreet.github.io/helm-charts
kubectl create namespace spark-jobs
helm upgrade --install livy --namespace spark-jobs jahstreet/livy
FROM docker.io/bitnami/spark:3.2.0

USER root
ARG LIVY_VERSION=0.7.0-incubating
ENV LIVY_HOME /opt/bitnami/livy
ENV LIVY_CONF_DIR ${LIVY_HOME}/conf
ENV PATH $LIVY_HOME/bin:${PATH}
ENV SPARK_HOME=/opt/bitnami/spark/
WORKDIR /opt/bitnami/

RUN install_packages unzip \
&& curl "https://downloads.apache.org/incubator/livy/${LIVY_VERSION}/apache-livy-${LIVY_VERSION}-bin.zip" -O \
&& unzip "apache-livy-${LIVY_VERSION}-bin" \
&& rm -rf "apache-livy-${LIVY_VERSION}-bin.zip" \
&& mv "apache-livy-${LIVY_VERSION}-bin" $LIVY_HOME \
&& mkdir $LIVY_HOME/logs \
&& chown -R 1001:1001 $LIVY_HOME

COPY ./livy.conf /opt/bitnami/livy/conf/

USER 1001

HEALTHCHECK CMD curl -f "http://host.docker.internal:${LIVY_PORT}/" || exit 1

EXPOSE 8998 10000

ENTRYPOINT ["sh", "-c", "/opt/bitnami/livy/bin/livy-server"]

Zeppelin

Zeppelin is a web-based, multipurpose notebook for big data workloads. It follows the interpreter plugin system to allow multiple language and processing frameworks.

curl -s -O https://raw.githubusercontent.com/apache/zeppelin/master/k8s/zeppelin-server.yaml
kubectl apply -f zeppelin-server.yaml

There are notebook options as well like Polynote (from Netflix) and Spark Notebook (not being actively developed).

Happy Reading !!

--

--