Spark Submit

spark-submit cheatsheet — launch Spark jobs on clusters. spark-submit --master yarn --deploy-mode cluster, --executor-memory, --num-executors. Deploy JAR or PySpark to YARN/K8s.

8 min read

What it is

spark-submit is the command-line utility used to launch Spark applications on a cluster, whether it’s a standalone Spark cluster, YARN, Mesos, or Kubernetes. You use it when you have a compiled Spark application (JAR or Python file) and want to execute it with specific configurations.

Installation

spark-submit is distributed with Apache Spark. You install it by downloading and extracting a Spark distribution.

Linux/Mac:

# Download a Spark distribution (e.g., from spark.apache.org/downloads.html)
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

# Extract the archive
tar -xzf spark-3.5.0-bin-hadoop3.tgz

# Add Spark's bin directory to your PATH (optional, but highly recommended)
echo 'export PATH=$PATH:/path/to/spark-3.5.0-bin-hadoop3/bin' >> ~/.bashrc
source ~/.bashrc

Windows: Download and extract a Spark distribution. You’ll then need to run spark-submit.cmd from the extracted bin directory. It’s recommended to add this directory to your system’s PATH environment variable.

Core Concepts

Application JAR/Python File: The compiled code or script that contains your Spark logic.
Master URL: Specifies the cluster manager Spark will connect to (e.g., local[*], yarn, spark://host:port).
Deploy Mode: Where the driver program runs. client mode runs the driver on the machine where spark-submit is invoked. cluster mode runs the driver on one of the worker nodes in the cluster.
Resource Allocation: How much CPU, memory, and how many executors you request for your application.

Commands / Usage

Submitting Applications

Submit a JAR application to a local Spark instance:

spark-submit --class com.example.MySparkApp --master local[*] my-spark-app.jar

Runs a JAR application locally using all available CPU cores.

Submit a Python application to a local Spark instance:

spark-submit --master local[*] my_spark_app.py

Runs a Python script locally using all available CPU cores.

Submit a JAR application to a YARN cluster in client mode:

spark-submit --class com.example.MySparkApp --master yarn --deploy-mode client my-spark-app.jar

Runs a JAR application on YARN, with the driver program running on the machine executing spark-submit.

Submit a JAR application to a YARN cluster in cluster mode:

spark-submit --class com.example.MySparkApp --master yarn --deploy-mode cluster my-spark-app.jar

Runs a JAR application on YARN, with the driver program running on a worker node within the YARN cluster.

Submit a JAR application with specific resources:

spark-submit \
  --class com.example.MySparkApp \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 10 \
  --executor-cores 4 \
  --executor-memory 8g \
  --driver-memory 4g \
  my-spark-app.jar

Launches an application on YARN requesting 10 executors, each with 4 cores and 8GB of memory, and 4GB for the driver.

Submit an application with dependencies:

spark-submit \
  --class com.example.MySparkApp \
  --master yarn \
  --jars /path/to/dependency1.jar,/path/to/dependency2.jar \
  my-spark-app.jar

Submits an application and includes two additional JAR files as dependencies.

Submit a Python application with dependencies:

spark-submit \
  --master yarn \
  --py-files /path/to/my_utils.py,/path/to/my_package.zip \
  my_spark_app.py

Submits a Python application and includes a Python file and a zip archive containing modules.

Submit an application with application arguments:

spark-submit \
  --class com.example.MySparkApp \
  --master local[*] \
  my-spark-app.jar input_path=/data/input output_path=/data/output

Passes input_path and output_path as arguments to your Spark application’s main method.

Configuration Options (Common Flags)

Master URL:

--master <master-url>: The cluster manager to connect to.
- local: Run locally with one thread.
- local[k]: Run locally with k worker threads.
- local[*]: Run locally with as many worker threads as logical cores on your machine.
- spark://host:port: Connect to a Spark Standalone cluster.
- yarn: Connect to a YARN cluster (requires HADOOP_CONF_DIR or YARN_CONF_DIR environment variable set).
- mesos://host:port: Connect to a Mesos cluster.
- k8s://https://<k8s-api-server-url>: Connect to a Kubernetes cluster.

Application Type:

--class <main-class>: The entry point for Scala/Java applications (e.g., com.example.MyApp). Required for JARs.

Deployment Mode:

--deploy-mode <mode>: Where the driver process runs.
- client: Driver runs on the submitting machine.
- cluster: Driver runs on a worker node in the cluster.

Resource Allocation:

--num-executors <number>: Number of executors to launch.
--executor-cores <number>: Number of CPU cores per executor.
--executor-memory <size>: Amount of memory per executor (e.g., 4g, 512m).
--driver-memory <size>: Amount of memory for the driver process (e.g., 2g).
--driver-cores <number>: Number of CPU cores for the driver process (applicable in cluster mode).
--driver-java-options <options>: JVM options for the driver process.
--conf <key>=<value>: Set arbitrary Spark configuration property.

Dependencies:

--jars <comma-separated-list>: Comma-separated list of JARs to include on the driver and executor classpaths.
--packages <comma-separated-list>: Comma-separated list of Maven coordinates (e.g., org.apache.hadoop:hadoop-aws:3.3.1). These will be downloaded by Spark.
--py-files <comma-separated-list>: Comma-separated list of .zip, .egg, or .py files to send to the cluster and add to the PYTHONPATH of the Python driver and tasks.

Application Arguments:

<app-arguments>: Any arguments passed after the application JAR/Python file will be forwarded to the main method of your application.

Other Useful Flags:

--name <app-name>: Assign a name to your application, which appears in the Spark UI.
--verbose: Print additional debug information.
--properties-file <path>: Load extra properties from a file.

Common Patterns

Submitting a Spark application with a specific Spark version:

# Assuming you have multiple Spark versions installed
SPARK_HOME=/path/to/spark-3.4.1-bin-hadoop3 ./bin/spark-submit --class com.example.MyApp --master yarn myapp.jar

Explicitly use a particular Spark installation by setting SPARK_HOME or calling the spark-submit from its bin directory.

Submitting to YARN with HDFS dependencies:

# Ensure your app JAR and any dependencies are on HDFS
hdfs dfs -put my-spark-app.jar /user/spark/apps/
hdfs dfs -put dependency.jar /user/spark/apps/

spark-submit \
  --class com.example.MySparkApp \
  --master yarn \
  --deploy-mode cluster \
  --jars hdfs:///user/spark/apps/dependency.jar \
  hdfs:///user/spark/apps/my-spark-app.jar

Specify application JARs and dependencies using hdfs:// URIs when running on YARN.

Submitting an application with custom Spark configurations:

spark-submit \
  --class com.example.MySparkApp \
  --master yarn \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.sql.shuffle.partitions=200 \
  my-spark-app.jar

Passes specific Spark configuration properties directly using the --conf flag.

Submitting a Python application that uses PySpark SQL and Pandas UDFs:

spark-submit \
  --master yarn \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
  --py-files utils.py \
  my_pyspark_app.py

Includes Kafka dependencies using Maven coordinates and a utility Python file.

Debugging a failed submission (client mode):

spark-submit --class com.example.MySparkApp --master yarn --deploy-mode client my-spark-app.jar
# Look for errors printed to your console immediately after submission.

In client mode, driver logs appear directly in your terminal.

Debugging a failed submission (cluster mode):

# Submit the application
spark-submit --class com.example.MySparkApp --master yarn --deploy-mode cluster my-spark-app.jar

# After submission, find the Application ID from the Spark UI or YARN UI
# Then use yarn logs command:
yarn logs -applicationId application_1678886400000_0001

In cluster mode, you need to query logs from the cluster using cluster manager tools (like yarn logs).

Gotchas

--deploy-mode client vs. --deploy-mode cluster: In client mode, your local machine needs to remain running for the application to continue. If you close your terminal or lose connection, the application will fail. cluster mode is generally preferred for long-running applications as the driver runs on the cluster.
Classpath Issues: When using --jars or --packages, ensure the specified files/coordinates are accessible and compatible with your Spark version and cluster environment. For YARN, files specified with hdfs:// URIs are automatically distributed. For local paths, they must be present on all nodes if running in cluster mode and the driver can access them.
YARN Configuration: For YARN, ensure your environment has HADOOP_CONF_DIR or YARN_CONF_DIR set correctly to point to the Hadoop configuration directory containing core-site.xml and hdfs-site.xml.
Driver vs. Executor Memory: Be mindful of the difference between --driver-memory and --executor-memory. The driver is a single JVM process, while executors are multiple JVMs running your tasks. Allocate appropriately based on your application’s needs.
Python Dependencies: For Python applications, --py-files is used for single files or zip archives. For more complex Python environments (e.g., requiring conda or specific package versions), consider using a custom Docker image or building a virtual environment and packaging it.
Spark Version Compatibility: Ensure the Spark distribution you are using (spark-submit) matches the Spark version running on your cluster. Mismatches can lead to unexpected errors.
Resource Limits: On managed clusters (like YARN or Kubernetes), your requests for --num-executors, --executor-cores, and --executor-memory are subject to the cluster’s resource availability and quotas.