What it is
spark-submit is the command-line utility used to launch Spark applications on a cluster, whether it’s a standalone Spark cluster, YARN, Mesos, or Kubernetes. You use it when you have a compiled Spark application (JAR or Python file) and want to execute it with specific configurations.
Installation
spark-submit is distributed with Apache Spark. You install it by downloading and extracting a Spark distribution.
Linux/Mac:
# Download a Spark distribution (e.g., from spark.apache.org/downloads.html)
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
# Extract the archive
tar -xzf spark-3.5.0-bin-hadoop3.tgz
# Add Spark's bin directory to your PATH (optional, but highly recommended)
echo 'export PATH=$PATH:/path/to/spark-3.5.0-bin-hadoop3/bin' >> ~/.bashrc
source ~/.bashrc
Windows:
Download and extract a Spark distribution. You’ll then need to run spark-submit.cmd from the extracted bin directory. It’s recommended to add this directory to your system’s PATH environment variable.
Core Concepts
- Application JAR/Python File: The compiled code or script that contains your Spark logic.
- Master URL: Specifies the cluster manager Spark will connect to (e.g.,
local[*],yarn,spark://host:port). - Deploy Mode: Where the driver program runs.
clientmode runs the driver on the machine wherespark-submitis invoked.clustermode runs the driver on one of the worker nodes in the cluster. - Resource Allocation: How much CPU, memory, and how many executors you request for your application.
Commands / Usage
Submitting Applications
Submit a JAR application to a local Spark instance:
spark-submit --class com.example.MySparkApp --master local[*] my-spark-app.jar
Runs a JAR application locally using all available CPU cores.
Submit a Python application to a local Spark instance:
spark-submit --master local[*] my_spark_app.py
Runs a Python script locally using all available CPU cores.
Submit a JAR application to a YARN cluster in client mode:
spark-submit --class com.example.MySparkApp --master yarn --deploy-mode client my-spark-app.jar
Runs a JAR application on YARN, with the driver program running on the machine executing spark-submit.
Submit a JAR application to a YARN cluster in cluster mode:
spark-submit --class com.example.MySparkApp --master yarn --deploy-mode cluster my-spark-app.jar
Runs a JAR application on YARN, with the driver program running on a worker node within the YARN cluster.
Submit a JAR application with specific resources:
spark-submit \
--class com.example.MySparkApp \
--master yarn \
--deploy-mode cluster \
--num-executors 10 \
--executor-cores 4 \
--executor-memory 8g \
--driver-memory 4g \
my-spark-app.jar
Launches an application on YARN requesting 10 executors, each with 4 cores and 8GB of memory, and 4GB for the driver.
Submit an application with dependencies:
spark-submit \
--class com.example.MySparkApp \
--master yarn \
--jars /path/to/dependency1.jar,/path/to/dependency2.jar \
my-spark-app.jar
Submits an application and includes two additional JAR files as dependencies.
Submit a Python application with dependencies:
spark-submit \
--master yarn \
--py-files /path/to/my_utils.py,/path/to/my_package.zip \
my_spark_app.py
Submits a Python application and includes a Python file and a zip archive containing modules.
Submit an application with application arguments:
spark-submit \
--class com.example.MySparkApp \
--master local[*] \
my-spark-app.jar input_path=/data/input output_path=/data/output
Passes input_path and output_path as arguments to your Spark application’s main method.
Configuration Options (Common Flags)
Master URL:
--master <master-url>: The cluster manager to connect to.local: Run locally with one thread.local[k]: Run locally withkworker threads.local[*]: Run locally with as many worker threads as logical cores on your machine.spark://host:port: Connect to a Spark Standalone cluster.yarn: Connect to a YARN cluster (requiresHADOOP_CONF_DIRorYARN_CONF_DIRenvironment variable set).mesos://host:port: Connect to a Mesos cluster.k8s://https://<k8s-api-server-url>: Connect to a Kubernetes cluster.
Application Type:
--class <main-class>: The entry point for Scala/Java applications (e.g.,com.example.MyApp). Required for JARs.
Deployment Mode:
--deploy-mode <mode>: Where the driver process runs.client: Driver runs on the submitting machine.cluster: Driver runs on a worker node in the cluster.
Resource Allocation:
--num-executors <number>: Number of executors to launch.--executor-cores <number>: Number of CPU cores per executor.--executor-memory <size>: Amount of memory per executor (e.g.,4g,512m).--driver-memory <size>: Amount of memory for the driver process (e.g.,2g).--driver-cores <number>: Number of CPU cores for the driver process (applicable in cluster mode).--driver-java-options <options>: JVM options for the driver process.--conf <key>=<value>: Set arbitrary Spark configuration property.
Dependencies:
--jars <comma-separated-list>: Comma-separated list of JARs to include on the driver and executor classpaths.--packages <comma-separated-list>: Comma-separated list of Maven coordinates (e.g.,org.apache.hadoop:hadoop-aws:3.3.1). These will be downloaded by Spark.--py-files <comma-separated-list>: Comma-separated list of .zip, .egg, or .py files to send to the cluster and add to the PYTHONPATH of the Python driver and tasks.
Application Arguments:
<app-arguments>: Any arguments passed after the application JAR/Python file will be forwarded to themainmethod of your application.
Other Useful Flags:
--name <app-name>: Assign a name to your application, which appears in the Spark UI.--verbose: Print additional debug information.--properties-file <path>: Load extra properties from a file.
Common Patterns
Submitting a Spark application with a specific Spark version:
# Assuming you have multiple Spark versions installed
SPARK_HOME=/path/to/spark-3.4.1-bin-hadoop3 ./bin/spark-submit --class com.example.MyApp --master yarn myapp.jar
Explicitly use a particular Spark installation by setting SPARK_HOME or calling the spark-submit from its bin directory.
Submitting to YARN with HDFS dependencies:
# Ensure your app JAR and any dependencies are on HDFS
hdfs dfs -put my-spark-app.jar /user/spark/apps/
hdfs dfs -put dependency.jar /user/spark/apps/
spark-submit \
--class com.example.MySparkApp \
--master yarn \
--deploy-mode cluster \
--jars hdfs:///user/spark/apps/dependency.jar \
hdfs:///user/spark/apps/my-spark-app.jar
Specify application JARs and dependencies using hdfs:// URIs when running on YARN.
Submitting an application with custom Spark configurations:
spark-submit \
--class com.example.MySparkApp \
--master yarn \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.shuffle.partitions=200 \
my-spark-app.jar
Passes specific Spark configuration properties directly using the --conf flag.
Submitting a Python application that uses PySpark SQL and Pandas UDFs:
spark-submit \
--master yarn \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
--py-files utils.py \
my_pyspark_app.py
Includes Kafka dependencies using Maven coordinates and a utility Python file.
Debugging a failed submission (client mode):
spark-submit --class com.example.MySparkApp --master yarn --deploy-mode client my-spark-app.jar
# Look for errors printed to your console immediately after submission.
In client mode, driver logs appear directly in your terminal.
Debugging a failed submission (cluster mode):
# Submit the application
spark-submit --class com.example.MySparkApp --master yarn --deploy-mode cluster my-spark-app.jar
# After submission, find the Application ID from the Spark UI or YARN UI
# Then use yarn logs command:
yarn logs -applicationId application_1678886400000_0001
In cluster mode, you need to query logs from the cluster using cluster manager tools (like yarn logs).
Gotchas
--deploy-mode clientvs.--deploy-mode cluster: Inclientmode, your local machine needs to remain running for the application to continue. If you close your terminal or lose connection, the application will fail.clustermode is generally preferred for long-running applications as the driver runs on the cluster.- Classpath Issues: When using
--jarsor--packages, ensure the specified files/coordinates are accessible and compatible with your Spark version and cluster environment. For YARN, files specified withhdfs://URIs are automatically distributed. For local paths, they must be present on all nodes if running inclustermode and the driver can access them. - YARN Configuration: For YARN, ensure your environment has
HADOOP_CONF_DIRorYARN_CONF_DIRset correctly to point to the Hadoop configuration directory containingcore-site.xmlandhdfs-site.xml. - Driver vs. Executor Memory: Be mindful of the difference between
--driver-memoryand--executor-memory. The driver is a single JVM process, while executors are multiple JVMs running your tasks. Allocate appropriately based on your application’s needs. - Python Dependencies: For Python applications,
--py-filesis used for single files or zip archives. For more complex Python environments (e.g., requiringcondaor specific package versions), consider using a custom Docker image or building a virtual environment and packaging it. - Spark Version Compatibility: Ensure the Spark distribution you are using (
spark-submit) matches the Spark version running on your cluster. Mismatches can lead to unexpected errors. - Resource Limits: On managed clusters (like YARN or Kubernetes), your requests for
--num-executors,--executor-cores, and--executor-memoryare subject to the cluster’s resource availability and quotas.