Last Updated: November 21, 2025
Focus Areas
| Focus |
|---|
Match `spark.sql.shuffle.partitions` to cluster size
|
Cache reused DataFrames
|
Commands & Queries
spark-submit --conf spark.sql.shuffle.partitions=200 job.py
Tune shuffles
df.repartition(100)
Repartition function
df.persist(StorageLevel.MEMORY_ONLY)
Cache DataFrame
Summary
Partition, cache, and limit shuffles to speed Spark jobs.
💡 Pro Tip:
Use `persist()` for reused data and avoid wide dependencies.